from:"Benjamin LaHaise"

summary of current pending changes in aio-next.git

2019-08-08 Thread Benjamin LaHaise

The first 10 of Kent's AIO optimization series have been pulled into my 
tree as summarised below.  I had to rework some of the patches to correctly 
apply after Gu Zheng's polishing of the AIO ringbuffer migration support, 
so any extra eyes on those changes would be helpful.  I also fixed a couple 
of bugs in the table lookup patch, as well as eliminating a BUG_ON() Kent 
added in order to be a bit more defensive.

These changes should show up in linux-next, so please give them a beating.  
I plan to post a few more changes against this tree in the next couple of 
weeks.

-ben
---
The following changes since commit 47188d39b5deeebf41f87a02af1b3935866364cf:

  Merge tag 'ext4_for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2013-07-14 21:47:51 
-0700)

are available in the git repository at:


  git://git.kvack.org/~bcrl/aio-next.git master

for you to fetch changes up to 6878ea72a5d1aa6caae86449975a50b7fe9abed5:

  aio: be defensive to ensure request batching is non-zero instead of BUG_ON() 
(2013-07-31 10:34:18 -0400)

--------
Benjamin LaHaise (4):
  aio: fix build when migration is disabled
  aio: double aio_max_nr in calculations
  aio: convert the ioctx list to table lookup v3
  aio: be defensive to ensure request batching is non-zero instead of 
BUG_ON()

Gu Zheng (2):
  fs/anon_inode: Introduce a new lib function anon_inode_getfile_private()
  fs/aio: Add support to aio ring pages migration

Kent Overstreet (9):
  aio: reqs_active -> reqs_available
  aio: percpu reqs_available
  aio: percpu ioctx refcount
  aio: io_cancel() no longer returns the io_event
  aio: Don't use ctx->tail unnecessarily
  aio: Kill aio_rw_vect_retry()
  aio: Kill unneeded kiocb members
  aio: Kill ki_users
  aio: Kill ki_dtor

 drivers/staging/android/logger.c |   2 +-
 drivers/usb/gadget/inode.c   |   9 +-
 fs/aio.c | 717 +--
 fs/anon_inodes.c |  66 
 fs/block_dev.c   |   2 +-
 fs/nfs/direct.c  |   1 -
 fs/ocfs2/file.c  |   6 +-
 fs/read_write.c  |   3 -
 fs/udf/file.c|   2 +-
 include/linux/aio.h  |  21 +-
 include/linux/anon_inodes.h  |   3 +
 include/linux/migrate.h  |   3 +
 include/linux/mm_types.h |   5 +-
 kernel/fork.c|   2 +-
 mm/migrate.c |   2 +-
 mm/page_io.c |   1 -
 net/socket.c |  15 +-
 17 files changed, 549 insertions(+), 311 deletions(-)
-- 
"Thought is the essence of where you are now."

Re: [PATCH] aio: Support read/write with non-iter file-ops

2019-07-18 Thread Benjamin LaHaise

On Thu, Jul 18, 2019 at 04:43:52PM -0700, Bjorn Andersson wrote:
> On Thu 18 Jul 16:17 PDT 2019, Al Viro wrote:
> 
> > On Thu, Jul 18, 2019 at 04:10:54PM -0700, Bjorn Andersson wrote:
> > > Implement a wrapper for aio_read()/write() to allow async IO on files
> > > not implementing the iter version of read/write, such as sysfs. This
> > > mimics how readv/writev uses non-iter ops in do_loop_readv_writev().
> > 
> > IDGI.  How would that IO manage to be async?  And what's the point
> > using aio in such situations in the first place?
> 
> The point is that an application using aio to submit io operations on a
> set of files, can use the same mechanism to read/write files that
> happens to be implemented by driver only implementing read/write (not
> read_iter/write_iter) in the registered file_operations struct, such as
> kernfs.
> 
> In this particular case I have a sysfs file that is accessing hardware
> and hence will block for a while and using this patch I can io_submit()
> a write and handle the completion of this in my normal event loop.
> 
> 
> Each individual io operation will be just as synchronous as the current
> iter-based mechanism - for the drivers that implement that.

Just adding the fops is not enough.  I have patches floating around at
Solace that add thread based fallbacks for files that don't have an aio
read / write implementation, but I'm not working on that code any more.
The thread based methods were quite useful in applications that had a need
for using other kernel infrastructure in their main event loops.

-ben

> Regards,
> Bjorn
> 

-- 
"Thought is the essence of where you are now."

Re: aio poll, io_pgetevents and a new in-kernel poll API V4

2018-01-25 Thread Benjamin LaHaise

On Mon, Jan 22, 2018 at 09:12:07PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for the IOCB_CMD_POLL operation to poll for the
> readyness of file descriptors using the aio subsystem.  The API is based
> on patches that existed in RHAS2.1 and RHEL3, which means it already is
> supported by libaio.  To implement the poll support efficiently new
> methods to poll are introduced in struct file_operations:  get_poll_head
> and poll_mask.  The first one returns a wait_queue_head to wait on
> (lifetime is bound by the file), and the second does a non-blocking
> check for the POLL* events.  This allows aio poll to work without
> any additional context switches, unlike epoll.

I implemented something similar back in December, but did so without
changing the in-kernel poll API.  See below for the patch that implements
it.  Is changing the in-kernel poll API really desirable given how many
drivers that will touch?

The tree with that as a work-in-progress is available at
git://git.kvack.org/~bcrl/aio-wip-20171215.git .  I have some time
scheduled right now to work on cleaning up that series which also includes
support for more aio options by making use of queue_work() to dispatch
operations that make use of helper threads.  The aio poll implementation
patch is below as an alternative approach.  It has passed some amount of
QA by Solace where we use it for a unified event loop with various
filesystem / block AIO combined with poll on TCP, unix domains sockets and
pipes.  Reworking cancellation was required to fix several lock
ordering issues that are impossible to address with the in-kernel aio
cancellation support.

-ben

> To make the interface fully useful a new io_pgetevents system call is
> added, which atomically saves and restores the signal mask over the
> io_pgetevents system call.  It it the logical equivalent to pselect and
> ppoll for io_pgetevents.

That looks useful.  I'll have to look at this in detail.

-ben

commit a299c474b19107122eae846b53f742d876f304f9
Author: Benjamin LaHaise 
Date:   Fri Dec 15 13:19:23 2017 -0500

aio: add aio poll implementation

Some applications using AIO have a need to be able to poll on file
descriptors.  This is typically the case with libraries using
non-blocking operations inside of an application using an io_getevents()
based main event loop.  This patch implements IOCB_CMD_POLL directly
inside fs/aio.c using file_operations->poll() to insert a waiter.  This
avoids the need to use helper threads, enabling completion events to be
generated directly in interrupt or task context.

diff --git a/fs/aio.c b/fs/aio.c
index 3381b2e..23b9a06 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -43,6 +43,7 @@

 #include 
 #include 
+#include 

 #include "internal.h"

@@ -172,7 +173,7 @@ struct kioctx {
  * synchronize cancellation and completion - we only set it to KIOCB_CANCELLED
  * with xchg() or cmpxchg(), see batch_complete_aio() and kiocb_cancel().
  */
-#define KIOCB_CANCELLED((void *) (~0ULL))
+#define KIOCB_CANCELLED((void *) (~5ULL))

 struct aio_kiocb {
struct kiocbcommon;
@@ -205,6 +206,13 @@ struct aio_kiocb {
aio_thread_work_fn_tki_work_fn;
struct work_struct  ki_work;
 #endif
+
+   unsigned long   ki_data;
+   struct poll_table_structpoll;
+   struct tasklet_struct   tasklet;
+   int wait_idx;
+   unsignedevents;
+   struct poll_table_entry poll_wait[2];
 };

 /*-- sysctl variables*/
@@ -212,7 +220,7 @@ struct aio_kiocb {
 unsigned long aio_nr;  /* current system wide number of aio requests */
 unsigned long aio_max_nr = 0x1; /* system wide maximum number of aio 
requests */
 #if IS_ENABLED(CONFIG_AIO_THREAD)
-unsigned long aio_auto_threads;/* Currently disabled by default */
+unsigned long aio_auto_threads = 1;/* Currently disabled by default */
 #endif
 /*end sysctl variables---*/

@@ -1831,6 +1839,213 @@ static ssize_t aio_write(struct aio_kiocb *kiocb, 
struct iocb *iocb, bool compat
return ret;
 }

+static int aio_poll_cancel_cb(struct kiocb *iocb, unsigned long data, int ret)
+{
+   struct aio_kiocb *req = container_of(iocb, struct aio_kiocb, common);
+   unsigned i;
+
+   for (i=0; iwait_idx; i++) {
+   wait_queue_head_t *head = req->poll_wait[i].wait_address;
+   remove_wait_queue(head, &req->poll_wait[i].wait);
+   }
+
+   aio_complete(iocb, -EINTR, 0);
+   return 0;
+}
+
+static int aio_poll_cancel(struct kiocb *iocb, kiocb_cancel_cb_fn **cb_p,
+  unsigned long *data_p)
+{
+   *cb_p = aio_poll_cancel_cb;
+   *data_p = 0;
+   return 0;
+}
+
+stati

Re: [PATCH 2/6] move _body_io_syscall to the generic syscall.h

2018-01-05 Thread Benjamin LaHaise

On Fri, Jan 05, 2018 at 11:25:17AM -0500, Jeff Moyer wrote:
> Christoph Hellwig  writes:
> 
> > This way it can be used for the fallback 6-argument version on
> > all architectures.
> >
> > Signed-off-by: Christoph Hellwig 
> 
> This is a strange way to do things.  However, I was never really sold on
> libaio having to implement its own system call wrappers.  That decision
> definitely resulted in some maintenance overhead.
> 
> Ben, what was your reasoning for not just using syscall?

The main issue was that glibc's pthreads implementation really sucked back
during initial development and there was a use-case for having the io_XXX
functions usable directly from clone()ed threads that didn't have all the
glibc pthread state setup for per-cpu areas to handle per-thread errno.
That made sense back then, but is rather silly today.

Technically, I'm not sure the generic syscall wrapper is safe to use.  The
io_XXX arch wrappers don't modify errno, while it appears the generic one
does.  That said, nobody has ever noticed...

-ben

> -Jeff
> 
> > ---
> >  src/syscall-generic.h | 6 --
> >  src/syscall.h | 7 +++
> >  2 files changed, 7 insertions(+), 6 deletions(-)
> >
> > diff --git a/src/syscall-generic.h b/src/syscall-generic.h
> > index 24d7c7c..35b8580 100644
> > --- a/src/syscall-generic.h
> > +++ b/src/syscall-generic.h
> > @@ -2,12 +2,6 @@
> >  #include 
> >  #include 
> >  
> > -#define _body_io_syscall(sname, args...) \
> > -{ \
> > -   int ret = syscall(__NR_##sname, ## args); \
> > -   return ret < 0 ? -errno : ret; \
> > -}
> > -
> >  #define io_syscall1(type,fname,sname,type1,arg1) \
> >  type fname(type1 arg1) \
> >  _body_io_syscall(sname, (long)arg1)
> > diff --git a/src/syscall.h b/src/syscall.h
> > index a2da030..3819519 100644
> > --- a/src/syscall.h
> > +++ b/src/syscall.h
> > @@ -10,6 +10,13 @@
> >  #define DEFSYMVER(compat_sym, orig_sym, ver_sym)   \
> > __asm__(".symver " SYMSTR(compat_sym) "," SYMSTR(orig_sym) "@@LIBAIO_" 
> > SYMSTR(ver_sym));
> >  
> > +/* generic fallback */
> > +#define _body_io_syscall(sname, args...) \
> > +{ \
> > +   int ret = syscall(__NR_##sname, ## args); \
> > +   return ret < 0 ? -errno : ret; \
> > +}
> > +
> >  #if defined(__i386__)
> >  #include "syscall-i386.h"
> >  #elif defined(__x86_64__)
> 

-- 
"Thought is the essence of where you are now."

Re: [PATCH 1/1] aio: make sure the input "timeout" value is valid

2017-12-13 Thread Benjamin LaHaise

On Wed, Dec 13, 2017 at 06:11:12AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 13, 2017 at 09:42:52PM +0800, Zhen Lei wrote:
> > Below information is reported by a lower kernel version, and I saw the
> > problem still exist in current version.
> 
> I think you're right, but what an awful interface we have here!
> The user must not only fetch it, they must validate it separately?
> And if they forget, then userspace is provoking undefined behaviour?  Ugh.
> Why not this:

That looks far better!

Acked-by: Benjamin LaHaise 

-ben

> diff --git a/fs/aio.c b/fs/aio.c
> index 4adbdcbe753a..fdd16cf897c8 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1788,8 +1788,9 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
>   struct timespec64   ts;
>  
>   if (timeout) {
> - if (unlikely(get_timespec64(&ts, timeout)))
> - return -EFAULT;
> + int error = get_valid_timespec64(&ts, timeout);
> + if (error)
> + return error;
>   }
>  
>   return do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &ts : 
> NULL);
> @@ -1805,9 +1806,9 @@ COMPAT_SYSCALL_DEFINE5(io_getevents, 
> compat_aio_context_t, ctx_id,
>   struct timespec64 t;
>  
>   if (timeout) {
> - if (compat_get_timespec64(&t, timeout))
> - return -EFAULT;
> -
> + int error = compat_get_valid_timespec64(&t, timeout);
> + if (error)
> + return error;
>   }
>  
>   return do_io_getevents(ctx_id, min_nr, nr, events, timeout ? &t : NULL);
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index 0fc36406f32c..578fc0f208d9 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -172,6 +172,26 @@ extern int get_compat_itimerspec64(struct itimerspec64 
> *its,
>  extern int put_compat_itimerspec64(const struct itimerspec64 *its,
>   struct compat_itimerspec __user *uits);
>  
> +static inline __must_check
> +int compat_get_valid_timespec64(struct timespec64 *ts, const void __user 
> *uts)
> +{
> + if (unlikely(compat_get_timespec64(ts, uts)))
> + return -EFAULT;
> + if (unlikely(!timespec64_valid(ts)))
> + return -EINVAL;
> + return 0;
> +}
> +
> +static inline __must_check
> +int compat_get_strict_timespec64(struct timespec64 *ts, const void __user 
> *uts)
> +{
> + if (unlikely(compat_get_timespec64(ts, uts)))
> + return -EFAULT;
> + if (unlikely(!timespec64_valid_strict(ts)))
> + return -EINVAL;
> + return 0;
> +}
> +
>  struct compat_iovec {
>   compat_uptr_t   iov_base;
>   compat_size_t   iov_len;
> diff --git a/include/linux/time.h b/include/linux/time.h
> index 4b62a2c0a661..506d87483d04 100644
> --- a/include/linux/time.h
> +++ b/include/linux/time.h
> @@ -18,6 +18,26 @@ int get_itimerspec64(struct itimerspec64 *it,
>  int put_itimerspec64(const struct itimerspec64 *it,
>   struct itimerspec __user *uit);
>  
> +static inline __must_check int get_valid_timespec64(struct timespec64 *ts,
> + const struct timespec __user *uts)
> +{
> + if (unlikely(get_timespec64(ts, uts)))
> + return -EFAULT;
> + if (unlikely(!timespec64_valid(ts)))
> + return -EINVAL;
> + return 0;
> +}
> +
> +static inline __must_check int get_strict_timespec64(struct timespec64 *ts,
> + const struct timespec __user *uts)
> +{
> + if (unlikely(get_timespec64(ts, uts)))
> + return -EFAULT;
> + if (unlikely(!timespec64_valid_strict(ts)))
> + return -EINVAL;
> + return 0;
> +}
> +
>  extern time64_t mktime64(const unsigned int year, const unsigned int mon,
>   const unsigned int day, const unsigned int hour,
>   const unsigned int min, const unsigned int sec);
> 

-- 
"Thought is the essence of where you are now."

Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-06 Thread Benjamin LaHaise

On Wed, Dec 06, 2017 at 09:19:09PM +0300, Kirill Tkhai wrote:
> On 06.12.2017 20:44, Benjamin LaHaise wrote:
> > On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> >>>> This memory lives in page-cache/lru, it is visible for shrinker which
> >>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
> >>>> the kernel, this memory looks reclaimable but it is not. And we only do
> >>>> this for migration.
> >>>
> >>> It's the same as any other memory that's mlock()ed into RAM.
> >>
> >> No. Again, this memory is not properly accounted, and unlike mlock()ed
> >> memory it is visible to shrinker which will do the unnecessary work on
> >> memory shortage which in turn will lead to unnecessary page faults.
> >>
> >> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> >> aio_private_file() ?
> > 
> > Send a patch then!  I don't know why you're asking rather than sending a
> > patch to do this if you think it is needed.
> > 
> >>>> triggers OOM-killer which kills sshd and other daemons on my machine.
> >>>> These pages were not even faulted in (or the shrinker can unmap them),
> >>>> the kernel can not know who should be blamed.
> >>>
> >>> The OOM-killer killed the wrong process: News at 11.
> >>
> >> Well. I do not think we should blame OOM-killer in this case. But as I
> >> said this is not a bug-report or something like this, I agree this is
> >> a minor issue.
> > 
> > I do think the OOM-killer is doing the wrong thing here.  If process X is
> > the only one that is allocating gobs of memory, why kill process Y that
> > hasn't allocated memory in minutes or hours just because it is bigger?
> 
> I assume, if a process hasn't allocated memory in minutes or hours,
> then the most probably all of its evictable memory has already been
> moved to swap as its pages were last in lru list.

That assumption would be incorrect.  Just because memory isn't being
allocated doesn't mean that it isn't being used - the two are very
different things.  Most of the embedded systems I work on allocate memory
at startup and then use it until the system gets restarted or rebooted.
By only doing memory allocations at startup, code is simplified,
performance is more predictable and failure modes are constrained.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-06 Thread Benjamin LaHaise

On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> > > This memory lives in page-cache/lru, it is visible for shrinker which
> > > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > > the kernel, this memory looks reclaimable but it is not. And we only do
> > > this for migration.
> >
> > It's the same as any other memory that's mlock()ed into RAM.
> 
> No. Again, this memory is not properly accounted, and unlike mlock()ed
> memory it is visible to shrinker which will do the unnecessary work on
> memory shortage which in turn will lead to unnecessary page faults.
> 
> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> aio_private_file() ?

Send a patch then!  I don't know why you're asking rather than sending a
patch to do this if you think it is needed.

> > > triggers OOM-killer which kills sshd and other daemons on my machine.
> > > These pages were not even faulted in (or the shrinker can unmap them),
> > > the kernel can not know who should be blamed.
> >
> > The OOM-killer killed the wrong process: News at 11.
> 
> Well. I do not think we should blame OOM-killer in this case. But as I
> said this is not a bug-report or something like this, I agree this is
> a minor issue.

I do think the OOM-killer is doing the wrong thing here.  If process X is
the only one that is allocating gobs of memory, why kill process Y that
hasn't allocated memory in minutes or hours just because it is bigger?
We're not perfect at tracking who owns memory allocations, so why not
factor in memory allocation rate when decided which process to kill?  We
keep throwing bandaids on the OOM-killer by annotating allocations, and we
keep missing the annotation of allocations.  Doesn't sound like a real fix
for the underlying problem to me when a better heuristic would solve the
current problem and probably several other future instances of the same
problem.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-05 Thread Benjamin LaHaise

On Tue, Dec 05, 2017 at 04:19:56PM +0100, Oleg Nesterov wrote:
> On 12/05, Kirill Tkhai wrote:
> >
> > Currently, aio_nr and aio_max_nr are global.
> 
> Yeah, I too tried to complain 2 years ago...
> 
> > In case of containers this
> > means that a single container may occupy all aio requests, which are
> > available in the system,
> 
> and memory. let me quote my old emails...
> 
> 
> This is off-topic, but the whole "vm" logic in aio_setup_ring()
> looks sub-optimal. I do not mean the code, just it seems to me it
> is pointless to pollute the page cache, and expose the pages we
> can not swap/free to lru. Afaics we _only_ need this for migration.

It is needed for migration, which is needed for hot unplug of memory.
There is no way around this.

> This memory lives in page-cache/lru, it is visible for shrinker which
> will unmap these pages for no reason on memory shortage. IOW, aio fools
> the kernel, this memory looks reclaimable but it is not. And we only do
> this for migration.

It's the same as any other memory that's mlock()ed into RAM.

> Even if this is not a problem, this does not look right. So perhaps at
> least mapping_set_unevictable() makes sense. But I simply do not know
> if migration will work with this change.
> 
> 
> 
> Perhaps I missed something, doesn't matter. But this means that
> this memory is not accounted, so if I increase aio-max-nr then
> this test-case
> 
>   #define __NR_io_setup   206
> 
>   int main(void)
>   {
>   int nr;
> 
>   for (nr = 0; ;++nr) {
>   void *ctx = NULL;
>   int ret = syscall(__NR_io_setup, 1, &ctx);
>   if (ret) {
>   printf("failed %d %m: ", nr);
>   getchar();
>   }
>   }
> 
>   return 0;
>   }
> 
> triggers OOM-killer which kills sshd and other daemons on my machine.
> These pages were not even faulted in (or the shrinker can unmap them),
> the kernel can not know who should be blamed.

The OOM-killer killed the wrong process: News at 11.  This is not new
behaviour, and has always been an issue.  If it bothered to kill the
process that was doing the allocations, the ioctxes would be freed and all
would be well.  Doesn't the OOM killer take into account which process is
allocating memory?  Seems like a pretty major deficiency if it isn't.

> Shouldn't we account aio events/pages somehow, say per-user, or in
> mm->pinned_vm ?

Sure, I have no objection to that.  Please send a patch!

> I do not think this is unkown, and probably this all is fine. IOW,
> this is just a question, not a bug-report or something like this.
> 
> And of course, this is not exploitable because aio-max-nr limits
> the number of pages you can steal.

Which is why the aio-max-nr limit exists!  That it is imprecise and
imperfect is not being denied.

> But otoh, aio_max_nr is system-wide, so the unpriviliged user can
> ddos (say) mysqld. And this leads to the same question: shouldn't
> we account nr_events at least?

Anyone can DDoS the local system - this nothing new and has always been
the case.  I'm not opposed to improvements in this area.  The only issue I
have with Kirill's changes is that we need to have test cases for this new
functionality if the code is to be merged.

TBH, using memory accounting to limit things may be a better approach, as
that at least doesn't require changes to containers implementations to
obtain a benefit from bounding aio's memory usage.

-ben

> Oleg.
> 

-- 
"Thought is the essence of where you are now."

Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

2017-12-04 Thread Benjamin LaHaise

Hi Kirill,

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> Hi,
> 
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.
> 
> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
> are accounted in parent cgroups too.
> 
> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
> aio requests limit, to show current limit and number of currenly occupied
> requests.

Where are your test cases to check this functionality in the libaio test
suite?

-ben

> Patches 1-3 are refactoring.
> Patch 4 is the place where the accounting actually introduced.
> Patch 5 adds "io.aio_nr" file.
> 
> ---
> 
> Kirill Tkhai (5):
>   aio: Move aio_nr increment to separate function
>   aio: Export aio_nr_lock and aio_max_nr initial value to 
> include/linux/aio.h
>   blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>   blkcg: Charge aio requests in blkio cgroup hierarchy
>   blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
> 
> 
>  block/blk-cgroup.c |   88 +-
>  fs/aio.c   |  151 
> 
>  include/linux/aio.h|   21 ++
>  include/linux/blk-cgroup.h |4 +
>  4 files changed, 247 insertions(+), 17 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai 
> 

-- 
"Thought is the essence of where you are now."

[GIT PULL] aio: improve aio-nr counting

2017-09-07 Thread Benjamin LaHaise

Hello Linus,

Please consider pulling the following change for 4.14 to improve aio-nr
counting on large SMP systems which has been in linux-next for quite some
time.  Cheers,

-ben

The following changes since commit 569dbb88e80deb68974ef6fdd6a13edb9d686261:

  Linux 4.13 (2017-09-03 13:56:17 -0700)

are available in the git repository at:

  git://git.kvack.org/~bcrl/aio-next.git master

for you to fetch changes up to 2a8a98673c13cb2a61a6476153acf8344adfa992:

  fs: aio: fix the increment of aio-nr and counting against aio-max-nr 
(2017-09-07 12:28:28 -0400)


Mauricio Faria de Oliveira (1):
  fs: aio: fix the increment of aio-nr and counting against aio-max-nr

 fs/aio.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)
-- 
"Thought is the essence of where you are now."

Re: [PATCH] aio: Add command to wait completion of all requests

2017-06-14 Thread Benjamin LaHaise

On Wed, Jun 14, 2017 at 12:11:38PM +0300, Kirill Tkhai wrote:
> On 14.06.2017 02:26, Benjamin LaHaise wrote:
...
> > Nope.  An aio may not complete in a timely fashion, in which case your
> > wait for all aios to complete will simply wait forever.  I take it this is
> > not the desired behaviour of checkpointing.
> 
> But read_events() does it. What the difference between them?

read_events() does not wait for all aios to complete.  The model here is
event driving programming: the program waits for an event, wakes up, does
some work, goes back to sleep until another event comes in.  Some of the
aios in flight won't complete for significant amounts of time, but that
doesn't matter since those which do complete will be processed and cause
state machines to make progress.

> >> Could you please describe how will cancelling aio requests will help to 
> >> wait
> >> till their completion? Is there is guarantee, they will be easily queued 
> >> back?
> >> I suppose, no, because there are may be a memory limit or some low level
> >> drivers limitations, dependent on internal conditions.
> > 
> > This is the problem you're facing.  Quiescing aios is complicated!
> 
> It's too generic words and they may refer to anything. And it's a problem,
> which is not connected with small aio engine, because it's impossible to 
> guarantee
> availability of kernel resources from there. Imagine, parallel task eats 
> memory
> of your just cancelled request: then you just can't queue request back.
> 
> This argument makes your suggestion not rock-stable suitable for all cases:
> it will break user applications in such situations.

Again, you're failing to understand what the programming model is for aio.

> >> Also, it's not seems good to overload aio with the functionality of 
> >> obtaining
> >> closed file descriptors of submitted requests.
> >>
> >> Do you mean this way, or I misunderstood you? Could you please to 
> >> concretize your
> >> idea?
> > 
> > Yes, this is what I'm talking about.
> > 
> >> In my vision cancelling requests does not allow to implement the need I 
> >> described.
> >> If we can't queue request back, it breaks snapshotting and user 
> >> application in
> >> general.
> > 
> > This is what you have to figure out.  This is why your patch is incomplete
> > and cannot be accepted.  You can't punt the complexity your feature
> > requires onto other maintainers -- your implementation has to be reasonably
> > complete at time of patch inclusion.  Can you see now why your patch can't
> > be included as-is?  The assumption you made that aios complete in a timely
> > fashion is incorrect.  Everything else stems from that.
> 
> I just can't see why read_events() just waits for requests completion not 
> paying
> attention of the all above you said. Could you please clarify the difference
> between two these situations?

Read my above comments about event driven programming.  Waiting for all
aios to complete is very different from waiting for a non-specific
completion event to come along.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH] aio: Add command to wait completion of all requests

2017-06-13 Thread Benjamin LaHaise

On Tue, Jun 13, 2017 at 07:17:43PM +0300, Kirill Tkhai wrote:
> On 13.06.2017 18:26, Benjamin LaHaise wrote:
> > On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
> > ...
> >> The functionality, I did, grew from real need and experience. We try to
> >> avoid kernel modification, where it's possible, but the in-flight aio
> >> requests is not a case suitable for that.
> > 
> > What you've done only works for *your* use-case, but not in general.  Like
> > in other subsystems, you need to provide hooks on a per file descriptor
> > basis for quiescing different kinds of file descriptors.
> 
> Which hooks do you suggest? It's possible there is no a file descriptor open 
> after
> a request is submitted. Do you suggest an interface to reopen a struct file?

That's what you have to contend with.  AIO keeps the struct file pinned
for the life of the request.  This is one of the complex issues you need
to address.

> > Your current
> > patch set completely ignores things like usb gadget.  You need to create
> > infrastructure for restarting i/os after your checkpointing occurs, which
> > you haven't put any thought into in this patchset.  If you want to discuss
> > how to do that, fine, but the approach in this patchset simply does not
> > work in general.  What happens when an aio doesn't complete or takes hours
> > to complete?
> 
> Here is wait_event_interruptible(), but it's possible to convert it
> in wait_event_interruptible_hrtimeout() like it's made in read_events().
> It's not a deadly issue of patch. The function read_events() simply
> waits for timeout, can't we do the same?

Nope.  An aio may not complete in a timely fashion, in which case your
wait for all aios to complete will simply wait forever.  I take it this is
not the desired behaviour of checkpointing.

> Could you please describe how will cancelling aio requests will help to wait
> till their completion? Is there is guarantee, they will be easily queued back?
> I suppose, no, because there are may be a memory limit or some low level
> drivers limitations, dependent on internal conditions.

This is the problem you're facing.  Quiescing aios is complicated!

> Also, it's not seems good to overload aio with the functionality of obtaining
> closed file descriptors of submitted requests.
> 
> Do you mean this way, or I misunderstood you? Could you please to concretize 
> your
> idea?

Yes, this is what I'm talking about.

> In my vision cancelling requests does not allow to implement the need I 
> described.
> If we can't queue request back, it breaks snapshotting and user application in
> general.

This is what you have to figure out.  This is why your patch is incomplete
and cannot be accepted.  You can't punt the complexity your feature
requires onto other maintainers -- your implementation has to be reasonably
complete at time of patch inclusion.  Can you see now why your patch can't
be included as-is?  The assumption you made that aios complete in a timely
fashion is incorrect.  Everything else stems from that.

-ben

> Thanks,
> Kirill
> 

-- 
"Thought is the essence of where you are now."

Re: [PATCH] aio: Add command to wait completion of all requests

2017-06-13 Thread Benjamin LaHaise

On Tue, Jun 13, 2017 at 06:11:03PM +0300, Kirill Tkhai wrote:
...
> The functionality, I did, grew from real need and experience. We try to
> avoid kernel modification, where it's possible, but the in-flight aio
> requests is not a case suitable for that.

What you've done only works for *your* use-case, but not in general.  Like
in other subsystems, you need to provide hooks on a per file descriptor
basis for quiescing different kinds of file descriptors.  Your current
patch set completely ignores things like usb gadget.  You need to create
infrastructure for restarting i/os after your checkpointing occurs, which
you haven't put any thought into in this patchset.  If you want to discuss
how to do that, fine, but the approach in this patchset simply does not
work in general.  What happens when an aio doesn't complete or takes hours
to complete?

> The checkpointing of live system is not easy as it seems. The way,
> you suggested, makes impossible the whole class of doing snapshots
> of the life system. You can't just kill a process, wait for zombie,
> and then restart the process again: processes are connected in difficult
> topologies of essences. You need to restore pgid and sid of the process,
> the namespaces, shared files (CLONE_FILES and CLONE_FS). Everything
> of this requires to be created in the certain order, and there is a
> lot of rules and limitations. You can't just create the same process
> in the same place: it's not easy, and it's just impossible. Your
> suggestion kills the big class of use cases, and it's not suitable
> in any way. You may refer to criu project site if you are interested
> (criu.org).

Point.

> Benjamin, please, could you check this once again? We really need
> this functionality, it's not empty desire. Lets speak about the
> way we should implement it, if you don't like the patch.
> 
> There are many functionality in kernel to support the concept
> I described. Check out MSG_PEEK flag for receiving from socket
> (see unix_dgram_recvmsg()), for example. AIO now is one of the
> last barriers of full support of snapshots in criu.
...

Then please start looking at the big picture and think about things other
than short lived disk i/o.  Without some design in the infrastructure to
handle those cases, your solution is incomplete and will potentially leave
us with complex and unsupportable semantics that don't actually solve the
problem you're trying to solve.

Some of the things to think about: you need infrastructure to restart an
aio, which means you need some way of dumping aios that remain in flight,
as otherwise your application will see aios cancelled during checkpointing
that should no have been.  You need to actually cancel aios.  These
details need to be addressed if checkpointing is going to be a robust
feature that works for other than toy use-cases.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH] aio: Add command to wait completion of all requests

2017-06-13 Thread Benjamin LaHaise

On Fri, Jun 09, 2017 at 12:49:34PM +0300, Kirill Tkhai wrote:
> During implementation aio support for Checkpoint-Restore
> in Userspace project (CRIU), we found, that current
> kernel functionality does not allow to wait for
> completion of all submitted requests. Checkpoint software
> can't determine if there is no in-flight requests, i.e.
> the process aio rings memory is ready for dumping.

Simply put: no.  There is no way this command will ever complete under
various conditions - if another thread submits i/o, if the in flight i/o
is waiting on non-disk i/o, etc.  If you want to wait until all aios of a
task are complete, kill the process and wait for the zombie to be reaped.  
This is a simplistic approach to checkpointing that does not solve all
issues related to checkpoint aio.

-ben

> The patch solves this problem. Also, the interface, it
> introduces, may be useful for other in-userspace users.
> Here new IOCB_CMD_WAIT_ALL cmd, which idea is to make
> the caller wait till the sum of completed and available
> requests becomes equal to (ctx->nr_events - 1). If they
> are equal, then there is no aio requests in-flight in
> the system.
> 
> Since available requests are per-cpu, we need synchronization
> with their other readers and writers. Variable ctx->dead
> is used for that, and we set it in 1 during synchronization.
> Now let's look how we become sure, that concurrents can
> see it on other cpus.
> 
> There is two cases. When the task is the only user of its
> memory, it races with aio_complete() only. This function
> takes ctx->completion_lock, so we simply use it to synchronize
> during per-cpu reading.
> 
> When there are more users of current->mm, we base on that
> put_reqs_available() and get_reqs_available() are already
> interrupts-free. Primitives local_irq_disable and
> local_irq_enable() work as rcu_read_xxx_sched() barriers,
> and waiter uses synchronize_sched() to propogate ctx->dead
> visability on other cpus. This RCU primitive works as
> full memory barrier, so nobody will use percpu reqs_available
> after we returned from it, and the waiting comes down
> to the first case further actions. The good point
> is the solution does not affect on existing code
> performance in any way, as it does not change it.
> 
> Couple more words about checkpoint/restore. We need
> especially this functionality, and we are not going to
> dump in-flight request for now, because the experiments
> show, that if dumping of a container was failed by
> another subsystem reasons, it's not always possible
> to queue cancelled requests back (file descriptors
> may be already closed, memory pressure). So, here is
> nothing about dumping.
> 
> It seams, this functionality will be useful not only
> for us, and others may use the new command like
> other standard aio commands.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  fs/aio.c |   71 
> +++---
>  include/uapi/linux/aio_abi.h |1 +
>  2 files changed, 67 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f52d925ee259..447fa4283c7c 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -899,7 +899,11 @@ static void put_reqs_available(struct kioctx *ctx, 
> unsigned nr)
>   struct kioctx_cpu *kcpu;
>   unsigned long flags;
>  
> - local_irq_save(flags);
> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
> + if (atomic_read(&ctx->dead)) {
> + atomic_add(nr, &ctx->reqs_available);
> + goto unlock;
> + }
>   kcpu = this_cpu_ptr(ctx->cpu);
>   kcpu->reqs_available += nr;
>  
> @@ -907,8 +911,8 @@ static void put_reqs_available(struct kioctx *ctx, 
> unsigned nr)
>   kcpu->reqs_available -= ctx->req_batch;
>   atomic_add(ctx->req_batch, &ctx->reqs_available);
>   }
> -
> - local_irq_restore(flags);
> +unlock:
> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
>  }
>  
>  static bool get_reqs_available(struct kioctx *ctx)
> @@ -917,7 +921,9 @@ static bool get_reqs_available(struct kioctx *ctx)
>   bool ret = false;
>   unsigned long flags;
>  
> - local_irq_save(flags);
> + local_irq_save(flags); /* Implies rcu_read_lock_sched() */
> + if (atomic_read(&ctx->dead))
> + goto out;
>   kcpu = this_cpu_ptr(ctx->cpu);
>   if (!kcpu->reqs_available) {
>   int old, avail = atomic_read(&ctx->reqs_available);
> @@ -937,7 +943,7 @@ static bool get_reqs_available(struct kioctx *ctx)
>   ret = true;
>   kcpu->reqs_available--;
>  out:
> - local_irq_restore(flags);
> + local_irq_restore(flags); /* Implies rcu_read_unlock_sched() */
>   return ret;
>  }
>  
> @@ -1533,6 +1539,58 @@ static ssize_t aio_write(struct kiocb *req, struct 
> iocb *iocb, bool vectored,
>   return ret;
>  }
>  
> +static bool reqs_completed(struct kioctx *ctx)
> +{
> + unsigned available;
> + spin_lock_irq(&ctx->complet

Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

2016-10-05 Thread Benjamin LaHaise

On Wed, Oct 05, 2016 at 02:58:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin,
> 
> On 10/05/2016 02:41 PM, Benjamin LaHaise wrote:
> >I'd suggest increasing the default limit by changing how it is calculated.
> >The current number came about 13 years ago when machines had orders of
> >magnitude less RAM than they do today.
> 
> Thanks for the suggestion.
> 
> Does the default also have implications other than memory usage?
> For example, concurrency/performance of as much aio contexts running,
> or if userspace could try to exploit some point with a larger number?

Anything's possible when a local user can run code.  It's the same problem 
as determining how much memory can be mlock()ed, or how much i/o a process 
should be allowed to do.  Nothing prevents an app from doing a huge amount 
of readahed() calls to make the system prefetch gigabytes of data.  That 
said, local users tend not to DoS themselves.

> Wondering about it because it can be set based on num_possible_cpus(),
> but that might be really large on high-end systems.

Today's high end systems are tomorrow's desktops...  It probably makes 
sense to implement per-user limits rather than the current global limit, 
and maybe even convert them to an rlimit to better fit in with the 
available frameworks for managing these things.

-ben

> Regards,
> 
> -- 
> Mauricio Faria de Oliveira
> IBM Linux Technology Center

-- 
"Thought is the essence of where you are now."

Re: aio: questions with ioctx_alloc() and large num_possible_cpus()

2016-10-05 Thread Benjamin LaHaise

On Tue, Oct 04, 2016 at 07:55:12PM -0300, Mauricio Faria de Oliveira wrote:
> Hi Benjamin, Kent, and others,
> 
> Would you please comment / answer about this possible problem?
> Any feedback is appreciated.

I'd suggest increasing the default limit by changing how it is calculated.  
The current number came about 13 years ago when machines had orders of 
magnitude less RAM than they do today.  

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-03-19 Thread Benjamin LaHaise

On Wed, Mar 16, 2016 at 02:59:38PM +0100, Arnd Bergmann wrote:
> On Wednesday 16 March 2016 13:12:36 Andy Shevchenko wrote:
> > 
> > > I've also sent a patch that fixes the link error on ARM and that should
> > > work on all other architectures too.
> > 
> > In case of avr32 signalfd_read() fails. Does your patch help with it as 
> > well?
> > 
> > P.S. Bisecting shows same culprit: 150a0b4905f1 ("aio: add support for
> > async openat()")
> 
> I don't know. What is the symptom on avr32? My patch only removes the
> get_user() instances on 64-bit values and replaces them with a
> single copy_from_user() call.

Which is the wrong fix.  Arch code should be able to handle 64 bit values 
in all the get/put_user() variants.  We use 64 bit variables all over the 
place in interfaces to userspace.

-ben

Re: linux-next: build failure after merge of the aio tree

2016-03-15 Thread Benjamin LaHaise

On Tue, Mar 15, 2016 at 04:19:02PM +, Sudip Mukherjee wrote:
> On Tue, Mar 15, 2016 at 05:46:34PM +1100, Stephen Rothwell wrote:
> > Hi Benjamin,
> > 
> > After merging the aio tree, today's linux-next build (powerpc
> > ppc44x_defconfig) failed like this:
> > 
> > fs/built-in.o: In function `aio_thread_op_foo_at':
> > aio.c:(.text+0x4dab4): undefined reference to `__get_user_bad'
> > aio.c:(.text+0x4daec): undefined reference to `__get_user_bad'
> > 
> > Caused by commit
> > 
> >   150a0b4905f1 ("aio: add support for async openat()")
> > 
> > despite commit
> > 
> >   d2f7a973e11e ("aio: don't use __get_user() for 64 bit values")
> > 
> > This is due to a bug in the powerpc __get_user_check() macro (the return
> > value is defined to be "unsigned long" which is only 32 bits on a 32
> > bit platform).
> 
> m68k allmodconfig and all defs of m32r fails while building next-20160315.
> 
> regards
> sudip

I've removed everything from the aio-next.git tree for now.  Will revisit 
after the merge window.

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: manual merge of the aio tree with the vfs tree

2016-03-15 Thread Benjamin LaHaise

On Tue, Mar 15, 2016 at 05:35:33AM +, Al Viro wrote:
> On Mon, Mar 14, 2016 at 11:24:38AM -0400, Benjamin LaHaise wrote:
> > On Mon, Mar 14, 2016 at 08:35:23AM +0100, Christoph Hellwig wrote:
> > > The aio changes have either been reviewed negatively or not at all.  That 
> > > tree should be dropped.
> > 
> > That isn't solely your decision.  If you have comments, please provide 
> > constructive feedback, as there are users and use-cases that need this 
> > kind of functionality.
> 
> "This kind of functionality" being what, exactly?  Bypassing code review?
> It's not that you've made trivial mistakes; everyone does from time to time.
> But failing to post patches with non-trivial interactions outside of the
> subsystem you are familiar with (be it on fsdevel or privately to people who
> work with the areas involved) *AND* failing to recognize that the lack
> of review might be a problem even after having been explicitly told so...

I'm not bypassing code review.  The code was sent out back in January, 
you were cc'd on it and has been waiting for you to provide some feedback 
on the mailing lists, which you haven't done until yesterday.  Given that 
review has not occurred, I fully well don't expect this series to be 
merged until further work is done.

> For fuck sake, you should know better.  You are not a newbie with a pet set
> of half-baked patches for his pet application, refering to (unspecified)
> "users that need this kind of functionality" and getting very surprised when
> those mean, rude folks on development lists inform them that code review is
> more than just a good idea.

No, my comment was based on HCH essentially saying that exposure in 
linux-next is a bad thing and the tree should be removed.  Quite the 
opposite -- it's useful in that it lets people see what is going on and 
lets me know about conflicts with other work.  Removing the tree from 
linux-next gets rid of those benefits, and that is what I was objecting 
to.  Fwiw, I don't think someone should have a veto over what goes in 
linux-next.  People should provide feedback, not a blanket "drop that".

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: manual merge of the aio tree with the vfs tree

2016-03-15 Thread Benjamin LaHaise

On Tue, Mar 15, 2016 at 05:19:39AM +, Al Viro wrote:
> On Tue, Mar 15, 2016 at 05:07:12AM +, Al Viro wrote:
> 
> > There *is* a reason for code review.  Or, at least, asking somebody familiar
> > with the code you are working with whether some assumption you are making
> > is true or false.  Me, for example, in our conversation regarding earlier 
> > parts
> > of aio.git queue about a week ago.  Or at any other point.
> 
> While we are at it, 150a0b49 ("aio: add support for async openat()") is also
> crap.  fs_struct and files_struct is nowhere near enough.  And yes, I realize
> that your application probably doesn't step into it.  Which means that these
> patches are just fine for your private kernel.  _Not_ for mainline.
> 
> Reviewed-and-NAKed-by: Al Viro 

You've had two months to make this comment, so I'm glad you've finally 
done so.

-ben
-- 
"Thought is the essence of where you are now."

aio openat Re: [PATCH 07/13] aio: enabled thread based async fsync

2016-03-14 Thread Benjamin LaHaise

On Sat, Jan 23, 2016 at 03:39:22PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 03:07:26PM -0800, Linus Torvalds wrote:
...
> > We could do things like that for the name loopkup for openat() too, where
> > we could handle the successful RCU loopkup synchronously, but then if we
> > fall out of RCU mode we'd do the thread.
> 
> We'd have to do quite a bit of work to unwind back out to the AIO
> layer before we can dispatch the open operation again in a thread,
> wouldn't we?

I had some time last week to make an aio openat do what it can in 
submit context.  The results are an improvement: when openat is handled 
in submit context it completes in about half the time it takes compared 
to the round trip via the work queue, and it's not terribly much code 
either.

-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c  |  122 +-
 fs/internal.h |1 
 fs/namei.c|   16 --
 fs/open.c |2 
 include/linux/namei.h |1 
 5 files changed, 117 insertions(+), 25 deletions(-)

commit 5d3d80fcf99287decc4774af01967cebbb0242fd
Author: Benjamin LaHaise 
Date:   Thu Mar 10 17:15:07 2016 -0500

aio: add support for in-submit openat

Using the LOOKUP_RCU infrastructure added for open(), implement such
functionality to enable in io_submit() openat() that does a non-blocking
file open operation.  This avoids the overhead of punting to another
kernel thread to complete the open operation when the files and data are
already in the dcache.  This helps cut simple aio openat() from ~60-90K
cycles to ~24-45K cycles on my test system.

Signed-off-by: Benjamin LaHaise 

diff --git a/fs/aio.c b/fs/aio.c
index 0a9309e..67c58b6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -42,6 +42,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include <../mm/internal.h>
 
 #include 
@@ -163,6 +165,7 @@ struct kioctx {
 
 struct aio_kiocb;
 typedef long (*aio_thread_work_fn_t)(struct aio_kiocb *iocb);
+typedef void (*aio_destruct_fn_t)(struct aio_kiocb *iocb);
 
 /*
  * We use ki_cancel == KIOCB_CANCELLED to indicate that a kiocb has been either
@@ -210,6 +213,7 @@ struct aio_kiocb {
 #if IS_ENABLED(CONFIG_AIO_THREAD)
struct task_struct  *ki_cancel_task;
unsigned long   ki_data;
+   unsigned long   ki_data2;
unsigned long   ki_rlimit_fsize;
unsignedki_thread_flags;/* AIO_THREAD_NEED... */
aio_thread_work_fn_tki_work_fn;
@@ -217,6 +221,7 @@ struct aio_kiocb {
struct fs_struct*ki_fs;
struct files_struct *ki_files;
const struct cred   *ki_cred;
+   aio_destruct_fn_t   ki_destruct_fn;
 #endif
 };
 
@@ -1093,6 +1098,8 @@ out_put:
 
 static void kiocb_free(struct aio_kiocb *req)
 {
+   if (req->ki_destruct_fn)
+   req->ki_destruct_fn(req);
if (req->common.ki_filp)
fput(req->common.ki_filp);
if (req->ki_eventfd != NULL)
@@ -1546,6 +1553,18 @@ static void aio_thread_fn(struct work_struct *work)
 ret == -ERESTARTNOHAND || ret == -ERESTART_RESTARTBLOCK))
ret = -EINTR;
 
+   /* Completion serializes cancellation by taking ctx_lock, so
+* aio_complete() will not return until after force_sig() in
+* aio_thread_queue_iocb_cancel().  This should ensure that
+* the signal is pending before being flushed in this thread.
+*/
+   aio_complete(&iocb->common, ret, 0);
+   if (fatal_signal_pending(current))
+   flush_signals(current);
+
+   /* Clean up state after aio_complete() since ki_destruct may still
+* need to access them.
+*/
if (iocb->ki_cred) {
current->cred = old_cred;
put_cred(iocb->ki_cred);
@@ -1558,15 +1577,6 @@ static void aio_thread_fn(struct work_struct *work)
exit_fs(current);
current->fs = old_fs;
}
-
-   /* Completion serializes cancellation by taking ctx_lock, so
-* aio_complete() will not return until after force_sig() in
-* aio_thread_queue_iocb_cancel().  This should ensure that
-* the signal is pending before being flushed in this thread.
-*/
-   aio_complete(&iocb->common, ret, 0);
-   if (fatal_signal_pending(current))
-   flush_signals(current);
 }
 
 /* aio_thread_queue_iocb
@@ -1758,11 +1768,6 @@ static long aio_poll(struct aio_kiocb *req, struct iocb 
*user_iocb, bool compat)
return aio_thread_queue_iocb(req, aio_thread_op_poll, 0);
 }
 
-static long aio_do_openat(int fd, const char *filename, int flags, int mode)
-{
-   return do_sys_open(fd, filename, flags, mode);
-}
-
 static long aio_do_unlinkat(i

Re: linux-next: build failure after merge of the aio tree

2016-03-14 Thread Benjamin LaHaise

On Mon, Mar 14, 2016 at 03:49:15PM +1100, Stephen Rothwell wrote:
> Hi Ben,

...
> OK, so at this point (just to get rid of the build failure I have done this:
...
> Well, you need to negotiate that with the affected architectures.

I put in a patch to use get_user() for now since the 32 bit architectures 
don't seem to have any plans for fixing this issue in a predictable 
timeframe.  There shouldn't be any build failures now -- I've checked ia64, 
i386 and x86_64.  The merge conflict looks trivial, so I won't touch 
those pieces for the time being.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH] x86_32: add support for 64 bit __get_user() v3

2016-03-14 Thread Benjamin LaHaise

Any more comments/feedback on this?  Can this be merged for 4.6?

-ben

On Wed, Mar 09, 2016 at 03:05:56PM -0500, Benjamin LaHaise wrote:
> The existing __get_user() implementation does not support fetching
> 64 bit values on 32 bit x86.  Implement this in a way that does not 
> generate any incorrect warnings as cautioned by Russell King.  Test 
> code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .
> 
> v2: use __inttype() as suggested by H. Peter Anvin, which cleans the
> code up nicely, and fix things to work on x86_64 as well.
> 
> v3: fix undefined behaviour on 32 bit x86 caused by multiple expansion 
> of ptr in the 64 bit case.
> 
> Signed-off-by: Benjamin LaHaise 
> 
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index a4a30e4..8743552 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -333,7 +333,26 @@ do { 
> \
>  } while (0)
>  
>  #ifdef CONFIG_X86_32
> -#define __get_user_asm_u64(x, ptr, retval, errret)   (x) = __get_user_bad()
> +#define __get_user_asm_u64(x, ptr, retval, errret)   \
> +({   \
> + __typeof__(ptr) __ptr = (ptr);  \
> + asm volatile(ASM_STAC "\n"  \
> +  "1:movl %2,%%eax\n"\
> +  "2:movl %3,%%edx\n"\
> +  "3: " ASM_CLAC "\n"\
> +  ".section .fixup,\"ax\"\n" \
> +  "4:mov %4,%0\n"\
> +  "  xorl %%eax,%%eax\n" \
> +  "  xorl %%edx,%%edx\n" \
> +  "  jmp 3b\n"   \
> +  ".previous\n"  \
> +  _ASM_EXTABLE(1b, 4b)   \
> +  _ASM_EXTABLE(2b, 4b)   \
> +  : "=r" (retval), "=A"(x)   \
> +  : "m" (__m(__ptr)), "m" __m(((u32 *)(__ptr)) + 1), \
> +"i" (errret), "0" (retval)); \
> +})
> +
>  #define __get_user_asm_ex_u64(x, ptr)(x) = 
> __get_user_bad()
>  #else
>  #define __get_user_asm_u64(x, ptr, retval, errret) \
> @@ -420,7 +439,7 @@ do {  
> \
>  #define __get_user_nocheck(x, ptr, size) \
>  ({   \
>   int __gu_err;   \
> - unsigned long __gu_val; \
> + __inttype(*(ptr)) __gu_val; \
>   __uaccess_begin();  \
>   __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
>   __uaccess_end();\
> 
> -- 
> "Thought is the essence of where you are now."

-- 
"Thought is the essence of where you are now."

Re: linux-next: manual merge of the aio tree with the vfs tree

2016-03-14 Thread Benjamin LaHaise

On Mon, Mar 14, 2016 at 08:35:23AM +0100, Christoph Hellwig wrote:
> The aio changes have either been reviewed negatively or not at all.  That 
> tree should be dropped.

That isn't solely your decision.  If you have comments, please provide 
constructive feedback, as there are users and use-cases that need this 
kind of functionality.

-ben
-- 
"Thought is the essence of where you are now."

[PATCH] x86_32: add support for 64 bit __get_user() v3

2016-03-09 Thread Benjamin LaHaise

The existing __get_user() implementation does not support fetching
64 bit values on 32 bit x86.  Implement this in a way that does not 
generate any incorrect warnings as cautioned by Russell King.  Test 
code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .

v2: use __inttype() as suggested by H. Peter Anvin, which cleans the
code up nicely, and fix things to work on x86_64 as well.

v3: fix undefined behaviour on 32 bit x86 caused by multiple expansion 
of ptr in the 64 bit case.

Signed-off-by: Benjamin LaHaise 

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a4a30e4..8743552 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -333,7 +333,26 @@ do {   
\
 } while (0)
 
 #ifdef CONFIG_X86_32
-#define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad()
+#define __get_user_asm_u64(x, ptr, retval, errret) \
+({ \
+   __typeof__(ptr) __ptr = (ptr);  \
+   asm volatile(ASM_STAC "\n"  \
+"1:movl %2,%%eax\n"\
+"2:movl %3,%%edx\n"\
+"3: " ASM_CLAC "\n"\
+".section .fixup,\"ax\"\n" \
+"4:mov %4,%0\n"\
+"  xorl %%eax,%%eax\n" \
+"  xorl %%edx,%%edx\n" \
+"  jmp 3b\n"   \
+".previous\n"  \
+_ASM_EXTABLE(1b, 4b)   \
+_ASM_EXTABLE(2b, 4b)   \
+: "=r" (retval), "=A"(x)   \
+: "m" (__m(__ptr)), "m" __m(((u32 *)(__ptr)) + 1), \
+  "i" (errret), "0" (retval)); \
+})
+
 #define __get_user_asm_ex_u64(x, ptr)  (x) = __get_user_bad()
 #else
 #define __get_user_asm_u64(x, ptr, retval, errret) \
@@ -420,7 +439,7 @@ do {
\
 #define __get_user_nocheck(x, ptr, size)   \
 ({ \
int __gu_err;   \
-   unsigned long __gu_val; \
+   __inttype(*(ptr)) __gu_val; \
__uaccess_begin();  \
__get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
__uaccess_end();\

-- 
"Thought is the essence of where you are now."

[PATCH] x86_32: add support for 64 bit __get_user() v2

2016-03-09 Thread Benjamin LaHaise

The existing __get_user() implementation does not support fetching
64 bit values on 32 bit x86.  Implement this in a way that does not 
generate any incorrect warnings as cautioned by Russell King.  Test 
code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .

v2: use __inttype() as suggested by H. Peter Anvin, which cleans the
code up nicely, and fix things to work on x86_64 as well.  Tested on 
both 32 bit and 64 bit x86.

Signed-off-by: Benjamin LaHaise 

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a4a30e4..1284da2 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -333,7 +333,25 @@ do {   
\
 } while (0)
 
 #ifdef CONFIG_X86_32
-#define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad()
+#define __get_user_asm_u64(x, ptr, retval, errret) \
+({ \
+   asm volatile(ASM_STAC "\n"  \
+"1:movl %2,%%eax\n"\
+"2:movl %3,%%edx\n"\
+"3: " ASM_CLAC "\n"\
+".section .fixup,\"ax\"\n" \
+"4:mov %4,%0\n"\
+"  xorl %%eax,%%eax\n" \
+"  xorl %%edx,%%edx\n" \
+"  jmp 3b\n"   \
+".previous\n"  \
+_ASM_EXTABLE(1b, 4b)   \
+_ASM_EXTABLE(2b, 4b)   \
+: "=r" (retval), "=A"(x)   \
+: "m" (__m(ptr)), "m" __m(((u32 *)(ptr)) + 1), \
+  "i" (errret), "0" (retval)); \
+})
+
 #define __get_user_asm_ex_u64(x, ptr)  (x) = __get_user_bad()
 #else
 #define __get_user_asm_u64(x, ptr, retval, errret) \
@@ -420,7 +438,7 @@ do {
\
 #define __get_user_nocheck(x, ptr, size)   \
 ({ \
int __gu_err;   \
-   unsigned long __gu_val; \
+   __inttype(*(ptr)) __gu_val; \
__uaccess_begin();  \
__get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
__uaccess_end();\

-- 
"Thought is the essence of where you are now."

Re: [PATCH] x86_32: add support for 64 bit __get_user()

2016-03-09 Thread Benjamin LaHaise

On Wed, Mar 09, 2016 at 10:22:50AM -0800, H. Peter Anvin wrote:
> On 03/09/2016 09:50 AM, Benjamin LaHaise wrote:
> > On Wed, Mar 09, 2016 at 09:36:30AM -0800, H. Peter Anvin wrote:
> >> On March 9, 2016 9:22:25 AM PST, Benjamin LaHaise  wrote:
> >>> The existing __get_user() implementation does not support fetching
> >>> 64 bit values on 32 bit x86.  Implement this in a way that does not 
> >>> generate any incorrect warnings as cautioned by Russell King.  Test 
> >>> code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .  
> > ...
> >> Weird.  I could swear we had already fixed this a few years ago.
> > 
> > That surprised me as well, but Russell raised the fact that the approaches 
> > previously tried on 32 bit architectures had caused various incorrect 
> > compiler warnings for certain obscure cases -- see the code in 
> > test_module.c 
> > in that URL that Russell provided to demonstrate the problem across all the 
> > corner cases.
> 
> Oh, I see... I implemented it for put but not get... weird.  You may
> want to look at the __inttype() macro defined earlier in this file; it
> might be useful.

Ah, interesting.  I'll look at that.

> I presume you have already seen:
> 
> >> fs/select.c:710: Error: operand type mismatch for `movq'
> >> fs/select.c:714: Error: incorrect register `%cx' used with `q' suffix
>fs/select.c:711: Error: operand type mismatch for `movq'
> >> fs/select.c:715: Error: incorrect register `%si' used with `q' suffix
> --
>fs/aio.c: Assembler messages:
> >> fs/aio.c:1606: Error: operand type mismatch for `movq'
> >> fs/aio.c:1610: Error: incorrect register `%si' used with `q' suffix
> 
> ... which implies it used 16-bit registers for 64-bit operations when
> compiling for 64 bits.

Yup, will respin shortly.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH] x86_32: add support for 64 bit __get_user()

2016-03-09 Thread Benjamin LaHaise

On Wed, Mar 09, 2016 at 09:36:30AM -0800, H. Peter Anvin wrote:
> On March 9, 2016 9:22:25 AM PST, Benjamin LaHaise  wrote:
> >The existing __get_user() implementation does not support fetching
> >64 bit values on 32 bit x86.  Implement this in a way that does not 
> >generate any incorrect warnings as cautioned by Russell King.  Test 
> >code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .  
...
> Weird.  I could swear we had already fixed this a few years ago.

That surprised me as well, but Russell raised the fact that the approaches 
previously tried on 32 bit architectures had caused various incorrect 
compiler warnings for certain obscure cases -- see the code in test_module.c 
in that URL that Russell provided to demonstrate the problem across all the 
corner cases.

-ben
-- 
"Thought is the essence of where you are now."

[PATCH] x86_32: add support for 64 bit __get_user()

2016-03-09 Thread Benjamin LaHaise

The existing __get_user() implementation does not support fetching
64 bit values on 32 bit x86.  Implement this in a way that does not 
generate any incorrect warnings as cautioned by Russell King.  Test 
code available at http://www.kvack.org/~bcrl/x86_32-get_user.tar .  

Signed-off-by: Benjamin LaHaise 

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a4a30e4..2d0607a 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -333,7 +333,23 @@ do {   
\
 } while (0)
 
 #ifdef CONFIG_X86_32
-#define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad()
+#define __get_user_asm_u64(x, addr, err, errret)   \
+   asm volatile(ASM_STAC "\n"  \
+"1:movl %2,%%eax\n"\
+"2:movl %3,%%edx\n"\
+"3: " ASM_CLAC "\n"\
+".section .fixup,\"ax\"\n" \
+"4:mov %4,%0\n"\
+"  xorl %%eax,%%eax\n" \
+"  xorl %%edx,%%edx\n" \
+"  jmp 3b\n"   \
+".previous\n"  \
+_ASM_EXTABLE(1b, 4b)   \
+_ASM_EXTABLE(2b, 4b)   \
+: "=r" (err), "=A"(x)  \
+: "m" (__m(addr)), "m" __m(((u32 *)(addr)) + 1),   \
+  "i" (errret), "0" (err))
+
 #define __get_user_asm_ex_u64(x, ptr)  (x) = __get_user_bad()
 #else
 #define __get_user_asm_u64(x, ptr, retval, errret) \
@@ -420,11 +436,20 @@ do {  
\
 #define __get_user_nocheck(x, ptr, size)   \
 ({ \
int __gu_err;   \
-   unsigned long __gu_val; \
-   __uaccess_begin();  \
-   __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
-   __uaccess_end();\
-   (x) = (__force __typeof__(*(ptr)))__gu_val; \
+   if (size == 8) {\
+   unsigned long __gu_val[2];  \
+   __gu_err = 0;   \
+   __uaccess_begin();  \
+   __get_user_asm_u64(__gu_val, ptr, __gu_err, -EFAULT);   \
+   __uaccess_end();\
+   (x) = *(__force __typeof__((ptr)))__gu_val; \
+   } else {\
+   unsigned long __gu_val; \
+   __uaccess_begin();  \
+   __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT); \
+   __uaccess_end();\
+   (x) = (__force __typeof__(*(ptr)))__gu_val; \
+   }   \
__builtin_expect(__gu_err, 0);  \
 })
 
-- 
"Thought is the essence of where you are now."

Re: linux-next: Tree for Feb 24

2016-02-24 Thread Benjamin LaHaise

On Wed, Feb 24, 2016 at 07:32:17AM +0100, Sedat Dilek wrote:
> On Wed, Feb 24, 2016 at 6:34 AM, Stephen Rothwell  
> wrote:
> > Hi all,
> >
> > Changes since 20160223:
> >
> ...
> > The aio tree still had a build failure so I used the version from
> > next-20160111.
> >
> 
> Might be good to poke the maintainer as I am seeing this for a long
> time in Linux-next.

These are architecture code related build failures that arch maintainers 
need to fix.  Avoiding pulling the tree allows people to ignore the issue, 
which isn't going to get things fixed.  I provided an example how to 
implement the 64 bit __get_user() without generating warnings, and it is 
now up to maintainers to adapt it for their architecture.

-ben

> - Sedat -

-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 04:17:42PM +, Russell King - ARM Linux wrote:
> __gu_val will be 32-bit, even when you're wanting a 64-bit quantity.
> That's where the fun and games start...

Okay, I figured out how to do it: instead of using a 64 bit unsigned long 
long for __gu_val, use an array of 2 unsigned longs.  See the patch below 
which I verified boots, passes your tests and doesn't truncate 64 bit 
values.  Comments?  A simple test module to verify things is located at 
http://www.kvack.org/~bcrl/test_module2.c .

-ben
-- 
"Thought is the essence of where you are now."


diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 09b1b0a..53244ae 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -326,7 +326,22 @@ do {   
\
 } while (0)
 
 #ifdef CONFIG_X86_32
-#define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad()
+#define __get_user_asm_u64(x, addr, err, errret)   \
+   asm volatile(ASM_STAC "\n"  \
+"1:movl %2,%%eax\n"\
+"2:movl %3,%%edx\n"\
+"3: " ASM_CLAC "\n"\
+".section .fixup,\"ax\"\n" \
+"4:mov %4,%0\n"\
+"  xorl %%eax,%%eax\n" \
+"  xorl %%edx,%%edx\n" \
+"  jmp 3b\n"   \
+".previous\n"  \
+_ASM_EXTABLE(1b, 4b)   \
+_ASM_EXTABLE(2b, 4b)   \
+: "=r" (err), "=A"(x)  \
+: "m" (__m(addr)), "m" __m(((u32 *)addr) + 1), "i" 
(errret), "0" (err))
+
 #define __get_user_asm_ex_u64(x, ptr)  (x) = __get_user_bad()
 #else
 #define __get_user_asm_u64(x, ptr, retval, errret) \
@@ -407,9 +422,16 @@ do {   
\
 #define __get_user_nocheck(x, ptr, size)   \
 ({ \
int __gu_err;   \
-   unsigned long __gu_val; \
-   __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
-   (x) = (__force __typeof__(*(ptr)))__gu_val; \
+   if (size == 8) {\
+   unsigned long __gu_val[2];  \
+   __gu_err = 0;   \
+   __get_user_asm_u64(__gu_val, ptr, __gu_err, -EFAULT);   \
+   (x) = *(__force __typeof__((ptr)))__gu_val; \
+   } else {\
+   unsigned long __gu_val; \
+   __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT); \
+   (x) = (__force __typeof__(*(ptr)))__gu_val; \
+   }   \
__builtin_expect(__gu_err, 0);  \
 })

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 04:17:42PM +, Russell King - ARM Linux wrote:
> That's the easy bit!
> 
> The problem you're going to run into is here:
> 
> #define __get_user_nocheck(x, ptr, size)\
> ({  \
> int __gu_err;   \
> unsigned long __gu_val; \
> __uaccess_begin();  \
> __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
> __uaccess_end();\
> (x) = (__force __typeof__(*(ptr)))__gu_val; \
> 
> __gu_val will be 32-bit, even when you're wanting a 64-bit quantity.
> That's where the fun and games start...

You're right -- it's quite non-trivial.  How evil would it be to make a 
separate __get_user64() macro?

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 04:17:42PM +, Russell King - ARM Linux wrote:
> On Thu, Feb 04, 2016 at 11:01:01AM -0500, Benjamin LaHaise wrote:
> > On Thu, Feb 04, 2016 at 02:39:07PM +, Russell King - ARM Linux wrote:
> > > However, this one should warn:
> > > 
> > > int test_wrong(char **v, const char **p)
> > > { return __get_user(*v, p); }
> > > 
> > > Good luck (I think you'll need lots of it to get a working solution)! :)
> > 
> > This works with your test cases on x86-32.  Note that it's only compile + 
> > link tested at present.
> 
> That's the easy bit!
> 
> The problem you're going to run into is here:
> 
> #define __get_user_nocheck(x, ptr, size)\
> ({  \
> int __gu_err;   \
> unsigned long __gu_val; \
> __uaccess_begin();  \
> __get_user_size(__gu_val, (ptr), (size), __gu_err, -EFAULT);\
> __uaccess_end();\
> (x) = (__force __typeof__(*(ptr)))__gu_val; \
> 
> __gu_val will be 32-bit, even when you're wanting a 64-bit quantity.
> That's where the fun and games start...

Ugh.  You're making me install a 32 bit distro.!

-ben

> -- 
> RMK's Patch system: http://www.arm.linux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.

-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 02:39:07PM +, Russell King - ARM Linux wrote:
> However, this one should warn:
> 
> int test_wrong(char **v, const char **p)
> { return __get_user(*v, p); }
> 
> Good luck (I think you'll need lots of it to get a working solution)! :)

This works with your test cases on x86-32.  Note that it's only compile + 
link tested at present.

-ben
-- 
"Thought is the essence of where you are now."

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 09b1b0a..d8834c2 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -326,7 +326,21 @@ do {   
\
 } while (0)
 
 #ifdef CONFIG_X86_32
-#define __get_user_asm_u64(x, ptr, retval, errret) (x) = __get_user_bad()
+#define __get_user_asm_u64(x, addr, err, errret)   \
+   asm volatile(ASM_STAC "\n"  \
+"1:movl %2,%%eax\n"\
+"  movl %3,%%edx\n"\
+"2: " ASM_CLAC "\n"\
+".section .fixup,\"ax\"\n" \
+"3:mov %4,%0\n"\
+"  xorl %%eax,%%eax\n" \
+"  xorl %%edx,%%edx\n" \
+"  jmp 2b\n"   \
+".previous\n"  \
+_ASM_EXTABLE(1b, 3b)   \
+: "=r" (err), "=A"(x)  \
+: "m" (__m(addr)), "m" __m(((u32 *)addr) + 1), "i" 
(errret), "0" (err))
+
 #define __get_user_asm_ex_u64(x, ptr)  (x) = __get_user_bad()
 #else
 #define __get_user_asm_u64(x, ptr, retval, errret) \

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 02:12:53PM +, Russell King - ARM Linux wrote:
> Hence, __get_user() on x86-32 with a 64-bit quantity results in
> __get_user_bad() being called, which is an undefined function.
> Only if you build with x86-64 support enabled (iow, CONFIG_X86_32 not
> defined) then you get the 64-bit __get_user() support.
> 
> Given this, I fail to see how x86-32 can possibly work.

You're right; mea culpa.  It compiles without warning on x86-32, but it 
does not link.  I still think this is broken archtecture stupidity since 
put_user() works for 64 bit data types.

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 01:50:56PM +, Russell King - ARM Linux wrote:
> > I am still convinced that this is an architecture issue.  Given that 64 bit 
> > values work in the *get_user implementations on other architectures, I see 
> > no reason there should need to be a workaround for this in common code.
> 
> So you're happy to break x86-32 then...

x86-32 works fine.

-ben
-- 
"Thought is the essence of where you are now."

Re: linux-next: build failure after merge of the aio tree

2016-02-04 Thread Benjamin LaHaise

On Thu, Feb 04, 2016 at 01:19:59PM +1100, Stephen Rothwell wrote:
> Hi Benjamin,
> 
> On Fri, 29 Jan 2016 13:03:39 +0100 Geert Uytterhoeven  
> wrote:
> >
> > On Fri, Jan 29, 2016 at 12:30 PM, Russell King - ARM Linux
> >  wrote:
> > >> Background: new aio code is adding __get_user() calls referencing 64
> > >> bit quantities (__u64 and __s64).  
> > >
> > > There's lots more architectures which do not support 64-bit get_user()
> > > _or_ __get_user(): avr32, blackfin, metag for example, and m68k which
> > > has this interesting thing "/* case 8: disabled because gcc-4.1 has a
> > > broken typeof \" in its *get_user() implementation.  
> > 
> > And if you enable it again, you get lots of "warning: cast to pointer from
> > integer of different size", like you mentioned.
> 
> Any thoughts?  I am still using the version of tha aio tree from
> next-20160111.

I am still convinced that this is an architecture issue.  Given that 64 bit 
values work in the *get_user implementations on other architectures, I see 
no reason there should need to be a workaround for this in common code.

-ben

> -- 
> Cheers,
> Stephen Rothwell

-- 
"Thought is the essence of where you are now."

Re: [mm -next] mapping->tree_lock inconsistent lock state

2016-02-03 Thread Benjamin LaHaise

On Wed, Feb 03, 2016 at 01:45:09PM -0500, Johannes Weiner wrote:
> On Thu, Feb 04, 2016 at 12:36:33AM +0900, Sergey Senozhatsky wrote:
> > Hello,
> > 
> > next-20160203
> > 
> > [ 3587.997451] =
> > [ 3587.997453] [ INFO: inconsistent lock state ]
> > [ 3587.997456] 4.5.0-rc2-next-20160203-dbg-7-g37a0a9d-dirty #377 Not 
> > tainted
> > [ 3587.997457] -
> > [ 3587.997459] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> 
> Thanks Sergey. Vladimir sent a patch to move mem_cgroup_migrate() out
> of the IRQ-disabled section:
> 
> http://marc.info/?l=linux-mm&m=145452460721208&w=2
> 
> It looks like Vladimir's original message didn't make it to linux-mm,
> only my reply to it; CC Ben LaHaise.

There's not much I can do when Google decides to filter it.

-ben
-- 
"Thought is the essence of where you are now."

Re: [PATCH 07/13] aio: enabled thread based async fsync

2016-01-22 Thread Benjamin LaHaise

On Sat, Jan 23, 2016 at 03:24:49PM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 04:56:30PM -0500, Benjamin LaHaise wrote:
> > On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> > > Filesystems *must take locks* in the IO path. We have to serialise
> > > against truncate and other operations at some point in the IO path
> > > (e.g. block mapping vs concurrent allocation and/or removal), and
> > > that can only be done sanely with sleeping locks.  There is no way
> > > of knowing in advance if we are going to block, and so either we
> > > always use threads for IO submission or we accept that occasionally
> > > the AIO submission will block.
> > 
> > I never said we don't take locks.  Still, we can be more intelligent 
> > about when and where we do so.  With the nonblocking pread() and pwrite() 
> > changes being proposed elsewhere, we can do the part of the I/O that 
> > doesn't block in the submitter, which is a huge win when possible.
> > 
> > As it stands today, *every* buffered write takes i_mutex immediately 
> > on entering ->write().  That one issue alone accounts for a nearly 10x 
> > performance difference between an O_SYNC write and an O_DIRECT write, 
> 
> Yes, that locking is for correct behaviour, not for performance
> reasons.  The i_mutex is providing the required semantics for POSIX
> write(2) functionality - writes must serialise against other reads
> and writes so that they are completed atomically w.r.t. other IO.
> i.e. writes to the same offset must not interleave, not should reads
> be able to see partial data from a write in progress.

No, the locks are not *required* for POSIX semantics, they are a legacy
of how Linux filesystem code has been implemented and how we ensure the
necessary internal consistency needed inside our filesystems is
provided.  There are other ways to achieve the required semantics that
do not involve a single giant lock for the entire file/inode.  And no, I
am not saying that doing this is simple or easy to do.

-ben

> Direct IO does not conform to POSIX concurrency standards, so we
> don't have to serialise concurrent IO against each other.
> 
> > and using O_SYNC writes is a legitimate use-case for users who want 
> > caching of data by the kernel (duplicating that functionality is a huge 
> > amount of work for an application, plus if you want the cache to be 
> > persistent between runs of an app, you have to get the kernel to do it).
> 
> Yes, but you take what you get given. Buffered IO sucks in many ways;
> this is just one of them.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

-- 
"Thought is the essence of where you are now."

Re: [PATCH -next] aio: Fix compile error due to unexpected use of cmpxchg()

2016-01-22 Thread Benjamin LaHaise

On Fri, Jan 22, 2016 at 10:14:39AM -0800, Guenter Roeck wrote:
> cmpxchg() on some architectures (ia64) doesn't like functions as parameters.
> This results in the following compile error on the affected architectures.
> 
> fs/aio.c: In function 'aio_thread_fn':
> fs/aio.c:1499:1: error: cast specifies function type

> Fixes: 6a81013efc40 ("aio: add support for IOCB_CMD_POLL via aio thread 
> helper")

Ah, I was wondering if there was an easy way to fix this bug.  Applied -- 
thanks!

-ben

> Cc: Benjamin LaHaise 
> Signed-off-by: Guenter Roeck 
> ---
> No idea if this is the correct or an acceptable fix.
> If not, please consider this to be a bug report.
> 
>  fs/aio.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 56bcdf4105f4..229a91e391df 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1495,8 +1495,8 @@ static void aio_thread_fn(struct work_struct *work)
>* cancellation if it has not already occurred.
>*/
>   old_cancel = cmpxchg(&iocb->ki_cancel,
> -  aio_thread_queue_iocb_cancel_early,
> -  aio_thread_queue_iocb_cancel);
> +  (kiocb_cancel_fn 
> *)aio_thread_queue_iocb_cancel_early,
> +  (kiocb_cancel_fn *)aio_thread_queue_iocb_cancel);
>   if (old_cancel != KIOCB_CANCELLED)
>   ret = iocb->ki_work_fn(iocb);
>   else
> -- 
> 2.1.4

-- 
"Thought is the essence of where you are now."

Re: [PATCH 07/13] aio: enabled thread based async fsync

2016-01-20 Thread Benjamin LaHaise

On Thu, Jan 21, 2016 at 08:45:46AM +1100, Dave Chinner wrote:
> Filesystems *must take locks* in the IO path. We have to serialise
> against truncate and other operations at some point in the IO path
> (e.g. block mapping vs concurrent allocation and/or removal), and
> that can only be done sanely with sleeping locks.  There is no way
> of knowing in advance if we are going to block, and so either we
> always use threads for IO submission or we accept that occasionally
> the AIO submission will block.

I never said we don't take locks.  Still, we can be more intelligent 
about when and where we do so.  With the nonblocking pread() and pwrite() 
changes being proposed elsewhere, we can do the part of the I/O that 
doesn't block in the submitter, which is a huge win when possible.

As it stands today, *every* buffered write takes i_mutex immediately 
on entering ->write().  That one issue alone accounts for a nearly 10x 
performance difference between an O_SYNC write and an O_DIRECT write, 
and using O_SYNC writes is a legitimate use-case for users who want 
caching of data by the kernel (duplicating that functionality is a huge 
amount of work for an application, plus if you want the cache to be 
persistent between runs of an app, you have to get the kernel to do it).

-ben

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

-- 
"Thought is the essence of where you are now."

Re: [PATCH 07/13] aio: enabled thread based async fsync

2016-01-20 Thread Benjamin LaHaise

On Wed, Jan 20, 2016 at 12:29:32PM -0800, Linus Torvalds wrote:
> On Wed, Jan 20, 2016 at 11:59 AM, Dave Chinner  wrote:
> >>
> >> Are there other users outside of Solace? It would be good to get comments..
> >
> > I know of quite a few storage/db products that use AIO. The most
> > recent high profile project that have been reporting issues with AIO
> > on XFS is http://www.scylladb.com/. That project is architected
> > around non-blocking AIO for scalability reasons...
> 
> I was more wondering about the new interfaces, making sure that the
> feature set actually matches what people want to do..

I suspect this will be an ongoing learning exercise as people start to use 
the new functionality and find gaps in terms of what is needed.  Certainly 
there is a bunch of stuff we need to add to cover the cases where disk i/o 
is required.  getdents() is one example, but the ABI issues we have with it 
are somewhat more complicated given the history associated with that 
interface.

> That said, I also agree that it would be interesting to hear what the
> performance impact is for existing performance-sensitive users. Could
> we make that "aio_may_use_threads()" case be unconditional, making
> things simpler?

Making it unconditional is a goal, but some work is required before that 
can be the case.  The O_DIRECT issue is one such matter -- it requires some 
changes to the filesystems to ensure that they adhere to the non-blocking 
nature of the new interface (ie taking i_mutex is a Bad Thing that users 
really do not want to be exposed to; if taking it blocks, the code should 
punt to a helper thread).  Additional auditing of some of the read/write 
implementations is also required, which will likely need some minor changes 
in things like sysfs and other weird functionality we have.  Having the 
flag reflects that while the functionality is useful, not all of the bugs 
have been worked out yet.

What's the desired approach to merge these changes?  Does it make sense 
to merge what is ready now and prepare the next round of changes for 4.6?  
Or is it more important to grow things to a more complete state before 
merging?

Regards,

-ben

>   Linus

-- 
"Thought is the essence of where you are now."

Re: int overflow in io_getevents

2016-01-07 Thread Benjamin LaHaise

On Thu, Jan 07, 2016 at 04:37:43PM +0100, Dmitry Vyukov wrote:
> pass ts to the function

Yeah, I should have had my morning coffee before hitting send.  Updated 
below, and hopefully final.  Checked with a test program to confirm that 
the huge value of seconds in timespec correctly waits, and that negative 
or other invalid values fail with EINVAL (download from
http://www.kvack.org/~bcrl/aio-io_getevents-timespec.c ).

-ben
-- 
"Thought is the essence of where you are now."

commit 49b78150bc5762c58cfb8b19a859c354cf1a71ac
Author: Benjamin LaHaise 
Date:   Thu Jan 7 10:37:58 2016 -0500

aio: handle integer overflow in io_getevents() timespec usage

Dmitry Vyukov reported an integer overflow in io_getevents() when
running a fuzzer.  Upon investigation, the triggers appears to be that
an invalid value for the tv_sec or tv_nsec was passed in which is not
handled by timespec_to_ktime().  This patch fixes that by making
io_getevents() return -EINVAL when timespec_valid() checks fail.  We
use timespec_valid() instead of timespec_valid_strict() to avoid issues
caused by userspace not knowing the cutoff for KTIME_SEC_MAX.

Reported-by: Dmitry Vyukov 
Signed-off-by: Benjamin LaHaise 

diff --git a/fs/aio.c b/fs/aio.c
index 155f842..e0d5398 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1269,6 +1269,8 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,
 
if (unlikely(copy_from_user(&ts, timeout, sizeof(ts
return -EFAULT;
+   if (!timespec_valid(&ts))
+   return -EINVAL;
 
until = timespec_to_ktime(ts);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: int overflow in io_getevents

2016-01-07 Thread Benjamin LaHaise

On Thu, Jan 07, 2016 at 10:12:02AM +0100, Dmitry Vyukov wrote:
...
> Sorry, but the following program still prints -9223372036562067969. I
> think timespec_valid check will do.

Ah, right.  Yes, using timespec_valid() instead of  timespec_valid_strict() 
as initially proposed will address my concerns.  Updated commit below.

-ben
-- 
"Thought is the essence of where you are now."

commit 3a55c535cf3836257434518bd6bc11464c493492
Author: Benjamin LaHaise 
Date:   Thu Jan 7 10:25:44 2016 -0500

aio: handle integer overflow in io_getevents() timespec usage

Dmitry Vyukov reported an integer overflow in io_getevents() when
running a fuzzer.  Upon investigation, the triggers appears to be that
an invalid value for the tv_sec or tv_nsec was passed in which is not
handled by timespec_to_ktime().  This patch fixes that by making
io_getevents() return -EINVAL when timespec_valid() checks fail.  We
use timespec_valid() instead of timespec_valid_strict() to avoid issues
caused by userspace not knowing the cutoff for KTIME_SEC_MAX.

Reported-by: Dmitry Vyukov 
    Signed-off-by: Benjamin LaHaise 

diff --git a/fs/aio.c b/fs/aio.c
index 155f842..d161a2f 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1269,6 +1269,8 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,

if (unlikely(copy_from_user(&ts, timeout, sizeof(ts
return -EFAULT;
+   if (!timespec_valid())
+   return -EINVAL;

until = timespec_to_ktime(ts);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: int overflow in io_getevents

2016-01-06 Thread Benjamin LaHaise

On Wed, Dec 16, 2015 at 07:38:33PM +0100, Dmitry Vyukov wrote:
> > Yup, looks correct. Will you send a patch?
> 
> I've drafted the verification:
> 
> @@ -1269,6 +1269,8 @@ static long read_events(struct kioctx *ctx, long
> min_nr, long nr,
> 
> if (unlikely(copy_from_user(&ts, timeout, sizeof(ts
> return -EFAULT;
> +   if (!timespec_valid_strict(&strict))
> +   return -EINVAL;
> 
> until = timespec_to_ktime(ts);
> }
> 
> But now I am thinking whether it is the right solution.
> First, user does not know about KTIME_MAX, so it is not unreasonable
> to pass timespec{INT64_MAX, INT64_MAX} as timeout expecting that it
> will block for a long time. And it actually probably mostly works now,
> because after the overflow you still get something large with high
> probability. If we do the fix, then users will need to pass seconds <
> KTIME_MAX, while they don't know KTIME_MAX value.
> Second, there seems to be more serious issue in ktime_set() which
> checks seconds for KTIME_MAX, but on the next line addition still
> overflows int64.
> Thoughts?

I finally had some time to look over this after the holidays, and I 
don't think using timespec_valid_strict() is the right approach here, 
as userspace will have no idea what KTIME_MAX is.  Instead, I think the 
right approach is to -EINVAL for negative values (which should avoid 
the overflow), and to allow too large values to be silently truncated 
by timespec_to_ktime().  The truncation doesn't matter all that much 
given that it's in the hundreds of years ballpark.  I'll push the patch 
below if there are no objections.

    -ben
-- 
"Thought is the essence of where you are now."

commit 4304367826d0df42086ef24428c6c277acd822a9
Author: Benjamin LaHaise 
Date:   Wed Jan 6 12:46:12 2016 -0500

aio: handle integer overflow in io_getevents() timespec usage

Dmitry Vyukov reported an integer overflow in io_getevents() when
running a fuzzer.  Upon investigation, the triggers appears to be that a
negative value for the tv_sec or tv_nsec was passed in which is not
handled by timespec_to_ktime().  This patch fixes that by making
io_getevents() return -EINVAL when negative timeouts are passed in.

Signed-off-by: Benjamin LaHaise 

diff --git a/fs/aio.c b/fs/aio.c
index 155f842..f325ed4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1269,6 +1269,8 @@ static long read_events(struct kioctx *ctx, long min_nr, 
long nr,

if (unlikely(copy_from_user(&ts, timeout, sizeof(ts
return -EFAULT;
+   if ((ts.tv_sec < 0) || (ts.tv_nsec < 0))
+   return -EINVAL;

until = timespec_to_ktime(ts);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: use-after-free in sock_wake_async

2015-11-24 Thread Benjamin LaHaise

On Tue, Nov 24, 2015 at 04:30:01PM -0500, Jason Baron wrote:
> So looking at this trace I think its the other->sk_socket that gets
> freed and then we call sk_wake_async() on it.
> 
> We could I think grab the socket reference there with unix_state_lock(),
> since that is held by unix_release_sock() before the final iput() is called.
> 
> So something like below might work (compile tested only):

That just adds the performance regression back in.  It should be possible 
to protect the other socket dereference using RCU.  I haven't had time to 
look at this yet today, but will try to find some time this evening to come 
up with a suggested patch.

-ben

> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index aaa0b58..2b014f1 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -196,6 +196,19 @@ static inline int unix_recvq_full(struct sock const
> *sk)
>   return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
>  }
> 
> +struct socket *unix_peer_get_socket(struct sock *s)
> +{
> + struct socket *peer;
> +
> + unix_state_lock(s);
> + peer = s->sk_socket;
> + if (peer)
> + __iget(SOCK_INODE(s->sk_socket));
> + unix_state_unlock(s);
> +
> + return peer;
> +}
> +
>  struct sock *unix_peer_get(struct sock *s)
>  {
>   struct sock *peer;
> @@ -1639,6 +1652,7 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>  {
>   struct sock *sk = sock->sk;
>   struct sock *other = NULL;
> + struct socket *other_socket = NULL;
>   int err, size;
>   struct sk_buff *skb;
>   int sent = 0;
> @@ -1662,7 +1676,10 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   } else {
>   err = -ENOTCONN;
>   other = unix_peer(sk);
> - if (!other)
> + if (other)
> + other_socket = unix_peer_get_socket(other);
> +
> + if (!other_socket)
>   goto out_err;
>   }
> 
> @@ -1721,6 +1738,9 @@ static int unix_stream_sendmsg(struct socket
> *sock, struct msghdr *msg,
>   sent += size;
>   }
> 
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
> +
>   scm_destroy(&scm);
> 
>   return sent;
> @@ -1733,6 +1753,8 @@ pipe_err:
>   send_sig(SIGPIPE, current, 0);
>   err = -EPIPE;
>  out_err:
> + if (other_socket)
> + iput(SOCK_INODE(other_socket));
>   scm_destroy(&scm);
>   return sent ? : err;
>  }

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: Add few code style fixes

2015-11-11 Thread Benjamin LaHaise

On Wed, Nov 11, 2015 at 11:28:08PM +0800, 刘长冬 wrote:
> >From 1609d68dee344925d182631922cd98790109588b Mon Sep 17 00:00:00 2001
> From: Liu Changdong 
> Date: Tue, 10 Nov 2015 00:04:18 +0800
> Subject: [PATCH] aio: Add few code style fixes
> 
> Add a blank line after declarations
> 
> Signed-off-by: Liu Changdong 
> ---
>  fs/aio.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 155f842..48e4fb0 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -208,6 +208,7 @@ static struct file *aio_private_file(struct kioctx
> *ctx, loff_t nr_pages)
>   struct file *file;
>   struct path path;
>   struct inode *inode = alloc_anon_inode(aio_mnt->mnt_sb);
> +
>   if (IS_ERR(inode))
>   return ERR_CAST(inode);

Your patch is white space damaged.  Please resend in plain text without 
HTML in a format that can be applied using git-am, and test that by sending 
the email to yourself before reposting.

-ben

> @@ -269,6 +270,7 @@ __initcall(aio_setup);
>  static void put_aio_ring_file(struct kioctx *ctx)
>  {
>   struct file *aio_ring_file = ctx->aio_ring_file;
> +
>   if (aio_ring_file) {
>   truncate_setsize(aio_ring_file->f_inode, 0);
> 
> @@ -293,6 +295,7 @@ static void aio_free_ring(struct kioctx *ctx)
> 
>   for (i = 0; i < ctx->nr_pages; i++) {
>   struct page *page;
> +
>   pr_debug("pid(%d) [%d] page->count=%d\n", current->pid, i,
>   page_count(ctx->ring_pages[i]));
>   page = ctx->ring_pages[i];
> @@ -475,6 +478,7 @@ static int aio_setup_ring(struct kioctx *ctx)
> 
>   for (i = 0; i < nr_pages; i++) {
>   struct page *page;
> +
>   page = find_or_create_page(file->f_inode->i_mapping,
> i, GFP_HIGHUSER | __GFP_ZERO);
>   if (!page)
> @@ -1352,6 +1356,7 @@ out:
>  SYSCALL_DEFINE1(io_destroy, aio_context_t, ctx)
>  {
>   struct kioctx *ioctx = lookup_ioctx(ctx);
> +
>   if (likely(NULL != ioctx)) {
>   struct ctx_rq_wait wait;
>   int ret;
> -- 
> 2.1.4

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

2015-07-21 Thread Benjamin LaHaise

On Tue, Jul 21, 2015 at 05:29:03PM +0200, Oleg Nesterov wrote:
> Andrew,
> 
> this replaces
> 
>   mm-move-mremap-from-file_operations-to-vm_operations_struct-fix.patch
> 
> in -mm tree nacked by Benjamin.
> 
> fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite() is
> not defined in this case. Add ifdef(CONFIG_MMU) into aio_ring_vm_ops.
> 
> Note that this only fixes the build error, currently aio doesn't work
> if !CONFIG_MMU anyway.
> 
> Reported-by: Fengguang Wu 
> Reported-by: Stephen Rothwell 
> Signed-off-by: Oleg Nesterov 
> ---
>  fs/aio.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index c122624..822c199 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -338,9 +338,11 @@ static int aio_ring_mremap(struct vm_area_struct *vma)
>  
>  static const struct vm_operations_struct aio_ring_vm_ops = {
>   .mremap = aio_ring_mremap,
> +#ifdef CONFIG_MMU
>   .fault  = filemap_fault,
>   .map_pages  = filemap_map_pages,
>   .page_mkwrite   = filemap_page_mkwrite,
> +#endif
>  };

One nit: using #if IS_ENABLED(CONFIG_MMU) is recommended these days not 
#ifdef CONFIG_MMU.

-ben

>  static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
> -- 
> 1.7.1
> 

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

2015-07-20 Thread Benjamin LaHaise

On Mon, Jul 20, 2015 at 09:24:40PM +0200, Oleg Nesterov wrote:
> On 07/20, Oleg Nesterov wrote:
> >
> > But again, again, please ignore. This all is off-topic and my understanding
> > is very limited.
> 
> Yes, yes, but sorry for noise and let me repeat...
> 
> This memory lives in page-cache/lru, it is visible for shrinker which
> will unmap these pages for no reason on memory shortage. IOW, aio fools
> the kernel, this memory looks reclaimable but it is not. And we only do
> this for migration.

And we have the same problem with O_DIRECT.  Given the size of the LRU in 
a modern system, I highly doubt a handful of pages getting scanned is a 
major problem.  If you want to improve this, go ahead, but we need to 
retain support for page migration as people have run into the need for it.

> Even if this is not a problem, this does not look right. So perhaps at
> least mapping_set_unevictable() makes sense. But I simply do not know
> if migration will work with this change.

Nor do I know if that will work.

> And I should have changes the subject a long ago... So what do you think
> we should do with the build failure?

I honestly don't care what of the options you do -- please just don't go 
about adding BUG()s.

-ben

> Oleg.

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

2015-07-20 Thread Benjamin LaHaise

On Mon, Jul 20, 2015 at 07:33:11PM +0200, Oleg Nesterov wrote:
> Hi Jeff,
> 
> On 07/20, Jeff Moyer wrote:
> >
> > Hi, Oleg,
> >
> > Oleg Nesterov  writes:
> >
> > > Shouldn't we account aio events/pages somehow, say per-user, or in
> > > mm->pinned_vm ?
> >
> > Ages ago I wrote a patch to account the completion ring to a process'
> > memlock limit:
> >   "[patch] aio: remove aio-max-nr and instead use the memlock rlimit to
> >limit the number of pages pinned for the aio completion ring"
> >   http://marc.info/?l=linux-aio&m=123661380807041&w=2
> >
> > The problem with that patch is that it modifies the user/kernel
> > interface.  It could be done over time, as Andrew outlined in that
> > thread, but I've been reluctant to take that on.
> 
> See also the usage of mm->pinned_vm and user->locked_vm in perf_mmap(),
> perhaps aio can do the same...
> 
> > If you just mean we should account the memory so that the right process
> > can be killed, that sounds like a good idea to me.
> 
> Not sure we actually need this. I only meant that this looks confusing
> because this memory is actually locked but the kernel doesn't know this.
> 
> And btw, I forgot to mention that I triggered OOM on the testing machine
> with only 512mb ram, and aio-max-nr was huge. So, once again, while this
> all doesn't look right to me, I do not think this is the real problem.
> 
> Except the fact that an unpriviliged user can steal all aio-max-nr events.
> This probably worth fixing in any case.
> 
> 
> 
> And if we accept the fact this memory is locked and if we properly account
> it, then may be we can just kill aio_migratepage(), aio_private_file(), and
> change aio_setup_ring() to simply use install_special_mapping(). This will
> greatly simplify the code. But let me remind that I know nothing about aio,
> so please don't take my thoughts seriously.

No, you can't get rid of that code.  The page migration is required when 
CPUs/memory is offlined and data needs to be moved to another node.  
Similarly, support for mremap() is also required for container migration / 
restoration.

As for accounting locked memory, we don't do that for memory pinned by 
O_DIRECT either.  Given how small the amount of memory aio can pin is 
compared to O_DIRECT or mlock(), it is unlikely that the accounting of 
how much aio has pinned will make any real difference in the big picture.  
A single O_DIRECT i/o can pin megabytes of memory.

-ben

> Oleg.

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

2015-07-17 Thread Benjamin LaHaise

On Fri, Jul 17, 2015 at 07:27:26PM +0200, Oleg Nesterov wrote:
> Benjamin,
> 
> it seems that we do not understand each other,
...
> >
> > Either try to fix it correctly,
> 
> And I think this fix is correct. In a sense that we only add
> filemap_page_mkwrite() to make the linker happy, it can never be called
> and thus we can never hit this BUG().
> 
> Please look at filemap_fault() in nommu.c,
> 
>   int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>   {
>   BUG();
>   return 0;
>   }
> 
> this is the same thing. If nothing else, mm/memory.c is not even compiled
> if NOMMU.

Using BUG() is the wrong approach.  If the code is not needed in NOMMU, then 
#ifdef it out.  Think about it: NOMMU systems are very low memory systems 
and they should not have dead code compiled in if it is not needed.

> > or disable the config.
> 
> Yes, I think this makes more sense. but see below...
> 
> > Making it just
> > compile but be knowingly broken is worse than either of those 2 options.
> 
> Why? See above. I think this change makes no difference except it fixes
> the build.
> 
> Again, of course I could miss something. Could you explain your point?

Don't add BUG().  It's the equivalent approach of saying "I think this code 
isn't needed, but I'm lazy and not going to remove it properly."

> > My point was that it is valid for someone to want to use the functionality
> > on a nommu system, and given that it should have worked before the page
> > migration code was added, It Would Be Nice(tm) to return it to that state.
> 
> Perhaps it worked on NOMMU before, I have no idea. But currently, afaics,
> it can not. Even sys_io_setup() can't suceed. So I do not understand why
> do we allow NOMMU && CONFIG_AIO.
> 
> But this is another issue. Of course I won't insist, please forget.

As I said in my earlier email: aio on NOMMU was broken by the page migration 
code that was added in mid 3.1x.  Fixing it should not be that hard.
> 
> > Adding a BUG() like that to the code is just plain broken.
> 
> Why? Could you explain what I have missed?

It's doing half the job.  Either the code should be #if'd out or not.

-ben

> Oleg.

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm-move-mremap-from-file_operations-to-vm_operations_struct-fix

2015-07-17 Thread Benjamin LaHaise

On Fri, Jul 17, 2015 at 01:52:28AM +0200, Oleg Nesterov wrote:
> On 07/16, Andrew Morton wrote:
> >
> > On Fri, 17 Jul 2015 01:14:05 +0200 Oleg Nesterov  wrote:
> >
> > > fs/aio.c can't be compiled if CONFIG_MMU=n, filemap_page_mkwrite()
> > > is not defined in this case. Add yet another "must not be called"
> > > helper into nommu.c to make the linker happy.
> > >
> > > I still think this is pointless, afaics sys_io_setup() simply can't
> > > succeed if CONFIG_MMU=n. Instead we should make CONFIG_AIO depend
> > > on CONFIG_MMU.
> > >
> > > ..
> > >
> > > --- a/mm/nommu.c
> > > +++ b/mm/nommu.c
> > > @@ -2008,6 +2008,12 @@ void filemap_map_pages(struct vm_area_struct *vma, 
> > > struct vm_fault *vmf)
> > >  }
> > >  EXPORT_SYMBOL(filemap_map_pages);
> > >
> > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault 
> > > *vmf)
> > > +{
> > > + BUG();
> > > + return 0;
> > > +}
> > > +
> > >  static int __access_remote_vm(struct task_struct *tsk, struct mm_struct 
> > > *mm,
> > >   unsigned long addr, void *buf, int len, int write)
> > >  {
> >
> > So if anyone starts testing aio on NOMMU, this patch will make the
> > whole thing immediately go BUG.  This isn't helpful :(
> 
> Well, I'm afraid I could miss something, but _afaics_ this can not
> happen. filemap_page_mkwrite() can't be called if NOMMU.
> 
> In particular, simply because sys_io_setup() is the only user (if
> NOMMU) and it can't succeed. But even if I missed something and it
> can succeed, ->page_mkwrite() must not be called anyway. But this,
> again, unless I missed something ;)
> 
> > Yes, making AIO depend on MMU sounds better.
> 
> Perhaps Benjamin can change his mind or correct me.

Either try to fix it correctly, or disable the config.  Making it just 
compile but be knowingly broken is worse than either of those 2 options.  
My point was that it is valid for someone to want to use the functionality 
on a nommu system, and given that it should have worked before the page 
migration code was added, It Would Be Nice(tm) to return it to that state.  
Adding a BUG() like that to the code is just plain broken.

-ben

> > Because if it wasn't
> > busted before, it sure is now!
> 
> I hope this change can't make any difference.
> 
> Oleg.

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 0/3] aio: ctx->dead cleanups

2015-06-22 Thread Benjamin LaHaise

Hi Oleg,

On Fri, Jun 19, 2015 at 08:18:40PM +0200, Oleg Nesterov wrote:
> > > As Jeff suggested, 1/3 was changed to simply remove the ->dead check.
> > > I also updated the changelog.
> > >
> > > Jeff, I preserved your acks in 2-3.
> >
> > also remove ctx->dead setting in ioctx_alloc().
> 
> Remove the note about -EINVAL from the changelog.
> 
> Oleg.

These look pretty reasonable.  I'll apply them to aio-next later today.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: fix sleeping while TASK_INTERRUPTIBLE

2015-02-03 Thread Benjamin LaHaise

On Wed, Feb 04, 2015 at 09:23:12AM +1100, Dave Chinner wrote:
> On Mon, Feb 02, 2015 at 10:45:42AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 1, 2015 at 9:54 PM, Dave Chinner  wrote:
> > >
> > > Simple enough - the patch below removes the warning from generic/036
> > > for me.
> > 
> > So because this is a debugging thing, I'd actually prefer these
> > "sched_annotate_sleep()" calls to always come with short  comments in
> > code why they exist and why they are fine.
> 
> Ok, I just copied the existing users which don't have any comments.
> 
> > In this case, it might be as simple as
> > 
> >  "If the mutex blocks and wakes us up, the loop in
> > wait_event_interruptible_hrtimeout() will just schedule without
> > sleeping and repeat. The ting-lock doesn't block often enough for this
> > to be a performance issue".
> > 
> > or perhaps just point to the comment in read_events().
> 
> Both. New patch below.

I've added my Signed-off-by and applied it to my aio-fixes tree.  I'll send 
a pull for this later this evening once I commit a fix for an mremap case 
that just got pointed out earlier today as well.

-ben

> -Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> aio: annotate aio_read_event_ring for sleep patterns
> 
> From: Dave Chinner 
> 
> Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
> warnings like the following due to being called from wait_event
> context:
> 
>  WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 
> __might_sleep+0x7f/0x90()
>  do not call blocking ops when !TASK_RUNNING; state=1 set at 
> [] prepare_to_wait_event+0x63/0x110
>  Modules linked in:
>  CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>   821c0372 88003c117cd8 81daf2bd d8d8
>   88003c117d28 88003c117d18 8109beda 88003c117cf8
>   821c115e 0061  7e4aa300
>  Call Trace:
>   [] dump_stack+0x4c/0x65
>   [] warn_slowpath_common+0x8a/0xc0
>   [] warn_slowpath_fmt+0x46/0x50
>   [] ? prepare_to_wait_event+0x63/0x110
>   [] ? prepare_to_wait_event+0x63/0x110
>   [] __might_sleep+0x7f/0x90
>   [] mutex_lock+0x24/0x45
>   [] aio_read_events+0x4c/0x290
>   [] read_events+0x1ec/0x220
>   [] ? prepare_to_wait_event+0x110/0x110
>   [] ? hrtimer_get_res+0x50/0x50
>   [] SyS_io_getevents+0x4d/0xb0
>   [] system_call_fastpath+0x12/0x17
>  ---[ end trace bde69eaf655a4fea ]---
> 
> There is not actually a bug here, so annotate the code to tell the
> debug logic that everything is just fine and not to fire a false
> positive.
> 
> Signed-off-by: Dave Chinner 
> ---
> V2: added comment to explain the annotation.
> 
>  fs/aio.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 1b7893e..327ef6d 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1140,6 +1140,13 @@ static long aio_read_events_ring(struct kioctx *ctx,
>   long ret = 0;
>   int copy_ret;
>  
> + /*
> +  * The mutex can block and wake us up and that will cause
> +  * wait_event_interruptible_hrtimeout() to schedule without sleeping
> +  * and repeat. This should be rare enough that it doesn't cause
> +  * peformance issues. See the comment in read_events() for more detail.
> +  */
> + sched_annotate_sleep();
>   mutex_lock(&ctx->ring_lock);
>  
>   /* Access to ->ring_pages here is protected by ctx->ring_lock. */

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: fix sleeping while TASK_INTERRUPTIBLE

2015-02-01 Thread Benjamin LaHaise

On Sun, Feb 01, 2015 at 03:33:25PM -0800, Linus Torvalds wrote:
> On Sun, Feb 1, 2015 at 2:14 PM, Benjamin LaHaise  wrote:
> >
> > It's ugly, but it actually is revealing a bug.  Spurious wake ups caused
> > by the task already being added to ctx->wait when calling into mutex_lock()
> > could inadvertently cause things to go wrong.  I can envision there being
> > code invoked that possibly expects a 1-1 relationship between sleeps and
> > wake ups, which being on the additional wait queue might break.
> 
> So I'm looking at it, and I don't see it.
> 
> One side uses wait_event_interruptible_hrtimeout(), which waits for
> the return value (or the timeout), and it doesn't matter how many
> times it gets woken up, regardless of what it's waiting for. If it
> gets extra wakeups, it will just go through the loop again.
> 
> The other side is just a plain aio_read_events() ->
> aio_read_events_ring(), and that one just reads as many events as it
> can, unless some error happens.
> 
> In other words, it really looks like the warning is spurious, and the
> comments about how the semaphore could cause it to loop around but it
> all works look entirely correct.
> 
> So no, I don't see it revealing a bug at all. All I see is a spurious warning.
> 
> What's the bug you think could happen?

The bug would be in code that gets run via mutex_lock(), kmap(), or (more 
likely) in the random mm or filesystem code that could be invoked via 
copy_to_user().

If someone has code that works like the following inside of, say, a filesystem 
that's doing i/o somewhere:

static void foo_done(struct foo *foo)
{
/* stuff err somewhere */
wake_up(foo->task);
}

void some_random_code_in_some_fs()
{
struct foo *foo;

/* setup the foo */
foo->task = current;
set_current_state(TASK_UNINTERRUPTIBLE);
/* send the foo on to do some other work */
schedule();
/* foo_done should have been called at this point. */
}

When the task in question can receive spurious wake ups, there is the 
possibility that schedule() ends up returning without foo_done() having 
been called, which is not the case normally (that is, there should be a 
one to one relationship between the wake_up and schedule returning in 
this case).  While I don't immediately see any code that relies on this, 
I'm not convinced that every possible code path that can be invoked 
(especially via copy_to_user()) does not rely on these semantics.  Maybe 
I'm just being paranoid, but this is one of the concerns I raised when 
this issue came forth.  Nobody has addressed it yet, though.

-ben

>  Linus

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: fix sleeping while TASK_INTERRUPTIBLE

2015-02-01 Thread Benjamin LaHaise

On Sun, Feb 01, 2015 at 01:01:06PM -0800, Linus Torvalds wrote:
> On Sun, Feb 1, 2015 at 6:40 AM, Benjamin LaHaise  wrote:
> >
> > Chris Mason (1):
> >   fs/aio: fix sleeping while TASK_INTERRUPTIBLE
> 
> Ugh.
> 
> This patch is too ugly to live. As far as I can tell, this is another
> case of people just mindlessly trying to make the warning go away,
> rather than fixing anything in teh code itself. In fact, the code gets
> less readable, and more hacky, with that insane "running" variable
> that doesn't actually add anything to the logic, just makes the code
> harder to read, and it's *very* non-obvious why this is done in the
> first place.
> 
> If you want to shut up the warning without actually changing the code,
> use sched_annotate_sleep(). The comment about why the nested sleep
> isn't a problem ("sleeps in kmap or copy_to_user don't trigger
> warnings: If we don't copy enough events out, we'll loop through
> schedule() one time before sleeping").

It's ugly, but it actually is revealing a bug.  Spurious wake ups caused 
by the task already being added to ctx->wait when calling into mutex_lock() 
could inadvertently cause things to go wrong.  I can envision there being  
code invoked that possibly expects a 1-1 relationship between sleeps and 
wake ups, which being on the additional wait queue might break.

> I'm just about to push out a commit that makes
> "sched_annotate_sleep()" do the right thing, and *not* set
> TASK_RUNNING, but instead just disable the warning for that case.
> Which makes all these games unnecessary. I'm just waiting for my
> 'allmodconfig' build to finish before I push it out. Just another
> minute or two, I think.
> 
> I really detest debug code (or compiler warnings) that encourage
> people to write code that is *worse* than the code that causes the
> debug code or warning to trigger. It's fundamentally wrong when those
> "fixes" actually  make the code less readable and maintainable in the
> long run.

I think in this case the debug code reveals an actual bug.  I looked at 
other ways to fix it, and after a few attempts, Chris Mason's solution was 
the least-bad.  An alternative approach would be to go back to making 
ctx->ring_lock back into a spinlock, but it ends up being just as much 
(or even more) code churn.

>  Linus

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] aio: fix sleeping while TASK_INTERRUPTIBLE

2015-02-01 Thread Benjamin LaHaise

Hello Linus,

The following changes since commit 5f785de588735306ec4d7c875caf9d28481c8b21:

  aio: Skip timer for io_getevents if timeout=0 (2014-12-13 17:50:20 -0500)

are available in the git repository at:

  git://git.kvack.org/~bcrl/aio-fixes.git master

for you to fetch changes up to f84249f4cfef5ffa8a3e6d588a1d185f3f1fef45:

  fs/aio: fix sleeping while TASK_INTERRUPTIBLE (2015-01-30 11:18:36 -0500)


Chris Mason (1):
  fs/aio: fix sleeping while TASK_INTERRUPTIBLE

 fs/aio.c | 117 ++-
 1 file changed, 93 insertions(+), 24 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1b7893e..71b192a 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1131,6 +1131,8 @@ EXPORT_SYMBOL(aio_complete);
 /* aio_read_events_ring
  * Pull an event off of the ioctx's event ring.  Returns the number of
  * events fetched
+ *
+ * must be called with ctx->ring locked
  */
 static long aio_read_events_ring(struct kioctx *ctx,
 struct io_event __user *event, long nr)
@@ -1139,8 +1141,7 @@ static long aio_read_events_ring(struct kioctx *ctx,
unsigned head, tail, pos;
long ret = 0;
int copy_ret;
-
-   mutex_lock(&ctx->ring_lock);
+   int running = current->state == TASK_RUNNING;
 
/* Access to ->ring_pages here is protected by ctx->ring_lock. */
ring = kmap_atomic(ctx->ring_pages[0]);
@@ -1179,6 +1180,17 @@ static long aio_read_events_ring(struct kioctx *ctx,
page = ctx->ring_pages[pos / AIO_EVENTS_PER_PAGE];
pos %= AIO_EVENTS_PER_PAGE;
 
+   /* we're called after calling prepare_to_wait, which means
+* our current state might not be TASK_RUNNING.  Set it
+* to running here to sleeps in kmap or copy_to_user don't
+* trigger warnings.  If we don't copy enough events out, we'll
+* loop through schedule() one time before sleeping.
+*/
+   if (!running) {
+   __set_current_state(TASK_RUNNING);
+   running = 1;
+   }
+
ev = kmap(page);
copy_ret = copy_to_user(event + ret, ev + pos,
sizeof(*ev) * avail);
@@ -1201,11 +1213,10 @@ static long aio_read_events_ring(struct kioctx *ctx,
 
pr_debug("%li  h%u t%u\n", ret, head, tail);
 out:
-   mutex_unlock(&ctx->ring_lock);
-
return ret;
 }
 
+/* must be called with ctx->ring locked */
 static bool aio_read_events(struct kioctx *ctx, long min_nr, long nr,
struct io_event __user *event, long *i)
 {
@@ -1223,6 +1234,73 @@ static bool aio_read_events(struct kioctx *ctx, long 
min_nr, long nr,
return ret < 0 || *i >= min_nr;
 }
 
+/*
+ * called without ctx->ring_lock held
+ */
+static long aio_wait_events(struct kioctx *ctx, long min_nr, long nr,
+   struct io_event __user *event,
+   long *i_ret, ktime_t until)
+{
+   struct hrtimer_sleeper t;
+   long ret;
+   DEFINE_WAIT(wait);
+
+   mutex_lock(&ctx->ring_lock);
+
+   /*
+* this is an open coding of wait_event_interruptible_hrtimeout
+* so we can deal with the ctx->mutex nicely.  It starts with the
+* fast path for existing events:
+*/
+   ret = aio_read_events(ctx, min_nr, nr, event, i_ret);
+   if (ret) {
+   mutex_unlock(&ctx->ring_lock);
+   return ret;
+   }
+
+   hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+   hrtimer_init_sleeper(&t, current);
+   if (until.tv64 != KTIME_MAX) {
+   hrtimer_start_range_ns(&t.timer, until, current->timer_slack_ns,
+ HRTIMER_MODE_REL);
+   }
+
+   while (1) {
+   ret = prepare_to_wait_event(&ctx->wait, &wait,
+   TASK_INTERRUPTIBLE);
+   if (ret)
+   break;
+
+   ret = aio_read_events(ctx, min_nr, nr, event, i_ret);
+   if (ret)
+   break;
+
+   /* make sure we didn't timeout taking the mutex */
+   if (!t.task) {
+   ret = -ETIME;
+   break;
+   }
+
+   mutex_unlock(&ctx->ring_lock);
+   schedule();
+
+   if (!t.task) {
+   ret = -ETIME;
+   goto out;
+   }
+   mutex_lock(&ctx->ring_lock);
+
+   }
+   mutex_unlock(&ctx->ring_lock);
+
+out:
+   finish_wait(&ctx->wait, &wait);
+   hrtimer_cancel(&t.timer);
+   destroy_hrtimer_on_stack(&t.timer);
+   return ret;
+
+}
+
 static long read_events(struct kioctx *ctx, long min_nr, long nr,

Re: [v3.19-rc1] read_events => aio_read_events => WARNING: "do not call blocking ops when !TASK_RUNNING"

2014-12-24 Thread Benjamin LaHaise

On Wed, Dec 24, 2014 at 02:50:02PM +0800, Fam Zheng wrote:
> Could you trace down to the source code context of this io_getevents call in
> mysqld?  The change merely made io_getevents return a lot more quickly than
> before, so it seems like that a polling loop is spinning too fast as a result.

This warning is unrelated to your changes -- see my email exchange with Chris 
Mason for the details.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: changes for 3.19

2014-12-14 Thread Benjamin LaHaise

On Mon, Dec 15, 2014 at 01:02:08AM +0200, Kirill A. Shutemov wrote:
> But it seems the problem is bigger than what the patch fixes. To me we are
> too permisive on what vma can be remapped.
> 
> How can we know that it's okay to move vma around for random driver which
> provide .mmap? Or I miss something obvious?

Most drivers do not care if a vma is moved within the virtual address space 
of a process.  The aio ring buffer is special in that it gets unmapped when 
userspace does an io_destroy(), and io_destroy() has to know what the address 
is moved to in order to perform the unmap.  Normal drivers don't perform the 
unmap themselves.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: changes for 3.19

2014-12-14 Thread Benjamin LaHaise

On Sun, Dec 14, 2014 at 02:13:36PM -0800, Andrew Morton wrote:
> The patch appears to be a bugfix which coincidentally helps CRIU?

Yes.

> If it weren't for the bugfix part, I'd be asking "why not pass the
> desired virtual address into io_setup()?".

It's entirely possible someone might have a need to mremap the event ring, 
but nobody seems to have tried until now.

> The patch overall is a no-op from an MM perspective and seems OK to me.

How about the documentation/comment updates below?

I'll try to be more aggressive about getting signoffs on these wider changes 
in the future.

-ben
-- 
"Thought is the essence of where you are now."

aio/mm: update documentation for mremap hook in commit 
e4a0d3e720e7e508749c1439b5ba3aff56c92976

Add a few more comments and documentation to explain the mremap hook introduced
in commit e4a0d3e720e7e508749c1439b5ba3aff56c92976.

Signed-off-by: Benjamin LaHaise 

 Documentation/filesystems/vfs.txt |4 
 fs/aio.c  |8 
 2 files changed, 12 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 43ce050..a9f3df5 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -820,6 +820,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+   void (*mremap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
@@ -868,6 +869,9 @@ otherwise noted.
 
   mmap: called by the mmap(2) system call
 
+  mremap: called by the mremap(2) system call.  Called holding mmap_sem for
+   write.
+
   open: called by the VFS when an inode should be opened. When the VFS
opens a file, it creates a new "struct file". It then calls the
open method for the newly allocated file structure. You might
diff --git a/fs/aio.c b/fs/aio.c
index 1b7893e..aba5385 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -286,11 +286,19 @@ static void aio_free_ring(struct kioctx *ctx)
 
 static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   /* Resizing the event ring is not supported; mark it so. */
vma->vm_flags |= VM_DONTEXPAND;
vma->vm_ops = &generic_file_vm_ops;
return 0;
 }
 
+/* aio_ring_remap()
+ * Called when th aio event ring is being relocated within the process'
+ * address space.  The primary purpose is to update the saved address of
+ * the aio event ring so that when the ioctx is detroyed, it gets removed
+ * from the correct userspace address.  This is typically used when
+ * reloading a process back into memory by checkpoint-restore.
+ */
 static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
 {
struct mm_struct *mm = vma->vm_mm;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] aio: changes for 3.19

2014-12-14 Thread Benjamin LaHaise

On Sun, Dec 14, 2014 at 01:47:32PM -0800, Linus Torvalds wrote:
> On Sun, Dec 14, 2014 at 12:22 PM, Benjamin LaHaise  wrote:
> >
> > Pavel Emelyanov (1):
> >   aio: Make it possible to remap aio ring
> 
> So quite frankly, I think this should have had more acks from VM
> people. The patch looks ok to me, but it took me by surprise, and I
> don't see much any discussion about it on linux-mm either..

Sadly, nobody responded.  Maybe akpm can chime in on this change (included 
below for ease of review and akpm added to the To:)?

-ben
-- 
"Thought is the essence of where you are now."

commit e4a0d3e720e7e508749c1439b5ba3aff56c92976
Author: Pavel Emelyanov 
Date:   Thu Sep 18 19:56:17 2014 +0400

aio: Make it possible to remap aio ring

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

    Signed-off-by: Pavel Emelyanov 
Acked-by: Dmitry Monakhov 
Signed-off-by: Benjamin LaHaise 

diff --git a/fs/aio.c b/fs/aio.c
index 14b9315..bfab556 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -286,12 +286,37 @@ static void aio_free_ring(struct kioctx *ctx)

 static int aio_ring_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   vma->vm_flags |= VM_DONTEXPAND;
vma->vm_ops = &generic_file_vm_ops;
return 0;
 }

+static void aio_ring_remap(struct file *file, struct vm_area_struct *vma)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   struct kioctx_table *table;
+   int i;
+
+   spin_lock(&mm->ioctx_lock);
+   rcu_read_lock();
+   table = rcu_dereference(mm->ioctx_table);
+   for (i = 0; i < table->nr; i++) {
+   struct kioctx *ctx;
+
+   ctx = table->table[i];
+   if (ctx && ctx->aio_ring_file == file) {
+   ctx->user_id = ctx->mmap_base = vma->vm_start;
+   break;
+   }
+   }
+
+   rcu_read_unlock();
+   spin_unlock(&mm->ioctx_lock);
+}
+
 static const struct file_operations aio_ring_fops = {
.mmap = aio_ring_mmap,
+   .mremap = aio_ring_remap,
 };

 #if IS_ENABLED(CONFIG_MIGRATION)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ab779e..85f378c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1497,6 +1497,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+   void (*mremap)(struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/mm/mremap.c b/mm/mremap.c
index b147f66..c855922 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -288,7 +288,8 @@ static unsigned long move_vma(struct vm_area_struct *vma,
old_len = new

[GIT PULL] aio: changes for 3.19

2014-12-14 Thread Benjamin LaHaise

Hello Linus & everyone,

The following changes since commit b2776bf7149bddd1f4161f14f79520f17fc1d71d:

  Linux 3.18 (2014-12-07 14:21:05 -0800)

are available in the git repository at:

  git://git.kvack.org/~bcrl/aio-next.git master

for you to fetch changes up to 5f785de588735306ec4d7c875caf9d28481c8b21:

  aio: Skip timer for io_getevents if timeout=0 (2014-12-13 17:50:20 -0500)


Fam Zheng (1):
  aio: Skip timer for io_getevents if timeout=0

Pavel Emelyanov (1):
  aio: Make it possible to remap aio ring

 fs/aio.c   | 33 +++--
 include/linux/fs.h |  1 +
 mm/mremap.c|  3 ++-
 3 files changed, 34 insertions(+), 3 deletions(-)

Regards,

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] aio: dirty page accounting fix

2014-11-24 Thread Benjamin LaHaise

Good morning Linus and all,

The following changes since commit 6098b45b32e6baeacc04790773ced9340601d511:

  aio: block exit_aio() until all context requests are completed (2014-09-04 
16:54:47 -0400)

are available in the git repository at:

  git://git.kvack.org/~bcrl/aio-fixes.git master

for you to fetch changes up to 835f252c6debd204fcd607c79975089b1ecd3472:

  aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer 
(2014-11-06 14:27:19 -0500)


Gu Zheng (1):
  aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

 fs/aio.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: Skip timer for io_getevents if timeout=0

2014-11-06 Thread Benjamin LaHaise

Hi Fam,

On Thu, Nov 06, 2014 at 08:44:36PM +0800, Fam Zheng wrote:
> In this case, it is basically a polling. Let's not involve timer at all
> because that would hurt performance for application event loops.
> 
> In an arbitrary test I've done, io_getevents syscall elapsed time
> reduces from 5+ nanoseconds to a few hundereds.

This looks quite reasonable.  I've applied this to my aio-next tree.

-ben

> Signed-off-by: Fam Zheng 
> ---
>  fs/aio.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 84a7510..7c0b561 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1221,8 +1221,12 @@ static long read_events(struct kioctx *ctx, long 
> min_nr, long nr,
>* the ringbuffer empty. So in practice we should be ok, but it's
>* something to be aware of when touching this code.
>*/
> - wait_event_interruptible_hrtimeout(ctx->wait,
> - aio_read_events(ctx, min_nr, nr, event, &ret), until);
> + if (until.tv64 == 0)
> + aio_read_events(ctx, min_nr, nr, event, &ret);
> + else
> + wait_event_interruptible_hrtimeout(ctx->wait,
> + aio_read_events(ctx, min_nr, nr, event, &ret),
> + until);
>  
>   if (!ret && signal_pending(current))
>   ret = -EINTR;
> -- 
> 1.9.3

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V2] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

2014-11-06 Thread Benjamin LaHaise

Hi Gu,

On Thu, Nov 06, 2014 at 05:46:21PM +0800, Gu Zheng wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=86831
> 
> Markus reported that when shutting down mysqld (with AIO support,
> on a ext3 formatted Harddrive) leads to a negative number of dirty pages
> (underrun to the counter). The negative number results in a drastic reduction
> of the write performance because the page cache is not used, because the 
> kernel
> thinks it is still 2 ^ 32 dirty pages open.

I've applied this to my aio-fixes.git tree and will forward to Linus 
shortly.  Andrew -- I added your Acked-by as well based on your email.  

-ben

> Add a warn trace in __dec_zone_state will catch this easily:
> 
> static inline void __dec_zone_state(struct zone *zone, enum
>   zone_stat_item item)
> {
>  atomic_long_dec(&zone->vm_stat[item]);
> +WARN_ON_ONCE(item == NR_FILE_DIRTY &&
>   atomic_long_read(&zone->vm_stat[item]) < 0);
>  atomic_long_dec(&vm_stat[item]);
> }
> 
> [   21.341632] [ cut here ]
> [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
> cancel_dirty_page+0x164/0x224()
> [   21.355296] Modules linked in: wutbox_cp sata_mv
> [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
> [   21.366793] Workqueue: events free_ioctx
> [   21.370760] [] (unwind_backtrace) from []
> (show_stack+0x20/0x24)
> [   21.378562] [] (show_stack) from []
> (dump_stack+0x24/0x28)
> [   21.385840] [] (dump_stack) from []
> (warn_slowpath_common+0x84/0x9c)
> [   21.393976] [] (warn_slowpath_common) from []
> (warn_slowpath_null+0x2c/0x34)
> [   21.402800] [] (warn_slowpath_null) from []
> (cancel_dirty_page+0x164/0x224)
> [   21.411524] [] (cancel_dirty_page) from []
> (truncate_inode_page+0x8c/0x158)
> [   21.420272] [] (truncate_inode_page) from []
> (truncate_inode_pages_range+0x11c/0x53c)
> [   21.429890] [] (truncate_inode_pages_range) from
> [] (truncate_pagecache+0x88/0xac)
> [   21.439252] [] (truncate_pagecache) from []
> (truncate_setsize+0x5c/0x74)
> [   21.447731] [] (truncate_setsize) from []
> (put_aio_ring_file.isra.14+0x34/0x90)
> [   21.456826] [] (put_aio_ring_file.isra.14) from
> [] (aio_free_ring+0x20/0xcc)
> [   21.465660] [] (aio_free_ring) from []
> (free_ioctx+0x24/0x44)
> [   21.473190] [] (free_ioctx) from []
> (process_one_work+0x134/0x47c)
> [   21.481132] [] (process_one_work) from []
> (worker_thread+0x130/0x414)
> [   21.489350] [] (worker_thread) from []
> (kthread+0xd4/0xec)
> [   21.496621] [] (kthread) from []
> (ret_from_fork+0x14/0x20)
> [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
> 
> The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
> (bypasses the VFS dirty pages increment) when init, and aio fs uses
> *default_backing_dev_info* as the backing dev, which does not disable
> the dirty pages accounting capability.
> So truncating aio ring file will contribute to accounting dirty pages (VFS
> dirty pages decrement), then error occurs.
> 
> The original goal is keeping these pages in memory (can not be reclaimed
> or swapped) in life-time via marking it dirty. But thinking more, we have
> already pinned pages via elevating the page's refcount, which can already
> achieve the goal, so the SetPageDirty seems unnecessary.
> 
> In order to fix the issue, using the __set_page_dirty_no_writeback instead
> of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
> set the dirty flags, don't disable set_page_dirty(), rely on default 
> behaviour).
> 
> With the above change, the dirty pages accounting can work well. But as we
> known, aio fs is an anonymous one, which should never cause any real 
> write-back,
> we can ignore the dirty pages (write back) accounting by disabling the dirty
> pages (write back) accounting capability. So we introduce an aio private
> backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
> replace the default one.
> 
> Reported-by: Markus Königshaus 
> Signed-off-by: Gu Zheng 
> Cc: stable 
> ---
> v2:
>  -explain more and add necessary code comment as akpm suggested.
> ---
>  fs/aio.c |   21 ++---
>  1 files changed, 14 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 84a7510..14b9315 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -165,6 +165,15 @@ static struct vfsmount *aio_mnt;
>  static const struct file_operations aio_ring_fops;
>  static const struct address_space_operations aio_ctx_aops;
>  
> +/* Backing dev info for aio fs.
> + * -no dirty page accounting or writeback happens
> + */
> +static struct backing_dev_info aio_fs_backing_dev_info = {
> + .name   = "aiofs",
> + .state  = 0,
> + .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
> +};
> +
>  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
>  {
>   struct qstr this = QSTR_INIT("[aio]", 5);
> @@ -176,6 +185,7 @@ static struct file *a

Re: [PATCH] aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

2014-11-05 Thread Benjamin LaHaise

On Wed, Nov 05, 2014 at 05:53:11PM +0800, Gu Zheng wrote:
> ping...

I need someone a bit more familiar with this area of code to chime in on 
reviewing this.  Andrew, can you provide any feedback on this fix?

-ben

> On 10/31/2014 06:07 PM, Gu Zheng wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=86831
> > 
> > Markus reported that when shutting down mysqld (with AIO support,
> > on a ext3 formatted Harddrive) leads to a negative number of dirty pages
> > (underrun to the counter). The negative number results in a drastic 
> > reduction
> > of the write performance because the page cache is not used, because the 
> > kernel
> > thinks it is still 2 ^ 32 dirty pages open.
> > 
> > Add a warn trace in __dec_zone_state will catch this easily:
> > 
> > static inline void __dec_zone_state(struct zone *zone, enum
> > zone_stat_item item)
> > {
> >  atomic_long_dec(&zone->vm_stat[item]);
> > +WARN_ON_ONCE(item == NR_FILE_DIRTY &&
> > atomic_long_read(&zone->vm_stat[item]) < 0);
> >  atomic_long_dec(&vm_stat[item]);
> > }
> > 
> > [   21.341632] [ cut here ]
> > [   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
> > cancel_dirty_page+0x164/0x224()
> > [   21.355296] Modules linked in: wutbox_cp sata_mv
> > [   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
> > [   21.366793] Workqueue: events free_ioctx
> > [   21.370760] [] (unwind_backtrace) from []
> > (show_stack+0x20/0x24)
> > [   21.378562] [] (show_stack) from []
> > (dump_stack+0x24/0x28)
> > [   21.385840] [] (dump_stack) from []
> > (warn_slowpath_common+0x84/0x9c)
> > [   21.393976] [] (warn_slowpath_common) from []
> > (warn_slowpath_null+0x2c/0x34)
> > [   21.402800] [] (warn_slowpath_null) from []
> > (cancel_dirty_page+0x164/0x224)
> > [   21.411524] [] (cancel_dirty_page) from []
> > (truncate_inode_page+0x8c/0x158)
> > [   21.420272] [] (truncate_inode_page) from []
> > (truncate_inode_pages_range+0x11c/0x53c)
> > [   21.429890] [] (truncate_inode_pages_range) from
> > [] (truncate_pagecache+0x88/0xac)
> > [   21.439252] [] (truncate_pagecache) from []
> > (truncate_setsize+0x5c/0x74)
> > [   21.447731] [] (truncate_setsize) from []
> > (put_aio_ring_file.isra.14+0x34/0x90)
> > [   21.456826] [] (put_aio_ring_file.isra.14) from
> > [] (aio_free_ring+0x20/0xcc)
> > [   21.465660] [] (aio_free_ring) from []
> > (free_ioctx+0x24/0x44)
> > [   21.473190] [] (free_ioctx) from []
> > (process_one_work+0x134/0x47c)
> > [   21.481132] [] (process_one_work) from []
> > (worker_thread+0x130/0x414)
> > [   21.489350] [] (worker_thread) from []
> > (kthread+0xd4/0xec)
> > [   21.496621] [] (kthread) from []
> > (ret_from_fork+0x14/0x20)
> > [   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---
> > 
> > The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
> > (bypasses the VFS dirty pages increment) when init, and aio fs uses
> > *default_backing_dev_info* as the backing dev, which does not disable
> > the dirty pages accounting capability.
> > So truncating aio ring file contributes to accounting dirty pages,
> > error occurs.
> > The original goal is keeping these pages in memory (can not be reclaimed
> > or swapped) in life-time via marking it dirty. But thinking more,
> > pinning pages can already achieve the goal, so the SetPageDirty seems
> > unnecessary.
> > In order to fix the issue, using the __set_page_dirty_no_writeback instead
> > of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
> > set the dirty flags, don't disable set_page_dirty(), rely on default 
> > behaviour).
> > And an aio private backing dev info (disabled the 
> > ACCT_DIRTY/WRITEBACK/ACCT_WB
> > capabilities) is introduced to replace the default one to avoid accounting
> > dirty pages.
> > 
> > Reported-by: Markus Königshaus 
> > Signed-off-by: Gu Zheng 
> > Cc: stable 
> > ---
> >  fs/aio.c |   18 +++---
> >  1 files changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/aio.c b/fs/aio.c
> > index 84a7510..b23dc31 100644
> > --- a/fs/aio.c
> > +++ b/fs/aio.c
> > @@ -165,6 +165,12 @@ static struct vfsmount *aio_mnt;
> >  static const struct file_operations aio_ring_fops;
> >  static const struct address_space_operations aio_ctx_aops;
> >  
> > +static struct backing_dev_info aio_fs_backing_dev_info = {
> > +   .name   = "aiofs",
> > +   .state  = 0,
> > +   .capabilities   = BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_MAP_COPY,
> > +};
> > +
> >  static struct file *aio_private_file(struct kioctx *ctx, loff_t nr_pages)
> >  {
> > struct qstr this = QSTR_INIT("[aio]", 5);
> > @@ -176,6 +182,7 @@ static struct file *aio_private_file(struct kioctx 
> > *ctx, loff_t nr_pages)
> >  
> > inode->i_mapping->a_ops = &aio_ctx_aops;
> > inode->i_mapping->private_data = ctx;
> > +   inode->i_mapping->backing_dev_info = &aio_fs_backing_dev_info;
> > inode->i_size = PAGE_SIZE * nr_pages;
>

Re: [PATCH] aio: Fix return code of io_submit() (RFC)

2014-10-03 Thread Benjamin LaHaise

On Fri, Oct 03, 2014 at 12:13:39PM -0600, Jens Axboe wrote:
> On 2014-10-03 12:08, Kent Overstreet wrote:
> >io_submit() could return -EAGAIN on memory allocation failure when it 
> >should
> >really have been returning -ENOMEM. This could confuse applications (i.e. 
> >fio)
> >since -EAGAIN means "too many requests outstanding, wait until completions 
> >have
> >been reaped" and if the application actually was tracking outstanding
> >completions this wouldn't make a lot of sense.
> >
> >NOTE:
> >
> >the man page seems to imply that the current behaviour (-EAGAIN on 
> >allocation
> >failure) has always been the case. I don't think it makes a lot of sense, 
> >but
> >this should probably be discussed more widely in case applications have 
> >somehow
> >come to rely on the current behaviour...
> 
> We can't really feasibly fix this, is my worry. Fio does track the 
> pending requests and does not get into a getevents() forever wait if it 
> gets -EAGAIN on submission. But before the fix, it would loop forever in 
> submission in -EAGAIN.

There are lots of instances in the kernel where out of memory is potentially 
exposed to the user.  If we're failing a memory allocation that is well under 
1KB, the system is probably completely hosed.

> How are applications supposed to deal with ENOMEM? I think the answer 
> here is that they can't, it would be a fatal condition. AIO must provide 
> isn't own guarantee of progress, with a mempool or similar.

I'm not sure if using a mempool is appropriate for allocations that are 
driven by userland code.  At least with an ENOMEM error, an application 
could free up some of the memory it allocated and possibly recover the 
system.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: Fix return code of io_submit() (RFC)

2014-10-03 Thread Benjamin LaHaise

Hi Kent,

On Fri, Oct 03, 2014 at 11:08:13AM -0700, Kent Overstreet wrote:
> io_submit() could return -EAGAIN on memory allocation failure when it should
> really have been returning -ENOMEM. This could confuse applications (i.e. fio)
> since -EAGAIN means "too many requests outstanding, wait until completions 
> have
> been reaped" and if the application actually was tracking outstanding
> completions this wouldn't make a lot of sense.

Wouldn't it be simpler to just return an ERR_PTR with the appropriate 
return code rather than move all that code around?

-ben

> NOTE:
> 
> the man page seems to imply that the current behaviour (-EAGAIN on allocation
> failure) has always been the case. I don't think it makes a lot of sense, but
> this should probably be discussed more widely in case applications have 
> somehow
> come to rely on the current behaviour...
> 
> Signed-off-by: Kent Overstreet 
> Cc: Benjamin LaHaise 
> Cc: Zach Brown 
> Cc: Jeff Moyer 
> Cc: Jens Axboe 
> Cc: Slava Pestov 
> ---
>  fs/aio.c | 26 +-
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 733750096b..556547044b 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -933,23 +933,14 @@ static inline struct kiocb *aio_get_req(struct kioctx 
> *ctx)
>  {
>   struct kiocb *req;
>  
> - if (!get_reqs_available(ctx)) {
> - user_refill_reqs_available(ctx);
> - if (!get_reqs_available(ctx))
> - return NULL;
> - }
> -
>   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
>   if (unlikely(!req))
> - goto out_put;
> + return NULL;
>  
>   percpu_ref_get(&ctx->reqs);
>  
>   req->ki_ctx = ctx;
>   return req;
> -out_put:
> - put_reqs_available(ctx, 1);
> - return NULL;
>  }
>  
>  static void kiocb_free(struct kiocb *req)
> @@ -1489,9 +1480,17 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>   return -EINVAL;
>   }
>  
> + if (!get_reqs_available(ctx)) {
> + user_refill_reqs_available(ctx);
> + if (!get_reqs_available(ctx))
> + return -EAGAIN;
> + }
> +
>   req = aio_get_req(ctx);
> - if (unlikely(!req))
> - return -EAGAIN;
> + if (unlikely(!req)) {
> + ret = -ENOMEM;
> + goto out_put;
> + }
>  
>   req->ki_filp = fget(iocb->aio_fildes);
>   if (unlikely(!req->ki_filp)) {
> @@ -1533,9 +1532,10 @@ static int io_submit_one(struct kioctx *ctx, struct 
> iocb __user *user_iocb,
>  
>   return 0;
>  out_put_req:
> - put_reqs_available(ctx, 1);
>   percpu_ref_put(&ctx->reqs);
>   kiocb_free(req);
> +out_put:
> + put_reqs_available(ctx, 1);
>   return ret;
>  }
>  
> -- 
> 2.1.1

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 1/2] aio: async readahead

2014-09-19 Thread Benjamin LaHaise

On Fri, Sep 19, 2014 at 04:26:12AM -0700, Christoph Hellwig wrote:
> Requiring the block mappings to be entirely async is why we never went
> for full buffered aio.  What would seem more useful is to offload all
> readahead to workqueues to make sure they never block the caller for
> sys_readahead or if we decide to readahead for the nonblocking read.

I can appreciate that it may be difficult for some filesystems to implement 
a fully asynchronous readpage, but at least for some, it is possible 
and not too difficult.

> I tried to implement this, but I couldn't find a good place to hang
> the work_struct for it off.  If we decide to dynamically allocate
> the ra structure separate from struct file that might be an obvious
> place.

The approach I used in the async ext2/3/4 indirect style metadata readpage 
was to put the async state into the page's memory.  That won't work very 
well on 32 bit systems, but it works well and avoids having to perform 
another memory allocation on 64 bit systems.

I'm still of the opinion that the readpage operation should be started by 
the submitting process.  Some of the work I did in tuning things for my 
employer with async reads found that punting reads to another thread 
caused significant degradation of our workload (basically, reading in a 
bunch of persistent messages from disk, with small messages being an 
important corner of performance).  What ended up being the best performing 
for me was to have an async readahead operation to fill the page cache 
with data from the file, and then to issue a read that was essentially 
non-blocking.  This approach meant that the copy of data from the kernel 
into userspace was performed by the thread that was actually using the 
data.  By doing the copy only once all i/o completed, the data was primed 
in the CPU's cache, allowing the code that actually operates on the data 
to benefit.  Any gradual copy over time ended up performing significantly 
worse.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC 2/2] ext4: async readpage for indirect style inodes

2014-09-17 Thread Benjamin LaHaise

Hi all,

And here is the version of readpage for ext3/ext4 that performs async 
metadata reads for old-style indirect block based ext3/ext4 filesystems.  
This version only includes the changes against ext4 -- the changes to 
ext3 are pretty much identical.  This is only an RFC and has at least 
one known issue, that being that it only works on ext3 filesystems with 
block size equal to page size.

-ben
-- 
"Thought is the essence of where you are now."

 fs/ext4/ext4.h  |3 
 fs/ext4/indirect.c  |6 
 fs/ext4/inode.c |  294 
 fs/mpage.c  |4 
 include/linux/mpage.h   |3 
 include/linux/pagemap.h |2 
 6 files changed, 307 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0c225c..8136284 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2835,6 +2835,9 @@ extern struct mutex ext4__aio_mutex[EXT4_WQ_HASH_SZ];
 extern int ext4_resize_begin(struct super_block *sb);
 extern void ext4_resize_end(struct super_block *sb);
 
+extern int ext4_block_to_path(struct inode *inode,
+ ext4_lblk_t i_block,
+ ext4_lblk_t offsets[4], int *boundary);
 #endif /* __KERNEL__ */
 
 #endif /* _EXT4_H */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index e75f840..689267a 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -69,9 +69,9 @@ static inline void add_chain(Indirect *p, struct buffer_head 
*bh, __le32 *v)
  * get there at all.
  */
 
-static int ext4_block_to_path(struct inode *inode,
- ext4_lblk_t i_block,
- ext4_lblk_t offsets[4], int *boundary)
+int ext4_block_to_path(struct inode *inode,
+  ext4_lblk_t i_block,
+  ext4_lblk_t offsets[4], int *boundary)
 {
int ptrs = EXT4_ADDR_PER_BLOCK(inode->i_sb);
int ptrs_bits = EXT4_ADDR_PER_BLOCK_BITS(inode->i_sb);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3aa26e9..4b36000 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2820,6 +2820,297 @@ static sector_t ext4_bmap(struct address_space 
*mapping, sector_t block)
return generic_block_bmap(mapping, block, ext4_get_block);
 }
 
+#include 
+
+struct ext4_readpage_state {
+   struct page *page;
+   struct inode*inode;
+   int offsets[4];
+   int depth;
+   int blocks_to_boundary;
+   int cur_depth;
+   int cur_block;
+   int waiting_on_lock;
+   struct buffer_head  *cur_bh;
+   struct bio  *bio;
+   struct wait_bit_queue   wait_bit;
+   struct work_struct  work;
+};
+
+#define dprintk(x...) do { ; } while (0)
+
+static void ext4_readpage_statemachine(struct ext4_readpage_state *state);
+static void ext4_readpage_work_func(struct work_struct *work)
+{
+   struct ext4_readpage_state *state;
+
+   state = container_of(work, struct ext4_readpage_state, work);
+   dprintk("ext4_readpage_work_func(%p): state=%p\n", work, state);
+   ext4_readpage_statemachine(state);
+}
+
+static int ext4_readpage_wait_func(wait_queue_t *wait, unsigned mode, int 
flags,
+  void *arg)
+{
+   struct ext4_readpage_state *state = wait->private;
+   struct wait_bit_key *key = arg;
+
+   dprintk("ext4_readpage_wait_func: state=%p\n", state);
+   dprintk("key->flags=%p key->bit_nr=%d, page->flags=%p\n",
+   key->flags, key->bit_nr, &pginfo->page->flags);
+   if (state->wait_bit.key.flags != key->flags ||
+   state->wait_bit.key.bit_nr != key->bit_nr ||
+   test_bit(key->bit_nr, key->flags))
+return 0;
+   dprintk("ext4_readpage_wait_func: page is unlocked/uptodate\n");
+   list_del_init(&wait->task_list);
+   INIT_WORK(&state->work, ext4_readpage_work_func);
+   schedule_work(&state->work);
+   return 1;
+}
+
+static void ext4_readpage_wait_on_bh(struct ext4_readpage_state *state)
+{
+   wait_queue_head_t *wq;
+   unsigned long flags;
+   int ret = 0;
+
+   state->wait_bit.key.flags = &state->cur_bh->b_state;
+   state->wait_bit.key.bit_nr = BH_Lock;
+   state->wait_bit.wait.private = state;
+   state->wait_bit.wait.func = ext4_readpage_wait_func;
+
+   wq = bit_waitqueue(state->wait_bit.key.flags,
+  state->wait_bit.key.bit_nr);
+   spin_lock_irqsave(&wq->lock, flags);
+   __add_wait_queue(wq, &state->wait_bit.wait);
+   if (!buffer_locked(state->cur_bh)) {
+   dprintk("ext4_readpage_wait_on_bh(%p): buffer not locked\n", 
state);
+   list_del_init(&state->wait_bit.wait.task_list);
+   ret = 1;
+   }
+   spin_unlock_irqrestore(&wq->lock, flags);
+
+   dprintk("ext4

[RFC 1/2] aio: async readahead

2014-09-17 Thread Benjamin LaHaise

Hi Milosz et al,

This code is probably relevant to the non-blocking read thread.  A 
non-blocking read is pretty useless without some way to trigger and 
become aware of data being read into the page cache, and the attached 
patch is one way to do so.

The changes below introduce an async readahead operation that is based 
on readpage (sorry, I haven't done an mpage version of this code yet).  
Please note that this code was written against an older kernel (3.4) 
and hasn't been extensively tested against recent kernels, so there may 
be a few bugs lingering.  That said, the code has been enabled in our 
internal kernel at Solace Systems for a few months now with no reported 
issues.

There is a companion patch to make ext3's readpage operation use async 
metadata reads that will follow.  A test program that uses the new readhead 
operation can be found at http://www.kvack.org/~bcrl/aio-readahead.c .

-ben
-- 
"Thought is the essence of where you are now."

 fs/aio.c |  220 +++
 include/linux/pagemap.h  |2 
 include/uapi/linux/aio_abi.h |2 
 mm/filemap.c |2 
 4 files changed, 225 insertions(+), 1 deletion(-)
diff --git a/fs/aio.c b/fs/aio.c
index 7337500..f1c0f74 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -46,6 +46,8 @@
 
 #include "internal.h"
 
+static long aio_readahead(struct kiocb *iocb);
+
 #define AIO_RING_MAGIC 0xa10a10a1
 #define AIO_RING_COMPAT_FEATURES   1
 #define AIO_RING_INCOMPAT_FEATURES 0
@@ -1379,6 +1381,12 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
opcode,
iter_op = file->f_op->read_iter;
goto rw_common;
 
+   case IOCB_CMD_READAHEAD:
+   ret = -EBADF;
+   if (unlikely(!(file->f_mode & FMODE_READ)))
+   break;
+   return aio_readahead(req);
+
case IOCB_CMD_PWRITE:
case IOCB_CMD_PWRITEV:
mode= FMODE_WRITE;
@@ -1710,3 +1718,215 @@ SYSCALL_DEFINE5(io_getevents, aio_context_t, ctx_id,
}
return ret;
 }
+
+/* for readahead */
+struct readahead_state;
+struct readahead_pginfo {
+   struct wait_bit_queue   wait_bit;
+   struct page *page;
+};
+
+struct readahead_state {
+   struct kiocb*iocb;
+   unsignednr_pages;
+   atomic_tnr_pages_reading;
+
+   struct readahead_pginfo pginfo[];
+};
+
+static void aio_readahead_complete(struct readahead_state *state)
+{
+   unsigned i, nr_uptodate = 0;
+   struct kiocb *iocb;
+   long res;
+   if (!atomic_dec_and_test(&state->nr_pages_reading))
+   return;
+   for (i = 0; i < state->nr_pages; i++) {
+   struct page *page = state->pginfo[i].page;
+
+   if (PageUptodate(page))
+   nr_uptodate++;
+   page_cache_release(page);
+   }
+   iocb = state->iocb;
+   if (nr_uptodate == state->nr_pages) {
+   res = iocb->ki_nbytes;
+   } else
+   res = -EIO;
+   kfree(state);
+   aio_complete(iocb, res, 0);
+}
+
+static int pginfo_wait_func(wait_queue_t *wait, unsigned mode, int flags,
+   void *arg)
+{
+   struct readahead_state *state = wait->private;
+   struct readahead_pginfo *pginfo;
+   struct wait_bit_key *key = arg;
+   unsigned idx;
+
+   pginfo = container_of(wait, struct readahead_pginfo, wait_bit.wait);
+   idx = pginfo - state->pginfo;
+   BUG_ON(idx >= state->nr_pages);
+
+   if (pginfo->wait_bit.key.flags != key->flags ||
+   pginfo->wait_bit.key.bit_nr != key->bit_nr ||
+   test_bit(key->bit_nr, key->flags))
+   return 0;
+   list_del_init(&wait->task_list);
+   aio_readahead_complete(state);
+   return 1;
+}
+
+static void pginfo_wait_on_page(struct readahead_state *state,
+   struct readahead_pginfo *pginfo)
+{
+   struct page *page = pginfo->page;
+   wait_queue_head_t *wq;
+   unsigned long flags;
+   int ret;
+
+   pginfo->wait_bit.key.flags = &page->flags;
+   pginfo->wait_bit.key.bit_nr = PG_locked;
+   pginfo->wait_bit.wait.private = state;
+   pginfo->wait_bit.wait.func = pginfo_wait_func;
+   
+   page = pginfo->page;
+   wq = page_waitqueue(page);
+   atomic_inc(&state->nr_pages_reading);
+
+   spin_lock_irqsave(&wq->lock, flags);
+   __add_wait_queue(wq, &pginfo->wait_bit.wait);
+   if (!PageLocked(page))
+   ret = pginfo_wait_func(&pginfo->wait_bit.wait, 0, 0,
+  &pginfo->wait_bit.key);
+   spin_unlock_irqrestore(&wq->lock, flags);
+}
+
+
+/*
+ * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates 
all
+ * the pages first, then submits them all for I/O

Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances

2014-09-17 Thread Benjamin LaHaise

On Wed, Sep 17, 2014 at 09:47:02AM -0400, Theodore Ts'o wrote:
...
> % git version
> git version 2.1.0
> 
> Perhaps you and other people are using your own scripts, and not using
> git send-email?

That would be because none of my systems have git 2.1.0 on them.  Fedora 
(and EPEL) appear to still be back on git 1.9.3 which does not have git 
send-email.  Until that command is more widely propagated, I expect we'll 
see people making this mistake every once in a while.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/7] check for O_NONBLOCK in all read_iter instances

2014-09-17 Thread Benjamin LaHaise

On Wed, Sep 17, 2014 at 07:42:54AM +1000, Dave Chinner wrote:
> [Please don't top post!]
> 
> On Tue, Sep 16, 2014 at 03:45:52PM -0400, Milosz Tanski wrote:
> > On Tue, Sep 16, 2014 at 3:27 PM, Jeff Moyer  wrote:
> > > Christoph Hellwig  writes:
> > >
> > > Hrm, you're not Christoph...
> >
> > I am not Christoph, we collaborated and he sent me this patch.
> 
> You're missing Jeff's point - have a look at the name and email
> adress the mail appears to be from. It's completely mangled - forged
> if you will and Linus had a major rant about doing exactly this to
> patch sereies recently.  There is a perfectly acceptible way of
> crediting who the patch is from correctly without resorting to games
> like this.

Linus flamed me for that a few weeks ago.  The problem is that if one 
uses "git format-patch" to prepare a series of emails to post, that it 
users the patch's author for the From: entry.  I think that is a bug 
in git since multiple people have encountered this issue.  It's not 
like git doesn't know what one's email address is

-ben

> Also, this patch doesn't have a description or a valid SOB on it
> 
> Please read Documentation/SubmittingPatches so you get the format of
> the patches correct for V2. ;)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to majord...@kvack.org.  For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: mailto:"a...@kvack.org";>a...@kvack.org

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: block exit_aio() until all context requests are completed

2014-09-04 Thread Benjamin LaHaise

On Thu, Sep 04, 2014 at 08:47:19AM +0800, Gu Zheng wrote:
> That's good, but I did not find it, have you sent it to the open list?

Sorry, the aio list wasn't cc'd on that one -- it was on one of the other 
development lists.

> IMO, there's no need to check the return value of kill_ioctx(), because
> context can be marked as DEAD only by io_destroy(), and io_destroy() is
> on the block mode. So if we run into the calling of kill_ioctx() in
> exit_aio, the context is surely valid.

Fair point.  I applied your version.  A pull request has been sent.

-ben

> Thanks,
> Gu
> 
> > 
> > -ben
> > 
> >> Signed-off-by: Gu Zheng 
> >> ---
> >>  fs/aio.c |7 ++-
> >>  1 files changed, 6 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/fs/aio.c b/fs/aio.c
> >> index 97bc62c..3e35b12 100644
> >> --- a/fs/aio.c
> >> +++ b/fs/aio.c
> >> @@ -793,6 +793,8 @@ void exit_aio(struct mm_struct *mm)
> >>  
> >>for (i = 0; i < table->nr; ++i) {
> >>struct kioctx *ctx = table->table[i];
> >> +  struct completion requests_done =
> >> +  COMPLETION_INITIALIZER_ONSTACK(requests_done);
> >>  
> >>if (!ctx)
> >>continue;
> >> @@ -804,7 +806,10 @@ void exit_aio(struct mm_struct *mm)
> >> * that it needs to unmap the area, just set it to 0.
> >> */
> >>ctx->mmap_size = 0;
> >> -  kill_ioctx(mm, ctx, NULL);
> >> +  kill_ioctx(mm, ctx, &requests_done);
> >> +
> >> +  /* Wait until all IO for the context are done. */
> >> +  wait_for_completion(&requests_done);
> >>}
> >>  
> >>RCU_INIT_POINTER(mm->ioctx_table, NULL);
> >> -- 
> >> 1.7.7
> > 
> 

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] aio: 2 small fixes for 3.17

2014-09-04 Thread Benjamin LaHaise

The following changes since commit 7be141d0554921751db103b54e9f794956aa4f65:

  Merge branch 'x86-urgent-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2014-08-24 16:17:41 
-0700)

are available in the git repository at:


  git://git.kvack.org/~bcrl/aio-fixes.git master

for you to fetch changes up to 6098b45b32e6baeacc04790773ced9340601d511:

  aio: block exit_aio() until all context requests are completed (2014-09-04 
16:54:47 -0400)


Gu Zheng (1):
  aio: block exit_aio() until all context requests are completed

Jeff Moyer (1):
  aio: add missing smp_rmb() in read_events_ring

 fs/aio.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: block exit_aio() until all context requests are completed

2014-09-03 Thread Benjamin LaHaise

Hi Gu,

On Wed, Sep 03, 2014 at 05:45:44PM +0800, Gu Zheng wrote:
> It seems that exit_aio() also needs to wait for all iocbs to complete (like
> io_destroy), but we missed the wait step in current implemention, so fix
> it in the same way as we did in io_destroy.

I actually already have another version of this patch queued up that 
checks the return value of kill_ioctx().

-ben

> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |7 ++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 97bc62c..3e35b12 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -793,6 +793,8 @@ void exit_aio(struct mm_struct *mm)
>  
>   for (i = 0; i < table->nr; ++i) {
>   struct kioctx *ctx = table->table[i];
> + struct completion requests_done =
> + COMPLETION_INITIALIZER_ONSTACK(requests_done);
>  
>   if (!ctx)
>   continue;
> @@ -804,7 +806,10 @@ void exit_aio(struct mm_struct *mm)
>* that it needs to unmap the area, just set it to 0.
>*/
>   ctx->mmap_size = 0;
> - kill_ioctx(mm, ctx, NULL);
> + kill_ioctx(mm, ctx, &requests_done);
> +
> + /* Wait until all IO for the context are done. */
> + wait_for_completion(&requests_done);
>   }
>  
>   RCU_INIT_POINTER(mm->ioctx_table, NULL);
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v1 5/9] block: loop: convert to blk-mq

2014-08-27 Thread Benjamin LaHaise

On Wed, Aug 27, 2014 at 08:08:59PM +0400, Maxim Patlasov wrote:
...
> 1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
> 2) the same as above, but call loop_queue_work() directly from 
> loop_queue_rq() -- 270K iops
> 3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
> 
> Taking into account so big difference (11K vs. 270K), would it be worthy 
> to implement pure non-blocking version of aio_kernel_submit() returning 
> error if blocking needed? Then loop driver (or any other in-kernel user) 
> might firstly try that non-blocking submit as fast-path, and, only if 
> it's failed, fall back to queueing.

What filesystem is the backing file for loop0 on?  O_DIRECT access as 
Ming's patches use should be non-blocking, and if not, that's something 
to fix.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-25 Thread Benjamin LaHaise

On Mon, Aug 25, 2014 at 03:06:12PM +, Elliott, Robert (Server Storage) 
wrote:
> Using this version of the patch, I ran into this crash after 36
> hours of scsi-mq testing over the weekend.
...
> 
> local_irq_save(flags);
> kcpu = this_cpu_ptr(ctx->cpu);
> kcpu->reqs_available += nr;
> 
> while (kcpu->reqs_available >= ctx->req_batch * 2) {

This crash is not particularly useful, as that version of the patch was 
buggy and could result in reqs_available getting corrupted (ie, it could 
overflow), which could lead to the system getting stuck for a significant 
amount of time in put_reqs_available().  In other words, the crash you 
reported matches the known problem with that patch.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-24 Thread Benjamin LaHaise

On Fri, Aug 22, 2014 at 02:43:56PM -0700, Linus Torvalds wrote:
> Ugh.
> 
> Ben, at this point my gut feel is that we should just revert the
> original "fix", and you should take a much deeper look at this all.
> The original "fix" was more broken then the leak it purported to fix,
> and now the patch to fix your fix has gone through two iterations and
> *still* Dan is finding bugs in it.  I'm getting the feeling that this
> code needs more thinking than you are actually putting into it.

That's why I had't sent it out as an official [PATCH] just yet.  I think 
things worked out okay since the untested patch I sent out pointed Dan in 
the right direction and he was able to put some effort into it while I 
didn't have to immediate time to do so.  I just put in a few hours to 
polish off the final details on this fix now, and it should be coming your 
way as soon as I get an ack back from Dan.  Hopefully Kent can review it 
as well, since I had to modify the approach to try to retain the advantages 
of his batched reqs_available handling and avoid bouncing the cacheline 
ctx->completion_lock is on during io_submit().  Cheers,

-ben

> Linus

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-24 Thread Benjamin LaHaise

On Fri, Aug 22, 2014 at 09:51:10PM +0300, Dan Aloni wrote:
> Ben, seems that the test program needs some twidling to make the bug
> appear still by setting MAX_IOS to 256 (and it still passes on a
> kernel with the original patch reverted). Under this condition the
> ring buffer size remains 128 (here, 32*4 CPUs), and it is overrun on
> the second attempt.

Thanks for checking that.  I've made some further changes to your test 
program to be able to support both userspace event reaping and using the 
io_getevents() syscall.  I also added a --max-ios= parameter to allow 
using different values of MAX_IOS for testing.  A copy of the modified 
source is available at http://www.kvack.org/~bcrl/20140824-aio_bug.c .  
With this version of the patch, both means of reaping events now work 
corectly.  The aio_bug.c code still needs to be added to libaio's set 
of tests, but that's a separate email.

> I think I have found two problems with your patch: first, the
> completed_events field is never decremented so it goes up until 2^32
> wraparound. So I tested with 'ctx->completed_events -= completed;'
> there (along with some prints), but testing revealed that this didn't
> solve the problem, so secondly, I also fixed the 'avail = ' line. The
> case where the 'head > tail' case didn't look correct to me.

Yeah, sorry for not checking that.  I didn't have time to sit down for a 
couple of hours to test and verify it until today.  Your changes for 
handling ring wrap around correctly look good after inspecting the code 
and how it reacts to the various conditions.

> So the good news is that it works now with fix below and MAX_IOS=256
> and even with MAX_IOS=512. You can git-amend this it to your patch I
> guess.

I ended up making a bunch of changes to the patch to ensure the user 
space tainted value of ring->head is clamped.  I also made the code use 
ctx->tail to avoid having to do a mod operation on the value of tail as 
well.  The code is rearranged to optimistically call refill_reqs_available() 
in aio_complete() to try to avoid doing this only at io_submit() time.  
This helps to avoid having to perform refill from the io_submit() path, 
as io_submit() does not otherwise have to take ctx->completion_lock 
and potentially bounce that cache line around.

If you can have a quick look and acknowledge with your Signed-off-by, I'll 
add it to the commit and send a pull request.  With these changes and the 
modified aio_bug.c, the test passes using various random values for 
max_ios, as well as both with and without --user_getevents validating that
both regressions involved have now been fixed.

...
> BTW, I am not an expert on this code so I am still not sure that
> 'ctx->completed_events' wouldn't get wrong if for instance - one SMP
> core does userspace event reaping and another calls io_submit(). Do
> you think it would work alright in that case?

This should be SMP safe: both ctx->completed_events and ctx->tail are 
protected ctx->completion_lock.  Additionally, possible races with 
grabbing the value of ring->head should also be safe, and I added a 
comment explaining why.  Does that address your concerns?  Cheers,

-ben
-- 
"Thought is the essence of where you are now."
 SNIP: from git://git.kvack.org/~bcrl/aio-fixes.git 
>From 46faee0c63f6712874215b7ca878c52a0960c5da Mon Sep 17 00:00:00 2001
From: Benjamin LaHaise 
Date: Sun, 24 Aug 2014 13:14:05 -0400
Subject: [PATCH] aio: fix reqs_available handling

As reported by Dan Aloni , the commit "aio: fix aio request
leak when events are reaped by userspace" introduces a regression when user
code attempts to perform io_submit() with more events than are available in
the ring buffer.  Reverting that commit would reintroduce a regression when
user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can count
events as being reaped by user space and io_getevents(), we have to track
event completion by looking at the number of events left in the event ring.
So long as there are as many events in the ring buffer as there have been
completion events generate, we cannot call put_reqs_available().  The code
to check for this is now placed in refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni 
Signed-off-by: Benjamin LaHaise 
---
 fs/aio.c | 77 
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index ae63587..24c4fe4 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -141,6 +141,7 @@ struct kioctx {

struct {

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-22 Thread Benjamin LaHaise

On Fri, Aug 22, 2014 at 07:15:02PM +0300, Dan Aloni wrote:
> Sorry, I was waiting for a new patch from your direction, I should
> have replied earlier. What bothered me about the patch you sent is that
> completed_events is added as a new field but nothing assigns to it, so I 
> wonder how it can be effective.

Ah, that was missing a hunk then.  Try this version instead.

-ben
-- 
"Thought is the essence of where you are now."


diff --git a/fs/aio.c b/fs/aio.c
index ae63587..fbdcc47 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -142,6 +142,7 @@ struct kioctx {
struct {
unsignedtail;
spinlock_t  completion_lock;
+   unsignedcompleted_events;
} cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
@@ -857,6 +858,31 @@ out:
return ret;
 }
 
+static void refill_reqs_available(struct kioctx *ctx)
+{
+   spin_lock_irq(&ctx->completion_lock);
+   if (ctx->completed_events) {
+   unsigned head, tail, avail, completed;
+   struct aio_ring *ring;
+
+   ring = kmap_atomic(ctx->ring_pages[0]);
+   head = ACCESS_ONCE(ring->head);
+   tail = ACCESS_ONCE(ring->tail);
+   kunmap_atomic(ring);
+
+   avail = (head <= tail ?  tail : ctx->nr_events) - head;
+   completed = ctx->completed_events;
+   if (avail < completed)
+   completed -= avail;
+   else
+   completed = 0;
+   put_reqs_available(ctx, completed);
+   }
+
+   spin_unlock_irq(&ctx->completion_lock);
+}
+
+
 /* aio_get_req
  * Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -865,8 +891,11 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 {
struct kiocb *req;
 
-   if (!get_reqs_available(ctx))
-   return NULL;
+   if (!get_reqs_available(ctx)) {
+   refill_reqs_available(ctx);
+   if (!get_reqs_available(ctx))
+   return NULL;
+   }
 
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
@@ -958,6 +987,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 */
spin_lock_irqsave(&ctx->completion_lock, flags);
 
+   ctx->completed_events++;
tail = ctx->tail;
pos = tail + AIO_EVENTS_OFFSET;
 
@@ -1005,7 +1035,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 
/* everything turned out well, dispose of the aiocb. */
kiocb_free(iocb);
-   put_reqs_available(ctx, 1);
 
/*
 * We have to order our ring_info tail store above and test
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-22 Thread Benjamin LaHaise

On Tue, Aug 19, 2014 at 08:46:51PM -0400, Benjamin LaHaise wrote:
> You can trigger the behaviour with fio by using userspace event reaping.  
> Adding a test case for that behaviour to libaio would be a good idea.

> I thought about how to fix this, and it isn't actually that hard.  Move 
> the put_reqs_available() call back into event consumption, and then add 
> code in the submit path to call put_reqs_available() if the system runs 
> out of events by noticing that there is free space in the event ring.  
> Something along the lines below should do it (please note, this is 
> completely untested!).  I'll test and polish this off tomorrow, as it's 
> getting a bit late here.

Dan, does this patch work for you?  It seems to pass your test program 
when I run it in a vm...

-ben
> -- 
> "Thought is the essence of where you are now."
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index ae63587..749ac22 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -142,6 +142,7 @@ struct kioctx {
>   struct {
>   unsignedtail;
>   spinlock_t  completion_lock;
> + unsignedcompleted_events;
>   } cacheline_aligned_in_smp;
>  
>   struct page *internal_pages[AIO_RING_PAGES];
> @@ -857,6 +858,31 @@ out:
>   return ret;
>  }
>  
> +static void refill_reqs_available(struct kioctx *ctx)
> +{
> + spin_lock_irq(&ctx->completion_lock);
> + if (ctx->completed_events) {
> + unsigned head, tail, avail, completed;
> + struct aio_ring *ring;
> +
> + ring = kmap_atomic(ctx->ring_pages[0]);
> + head = ACCESS_ONCE(ring->head);
> + tail = ACCESS_ONCE(ring->tail);
> + kunmap_atomic(ring);
> +
> + avail = (head <= tail ?  tail : ctx->nr_events) - head;
> + completed = ctx->completed_events;
> + if (avail < completed)
> + completed -= avail;
> + else
> + completed = 0;
> + put_reqs_available(ctx, completed);
> + }
> +
> + spin_unlock_irq(&ctx->completion_lock);
> +}
> +
> +
>  /* aio_get_req
>   *   Allocate a slot for an aio request.
>   * Returns NULL if no requests are free.
> @@ -865,8 +891,11 @@ static inline struct kiocb *aio_get_req(struct kioctx 
> *ctx)
>  {
>   struct kiocb *req;
>  
> - if (!get_reqs_available(ctx))
> - return NULL;
> + if (!get_reqs_available(ctx)) {
> + refill_reqs_available(ctx);
> + if (!get_reqs_available(ctx))
> + return NULL;
> + }
>  
>   req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
>   if (unlikely(!req))
> @@ -1005,7 +1034,6 @@ void aio_complete(struct kiocb *iocb, long res, long 
> res2)
>  
>   /* everything turned out well, dispose of the aiocb. */
>   kiocb_free(iocb);
> - put_reqs_available(ctx, 1);
>  
>   /*
>* We have to order our ring_info tail store above and test

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-19 Thread Benjamin LaHaise

On Tue, Aug 19, 2014 at 08:14:26PM +0300, Dan Aloni wrote:
> On Tue, Aug 19, 2014 at 12:54:04PM -0400, Benjamin LaHaise wrote:
> > On Tue, Aug 19, 2014 at 07:37:33PM +0300, Dan Aloni wrote:
> > > Some testing I've done today indicates that the original commit broke
> > > AIO with regard to users that overflow the maximum number of request
> > > per IO context (where -EAGAIN is returned).
> > > 
> > > In fact, it did worse - the attached C program can easily overrun the
> > > ring buffer that is responsible for managing the completed requests,
> > > and caused notification about their completion never to be returned.
> > 
> > Argh, that would be a problem.
> > 
> > ...
> > > This reverts commit b34e0e1319b31202eb142dcd9688cf7145a30bf6.
> > 
> > Reverting isn't okay, as that reintroduces another regression.  We need 
> > to come up with a fix for this issue that doesn't reintroduce the other 
> > regression for events reaped in user space.  Let me have a look and see 
> > what I can come up with...
> 
> About the original regression you mention, is there a program you can
> indicate that reproduces it? On my setups, the regression testing in 
> libaio was not able to detect the current regression too.

You can trigger the behaviour with fio by using userspace event reaping.  
Adding a test case for that behaviour to libaio would be a good idea.

I thought about how to fix this, and it isn't actually that hard.  Move 
the put_reqs_available() call back into event consumption, and then add 
code in the submit path to call put_reqs_available() if the system runs 
out of events by noticing that there is free space in the event ring.  
Something along the lines below should do it (please note, this is 
completely untested!).  I'll test and polish this off tomorrow, as it's 
getting a bit late here.
 
-ben
-- 
"Thought is the essence of where you are now."

diff --git a/fs/aio.c b/fs/aio.c
index ae63587..749ac22 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -142,6 +142,7 @@ struct kioctx {
struct {
unsignedtail;
spinlock_t  completion_lock;
+   unsignedcompleted_events;
} cacheline_aligned_in_smp;
 
struct page *internal_pages[AIO_RING_PAGES];
@@ -857,6 +858,31 @@ out:
return ret;
 }
 
+static void refill_reqs_available(struct kioctx *ctx)
+{
+   spin_lock_irq(&ctx->completion_lock);
+   if (ctx->completed_events) {
+   unsigned head, tail, avail, completed;
+   struct aio_ring *ring;
+
+   ring = kmap_atomic(ctx->ring_pages[0]);
+   head = ACCESS_ONCE(ring->head);
+   tail = ACCESS_ONCE(ring->tail);
+   kunmap_atomic(ring);
+
+   avail = (head <= tail ?  tail : ctx->nr_events) - head;
+   completed = ctx->completed_events;
+   if (avail < completed)
+   completed -= avail;
+   else
+   completed = 0;
+   put_reqs_available(ctx, completed);
+   }
+
+   spin_unlock_irq(&ctx->completion_lock);
+}
+
+
 /* aio_get_req
  * Allocate a slot for an aio request.
  * Returns NULL if no requests are free.
@@ -865,8 +891,11 @@ static inline struct kiocb *aio_get_req(struct kioctx *ctx)
 {
struct kiocb *req;
 
-   if (!get_reqs_available(ctx))
-   return NULL;
+   if (!get_reqs_available(ctx)) {
+   refill_reqs_available(ctx);
+   if (!get_reqs_available(ctx))
+   return NULL;
+   }
 
req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
if (unlikely(!req))
@@ -1005,7 +1034,6 @@ void aio_complete(struct kiocb *iocb, long res, long res2)
 
/* everything turned out well, dispose of the aiocb. */
kiocb_free(iocb);
-   put_reqs_available(ctx, 1);
 
/*
 * We have to order our ring_info tail store above and test
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Revert "aio: fix aio request leak when events are reaped by user space"

2014-08-19 Thread Benjamin LaHaise

On Tue, Aug 19, 2014 at 07:37:33PM +0300, Dan Aloni wrote:
> Some testing I've done today indicates that the original commit broke
> AIO with regard to users that overflow the maximum number of request
> per IO context (where -EAGAIN is returned).
> 
> In fact, it did worse - the attached C program can easily overrun the
> ring buffer that is responsible for managing the completed requests,
> and caused notification about their completion never to be returned.

Argh, that would be a problem.

...
> This reverts commit b34e0e1319b31202eb142dcd9688cf7145a30bf6.

Reverting isn't okay, as that reintroduces another regression.  We need 
to come up with a fix for this issue that doesn't reintroduce the other 
regression for events reaped in user space.  Let me have a look and see 
what I can come up with...

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] aio changes for 3.17

2014-08-15 Thread Benjamin LaHaise

The following changes since commit 263782c1c95bbddbb022dc092fd89a36bb8d5577:

  aio: protect reqs_available updates from changes in interrupt handlers 
(2014-07-14 13:05:26 -0400)

are available in the git repository at:

  git://git.kvack.org/~bcrl/aio-next.git master

for you to fetch changes up to 00fefb9cf2b5493a86912de55ba912bdfae4a207:

  aio: use iovec array rather than the single one (2014-07-24 10:59:40 -0400)


Benjamin LaHaise (2):
  Merge ../aio-fixes
  aio: remove no longer needed preempt_disable()

Gu Zheng (4):
  aio: remove the needless registration of ring file's private_data
  aio: use the macro rather than the inline magic number
  aio: fix some comments
  aio: use iovec array rather than the single one

Oleg Nesterov (2):
  aio: change exit_aio() to load mm->ioctx_table once and avoid 
rcu_read_lock()
  aio: kill the misleading rcu read locks in ioctx_add_table() and 
kill_ioctx()

 fs/aio.c | 86 ++--
 1 file changed, 30 insertions(+), 56 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 1c9c5f0..0fd9181 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -192,7 +192,6 @@ static struct file *aio_private_file(struct kioctx *ctx, 
loff_t nr_pages)
}
 
file->f_flags = O_RDWR;
-   file->private_data = ctx;
return file;
 }
 
@@ -202,7 +201,7 @@ static struct dentry *aio_mount(struct file_system_type 
*fs_type,
static const struct dentry_operations ops = {
.d_dname= simple_dname,
};
-   return mount_pseudo(fs_type, "aio:", NULL, &ops, 0xa10a10a1);
+   return mount_pseudo(fs_type, "aio:", NULL, &ops, AIO_RING_MAGIC);
 }
 
 /* aio_setup
@@ -554,8 +553,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
struct aio_ring *ring;
 
spin_lock(&mm->ioctx_lock);
-   rcu_read_lock();
-   table = rcu_dereference(mm->ioctx_table);
+   table = rcu_dereference_raw(mm->ioctx_table);
 
while (1) {
if (table)
@@ -563,7 +561,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
if (!table->table[i]) {
ctx->id = i;
table->table[i] = ctx;
-   rcu_read_unlock();
spin_unlock(&mm->ioctx_lock);
 
/* While kioctx setup is in progress,
@@ -577,8 +574,6 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
}
 
new_nr = (table ? table->nr : 1) * 4;
-
-   rcu_read_unlock();
spin_unlock(&mm->ioctx_lock);
 
table = kzalloc(sizeof(*table) + sizeof(struct kioctx *) *
@@ -589,8 +584,7 @@ static int ioctx_add_table(struct kioctx *ctx, struct 
mm_struct *mm)
table->nr = new_nr;
 
spin_lock(&mm->ioctx_lock);
-   rcu_read_lock();
-   old = rcu_dereference(mm->ioctx_table);
+   old = rcu_dereference_raw(mm->ioctx_table);
 
if (!old) {
rcu_assign_pointer(mm->ioctx_table, table);
@@ -737,12 +731,9 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx 
*ctx,
 
 
spin_lock(&mm->ioctx_lock);
-   rcu_read_lock();
-   table = rcu_dereference(mm->ioctx_table);
-
+   table = rcu_dereference_raw(mm->ioctx_table);
WARN_ON(ctx != table->table[ctx->id]);
table->table[ctx->id] = NULL;
-   rcu_read_unlock();
spin_unlock(&mm->ioctx_lock);
 
/* percpu_ref_kill() will do the necessary call_rcu() */
@@ -791,40 +782,30 @@ EXPORT_SYMBOL(wait_on_sync_kiocb);
  */
 void exit_aio(struct mm_struct *mm)
 {
-   struct kioctx_table *table;
-   struct kioctx *ctx;
-   unsigned i = 0;
-
-   while (1) {
-   rcu_read_lock();
-   table = rcu_dereference(mm->ioctx_table);
-
-   do {
-   if (!table || i >= table->nr) {
-   rcu_read_unlock();
-   rcu_assign_pointer(mm->ioctx_table, NULL);
-   if (table)
-   kfree(table);
-   return;
-   }
+   struct kioctx_table *table = rcu_dereference_raw(mm->ioctx_table);
+   int i;
 
-   ctx = table->table[i++];
-   } while (!ctx);
+   if (!table)
+   return;
 
-   rcu_read_unlock();
+   for (i = 0; i < table->nr; ++i) {
+   struct kioctx *ct

Re: [PATCH 7/7] aio: use iovec array rather than the single one

2014-08-15 Thread Benjamin LaHaise

On Fri, Aug 15, 2014 at 06:35:25PM -0600, Linus Torvalds wrote:
> Ben,
> 
> *please* stop doing this kind of crap.
> 
> You're creating fake emails with invalid senders. That's simply not 
> acceptable.
...

I'll admit this was my mistake and I should have caught it while reviewing 
the emails.  At least in the past my mail client has generated the correct 
From: headers, but I made a mistake somewhere in the process today.  Evidently 
passing in the --from= parameter to git-format-patch has been missing from 
my scripts forever.

That said: git should *not* default to generating fake From: headers in 
generated emails!  Having the git default being completely unreasonable in 
this case makes absolutely no sense to me, especially when git already 
knows what my bloody email address is!  It's wrong, it's clear that it's 
wrong, so why is the default that way?

Damned rope shooting my foot off.  Please excuse me while I go find a medic 
to stop the bleeding...

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/7] aio: remove no longer needed preempt_disable()

2014-08-15 Thread Benjamin LaHaise

Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577,
clean up get/put_reqs_available() to remove the no longer needed 
preempt_disable()
and preempt_enable() pair.

Signed-off-by: Benjamin LaHaise 
Cc: Jens Axboe 
---
 fs/aio.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 8216aa0..9ce9e8e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -814,10 +814,8 @@ static void put_reqs_available(struct kioctx *ctx, 
unsigned nr)
struct kioctx_cpu *kcpu;
unsigned long flags;
 
-   preempt_disable();
-   kcpu = this_cpu_ptr(ctx->cpu);
-
local_irq_save(flags);
+   kcpu = this_cpu_ptr(ctx->cpu);
kcpu->reqs_available += nr;
 
while (kcpu->reqs_available >= ctx->req_batch * 2) {
@@ -826,7 +824,6 @@ static void put_reqs_available(struct kioctx *ctx, unsigned 
nr)
}
 
local_irq_restore(flags);
-   preempt_enable();
 }
 
 static bool get_reqs_available(struct kioctx *ctx)
@@ -835,10 +832,8 @@ static bool get_reqs_available(struct kioctx *ctx)
bool ret = false;
unsigned long flags;
 
-   preempt_disable();
-   kcpu = this_cpu_ptr(ctx->cpu);
-
local_irq_save(flags);
+   kcpu = this_cpu_ptr(ctx->cpu);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);
 
@@ -858,7 +853,6 @@ static bool get_reqs_available(struct kioctx *ctx)
kcpu->reqs_available--;
 out:
local_irq_restore(flags);
-   preempt_enable();
return ret;
 }
 
-- 
1.8.2.1


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/7] aio: changes for 3.17

2014-08-15 Thread Benjamin LaHaise

Hello Linus,

Please pull the following changes from the aio-next git tree at 
git://git.kvack.org/~bcrl/aio-next.git .  Most of the changes this time 
around are code cleanups.

-ben

Benjamin LaHaise (1):
  aio: remove no longer needed preempt_disable()

Gu Zheng (4):
  aio: remove the needless registration of ring file's private_data
  aio: use the macro rather than the inline magic number
  aio: fix some comments
  aio: use iovec array rather than the single one

Oleg Nesterov (2):
  aio: change exit_aio() to load mm->ioctx_table once and avoid
rcu_read_lock()
  aio: kill the misleading rcu read locks in ioctx_add_table() and
kill_ioctx()


-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 2/4] aio: use the macro rather than the inline magic number

2014-07-24 Thread Benjamin LaHaise

On Wed, Jul 23, 2014 at 06:03:52PM +0800, Gu Zheng wrote:
> Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC),
> just clean up.

Thanks Gu -- applied to my aio-next tree at   
git://git.kvack.org/~bcrl/aio-next.git .

-ben

> Reviewed-by: Jeff Moyer 
> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 94546ab..e0f838d 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -201,7 +201,7 @@ static struct dentry *aio_mount(struct file_system_type 
> *fs_type,
>   static const struct dentry_operations ops = {
>   .d_dname= simple_dname,
>   };
> - return mount_pseudo(fs_type, "aio:", NULL, &ops, 0xa10a10a1);
> + return mount_pseudo(fs_type, "aio:", NULL, &ops, AIO_RING_MAGIC);
>  }
>  
>  /* aio_setup
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 3/4] aio: fix some comments

2014-07-24 Thread Benjamin LaHaise

On Wed, Jul 23, 2014 at 06:03:53PM +0800, Gu Zheng wrote:
> The function comments of aio_run_iocb and aio_read_events are out of date, so
> fix them here.

Thanks Gu -- applied to my aio-next tree at   
git://git.kvack.org/~bcrl/aio-next.git .

-ben

> Reviewed-by: Jeff Moyer 
> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |7 +++
>  1 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index e0f838d..f1fede2 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1044,7 +1044,7 @@ void aio_complete(struct kiocb *iocb, long res, long 
> res2)
>  }
>  EXPORT_SYMBOL(aio_complete);
>  
> -/* aio_read_events
> +/* aio_read_events_ring
>   *   Pull an event off of the ioctx's event ring.  Returns the number of
>   *   events fetched
>   */
> @@ -1296,9 +1296,8 @@ static ssize_t aio_setup_single_vector(struct kiocb 
> *kiocb,
>  }
>  
>  /*
> - * aio_setup_iocb:
> - *   Performs the initial checks and aio retry method
> - *   setup for the kiocb at the time of io submission.
> + * aio_run_iocb:
> + *   Performs the initial checks and io submission.
>   */
>  static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
>   char __user *buf, bool compat)
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 1/4] aio: remove the needless registration of ring file's private_data

2014-07-24 Thread Benjamin LaHaise

On Wed, Jul 23, 2014 at 06:03:51PM +0800, Gu Zheng wrote:
> Remove the registration of ring file's private_data, we do not use
> it.

Thanks Gu -- applied to my aio-next tree at 
git://git.kvack.org/~bcrl/aio-next.git .

-ben

> Reviewed-by: Jeff Moyer 
> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 1c9c5f0..94546ab 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -192,7 +192,6 @@ static struct file *aio_private_file(struct kioctx *ctx, 
> loff_t nr_pages)
>   }
>  
>   file->f_flags = O_RDWR;
> - file->private_data = ctx;
>   return file;
>  }
>  
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND PATCH 4/4] aio: use iovec array rather than the single one

2014-07-24 Thread Benjamin LaHaise

On Wed, Jul 23, 2014 at 06:03:54PM +0800, Gu Zheng wrote:
> Previously, we only offer a single iovec to handle all the read/write cases, 
> so
> the PREADV/PWRITEV request always need to alloc more iovec buffer when copying
> user vectors.
> If we use a tmp iovec array rather than the single one, some small 
> PREADV/PWRITEV
> workloads(vector size small than the tmp buffer) will not need to alloc more
> iovec buffer when copying user vectors.

Thanks Gu -- applied to my aio-next tree at   
git://git.kvack.org/~bcrl/aio-next.git .

-ben

> Reviewed-by: Jeff Moyer 
> Signed-off-by: Gu Zheng 
> ---
>  fs/aio.c |   10 +-
>  1 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index f1fede2..df3491a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1267,12 +1267,12 @@ static ssize_t aio_setup_vectored_rw(struct kiocb 
> *kiocb,
>   if (compat)
>   ret = compat_rw_copy_check_uvector(rw,
>   (struct compat_iovec __user *)buf,
> - *nr_segs, 1, *iovec, iovec);
> + *nr_segs, UIO_FASTIOV, *iovec, iovec);
>   else
>  #endif
>   ret = rw_copy_check_uvector(rw,
>   (struct iovec __user *)buf,
> - *nr_segs, 1, *iovec, iovec);
> + *nr_segs, UIO_FASTIOV, *iovec, iovec);
>   if (ret < 0)
>   return ret;
>  
> @@ -1309,7 +1309,7 @@ static ssize_t aio_run_iocb(struct kiocb *req, unsigned 
> opcode,
>   fmode_t mode;
>   aio_rw_op *rw_op;
>   rw_iter_op *iter_op;
> - struct iovec inline_vec, *iovec = &inline_vec;
> + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs;
>   struct iov_iter iter;
>  
>   switch (opcode) {
> @@ -1344,7 +1344,7 @@ rw_common:
>   if (!ret)
>   ret = rw_verify_area(rw, file, &req->ki_pos, 
> req->ki_nbytes);
>   if (ret < 0) {
> - if (iovec != &inline_vec)
> + if (iovec != inline_vecs)
>   kfree(iovec);
>   return ret;
>   }
> @@ -1391,7 +1391,7 @@ rw_common:
>   return -EINVAL;
>   }
>  
> - if (iovec != &inline_vec)
> + if (iovec != inline_vecs)
>   kfree(iovec);
>  
>   if (ret != -EIOCBQUEUED) {
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] aio: fix some comments

2014-07-22 Thread Benjamin LaHaise

On Tue, Jul 22, 2014 at 10:40:03AM +0800, Gu Zheng wrote:
> Signed-off-by: Gu Zheng 

Again, you're missing a commit message here.  Please resubmit with a commit 
message.

-ben

> ---
>  fs/aio.c |7 +++
>  1 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index 1dc6158..0cd0479 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1037,7 +1037,7 @@ void aio_complete(struct kiocb *iocb, long res, long 
> res2)
>  }
>  EXPORT_SYMBOL(aio_complete);
>  
> -/* aio_read_events
> +/* aio_read_events_ring
>   *   Pull an event off of the ioctx's event ring.  Returns the number of
>   *   events fetched
>   */
> @@ -1289,9 +1289,8 @@ static ssize_t aio_setup_single_vector(struct kiocb 
> *kiocb,
>  }
>  
>  /*
> - * aio_setup_iocb:
> - *   Performs the initial checks and aio retry method
> - *   setup for the kiocb at the time of io submission.
> + * aio_run_iocb:
> + *   Performs the initial checks and io submission.
>   */
>  static ssize_t aio_run_iocb(struct kiocb *req, unsigned opcode,
>   char __user *buf, bool compat)
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/4] aio: use the macro rather than the inline magic number

2014-07-22 Thread Benjamin LaHaise

On Tue, Jul 22, 2014 at 10:40:02AM +0800, Gu Zheng wrote:
> Signed-off-by: Gu Zheng 

You're missing a commit message here.  The patch is otherwise fine.

-ben

> ---
>  fs/aio.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/aio.c b/fs/aio.c
> index ad35876..1dc6158 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -201,7 +201,7 @@ static struct dentry *aio_mount(struct file_system_type 
> *fs_type,
>   static const struct dentry_operations ops = {
>   .d_dname= simple_dname,
>   };
> - return mount_pseudo(fs_type, "aio:", NULL, &ops, 0xa10a10a1);
> + return mount_pseudo(fs_type, "aio:", NULL, &ops, AIO_RING_MAGIC);
>  }
>  
>  /* aio_setup
> -- 
> 1.7.7

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] aio: protect reqs_available updates from changes in interrupt handlers

2014-07-14 Thread Benjamin LaHaise

On Mon, Jul 14, 2014 at 07:30:43PM +0200, Jens Axboe wrote:
> Why not just makes this:
> 
>   local_irq_save(flags);
>   kcpu = this_put_ptr(...);
>   ...
> 
> and get rid of the preemption bits, it's a bit redundant now you need to 
> kill local interrupts anyway.

*nod*.  I'll add the following cleanup patch, as the first version already 
got merged into another tree.

-ben

diff --git a/fs/aio.c b/fs/aio.c
index 1c9c5f0..104da62 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -832,10 +832,8 @@ static void put_reqs_available(struct kioctx *ctx, 
unsigned nr)
struct kioctx_cpu *kcpu;
unsigned long flags;
 
-   preempt_disable();
-   kcpu = this_cpu_ptr(ctx->cpu);
-
local_irq_save(flags);
+   kcpu = this_cpu_ptr(ctx->cpu);
kcpu->reqs_available += nr;
 
while (kcpu->reqs_available >= ctx->req_batch * 2) {
@@ -844,7 +842,6 @@ static void put_reqs_available(struct kioctx *ctx, unsigned 
nr)
}
 
local_irq_restore(flags);
-   preempt_enable();
 }
 
 static bool get_reqs_available(struct kioctx *ctx)
@@ -853,10 +850,8 @@ static bool get_reqs_available(struct kioctx *ctx)
bool ret = false;
unsigned long flags;
 
-   preempt_disable();
-   kcpu = this_cpu_ptr(ctx->cpu);
-
local_irq_save(flags);
+   kcpu = this_cpu_ptr(ctx->cpu);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);
 
@@ -876,7 +871,6 @@ static bool get_reqs_available(struct kioctx *ctx)
kcpu->reqs_available--;
 out:
local_irq_restore(flags);
-   preempt_enable();
return ret;
 }
 

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: scsi-mq V2

2014-07-14 Thread Benjamin LaHaise

Hi Robert,

On Sun, Jul 13, 2014 at 05:15:15PM +, Elliott, Robert (Server Storage) 
wrote:
> > > I will see if that solves the problem with the scsi-mq-3 tree, or
> > > at least some of the bisect trees leading up to it.
> > 
> > scsi-mq-3 is still going after 45 minutes.  I'll leave it running
> > overnight.
> > 
> 
> That has been going strong for 18 hours, so I think that's the patch
> we need.

Thanks for taking the time to narrow this down.  I've applied the fix to 
my aio-fixes tree at git://git.kvack.org/~bcrl/aio-fixes.git and fowarded 
it on to Linus as well.

-ben
-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] aio: protect reqs_available updates from changes in interrupt handlers

2014-07-14 Thread Benjamin LaHaise

Hello everyone,

Please pull the following commit (263782c1c95bbddbb022dc092fd89a36bb8d5577) 
from git://git.kvack.org/aio-fixes.git to fix an aio bug reported by Robert 
Elliot.
---
As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot 
Tested-by: Robert Elliot 
Signed-off-by: Benjamin LaHaise 
Cc: Jens Axboe , Christoph Hellwig 
Cc: sta...@vger.kenel.org
---
 fs/aio.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index 955947e..1c9c5f0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
struct kioctx_cpu *kcpu;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);
 
+   local_irq_save(flags);
kcpu->reqs_available += nr;
+
while (kcpu->reqs_available >= ctx->req_batch * 2) {
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}
 
+   local_irq_restore(flags);
preempt_enable();
 }
 
@@ -847,10 +851,12 @@ static bool get_reqs_available(struct kioctx *ctx)
 {
struct kioctx_cpu *kcpu;
bool ret = false;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);
 
+   local_irq_save(flags);
if (!kcpu->reqs_available) {
int old, avail = atomic_read(&ctx->reqs_available);
 
@@ -869,6 +875,7 @@ static bool get_reqs_available(struct kioctx *ctx)
ret = true;
kcpu->reqs_available--;
 out:
+   local_irq_restore(flags);
preempt_enable();
return ret;
 }
-- 
1.8.2.1

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: scsi-mq V2

2014-07-11 Thread Benjamin LaHaise

On Fri, Jul 11, 2014 at 02:33:12PM +, Elliott, Robert (Server Storage) 
wrote:
> That ran 9 total hours with no problem.
> 
> Rather than revert in the bisect trees, I added just this single additional
> patch to the no-rebase tree, and the problem appeared:

Can you try the below totally untested patch instead?  It looks like
put_reqs_available() is not irq-safe.

-ben
-- 
"Thought is the essence of where you are now."


diff --git a/fs/aio.c b/fs/aio.c
index 955947e..4b97180 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
struct kioctx_cpu *kcpu;
+   unsigned long flags;
 
preempt_disable();
kcpu = this_cpu_ptr(ctx->cpu);
 
+   local_irq_save(flags);
kcpu->reqs_available += nr;
+
while (kcpu->reqs_available >= ctx->req_batch * 2) {
kcpu->reqs_available -= ctx->req_batch;
atomic_add(ctx->req_batch, &ctx->reqs_available);
}
 
+   local_irq_restore(flags);
preempt_enable();
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise

On Thu, Jul 10, 2014 at 02:36:40PM +, Elliott, Robert (Server Storage) 
wrote:
> 
> 
> > -Original Message-
> > From: Jens Axboe [mailto:ax...@kernel.dk]
> > Sent: Thursday, 10 July, 2014 8:53 AM
> > To: Christoph Hellwig; Benjamin LaHaise
> > Cc: Elliott, Robert (Server Storage); dgilb...@interlog.com; James 
> > Bottomley;
> > Bart Van Assche; linux-s...@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: scsi-mq V2
> > 
> > On 2014-07-10 15:50, Christoph Hellwig wrote:
> > > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> > >> There is one possible concern that could be exacerbated by other changes
> > in
> > >> the system: if the application is running close to the bare minimum 
> > >> number
> > >> of requests allocated in io_setup(), the per cpu reference counters will
> > >> have a hard time batching things.  It might be worth testing with an
> > >> increased number of requests being allocated if this is the case.
> > >
> > > Well, Robert said reverting the two aio commits didn't help.  Either he
> > > didn't manage to boot into the right kernel, or we need to look
> > > elsewhere for the culprit.
> > 
> > Rob, let me know what scsi_debug setup you use, and I can try and
> > reproduce it here as well.
> > 
> > --
> > Jens Axboe
> 
> This system has 6 online CPUs and 64 possible CPUs.
> 
> Printing avail and req_batch in that loop results in many of these:
> ** 3813 printk messages dropped ** [10643.503772] ctx 88042d8d4cc0 
> avail=0 req_batch=2
> 
> Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
> those pr_debug prints results in nothing extra printing,
> so it's not hitting an error.
> 
> Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
> these as fio starts:
> 
> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

Something is horribly wrong here.  There is no way that value for nr_events 
should be passed in to ioctx_alloc().  This implies that userland is calling 
io_setup() with an impossibly large value for nr_events.  Can you post the 
actual diff for your fs/aio.c relative to linus' tree?

-ben


> [  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.401491] sd 5:

Re: scsi-mq V2

2014-07-10 Thread Benjamin LaHaise

On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2014-07-10 15:44, Benjamin LaHaise wrote:
> >On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
> >>That's how fio always runs, it sets up the context with the exact queue
> >>depth that it needs. Do we have a good enough understanding of other aio
> >>use cases to say that this isn't the norm? I would expect it to be, it's
> >>the way that the API would most obviously be used.
> >
> >The problem with this approach is that it works very poorly with per cpu
> >reference counting's batching of references, which is pretty much a
> >requirement now that many core systems are the norm.  Allocating the bare
> >minimum is not the right thing to do today.  That said, the default limits
> >on the number of requests probably needs to be raised.
> 
> Sorry, that's a complete cop-out. Then you handle this internally, 
> allocate a bigger pool and cap the limit if you need to. Look at the 
> API. You pass in the number of requests you will use. Do you expect 
> anyone to double up, just in case? Will never happen.
> 
> But all of this is side stepping the point that there's a real bug 
> reported here. The above could potentially explain the "it's using X 
> more CPU, or it's Y slower". The above is a softlock, it never completes.

I'm not trying to cop out on this -- I'm asking for a data point to see 
if changing the request limits has any effect.

-ben

> -- 
> Jens Axboe

-- 
"Thought is the essence of where you are now."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 >

1 - 100 of 311 matches

Mail list logo