Re: [PATCH] mm: Revert pinned_vm braindamage

2013-06-13 Thread Andrew Morton
Let's try to get this wrapped up?

On Thu, 6 Jun 2013 14:43:51 +0200 Peter Zijlstra  wrote:

> 
> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages")
> broke RLIMIT_MEMLOCK.

I rather like what bc3e53f682 did, actually.  RLIMIT_MEMLOCK limits the
amount of memory you can mlock().  Nice and simple.

This pinning thing which infiniband/perf are doing is conceptually
different and if we care at all, perhaps we should be looking at adding
RLIMIT_PINNED.

> Before that patch: mm_struct::locked_vm < RLIMIT_MEMLOCK; after that
> patch we have: mm_struct::locked_vm < RLIMIT_MEMLOCK &&
> mm_struct::pinned_vm < RLIMIT_MEMLOCK.

But this is a policy decision which was implemented in perf_mmap() and
perf can alter that decision.  How bad would it be if perf just ignored
RLIMIT_MEMLOCK?


drivers/infiniband/hw/qib/qib_user_pages.c has issues, btw.  It
compares the amount-to-be-pinned with rlimit(RLIMIT_MEMLOCK), but
forgets to also look at current->mm->pinned_vm.  Duh.

It also does the pinned accounting in __qib_get_user_pages() but in
__qib_release_user_pages(), the caller is supposed to do it, which is
rather awkward.


Longer-term I don't think that inifinband or perf should be dinking
around with rlimit(RLIMIT_MEMLOCK) or ->pinned_vm.  Those policy
decisions should be hoisted into a core mm helper where we can do it
uniformly (and more correctly than infiniband's attempt!).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v3] IB/ipath: use GFP_NOWAIT under spin lock

2013-02-22 Thread Andrew Morton
On Fri, 22 Feb 2013 14:21:58 -0800
Tejun Heo  wrote:

> Hello, Andrew.
> 
> On Fri, Feb 22, 2013 at 2:16 PM, Andrew Morton
>  wrote:
> > Given that this code is performing allocations under
> > spin_lock_irqsave(), it should be using idr_preload()/idr_preload_end()
> > (and perhaps even a repeat loop around those) to make the allocations
> > more reliable.
> 
> It's already using preload so GFP_NOWAIT is enough to guarantee
> GFP_KERNEL level allocation.
> 

Not "guarantee".

- idr_preload() might fail.  It is a design error that idr_preload()
  returns void - it should do what radix_tree_preload() does.

- even if idr_preload() completed the preload, an interrupt might
  come in and its handler might drain this cpu's preload pool.

This all sounds like crazy will-never-happen stuff.  That's what I
thought about the radix-tree code when used for pagecache.  But under
extreme worloads, the cant-happens kept on happening, which is how all
that funky preload/preempt-disable stuff occurred.

It would be more robust still if the per-cpu preload magazines were
per-radixtree and per-idr, rather than kernel-wide.  That would reduce
the risk of interrupt-time stealing.  Or make foo_preload() return with
interrupts disabled rather than jsut preempt_disable().

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next v3] IB/ipath: use GFP_NOWAIT under spin lock

2013-02-22 Thread Andrew Morton
On Fri, 22 Feb 2013 10:48:12 +0800
Wei Yongjun  wrote:

> From: Wei Yongjun 
> 
> A spin lock is taken here so we should use GFP_NOWAIT like
> other case.
> 
> ..
>
> --- a/drivers/infiniband/hw/ipath/ipath_driver.c
> +++ b/drivers/infiniband/hw/ipath/ipath_driver.c
> @@ -204,7 +204,7 @@ static struct ipath_devdata *ipath_alloc_devdata(struct 
> pci_dev *pdev)
>   idr_preload(GFP_KERNEL);
>   spin_lock_irqsave(&ipath_devs_lock, flags);
>  
> - ret = idr_alloc(&unit_table, dd, 0, 0, GFP_KERNEL);
> + ret = idr_alloc(&unit_table, dd, 0, 0, GFP_NOWAIT);
>   if (ret < 0) {
>   printk(KERN_ERR IPATH_DRV_NAME
>  ": Could not allocate unit ID: error %d\n", -ret);

I'd say we're entitled to use GFP_ATOMIC here: it will dip into page
reserves and is less likely to fail.

Given that this code is performing allocations under
spin_lock_irqsave(), it should be using idr_preload()/idr_preload_end()
(and perhaps even a repeat loop around those) to make the allocations
more reliable.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: linux-next: build failure after merge of the akpm tree

2012-09-24 Thread Andrew Morton
On Mon, 24 Sep 2012 12:36:43 -0700
Roland Dreier  wrote:

> On Mon, Sep 24, 2012 at 7:02 AM, Stephen Rothwell  
> wrote:
> > After merging the akpm tree, today's linux-next build (powerpc
> > ppc64_defconfig) failed like this:
> >
> > drivers/infiniband/hw/mlx4/cm.c: In function 'id_map_alloc':
> > drivers/infiniband/hw/mlx4/cm.c:228:36: error: 'MAX_ID_MASK' undeclared 
> > (first use in this function)
> >
> > Caused by commit d7a4e9b679e9 ("IB/mlx4: Add CM paravirtualization") from
> > the infiniband tree interacting with commit "idr: rename MAX_LEVEL to
> > MAX_IDR_LEVEL" from the akpm tree.
> >
> > I have added the following merge fix patch for today:
> >
> > From: Stephen Rothwell 
> > Date: Mon, 24 Sep 2012 23:57:53 +1000
> > Subject: [PATCH] IB/mlx4: fix for MAX_ID_MASK to MAX_IDR_MASK name change
> >
> > Signed-off-by: Stephen Rothwell 
> > ---
> >  drivers/infiniband/hw/mlx4/cm.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/infiniband/hw/mlx4/cm.c 
> > b/drivers/infiniband/hw/mlx4/cm.c
> > index e25e4da..80079e5 100644
> > --- a/drivers/infiniband/hw/mlx4/cm.c
> > +++ b/drivers/infiniband/hw/mlx4/cm.c
> > @@ -225,7 +225,7 @@ id_map_alloc(struct ib_device *ibdev, int slave_id, u32 
> > sl_cm_id)
> > ret = idr_get_new_above(&sriov->pv_id_table, ent,
> > next_id, &id);
> > if (!ret) {
> > -   next_id = ((unsigned) id + 1) & MAX_ID_MASK;
> > +   next_id = ((unsigned) id + 1) & MAX_IDR_MASK;
> > ent->pv_cm_id = (u32)id;
> > sl_id_map_add(ibdev, ent);
> > }
> 
> Andrew, any preference on how to handle this merge?

I'm fixing up that patch as things change under its feet.  I usually
merge things like this late in the merge window after everything else
has landed, to cause minimum breakage/disruption.

If there was something in linux-next which didn't get into the merge
window (bad) then I cheerily break it and the tree maintainer fixes
things up.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] mm: Distinguish between mlocked and pinned pages

2011-08-17 Thread Andrew Morton
On Wed, 10 Aug 2011 15:21:47 -0500 (CDT)
Christoph Lameter  wrote:

> Some kernel components pin user space memory (infiniband and perf)
> (by increasing the page count) and account that memory as "mlocked".
> 
> The difference between mlocking and pinning is:
> 
> A. mlocked pages are marked with PG_mlocked and are exempt from
>swapping. Page migration may move them around though.
>They are kept on a special LRU list.
> 
> B. Pinned pages cannot be moved because something needs to
>directly access physical memory. They may not be on any
>LRU list.
> 
> I recently saw an mlockalled process where mm->locked_vm became
> bigger than the virtual size of the process (!) because some
> memory was accounted for twice:
> 
> Once when the page was mlocked and once when the Infiniband
> layer increased the refcount because it needt to pin the RDMA
> memory.
> 
> This patch introduces a separate counter for pinned pages and
> accounts them seperately.
> 

Sounds reasonable.  But how do we prevent future confusion?  We should
carefully define these terms in an obvious place, please.

> --- linux-2.6.orig/include/linux/mm_types.h   2011-08-10 14:08:42.0 
> -0500
> +++ linux-2.6/include/linux/mm_types.h2011-08-10 14:09:02.0 
> -0500
> @@ -281,7 +281,7 @@ struct mm_struct {
>   unsigned long hiwater_rss;  /* High-watermark of RSS usage */
>   unsigned long hiwater_vm;   /* High-water virtual memory usage */
> 
> - unsigned long total_vm, locked_vm, shared_vm, exec_vm;
> + unsigned long total_vm, locked_vm, pinned_vm, shared_vm, exec_vm;
>   unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
>   unsigned long start_code, end_code, start_data, end_data;
>   unsigned long start_brk, brk, start_stack;

This is an obvious place.  Could I ask that you split all these up into
one-definition-per-line and we can start in on properly documenting
each field?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: idr_get_new_exact ?

2010-09-20 Thread Andrew Morton
On Mon, 20 Sep 2010 16:11:31 +0200
Ohad Ben-Cohen  wrote:

> Occasionally, drivers care about the value that idr associates with
> their pointers.
> 
> Today we have idr_get_new_above() which allocates a new idr entry
> above or equal to a given starting id, but sometimes drivers need to
> force an exact value.
> 
> To overcome this small API gap, drivers are wrapping idr_get_new_above
> and then either BUG_ON() or just call idr_remove() and returns -EBUSY
> when idr allocates them an id which is different than their requested
> value.
> 
> There are only a handful of users who need this (see below. especially
> note the i2c comment :), but it might be nice to have such an API (a
> bit less of code, and a bit less error prone).
> 
> Would something like the below be desirable/acceptable ?

It seems OK to me - it's an improvement over what we have now.

> (untested. and i just picked the simplest and straight-forward way to
> implement this; obviously it's not optimal since there's no reason to
> even allocate an id if we know it's not the id we're looking for. but
> it's enough to get the idea, it's not a hot path, and it's what
> drivers are doing today)

Sure, we can speed it up later if that appears to be necessary.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


i386 allmodconfig, current mainline

2010-08-09 Thread Andrew Morton

Guys.  What's goin' on out there?


drivers/power/olpc_battery.c:387: error: unknown field 'owner' specified in 
initializer
drivers/power/olpc_battery.c:387: warning: initialization from incompatible 
pointer type
make[2]: *** [drivers/power/olpc_battery.o] Error 1
make[1]: *** [drivers/power] Error 2


drivers/mtd/maps/gpio-addr-flash.c: In function 'gpio_flash_probe':
drivers/mtd/maps/gpio-addr-flash.c:212: warning: cast to pointer from integer 
of different size
drivers/mtd/maps/gpio-addr-flash.c:224: warning: cast to pointer from integer 
of different size


drivers/pci/intel-iommu.c: In function 'dma_pte_addr':
drivers/pci/intel-iommu.c:239: warning: passing argument 1 of '__cmpxchg64' 
from incompatible pointer type


drivers/net/wan/farsync.c: In function 'fst_intr_rx':
drivers/net/wan/farsync.c:1312: warning: cast to pointer from integer of 
different size
drivers/net/wan/farsync.c: In function 'do_bottom_half_tx':
drivers/net/wan/farsync.c:1407: warning: cast to pointer from integer of 
different size


fs/squashfs/xattr.c:37: warning: 'squashfs_xattr_handler' declared inline after 
being called
fs/squashfs/xattr.c:37: warning: previous declaration of 
'squashfs_xattr_handler' was here


drivers/net/wireless/ipw2x00/ipw2100.c: In function 'ipw2100_tx_send_commands':
drivers/net/wireless/ipw2x00/ipw2100.c:3063: warning: cast to pointer from 
integer of different size


In file included from drivers/platform/x86/intel_scu_ipc.c:26:
/usr/src/devel/arch/x86/include/asm/mrst.h:14: warning: 'struct 
sfi_table_header' declared inside parameter list
/usr/src/devel/arch/x86/include/asm/mrst.h:14: warning: its scope is only this 
definition or declaration, which is probably not what you want


drivers/infiniband/hw/nes/nes_verbs.c: In function 
'nes_alloc_fast_reg_page_list':
drivers/infiniband/hw/nes/nes_verbs.c:477: warning: cast to pointer from 
integer of different size
drivers/infiniband/hw/nes/nes_verbs.c: In function 'nes_post_send':
drivers/infiniband/hw/nes/nes_verbs.c:3486: warning: cast to pointer from 
integer of different size
drivers/infiniband/hw/nes/nes_verbs.c:3486: warning: cast to pointer from 
integer of different size


drivers/infiniband/hw/cxgb4/cq.c: In function 'destroy_cq':
drivers/infiniband/hw/cxgb4/cq.c:58: warning: cast from pointer to integer of 
different size
drivers/infiniband/hw/cxgb4/cq.c: In function 'create_cq':
drivers/infiniband/hw/cxgb4/cq.c:135: warning: cast from pointer to integer of 
different size


drivers/dma/timb_dma.c: In function 'td_fill_desc':
drivers/dma/timb_dma.c:203: warning: cast to pointer from integer of different 
size


drivers/net/wireless/iwmc3200wifi/rx.c: In function 'iwm_ntf_wifi_if_wrapper':
drivers/net/wireless/iwmc3200wifi/rx.c:1198: warning: comparison is always true 
due to limited range of data type
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ummunotify: Userspace support for MMU notifications

2010-04-12 Thread Andrew Morton
On Mon, 12 Apr 2010 07:22:17 +0100
Eric B Munson  wrote:

> Andrew,
> 
> I am resubmitting this patch because I believe that the discussion
> has shown this to be an acceptable solution.

To whom?  Some acked-by's would clarify.

>  I have fixed the 32 bit
> build errors, but other than that change, the code is the same as
> Roland's V3 patch.
> 
> From: Roland Dreier 
> 
> As discussed in 
> and follow-up messages, libraries using RDMA would like to track
> precisely when application code changes memory mapping via free(),
> munmap(), etc.  Current pure-userspace solutions using malloc hooks
> and other tricks are not robust, and the feeling among experts is that
> the issue is unfixable without kernel help.

But this info could be reassembled by tracking syscall activity, yes? 
Perhaps some discussion here explaining why the (possibly enhanced)
ptrace, audit, etc interfaces are unsuitable.

> We solve this not by implementing the full API proposed in the email
> linked above but rather with a simpler and more generic interface,
> which may be useful in other contexts.  Specifically, we implement a
> new character device driver, ummunotify, that creates a /dev/ummunotify
> node.  A userspace process can open this node read-only and use the fd
> as follows:
> 
>  1. ioctl() to register/unregister an address range to watch in the
> kernel (cf struct ummunotify_register_ioctl in ).
> 
>  2. read() to retrieve events generated when a mapping in a watched
> address range is invalidated (cf struct ummunotify_event in
> ).  select()/poll()/epoll() and SIGIO are
> handled for this IO.
> 
>  3. mmap() one page at offset 0 to map a kernel page that contains a
> generation counter that is incremented each time an event is
> generated.  This allows userspace to have a fast path that checks
> that no events have occurred without a system call.

OK, what's missing from this whole description and from ummunotify.txt
is: how does one specify the target process?  Does /dev/ummunotify
implicitly attach to current->mm?  If so, why, and what are the
implications of this?

If instead it is possible to attach to some other process's mmu
activity (/proc//ummunotity?) then how is that done and what are
the security/permissions implications?

Also, the whole thing is obviously racy: by the time userspace finds
out that something has happened, it might have changed.  This
inevitably reduces the applicability/usefulness of the whole thing as
compared to some synchronous mechanism which halts the monitored thread
until the request has been processed and acked.  All this should (IMO)
be explored, explained and justified.

Also, what prevents the obvious DoS which occurs when I register for
events and just let them queue up until the kernel runs out of memory? 
presumably events get dropped - what are the reliability implications
of this and how is the max queue length managed?

Also, ioctls are unpopular.  Were other intefaces considered?

> Thanks to Jason Gunthorpe  obsidianresearch.com> for
> suggestions on the interface design.  Also thanks to Jeff Squyres
>  cisco.com> for prototyping support for this in Open MPI, which
> helped find several bugs during development.
> 
> Signed-off-by: Roland Dreier 
> Signed-off-by: Eric B Munson 
> 
> ---
> 
> Changes since v3:
>  - Fixed replaced [get|put] user with copy_[from|to]_user to fix x86
>builds
> ---
>  Documentation/Makefile  |3 +-
>  Documentation/ummunotify/Makefile   |7 +
>  Documentation/ummunotify/ummunotify.txt |  150 
>  Documentation/ummunotify/umn-test.c |  200 +++
>  drivers/char/Kconfig|   12 +
>  drivers/char/Makefile   |1 +
>  drivers/char/ummunotify.c   |  567 
> +++
>  include/linux/ummunotify.h  |  121 +++
>  8 files changed, 1060 insertions(+), 1 deletions(-)
>  create mode 100644 Documentation/ummunotify/Makefile
>  create mode 100644 Documentation/ummunotify/ummunotify.txt
>  create mode 100644 Documentation/ummunotify/umn-test.c
>  create mode 100644 drivers/char/ummunotify.c
>  create mode 100644 include/linux/ummunotify.h
> 
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index 6fc7ea1..27ba76a 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -1,3 +1,4 @@
>  obj-m := DocBook/ accounting/ auxdisplay/ connector/ \
>   filesystems/ filesystems/configfs/ ia64/ laptops/ networking/ \
> - pcmcia/ spi/ timers/ video4linux/ vm/ watchdog/src/
> + pcmcia/ spi/ timers/ video4linux/ vm/ ummunotify/ \
> + watchdog/src/
> diff --git a/Documentation/ummunotify/Makefile 
> b/Documentation/ummunotify/Makefile
> new file mode 100644
> index 000..89f31a0
> --- /dev/null
> +++ b/Documentation/ummunotify/Makefile
> @@ -0,0 +1,7 @@
> +# List of programs to build
> +hostprogs-y := umn-test
>