Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-30 Thread KOSAKI Motohiro
 I'm unhappy you guys uses offensive word so much. Please cool down all
 you guys. :-/ In fact, _BOTH_ the behavior before and after Cristoph's
 patch doesn't have cleaner semantics.

 Erm, this feature _regressed_ after the patch. All other concerns are
 secondary. What's so difficult to understand about that?

Because it is not new commit at all. Christoph's patch was introduced
10 releases before.

$ git describe bc3e53f682
v3.1-7235-gbc3e53f

If we just revert it, we may get another and opposite regression
report. I'm worried
about it. Moreover, I don't think discussion better fix is too difficult for us.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-29 Thread KOSAKI Motohiro

Hi

I'm unhappy you guys uses offensive word so much. Please cool down all you 
guys. :-/
In fact, _BOTH_ the behavior before and after Cristoph's patch doesn't have 
cleaner semantics.
And PeterZ proposed make new cleaner one rather than revert. No need to hassle.

I'm 100% sure -rt people need stronger-mlock api. Please join discussion to 
make better API.
In my humble opinion is: we should make mlock3(addr, len flags) new syscall (*) 
and support
-rt requirement directly. And current strange IB RLIMIT_MEMLOCK usage should 
gradually migrate
it.
(*) or, to enhance mbind() is an option because i expect apps need to pin pages 
nearby NUMA nodes
in many case.

As your know, current IB pinning implementation doesn't guarantee no minor 
fault when fork
is used. It's ok for IB. They uses madvise(MADV_NOFORK) too. But I'm not sure 
*all* of rt
application are satisfied this. We might need to implement copy-on-fork or 
might not. I'd
like hear other people's opinion.

Also, all developer should know this pinning breaks when memory hot-plug is 
happen likes
cpu bounding bysched_setaffinity() may break when cpu hot-remove.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] mm: Fix RLIMIT_MEMLOCK

2013-05-25 Thread KOSAKI Motohiro
On Fri, May 24, 2013 at 11:40 AM, Christoph Lameter c...@linux.com wrote:
 On Fri, 24 May 2013, Peter Zijlstra wrote:

 Patch bc3e53f682 (mm: distinguish between mlocked and pinned pages)
 broke RLIMIT_MEMLOCK.

 Nope the patch fixed a problem with double accounting.

 The problem that we seem to have is to define what mlocked and pinned mean
 and how this relates to RLIMIT_MEMLOCK.

 mlocked pages are pages that are movable (not pinned!!!) and that are
 marked in some way by user space actions as mlocked (POSIX semantics).
 They are marked with a special page flag (PG_mlocked).

 Pinned pages are pages that have an elevated refcount because the hardware
 needs to use these pages for I/O. The elevated refcount may be temporary
 (then we dont care about this) or for a longer time (such as the memory
 registration of the IB subsystem). That is when we account the memory as
 pinned. The elevated refcount stops page migration and other things from
 trying to move that memory.

 Pages can be both pinned and mlocked. Before my patch some pages those two
 issues were conflated since the same counter was used and therefore these

pages were counted twice. If an RDMA application was running using
 mlockall() and was performing large scale I/O then the counters could show
 extraordinary large numbers and the VM would start to behave erratically.

 It is important for the VM to know which pages cannot be evicted but that
 involves many more pages due to dirty pages etc etc.

 So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages
 that userspace has mlocked.

 You want the counter to mean something different it seems. What is it?

 I think we need to be first clear on what we want to accomplish and what
 these counters actually should count before changing things.

Hm.
If pinned and mlocked are totally difference intentionally, why IB uses
RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up
number of pinned pages and other gup users don't.
I can't guess IB folk's intent.

And now ever IB code has duplicated RLIMIT_MEMLOCK
check and at least __ipath_get_user_pages() forget to check
capable(CAP_IPC_LOCK).
That's bad.


 Certainly would appreciate improvements in this area but resurrecting the
 conflation between mlocked and pinned pages is not the way to go.

 This patch proposes to properly fix the problem by introducing
 VM_PINNED. This also provides the groundwork for a possible mpin()
 syscall or MADV_PIN -- although these are not included.

 Maybe add a new PIN page flag? Pages are not pinned per vma as the patch
 seems to assume.

Generically, you are right. But if VM_PINNED is really only for IB,
this is acceptable
limitation. They can split vma for their own purpose.

Anyway, I agree we should clearly understand the semantics of IB pinning and
the userland usage and assumption.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Infiniband use of get_user_pages()

2013-04-26 Thread KOSAKI Motohiro
   when checking users of get_user_pages() (I'm doing some cleanups in that
 area to fix filesystem's issues with mmap_sem locking) I've noticed that
 infiniband drivers add number of pages obtained from get_user_pages() to
 mm-pinned_vm counter. Although this makes some sence, it doesn't match
 with any other user of get_user_pages() (e.g. direct IO) so has infiniband
 some special reason why it does so?

I'm also puzzled because mm-pinned_vm_counter is only used from /proc. Who
and how to use this accounting?


   Also that seems to be the only real reason why mmap_sem has to be grabbed
 in exclusive mode, am I right?

I think so. get_user_pages() doesn't need write lock.


   Another suspicious thing (at least in drivers/infiniband/core/umem.c:
 ib_umem_get()) is that arguments of get_user_pages() are like:
 ret = get_user_pages(current, current-mm, cur_base,
  min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page 
 *)),
  1, !umem-writable, page_list, vma_list);
 So we always have write argument set to 1 and force argument is set to
 !umem-writable. Is that really intentional? My naive guess would be that
 arguments should be switched... Although even in that case I fail to see
 why 'force' argument should be set. Can someone please explain?

If I understand correctly, IB and DirectIO have different set sequence.

DirectIO
 1. write to buf
 2. write(buf). i.e. get_user_pages_fast(write=0)

Infiniband
 1. reg_mr. i.e. get_user_pages(write=1)
 2. write to buf

I mean, if direct-io is passed zero page, it is user mistake. but user
process which uses infiniband set up mr before writing IO buffer,
IIUC.

In this case, I think user process which uses infiniband need to dereg
mr before fork(). but it is another story.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] infiniband, ipath: convert old cpumask api into new one

2011-05-18 Thread KOSAKI Motohiro
Adapt new api. we plan to remove old one later. Almost change are
trivial. but there is one real fix. following code is unsafe.

int ncpus = num_online_cpus()
for (i = 0; i  ncpus; i++) {
..
}

because 1) we don't guarantee last bit of online cpus is equal
to num_online_cpus(). some arch assign sparse cpu number.
2) cpu hotplugging may change cpu_online_mask at same time.
we need to pin it by get_online_cpus().

Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: linux-rdma@vger.kernel.org
Cc: Ralph Campbell infinip...@qlogic.com
Signed-off-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
---
 drivers/infiniband/hw/ipath/ipath_file_ops.c |   11 +++
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c 
b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index bdf4422..aab5d89 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -40,6 +40,7 @@
 #include linux/highmem.h
 #include linux/io.h
 #include linux/jiffies.h
+#include linux/cpu.h
 #include asm/pgtable.h

 #include ipath_kernel.h
@@ -1684,17 +1685,19 @@ static int find_best_unit(struct file *fp,
 * information.  There may be some issues with dual core numbering
 * as well.  This needs more work prior to release.
 */
-   if (!cpumask_empty(current-cpus_allowed) 
-   !cpumask_full(current-cpus_allowed)) {
+   if (!cpumask_empty(tsk_cpus_allowed(current)) 
+   !cpumask_full(tsk_cpus_allowed(current))) {
int ncpus = num_online_cpus(), curcpu = -1, nset = 0;
-   for (i = 0; i  ncpus; i++)
-   if (cpumask_test_cpu(i, current-cpus_allowed)) {
+   get_online_cpus();
+   for_each_online_cpu(i)
+   if (cpumask_test_cpu(i, tsk_cpus_allowed(current))) {
ipath_cdbg(PROC, %s[%u] affinity set for 
   cpu %d/%d\n, current-comm,
   current-pid, i, ncpus);
curcpu = i;
nset++;
}
+   put_online_cpus();
if (curcpu != -1  nset != ncpus) {
if (npresent) {
prefunit = curcpu / (ncpus / npresent);
-- 
1.7.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] infiniband, qib: convert old cpumask api into new one.

2011-05-18 Thread KOSAKI Motohiro
Adapt new APIs. we plan to remove old one later and plan to
change current-cpus_allowed implementation.

No functional change.

Signed-off-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
Cc: Roland Dreier rol...@kernel.org
Cc: Sean Hefty sean.he...@intel.com
Cc: Hal Rosenstock hal.rosenst...@gmail.com
Cc: linux-rdma@vger.kernel.org
Cc: Mike Marciniszyn mike.marcinis...@qlogic.com
Cc: Ralph Campbell ralph.campb...@qlogic.com
---
 drivers/infiniband/hw/qib/qib_file_ops.c |   11 ++-
 1 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_file_ops.c 
b/drivers/infiniband/hw/qib/qib_file_ops.c
index de7d17d..dd9e1a2 100644
--- a/drivers/infiniband/hw/qib/qib_file_ops.c
+++ b/drivers/infiniband/hw/qib/qib_file_ops.c
@@ -1527,6 +1527,7 @@ done_chk_sdma:
struct qib_filedata *fd = fp-private_data;
const struct qib_ctxtdata *rcd = fd-rcd;
const struct qib_devdata *dd = rcd-dd;
+   unsigned int weight;

if (dd-flags  QIB_HAS_SEND_DMA) {
fd-pq = qib_user_sdma_queue_create(dd-pcidev-dev,
@@ -1545,8 +1546,8 @@ done_chk_sdma:
 * it just means that sooner or later we don't recommend
 * a cpu, and let the scheduler do it's best.
 */
-   if (!ret  cpus_weight(current-cpus_allowed) =
-   qib_cpulist_count) {
+   weight = cpumask_weight(tsk_cpus_allowed(current));
+   if (!ret  weight = qib_cpulist_count) {
int cpu;
cpu = find_first_zero_bit(qib_cpulist,
  qib_cpulist_count);
@@ -1554,13 +1555,13 @@ done_chk_sdma:
__set_bit(cpu, qib_cpulist);
fd-rec_cpu_num = cpu;
}
-   } else if (cpus_weight(current-cpus_allowed) == 1 
-   test_bit(first_cpu(current-cpus_allowed),
+   } else if (weight == 1 
+   test_bit(cpumask_first(tsk_cpus_allowed(current)),
 qib_cpulist))
qib_devinfo(dd-pcidev, %s PID %u affinity 
set to cpu %d; already allocated\n,
current-comm, current-pid,
-   first_cpu(current-cpus_allowed));
+   cpumask_first(tsk_cpus_allowed(current)));
}

mutex_unlock(qib_mutex);
-- 
1.7.3.1


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] mm: add generic adaptive large memory allocationAPIs

2010-05-16 Thread KOSAKI Motohiro
 Peter Zijlstra wrote:
  NAK, I really utterly dislike that inatomic argument. The alloc side
  doesn't function in atomic context either. Please keep the thing
  symmetric in that regards.
 
 Excuse me. kmalloc(GFP_KERNEL) may sleep (and therefore cannot be used in
 atomic context). However, kfree() for memory allocated with 
 kmalloc(GFP_KERNEL)
 never sleep (and therefore can be used in atomic context).
 Why kmalloc() and kfree() are NOT kept symmetric?

In kmalloc case, we need to consider both kmalloc(GFP_KERNEL)/kfree() pair and
kmalloc(GFP_ATOMIC)/kfree() pair. latter is mainly used on atomic context.
To make kfree() atomic help to keep the implementation simple.

But kvmalloc don't have GFP_ATOMIC feautre. that's big difference.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [GIT PULL] please pull ummunotify

2009-09-15 Thread KOSAKI Motohiro
 
   So.. What is the problem with fork? The semantics of what should
   happen seem natural enough to me, the PD doesn't get copied to the
   child, so the MR stays with the parent. COW events on the pinned
   region must be resolved so that the physical page stays with the
   process that has pinned it - the pin is logically released in the
   child because the MR doesn't exist because the PD doesn't exist.
 
 This is getting away from the problem that ummunotify is solving, but
 handling a COW fault generated by the parent by doing the copy in the
 child seems like a pretty major, tricky change to make.  The child may
 have forked 100 more times in the meantime, meaning we now have to
 change 101 memory maps ... the cost of page faults goes through the roof
 probably...

Ummm...
Perhaps my first question was wrong. I'm not intent to NAK your patch.
I merely want to know your patch detail...

ok, I ask you again as another word.

 - I guess you have your MPI implementaion w/ ummunotify, right?
 - I guess you have test sevaral pattern, right?
   if so, can we see your test result?
 - I think you can explain your MPI advantage/disadvantage against
   current OpenMPI (or mpich et al).
 - I guess your patch dramatically improve MPI implementaion, but
   it's not free. it request some limitation to MPI application, right?
 - I imagine multi thread and fork. Is there another linmitaion?
 - In past discuttion, you said ummunotify user should not use
   multi threading. you also think user should not fork?



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html