Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-31 Thread Michel Lespinasse
On Sat, Jul 30, 2022 at 09:52:34PM +0200, Rafael J. Wysocki wrote:
> On Sat, Jul 30, 2022 at 11:48 AM Michel Lespinasse
>  wrote:
> > I'm not sure if that was the patch you meant to send though, as it
> > seems it's only adding a tracepoint so shouldn't make any difference
> > if I'm not actually using the tracepoint ?
> 
> You are right, it looks like I pasted a link to a different patch by
> mistake.  Sorry about that.
> 
> I meant this one:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=pm&id=d295ad34f236c3518634fb6403d4c0160456e470
> 
> which will appear in the final 5.19.

Thanks. I can confirm that this patch fixes the boot time debug
warnings for me. And I see that linus already merged it, nice!

--
Michel "walken" Lespinasse.


Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-30 Thread Michel Lespinasse
On Fri, Jul 29, 2022 at 04:59:50PM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 29, 2022 at 12:25 PM Michel Lespinasse
>  wrote:
> >
> > On Thu, Jul 28, 2022 at 10:20:53AM -0700, Paul E. McKenney wrote:
> > > On Mon, Jul 25, 2022 at 12:43:06PM -0700, Michel Lespinasse wrote:
> > > > On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> > > > > Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> > > > > Xeons") wrecked intel_idle in two ways:
> > > > >
> > > > >  - must not have tracing in idle functions
> > > > >  - must return with IRQs disabled
> > > > >
> > > > > Additionally, it added a branch for no good reason.
> > > > >
> > > > > Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on 
> > > > > Xeons")
> > > > > Signed-off-by: Peter Zijlstra (Intel) 
> > > >
> > > > After this change was introduced, I am seeing "WARNING: suspicious RCU
> > > > usage" when booting a kernel with debug options compiled in. Please
> > > > see the attached dmesg output. The issue starts with commit 32d4fd5751ea
> > > > and is still present in v5.19-rc8.
> > > >
> > > > I'm not sure, is this too late to fix or revert in v5.19 final ?
> > >
> > > I finally got a chance to take a quick look at this.
> > >
> > > The rcu_eqs_exit() function is making a lockdep complaint about
> > > being invoked with interrupts enabled.  This function is called from
> > > rcu_idle_exit(), which is an expected code path from cpuidle_enter_state()
> > > via its call to rcu_idle_exit().  Except that rcu_idle_exit() disables
> > > interrupts before invoking rcu_eqs_exit().
> > >
> > > The only other call to rcu_idle_exit() does not disable interrupts,
> > > but it is via rcu_user_exit(), which would be a very odd choice for
> > > cpuidle_enter_state().
> > >
> > > It seems unlikely, but it might be that it is the use of local_irq_save()
> > > instead of raw_local_irq_save() within rcu_idle_exit() that is causing
> > > the trouble.  If this is the case, then the commit shown below would
> > > help.  Note that this commit removes the warning from lockdep, so it
> > > is necessary to build the kernel with CONFIG_RCU_EQS_DEBUG=y to enable
> > > equivalent debugging.
> > >
> > > Could you please try your test with the -rce commit shown below applied?
> >
> > Thanks for looking into it.
> >
> > After checking out Peter's commit 32d4fd5751ea,
> > cherry picking your commit ed4ae5eff4b3,
> > and setting CONFIG_RCU_EQS_DEBUG=y in addition of my usual debug config,
> > I am now seeing this a few seconds into the boot:
> >
> > [3.010650] [ cut here ]
> > [3.010651] WARNING: CPU: 0 PID: 0 at kernel/sched/clock.c:397 
> > sched_clock_tick+0x27/0x60
> > [3.010657] Modules linked in:
> > [3.010660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
> > 5.19.0-rc1-test-5-g1be22fea0611 #1
> > [3.010662] Hardware name: LENOVO 30BFS44D00/1036, BIOS S03KT51A 
> > 01/17/2022
> > [3.010663] RIP: 0010:sched_clock_tick+0x27/0x60
> > [3.010665] Code: 1f 40 00 53 eb 02 5b c3 66 90 8b 05 2f c3 40 01 85 c0 
> > 74 18 65 8b 05 60 88 8f 4e 85 c0 75 0d 65 8b 05 a9 85 8f 4e 85 c0 74 02 
> > <0f> 0b e8 e2 6c 89 00 48 c7 c3 40 d5 02 00
> >  89 c0 48 03 1c c5 c0 98
> > [3.010667] RSP: :b2803e28 EFLAGS: 00010002
> > [3.010670] RAX: 0001 RBX: c8ce7fa07060 RCX: 
> > 0001
> > [3.010671] RDX:  RSI: b268dd21 RDI: 
> > b269ab13
> > [3.010673] RBP: 0001 R08: ffc300d5 R09: 
> > 0002be80
> > [3.010674] R10: 03625b53183a R11: a012b802b7a4 R12: 
> > b2aa9e80
> > [3.010675] R13: b2aa9e00 R14: 0001 R15: 
> > 
> > [3.010677] FS:  () GS:a012b800() 
> > knlGS:
> > [3.010678] CS:  0010 DS:  ES:  CR0: 80050033
> > [3.010680] CR2: a012f81ff000 CR3: 000c99612001 CR4: 
> > 003706f0
> > [3.010681] DR0:  DR1:  DR2: 
> > 
> > [3.010682] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [3.010683] Call Trace:

Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-30 Thread Michel Lespinasse
On Fri, Jul 29, 2022 at 08:26:22AM -0700, Paul E. McKenney wrote:> Would you be 
willing to try another shot in the dark, but untested
> this time?  I freely admit that this is getting strange.
> 
>   Thanx, Paul

Yes, adding this second change got rid of the boot time warning for me.

> 
> 
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index e374c0c923dae..279f557bf60bb 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -394,7 +394,7 @@ notrace void sched_clock_tick(void)
>   if (!static_branch_likely(&sched_clock_running))
>   return;
>  
> - lockdep_assert_irqs_disabled();
> + WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !raw_irqs_disabled());
>  
>   scd = this_scd();
>   __scd_stamp(scd);


Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-29 Thread Michel Lespinasse
On Thu, Jul 28, 2022 at 10:20:53AM -0700, Paul E. McKenney wrote:
> On Mon, Jul 25, 2022 at 12:43:06PM -0700, Michel Lespinasse wrote:
> > On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> > > Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> > > Xeons") wrecked intel_idle in two ways:
> > > 
> > >  - must not have tracing in idle functions
> > >  - must return with IRQs disabled
> > > 
> > > Additionally, it added a branch for no good reason.
> > > 
> > > Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on Xeons")
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > 
> > After this change was introduced, I am seeing "WARNING: suspicious RCU
> > usage" when booting a kernel with debug options compiled in. Please
> > see the attached dmesg output. The issue starts with commit 32d4fd5751ea
> > and is still present in v5.19-rc8.
> > 
> > I'm not sure, is this too late to fix or revert in v5.19 final ?
> 
> I finally got a chance to take a quick look at this.
> 
> The rcu_eqs_exit() function is making a lockdep complaint about
> being invoked with interrupts enabled.  This function is called from
> rcu_idle_exit(), which is an expected code path from cpuidle_enter_state()
> via its call to rcu_idle_exit().  Except that rcu_idle_exit() disables
> interrupts before invoking rcu_eqs_exit().
> 
> The only other call to rcu_idle_exit() does not disable interrupts,
> but it is via rcu_user_exit(), which would be a very odd choice for
> cpuidle_enter_state().
> 
> It seems unlikely, but it might be that it is the use of local_irq_save()
> instead of raw_local_irq_save() within rcu_idle_exit() that is causing
> the trouble.  If this is the case, then the commit shown below would
> help.  Note that this commit removes the warning from lockdep, so it
> is necessary to build the kernel with CONFIG_RCU_EQS_DEBUG=y to enable
> equivalent debugging.
> 
> Could you please try your test with the -rce commit shown below applied?

Thanks for looking into it.

After checking out Peter's commit 32d4fd5751ea,
cherry picking your commit ed4ae5eff4b3,
and setting CONFIG_RCU_EQS_DEBUG=y in addition of my usual debug config,
I am now seeing this a few seconds into the boot:

[3.010650] [ cut here ]
[3.010651] WARNING: CPU: 0 PID: 0 at kernel/sched/clock.c:397 
sched_clock_tick+0x27/0x60
[3.010657] Modules linked in:
[3.010660] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.19.0-rc1-test-5-g1be22fea0611 #1
[3.010662] Hardware name: LENOVO 30BFS44D00/1036, BIOS S03KT51A 01/17/2022
[3.010663] RIP: 0010:sched_clock_tick+0x27/0x60
[3.010665] Code: 1f 40 00 53 eb 02 5b c3 66 90 8b 05 2f c3 40 01 85 c0 74 
18 65 8b 05 60 88 8f 4e 85 c0 75 0d 65 8b 05 a9 85 8f 4e 85 c0 74 02 <0f> 0b e8 
e2 6c 89 00 48 c7 c3 40 d5 02 00
 89 c0 48 03 1c c5 c0 98
[3.010667] RSP: :b2803e28 EFLAGS: 00010002
[3.010670] RAX: 0001 RBX: c8ce7fa07060 RCX: 0001
[3.010671] RDX:  RSI: b268dd21 RDI: b269ab13
[3.010673] RBP: 0001 R08: ffc300d5 R09: 0002be80
[3.010674] R10: 03625b53183a R11: a012b802b7a4 R12: b2aa9e80
[3.010675] R13: b2aa9e00 R14: 0001 R15: 
[3.010677] FS:  () GS:a012b800() 
knlGS:
[3.010678] CS:  0010 DS:  ES:  CR0: 80050033
[3.010680] CR2: a012f81ff000 CR3: 000c99612001 CR4: 003706f0
[3.010681] DR0:  DR1:  DR2: 
[3.010682] DR3:  DR6: fffe0ff0 DR7: 0400
[3.010683] Call Trace:
[3.010685]  
[3.010688]  cpuidle_enter_state+0xb7/0x4b0
[3.010694]  cpuidle_enter+0x29/0x40
[3.010697]  do_idle+0x1d4/0x210
[3.010702]  cpu_startup_entry+0x19/0x20
[3.010704]  rest_init+0x117/0x1a0
[3.010708]  arch_call_rest_init+0xa/0x10
[3.010711]  start_kernel+0x6d8/0x6ff
[3.010716]  secondary_startup_64_no_verify+0xce/0xdb
[3.010728]  
[3.010729] irq event stamp: 44179
[3.010730] hardirqs last  enabled at (44179): [] 
asm_sysvec_apic_timer_interrupt+0x1b/0x20
[3.010734] hardirqs last disabled at (44177): [] 
__do_softirq+0x3f0/0x498
[3.010736] softirqs last  enabled at (44178): [] 
__do_softirq+0x332/0x498
[3.010738] softirqs last disabled at (44171): [] 
irq_exit_rcu+0xab/0xf0
[3.010741] ---[ end trace  ]---


Re: [PATCH 04/36] cpuidle,intel_idle: Fix CPUIDLE_FLAG_IRQ_ENABLE

2022-07-25 Thread Michel Lespinasse
On Wed, Jun 08, 2022 at 04:27:27PM +0200, Peter Zijlstra wrote:
> Commit c227233ad64c ("intel_idle: enable interrupts before C1 on
> Xeons") wrecked intel_idle in two ways:
> 
>  - must not have tracing in idle functions
>  - must return with IRQs disabled
> 
> Additionally, it added a branch for no good reason.
> 
> Fixes: c227233ad64c ("intel_idle: enable interrupts before C1 on Xeons")
> Signed-off-by: Peter Zijlstra (Intel) 

After this change was introduced, I am seeing "WARNING: suspicious RCU
usage" when booting a kernel with debug options compiled in. Please
see the attached dmesg output. The issue starts with commit 32d4fd5751ea
and is still present in v5.19-rc8.

I'm not sure, is this too late to fix or revert in v5.19 final ?

Thanks,

--
Michel "walken" Lespinasse
[0.00] microcode: microcode updated early to revision 0x2006d05, date = 
2021-11-13
[0.00] Linux version 5.19.0-rc8-test-3-ge3a8d97e6a35 (walken@zeus) 
(gcc (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP 
PREEMPT_DYNAMIC Mon Jul 25 00:32:16 PDT 2022
[0.00] Command line: 
BOOT_IMAGE=/vmlinuz-5.19.0-rc8-test-3-ge3a8d97e6a35 
root=/dev/mapper/budai--vg-root ro consoleblank=600 quiet
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point 
registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[0.00] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[0.00] x86/fpu: Supporting XSAVE feature 0x020: 'AVX-512 opmask'
[0.00] x86/fpu: Supporting XSAVE feature 0x040: 'AVX-512 Hi256'
[0.00] x86/fpu: Supporting XSAVE feature 0x080: 'AVX-512 ZMM_Hi256'
[0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[0.00] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[0.00] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[0.00] x86/fpu: xstate_offset[5]:  960, xstate_sizes[5]:   64
[0.00] x86/fpu: xstate_offset[6]: 1024, xstate_sizes[6]:  512
[0.00] x86/fpu: xstate_offset[7]: 1536, xstate_sizes[7]: 1024
[0.00] x86/fpu: Enabled xstate features 0xff, context size is 2560 
bytes, using 'compacted' format.
[0.00] signal: max sigframe size: 3632
[0.00] BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x0003efff] usable
[0.00] BIOS-e820: [mem 0x0003f000-0x0003] reserved
[0.00] BIOS-e820: [mem 0x0004-0x0009] usable
[0.00] BIOS-e820: [mem 0x000a-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0x62a8efff] usable
[0.00] BIOS-e820: [mem 0x62a8f000-0x6aef4fff] reserved
[0.00] BIOS-e820: [mem 0x6aef5000-0x6b0c2fff] ACPI data
[0.00] BIOS-e820: [mem 0x6b0c3000-0x6be0dfff] ACPI NVS
[0.00] BIOS-e820: [mem 0x6be0e000-0x6c7f1fff] reserved
[0.00] BIOS-e820: [mem 0x6c7f2000-0x6c8e5fff] type 20
[0.00] BIOS-e820: [mem 0x6c8e6000-0x6c8e6fff] reserved
[0.00] BIOS-e820: [mem 0x6c8e7000-0x6c9a7fff] type 20
[0.00] BIOS-e820: [mem 0x6c9a8000-0x6e6f] usable
[0.00] BIOS-e820: [mem 0x6e70-0x8fff] reserved
[0.00] BIOS-e820: [mem 0xfd00-0xfe010fff] reserved
[0.00] BIOS-e820: [mem 0xfec0-0xfec00fff] reserved
[0.00] BIOS-e820: [mem 0xfed0-0xfed00fff] reserved
[0.00] BIOS-e820: [mem 0xfed2-0xfed44fff] reserved
[0.00] BIOS-e820: [mem 0xfee0-0xfee00fff] reserved
[0.00] BIOS-e820: [mem 0xff00-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x00107fff] usable
[0.00] NX (Execute Disable) protection: active
[0.00] efi: EFI v2.70 by American Megatrends
[0.00] efi: TPMFinalLog=0x6bd97000 ACPI=0x6b0c2000 ACPI 2.0=0x6b0c2014 
SMBIOS=0x6c47b000 SMBIOS 3.0=0x6c47a000 MEMATTR=0x5d04d018 ESRT=0x604e0718 
[0.00] SMBIOS 3.0.0 present.
[0.00] DMI: LENOVO 30BFS44D00/1036, BIOS S03KT51A 01/17/2022
[0.00] tsc: Detected 3700.000 MHz processor
[0.00] tsc: Detected 3699.850 MHz TSC
[0.000278] e820: update [mem 0x-0x0fff] usable ==> reserved
[0.000281] e820: remove [mem 0x000a-0x000f] usable
[0.000289] last_pfn = 0x108 max_arch_pfn = 0x4
[0.000397] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[0.001626] e820: update [mem 0x7e00-0xf

Re: [PATCH v12 00/31] Speculative page faults

2019-04-22 Thread Michel Lespinasse
Hi Laurent,

Thanks a lot for copying me on this patchset. It took me a few days to
go through it - I had not been following the previous iterations of
this series so I had to catch up. I will be sending comments for
individual commits, but before tat I would like to discuss the series
as a whole.

I think these changes are a big step in the right direction. My main
reservation about them is that they are additive - adding some complexity
for speculative page faults - and I wonder if it'd be possible, over the
long term, to replace the existing complexity we have in mmap_sem retry
mechanisms instead of adding to it. This is not something that should
block your progress, but I think it would be good, as we introduce spf,
to evaluate whether we could eventually get all the way to removing the
mmap_sem retry mechanism, or if we will actually have to keep both.


The proposed spf mechanism only handles anon vmas. Is there a
fundamental reason why it couldn't handle mapped files too ?
My understanding is that the mechanism of verifying the vma after
taking back the ptl at the end of the fault would work there too ?
The file has to stay referenced during the fault, but holding the vma's
refcount could be made to cover that ? the vm_file refcount would have
to be released in __free_vma() instead of remove_vma; I'm not quite sure
if that has more implications than I realize ?

The proposed spf mechanism only works at the pte level after the page
tables have already been created. The non-spf page fault path takes the
mm->page_table_lock to protect against concurrent page table allocation
by multiple page faults; I think unmapping/freeing page tables could
be done under mm->page_table_lock too so that spf could implement
allocating new page tables by verifying the vma after taking the
mm->page_table_lock ?

The proposed spf mechanism depends on ARCH_HAS_PTE_SPECIAL.
I am not sure what is the issue there - is this due to the vma->vm_start
and vma->vm_pgoff reads in *__vm_normal_page() ?


My last potential concern is about performance. The numbers you have
look great, but I worry about potential regressions in PF performance
for threaded processes that don't currently encounter contention
(i.e. there may be just one thread actually doing all the work while
the others are blocked). I think one good proxy for measuring that
would be to measure a single threaded workload - kernbench would be
fine - without the special-case optimization in patch 22 where
handle_speculative_fault() immediately aborts in the single-threaded case.

Reviewed-by: Michel Lespinasse 
This is for the series as a whole; I expect to do another review pass on
individual commits in the series when we have agreement on the toplevel
stuff (I noticed a few things like out-of-date commit messages but that's
really minor stuff).


I want to add a note about mmap_sem. In the past there has been
discussions about replacing it with an interval lock, but these never
went anywhere because, mostly, of the fact that such mechanisms were
too expensive to use in the page fault path. I think adding the spf
mechanism would invite us to revisit this issue - interval locks may
be a great way to avoid blocking between unrelated mmap_sem writers
(for example, do not delay stack creation for new threads while a
large mmap or munmap may be going on), and probably also to handle
mmap_sem readers that can't easily use the spf mechanism (for example,
gup callers which make use of the returned vmas). But again that is a
separate topic to explore which doesn't have to get resolved before
spf goes in.


Re: [patch 2/2] mm: use vm_unmapped_area() on powerpc architecture

2013-03-18 Thread Michel Lespinasse
On Mon, Mar 18, 2013 at 4:12 AM, Aneesh Kumar K.V
 wrote:
> how about  ?
>
> static bool slice_scan_available(unsigned long addr,
>  struct slice_mask available,
>  int end,
>  unsigned long *boundary_addr)
> {
> unsigned long slice;
> if (addr < SLICE_LOW_TOP) {
> slice = GET_LOW_SLICE_INDEX(addr);
> *boundary_addr = (slice + end) << SLICE_LOW_SHIFT;
> return !!(available.low_slices & (1u << slice));
> } else {
> slice = GET_HIGH_SLICE_INDEX(addr);

> if ((slice + end) >= SLICE_NUM_HIGH)
> /* loop back in the high slice */
> *boundary_addr = SLICE_LOW_TOP;
> else
> *boundary_addr = (slice + end) << SLICE_HIGH_SHIFT;

I don't mind having this section as an if..else rather than ?:
statement. However, the condition would need to be if (slice == 0 &&
end == 0) or if (slice + end == 0)

This is because the beginning of high slice 0 is at SLICE_LOW_TOP and not at 0,
and the end of high slice 63 is at 64TB not at SLICE_LOW_TOP.

> return !!(available.high_slices & (1u << slice));
> }
> }

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [patch 1/2] mm: remove free_area_cache use in powerpc architecture

2013-03-18 Thread Michel Lespinasse
On Thu, Feb 21, 2013 at 3:05 PM,   wrote:
> From: Michel Lespinasse 
> Subject: mm: remove free_area_cache use in powerpc architecture
>
> As all other architectures have been converted to use vm_unmapped_area(),
> we are about to retire the free_area_cache.

Ping ? I haven't seen any replies from the ppc maintainers, and
powerpc is now the only remaining user of free_area_cache. Could you
please take these two patches ?

Thanks,

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-05 Thread Michel Lespinasse
On Tue, Mar 5, 2013 at 7:54 AM, Lai Jiangshan  wrote:
> On 03/03/13 01:06, Oleg Nesterov wrote:
>> On 03/02, Michel Lespinasse wrote:
>>>
>>> My version would be slower if it needs to take the
>>> slow path in a reentrant way, but I'm not sure it matters either :)
>>
>> I'd say, this doesn't matter at all, simply because this can only happen
>> if we race with the active writer.
>>
>
> It can also happen when interrupted. (still very rarely)
>
> arch_spin_trylock()
> --->interrupted,
> __this_cpu_read() returns 0.
> arch_spin_trylock() fails
> slowpath, any nested will be slowpath too.
> ...
> ..._read_unlock()
> <---interrupt
> __this_cpu_inc()
> 

Yes (and I think this is actually the most likely way for it to happen).

We do need this to work correctly, but I don't expect we need it to be fast.
(could be wrong, this is only my intuition)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-05 Thread Michel Lespinasse
Hi Lai,

Just a few comments about your v2 proposal. Hopefully you'll catch
these before you send out v3 :)

- I would prefer reader_refcnt to be unsigned int instead of unsigned long
- I would like some comment to indicate that lgrwlocks don't have
  reader-writer fairness and are thus somewhat discouraged
  (people could use plain lglock if they don't need reader preference,
  though even that use (as brlock) is discouraged already :)
- I don't think FALLBACK_BASE is necessary (you already mentioned you'd
  drop it)
- I prefer using the fallback_rwlock's dep_map for lockdep tracking.
  I feel this is more natural since we want the lgrwlock to behave as
  the rwlock, not as the lglock.
- I prefer to avoid return statements in the middle of functions when
  it's easyto do so.

Attached is my current version (based on an earlier version of your code).
You don't have to take it as is but I feel it makes for a more concrete
suggestion :)

Thanks,

8<---
lglock: add read-preference lgrwlock

Current lglock may be used as a fair rwlock; however sometimes a
read-preference rwlock is preferred. One such use case recently came
up for get_cpu_online_atomic().

This change adds a new lgrwlock with the following properties:
- high performance read side, using only cpu-local structures when there
  is no write side to contend with;
- correctness guarantees similar to rwlock_t: recursive readers are allowed
  and the lock's read side is not ordered vs other locks;
- low performance write side (comparable to lglocks' global side).

The implementation relies on the following principles:
- reader_refcnt is a local lock count; it indicates how many recursive
  read locks are taken using the local lglock;
- lglock is used by readers for local locking; it must be acquired
  before reader_refcnt becomes nonzero and released after reader_refcnt
  goes back to zero;
- fallback_rwlock is used by readers for global locking; it is acquired
  when fallback_reader_refcnt is zero and the trylock fails on lglock.
- writers take both the lglock write side and the fallback_rwlock, thus
  making sure to exclude both local and global readers.

Thanks to Srivatsa S. Bhat for proposing a lock with these requirements
and Lai Jiangshan for proposing this algorithm as an lglock extension.

Signed-off-by: Michel Lespinasse 

---
 include/linux/lglock.h | 46 +++
 kernel/lglock.c| 58 ++
 2 files changed, 104 insertions(+)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index 0d24e932db0b..8b59084935d5 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -67,4 +67,50 @@ void lg_local_unlock_cpu(struct lglock *lg, int cpu);
 void lg_global_lock(struct lglock *lg);
 void lg_global_unlock(struct lglock *lg);
 
+/*
+ * lglock may be used as a read write spinlock if desired (though this is
+ * not encouraged as the write side scales badly on high CPU count machines).
+ * It has reader/writer fairness when used that way.
+ *
+ * However, sometimes it is desired to have an unfair rwlock instead, with
+ * reentrant readers that don't need to be ordered vs other locks, comparable
+ * to rwlock_t. lgrwlock implements such semantics.
+ */
+struct lgrwlock {
+   unsigned int __percpu *reader_refcnt;
+   struct lglock lglock;
+   rwlock_t fallback_rwlock;
+};
+
+#define __DEFINE_LGRWLOCK_PERCPU_DATA(name)\
+   static DEFINE_PER_CPU(unsigned int, name ## _refcnt);   \
+   static DEFINE_PER_CPU(arch_spinlock_t, name ## _lock)   \
+   = __ARCH_SPIN_LOCK_UNLOCKED;
+
+#define __LGRWLOCK_INIT(name) {
\
+   .reader_refcnt = &name ## _refcnt,  \
+   .lglock = { .lock = &name ## _lock },   \
+   .fallback_rwlock = __RW_LOCK_UNLOCKED(name.fallback_rwlock) \
+}
+
+#define DEFINE_LGRWLOCK(name)  \
+   __DEFINE_LGRWLOCK_PERCPU_DATA(name) \
+   struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+#define DEFINE_STATIC_LGRWLOCK(name)   \
+   __DEFINE_LGRWLOCK_PERCPU_DATA(name) \
+   static struct lgrwlock name = __LGRWLOCK_INIT(name)
+
+static inline void lg_rwlock_init(struct lgrwlock *lgrw, char *name)
+{
+   lg_lock_init(&lgrw->lglock, name);
+}
+
+void lg_read_lock(struct lgrwlock *lgrw);
+void lg_read_unlock(struct lgrwlock *lgrw);
+void lg_write_lock(struct lgrwlock *lgrw);
+void lg_write_unlock(struct lgrwlock *lgrw);
+void __lg_read_write_lock(struct lgrwlock *lgrw);
+void __lg_read_write_unlock(struct lgrwlock *lgrw);
+
 #endif
diff --git a/kernel/lglock.c b/kernel/lglock.c
index 86ae

Re: [PATCH V2] lglock: add read-preference local-global rwlock

2013-03-04 Thread Michel Lespinasse
On Sun, Mar 3, 2013 at 9:40 AM, Oleg Nesterov  wrote:
>> However, I still think that FALLBACK_BASE only adds the unnecessary
>> complications. But even if I am right this is subjective of course, please
>> feel free to ignore.

Would it help if I sent out that version (without FALLBACK_BASE) as a
formal proposal ?

> Hmm. But then I do not understand the lglock annotations. Obviously,
> rwlock_acquire_read() in lg_local_lock() can't even detect the simplest
> deadlock, say, lg_local_lock(LOCK) + lg_local_lock(LOCK). Not to mention
> spin_lock(X) + lg_local_lock(Y) vs lg_local_lock(Y) + spin_lock(X).
>
> OK, I understand that it is not easy to make these annotations correct...

I am going to send out a proposal to fix the existing lglock
annotations and detect the two cases you noticed. It's actually not
that hard :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] lglock: add read-preference local-global rwlock

2013-03-02 Thread Michel Lespinasse
On Sat, Mar 2, 2013 at 2:28 AM, Oleg Nesterov  wrote:
> Lai, I didn't read this discussion except the code posted by Michel.
> I'll try to read this patch carefully later, but I'd like to ask
> a couple of questions.
>
> This version looks more complex than Michel's, why? Just curious, I
> am trying to understand what I missed. See
> http://marc.info/?l=linux-kernel&m=136196350213593

>From what I can see, my version used local_refcnt to count how many
reentrant locks are represented by the fastpath lglock spinlock; Lai's
version uses it to count how many reentrant locks are represented by
either the fastpath lglock spinlock or the global rwlock, with
FALLBACK_BASE being a bit thrown in so we can remember which of these
locks was acquired. My version would be slower if it needs to take the
slow path in a reentrant way, but I'm not sure it matters either :)

> Interrupt handler on CPU_1 does _read_lock() notices ->reader_refcnt != 0
> and simply does this_cpu_inc(), so reader_refcnt == FALLBACK_BASE + 1.
>
> Then irq does _read_unlock(), and
>
>> +void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
>> +{
>> + switch (__this_cpu_dec_return(*lgrw->reader_refcnt)) {
>> + case 0:
>> + lg_local_unlock(&lgrw->lglock);
>> + return;
>> + case FALLBACK_BASE:
>> + __this_cpu_sub(*lgrw->reader_refcnt, FALLBACK_BASE);
>> + read_unlock(&lgrw->fallback_rwlock);
>
> hits this case?
>
> Doesn't look right, but most probably I missed something.

Good catch. I think this is easily fixed by setting reader_refcn
directly to FALLBACK_BASE+1, instead of setting it to FALLBACK_BASE
and then incrementing it to FALLBACK_BASE+1.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-28 Thread Michel Lespinasse
On Thu, Feb 28, 2013 at 3:25 AM, Oleg Nesterov  wrote:
> On 02/27, Michel Lespinasse wrote:
>>
>> +void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
>> +{
>> +   preempt_disable();
>> +
>> +   if (__this_cpu_read(*lgrw->local_refcnt) ||
>> +   arch_spin_trylock(this_cpu_ptr(lgrw->lglock->lock))) {
>> +   __this_cpu_inc(*lgrw->local_refcnt);
>
> Please look at __this_cpu_generic_to_op(). You need this_cpu_inc()
> to avoid the race with irs. The same for _read_unlock.

Hmmm, I was thinking that this was safe because while interrupts might
modify local_refcnt to acquire a nested read lock, they are expected
to release that lock as well which would set local_refcnt back to its
original value ???

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-27 Thread Michel Lespinasse
Hi Srivatsa,

I think there is some elegance in Lai's proposal of using a local
trylock for the reader uncontended case and global rwlock to deal with
the contended case without deadlocks. He apparently didn't realize
initially that nested read locks are common, and he seems to have
confused you because of that, but I think his proposal could be
changed easily to account for that and result in short, easily
understood code. What about the following:

- local_refcnt is a local lock count; it indicates how many recursive
locks are taken using the local lglock
- lglock is used by readers for local locking; it must be acquired
before local_refcnt becomes nonzero and released after local_refcnt
goes back to zero.
- fallback_rwlock is used by readers for global locking; it is
acquired when fallback_reader_refcnt is zero and the trylock fails on
lglock

+void lg_rwlock_local_read_lock(struct lgrwlock *lgrw)
+{
+   preempt_disable();
+
+   if (__this_cpu_read(*lgrw->local_refcnt) ||
+   arch_spin_trylock(this_cpu_ptr(lgrw->lglock->lock))) {
+   __this_cpu_inc(*lgrw->local_refcnt);
+
rwlock_acquire_read(&lgrw->fallback_rwlock->lock_dep_map, 0, 0,
_RET_IP_);
+   } else {
+   read_lock(&lgrw->fallback_rwlock);
+   }
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_lock);
+
+void lg_rwlock_local_read_unlock(struct lgrwlock *lgrw)
+{
+   if (likely(__this_cpu_read(*lgrw->local_refcnt))) {
+   rwlock_release(&lgrw->fallback_rwlock->lock_dep_map,
1, _RET_IP_);
+   if (!__this_cpu_dec_return(*lgrw->local_refcnt))
+   arch_spin_unlock(this_cpu_ptr(lgrw->lglock->lock));
+   } else {
+   read_unlock(&lgrw->fallback_rwlock);
+   }
+
+   preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_local_read_unlock);
+
+void lg_rwlock_global_write_lock(struct lgrwlock *lgrw)
+{
+   int i;
+
+   preempt_disable();
+
+   for_each_possible_cpu(i)
+   arch_spin_lock(per_cpu_ptr(lgrw->lglock->lock, i));
+   write_lock(&lgrw->fallback_rwlock);
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_lock);
+
+void lg_rwlock_global_write_unlock(struct lgrwlock *lgrw)
+{
+   int i;
+
+   write_unlock(&lgrw->fallback_rwlock);
+   for_each_possible_cpu(i)
+   arch_spin_unlock(per_cpu_ptr(lgrw->lglock->lock, i));
+
+   preempt_enable();
+}
+EXPORT_SYMBOL(lg_rwlock_global_write_unlock);

This is to me relatively easier to understand than Srivatsa's
proposal. Now I'm not sure who wins efficiency wise, but I think it
should be relatively close as readers at least don't touch shared
state in the uncontended case (even with some recursion going on).

There is an interesting case where lg_rwlock_local_read_lock could be
interrupted after getting the local lglock but before incrementing
local_refcnt to 1; if that happens any nested readers within that
interrupt will have to take the global rwlock read side. I think this
is perfectly acceptable as this should not be a common case though
(and thus the global rwlock cache line probably wouldn't even bounce
between cpus then).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 08/46] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2013-02-19 Thread Michel Lespinasse
On Tue, Feb 19, 2013 at 2:50 AM, Srivatsa S. Bhat
 wrote:
> But, the whole intention behind removing the parts depending on the
> recursive property of rwlocks would be to make it easier to make rwlocks
> fair (going forward) right? Then, that won't work for CPU hotplug, because,
> just like we have a legitimate reason to have recursive
> get_online_cpus_atomic(), we also have a legitimate reason to have
> unfairness in locking (i.e., for deadlock-safety). So we simply can't
> afford to make the locking fair - we'll end up in too many deadlock
> possibilities, as hinted in the changelog of patch 1.

Grumpf - I hadn't realized that making the underlying rwlock fair
would break your hotplug use case. But you are right, it would. Oh
well :/

> So the only long-term solution I can think of is to decouple
> percpu-rwlocks and rwlock_t (like what Tejun suggested) by implementing
> our own unfair locking scheme inside. What do you think?

I have no idea how hard would it be to change get_online_cpus_atomic()
call sites so that the hotplug rwlock read side has a defined order vs
other locks (thus making sure the situation you describe in patch 1
doesn't happen). I agree we shouldn't base our short term plans around
that, but maybe that's doable in the long term ???

Otherwise, I think we should add some big-fat-warning that percpu
rwlocks don't have reader/writer fairness, that the hotplug use case
actually depends on the unfairness / would break if the rwlock was
made fair, and that any new uses of percpu rwlocks should be very
carefully considered because of the reader/writer fairness issues.
Maybe even give percpu rwlocks a less generic sounding name, given how
constrained they are by the hotplug use case.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-18 Thread Michel Lespinasse
On Tue, Feb 19, 2013 at 1:56 AM, Srivatsa S. Bhat
 wrote:
> On 02/18/2013 09:51 PM, Srivatsa S. Bhat wrote:
>> On 02/18/2013 09:15 PM, Michel Lespinasse wrote:
>>> I don't see anything preventing a race with the corresponding code in
>>> percpu_write_unlock() that sets writer_signal back to false. Did I
>>> miss something here ? It seems to me we don't have any guarantee that
>>> all writer signals will be set to true at the end of the loop...
>>
>> Ah, thanks for pointing that out! IIRC Oleg had pointed this issue in the 
>> last
>> version, but back then, I hadn't fully understood what he meant. Your
>> explanation made it clear. I'll work on fixing this.
>
> We can fix this by using the simple patch (untested) shown below.
> The alternative would be to acquire the rwlock for write, update the
> ->writer_signal values, release the lock, wait for readers to switch,
> again acquire the rwlock for write with interrupts disabled etc... which
> makes it kinda messy, IMHO. So I prefer the simple version shown below.

Looks good.

Another alternative would be to make writer_signal an atomic integer
instead of a bool. That way writers can increment it before locking
and decrement it while unlocking.

To reduce the number of atomic ops during writer lock/unlock, the
writer_signal could also be a global read_mostly variable (I don't see
any downsides to that compared to having it percpu - or is it because
you wanted all the fastpath state to be in one single cacheline ?)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 08/46] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2013-02-18 Thread Michel Lespinasse
On Tue, Feb 19, 2013 at 12:43 AM, Srivatsa S. Bhat
 wrote:
> On 02/18/2013 09:53 PM, Michel Lespinasse wrote:
>> I am wondering though, if you could take care of recursive uses in
>> get/put_online_cpus_atomic() instead of doing it as a property of your
>> rwlock:
>>
>> get_online_cpus_atomic()
>> {
>> unsigned long flags;
>> local_irq_save(flags);
>> if (this_cpu_inc_return(hotplug_recusion_count) == 1)
>> percpu_read_lock_irqsafe(&hotplug_pcpu_rwlock);
>> local_irq_restore(flags);
>> }
>>
>> Once again, the idea there is to avoid baking the reader side
>> recursive properties into your rwlock itself, so that it won't be
>> impossible to implement reader/writer fairness into your rwlock in the
>> future (which may be not be very important for the hotplug use, but
>> could be when other uses get introduced).
>
> Hmm, your proposal above looks good to me, at first glance.
> (Sorry, I had mistaken your earlier mails to mean that you were against
> recursive reader-side, while you actually meant that you didn't like
> implementing the recursive reader-side logic using the recursive property
> of rwlocks).

To be honest, I was replying as I went through the series, so I hadn't
digested your hotplug use case yet :)

But yes - I don't like having the rwlock itself be recursive, but I do
understand that you have a legitimate requirement for
get_online_cpus_atomic() to be recursive. This IMO points to the
direction I suggested, of explicitly handling the recusion in
get_online_cpus_atomic() so that the underlying rwlock doesn't have to
support recursive reader side itself.

(And this would work for the idea of making writers own the reader
side as well - you can do it with the hotplug_recursion_count instead
of with the underlying rwlock).

> While your idea above looks good, it might introduce more complexity
> in the unlock path, since this would allow nesting of heterogeneous readers
> (ie., if hotplug_recursion_count == 1, you don't know whether you need to
> simply decrement the counter or unlock the rwlock as well).

Well, I think the idea doesn't make the underlying rwlock more
complex, since you could in principle keep your existing
percpu_read_lock_irqsafe implementation as is and just remove the
recursive behavior from its documentation.

Now ideally if we're adding a bit of complexity in
get_online_cpus_atomic() it'd be nice if we could remove some from
percpu_read_lock_irqsafe, but I haven't thought about that deeply
either. I think you'd still want to have the equivalent of a percpu
reader_refcnt, except it could now be a bool instead of an int, and
percpu_read_lock_irqsafe would still set it to back to 0/false after
acquiring the global read side if a writer is signaled. Basically your
existing percpu_read_lock_irqsafe code should still work, and we could
remove just the parts that were only there to deal with the recursive
property.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 08/46] CPU hotplug: Provide APIs to prevent CPU offline from atomic context

2013-02-18 Thread Michel Lespinasse
On Mon, Feb 18, 2013 at 8:39 PM, Srivatsa S. Bhat
 wrote:
> Some important design requirements and considerations:
> -
>
> 1. Scalable synchronization at the reader-side, especially in the fast-path
>
>Any synchronization at the atomic hotplug readers side must be highly
>scalable - avoid global single-holder locks/counters etc. Because, these
>paths currently use the extremely fast preempt_disable(); our replacement
>to preempt_disable() should not become ridiculously costly and also should
>not serialize the readers among themselves needlessly.
>
>At a minimum, the new APIs must be extremely fast at the reader side
>atleast in the fast-path, when no CPU offline writers are active.
>
> 2. preempt_disable() was recursive. The replacement should also be recursive.
>
> 3. No (new) lock-ordering restrictions
>
>preempt_disable() was super-flexible. It didn't impose any ordering
>restrictions or rules for nesting. Our replacement should also be equally
>flexible and usable.
>
> 4. No deadlock possibilities
>
>Regular per-cpu locking is not the way to go if we want to have relaxed
>rules for lock-ordering. Because, we can end up in circular-locking
>dependencies as explained in https://lkml.org/lkml/2012/12/6/290
>
>So, avoid the usual per-cpu locking schemes (per-cpu locks/per-cpu atomic
>counters with spin-on-contention etc) as much as possible, to avoid
>numerous deadlock possibilities from creeping in.
>
>
> Implementation of the design:
> 
>
> We use per-CPU reader-writer locks for synchronization because:
>
>   a. They are quite fast and scalable in the fast-path (when no writers are
>  active), since they use fast per-cpu counters in those paths.
>
>   b. They are recursive at the reader side.
>
>   c. They provide a good amount of safety against deadlocks; they don't
>  spring new deadlock possibilities on us from out of nowhere. As a
>  result, they have relaxed locking rules and are quite flexible, and
>  thus are best suited for replacing usages of preempt_disable() or
>  local_irq_disable() at the reader side.
>
> Together, these satisfy all the requirements mentioned above.

Thanks for this detailed design explanation.

> +/*
> + * Invoked by atomic hotplug reader (a task which wants to prevent
> + * CPU offline, but which can't afford to sleep), to prevent CPUs from
> + * going offline. So, you can call this function from atomic contexts
> + * (including interrupt handlers).
> + *
> + * Note: This does NOT prevent CPUs from coming online! It only prevents
> + * CPUs from going offline.
> + *
> + * You can call this function recursively.
> + *
> + * Returns with preemption disabled (but interrupts remain as they are;
> + * they are not disabled).
> + */
> +void get_online_cpus_atomic(void)
> +{
> +   percpu_read_lock_irqsafe(&hotplug_pcpu_rwlock);
> +}
> +EXPORT_SYMBOL_GPL(get_online_cpus_atomic);
> +
> +void put_online_cpus_atomic(void)
> +{
> +   percpu_read_unlock_irqsafe(&hotplug_pcpu_rwlock);
> +}
> +EXPORT_SYMBOL_GPL(put_online_cpus_atomic);

So, you made it clear why you want a recursive read side here.

I am wondering though, if you could take care of recursive uses in
get/put_online_cpus_atomic() instead of doing it as a property of your
rwlock:

get_online_cpus_atomic()
{
unsigned long flags;
local_irq_save(flags);
if (this_cpu_inc_return(hotplug_recusion_count) == 1)
percpu_read_lock_irqsafe(&hotplug_pcpu_rwlock);
local_irq_restore(flags);
}

Once again, the idea there is to avoid baking the reader side
recursive properties into your rwlock itself, so that it won't be
impossible to implement reader/writer fairness into your rwlock in the
future (which may be not be very important for the hotplug use, but
could be when other uses get introduced).

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 07/46] percpu_rwlock: Allow writers to be readers, and add lockdep annotations

2013-02-18 Thread Michel Lespinasse
On Mon, Feb 18, 2013 at 8:39 PM, Srivatsa S. Bhat
 wrote:
> @@ -200,6 +217,16 @@ void percpu_write_lock_irqsave(struct percpu_rwlock 
> *pcpu_rwlock,
>
> smp_mb(); /* Complete the wait-for-readers, before taking the lock */
> write_lock_irqsave(&pcpu_rwlock->global_rwlock, *flags);
> +
> +   /*
> +* It is desirable to allow the writer to acquire the percpu-rwlock
> +* for read (if necessary), without deadlocking or getting complaints
> +* from lockdep. To achieve that, just increment the reader_refcnt of
> +* this CPU - that way, any attempt by the writer to acquire the
> +* percpu-rwlock for read, will get treated as a case of nested percpu
> +* reader, which is safe, from a locking perspective.
> +*/
> +   this_cpu_inc(pcpu_rwlock->rw_state->reader_refcnt);

I find this quite disgusting, but once again this may be because I
don't like unfair recursive rwlocks.

In my opinion, the alternative of explicitly not taking the read lock
when one already has the write lock sounds *much* nicer.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 04/46] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks

2013-02-18 Thread Michel Lespinasse
Hi Srivasta,

I admit not having followed in detail the threads about the previous
iteration, so some of my comments may have been discussed already
before - apologies if that is the case.

On Mon, Feb 18, 2013 at 8:38 PM, Srivatsa S. Bhat
 wrote:
> Reader-writer locks and per-cpu counters are recursive, so they can be
> used in a nested fashion in the reader-path, which makes per-CPU rwlocks also
> recursive. Also, this design of switching the synchronization scheme ensures
> that you can safely nest and use these locks in a very flexible manner.

I like the general idea of switching between per-cpu and global
rwlocks as needed; however I dislike unfair locks, and especially
unfair recursive rwlocks.

If you look at rwlock_t, the main reason we haven't been able to
implement reader/writer fairness there is because tasklist_lock makes
use of the recursive nature of the rwlock_t read side. I'm worried
about introducing more lock usages that would make use of the same
property for your proposed lock.

I am fine with your proposal not implementing reader/writer fairness
from day 1, but I am worried about your proposal having a recursive
reader side. Or, to put it another way: if your proposal didn't have a
recursive reader side, and rwlock_t could somehow be changed to
implement reader/writer fairness, then this property could
automatically propagate into your proposed rwlock; but if anyone makes
use of the recursive nature of your proposal then implementing
reader/writer fairness later won't be as easy.

I see that the very next change in this series is talking about
acquiring the read side from interrupts, so it does look like you're
planning to make use of the recursive nature of the read side. I kinda
wish you didn't, as this is exactly replicating the design of
tasklist_lock which is IMO problematic. Your prior proposal of
disabling interrupts during the read side had other disadvantages, but
I think it was nice that it didn't rely on having a recursive read
side.

> +#define reader_yet_to_switch(pcpu_rwlock, cpu) \
> +   (ACCESS_ONCE(per_cpu_ptr((pcpu_rwlock)->rw_state, 
> cpu)->reader_refcnt))
> +
> +#define reader_percpu_nesting_depth(pcpu_rwlock) \
> +   (__this_cpu_read((pcpu_rwlock)->rw_state->reader_refcnt))
> +
> +#define reader_uses_percpu_refcnt(pcpu_rwlock) \
> +   reader_percpu_nesting_depth(pcpu_rwlock)
> +
> +#define reader_nested_percpu(pcpu_rwlock)  \
> +   (reader_percpu_nesting_depth(pcpu_rwlock) > 1)
> +
> +#define writer_active(pcpu_rwlock) \
> +   (__this_cpu_read((pcpu_rwlock)->rw_state->writer_signal))

I'm personally not a fan of such one-line shorthand functions - I
think they tend to make the code harder to read instead of easier, as
one constantly has to refer to them to understand what's actually
going on.

>  void percpu_write_lock(struct percpu_rwlock *pcpu_rwlock)
>  {
> +   unsigned int cpu;
> +
> +   /*
> +* Tell all readers that a writer is becoming active, so that they
> +* start switching over to the global rwlock.
> +*/
> +   for_each_possible_cpu(cpu)
> +   per_cpu_ptr(pcpu_rwlock->rw_state, cpu)->writer_signal = true;

I don't see anything preventing a race with the corresponding code in
percpu_write_unlock() that sets writer_signal back to false. Did I
miss something here ? It seems to me we don't have any guarantee that
all writer signals will be set to true at the end of the loop...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 01/45] percpu_rwlock: Introduce the global reader-writer lock backend

2013-01-23 Thread Michel Lespinasse
On Tue, Jan 22, 2013 at 11:32 AM, Steven Rostedt  wrote:
> On Tue, 2013-01-22 at 13:03 +0530, Srivatsa S. Bhat wrote:
>> A straight-forward (and obvious) algorithm to implement Per-CPU Reader-Writer
>> locks can also lead to too many deadlock possibilities which can make it very
>> hard/impossible to use. This is explained in the example below, which helps
>> justify the need for a different algorithm to implement flexible Per-CPU
>> Reader-Writer locks.
>>
>> We can use global rwlocks as shown below safely, without fear of deadlocks:
>>
>> Readers:
>>
>>  CPU 0CPU 1
>>  --   --
>>
>> 1.spin_lock(&random_lock); read_lock(&my_rwlock);
>>
>>
>> 2.read_lock(&my_rwlock);   spin_lock(&random_lock);
>>
>>
>> Writer:
>>
>>  CPU 2:
>>  --
>>
>>write_lock(&my_rwlock);
>>
>
> I thought global locks are now fair. That is, a reader will block if a
> writer is waiting. Hence, the above should deadlock on the current
> rwlock_t types.

I believe you are mistaken here. struct rw_semaphore is fair (and
blocking), but rwlock_t is unfair. The reason we can't easily make
rwlock_t fair is because tasklist_lock currently depends on the
rwlock_t unfairness - tasklist_lock readers typically don't disable
local interrupts, and tasklist_lock may be acquired again from within
an interrupt, which would deadlock if rwlock_t was fair and a writer
was queued by the time the interrupt is processed.

> We need to fix those locations (or better yet, remove all rwlocks ;-)

tasklist_lock is the main remaining user. I'm not sure about removing
rwlock_t, but I would like to at least make it fair somehow :)

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 8/8] mm: remove free_area_cache

2013-01-23 Thread Michel Lespinasse
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/arm/mm/mmap.c   |2 --
 arch/arm64/mm/mmap.c |2 --
 arch/mips/mm/mmap.c  |2 --
 arch/powerpc/mm/mmap_64.c|2 --
 arch/s390/mm/mmap.c  |4 
 arch/sparc/kernel/sys_sparc_64.c |2 --
 arch/tile/mm/mmap.c  |2 --
 arch/x86/ia32/ia32_aout.c|2 --
 arch/x86/mm/mmap.c   |2 --
 fs/binfmt_aout.c |2 --
 fs/binfmt_elf.c  |2 --
 include/linux/mm_types.h |3 ---
 include/linux/sched.h|2 --
 kernel/fork.c|4 
 mm/mmap.c|   28 
 mm/nommu.c   |4 
 mm/util.c|1 -
 17 files changed, 0 insertions(+), 66 deletions(-)

diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index 10062ceadd1c..0c6356255fe3 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -181,11 +181,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base(random_factor);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 7c7be7855638..8ed6cb1a900f 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -90,11 +90,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 EXPORT_SYMBOL_GPL(arch_pick_mmap_layout);
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index d9be7540a6be..f4e63c29d044 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -158,11 +158,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base(random_factor);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/powerpc/mm/mmap_64.c b/arch/powerpc/mm/mmap_64.c
index 67a42ed0d2fc..cb8bdbe4972f 100644
--- a/arch/powerpc/mm/mmap_64.c
+++ b/arch/powerpc/mm/mmap_64.c
@@ -92,10 +92,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
index c59a5efa58b1..f2a462625c9e 100644
--- a/arch/s390/mm/mmap.c
+++ b/arch/s390/mm/mmap.c
@@ -91,11 +91,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
@@ -173,11 +171,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = s390_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = s390_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index 708bc29d36a8..f3c169f9d3a1 100644
--- a/arch/sparc/kernel/sys_sparc_64.

[PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-23 Thread Michel Lespinasse
Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/powerpc/mm/slice.c |  123 ++-
 1 files changed, 78 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 999a74f25ebe..3e99c149271a 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -237,36 +237,69 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
 #endif
 }
 
+/*
+ * Compute which slice addr is part of;
+ * set *boundary_addr to the start or end boundary of that slice
+ * (depending on 'end' parameter);
+ * return boolean indicating if the slice is marked as available in the
+ * 'available' slice_mark.
+ */
+static bool slice_scan_available(unsigned long addr,
+struct slice_mask available,
+int end,
+unsigned long *boundary_addr)
+{
+   unsigned long slice;
+   if (addr < SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end) << SLICE_LOW_SHIFT;
+   return !!(available.low_slices & (1u << slice));
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end) ?
+   ((slice + end) << SLICE_HIGH_SHIFT) : SLICE_LOW_TOP;
+   return !!(available.high_slices & (1u << slice));
+   }
+}
+
 static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
  unsigned long len,
  struct slice_mask available,
  int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, next_end;
+   struct vm_unmapped_area_info info;
 
-   addr = TASK_UNMAPPED_BASE;
-
-   for (;;) {
-   addr = _ALIGN_UP(addr, 1ul << pshift);
-   if ((TASK_SIZE - len) < addr)
-   break;
-   vma = find_vma(mm, addr);
-   BUG_ON(vma && (addr >= vma->vm_end));
+   info.flags = 0;
+   info.length = len;
+   info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
+   info.align_offset = 0;
 
-   mask = slice_range_to_mask(addr, len);
-   if (!slice_check_fit(mask, available)) {
-   if (addr < SLICE_LOW_TOP)
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_LOW_SHIFT);
-   else
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_HIGH_SHIFT);
+   addr = TASK_UNMAPPED_BASE;
+   while (addr < TASK_SIZE) {
+   info.low_limit = addr;
+   if (!slice_scan_available(addr, available, 1, &addr))
continue;
+
+ next_slice:
+   /*
+* At this point [info.low_limit; addr) covers
+* available slices only and ends at a slice boundary.
+* Check if we need to reduce the range, or if we can
+* extend it to cover the next available slice.
+*/
+   if (addr >= TASK_SIZE)
+   addr = TASK_SIZE;
+   else if (slice_scan_available(addr, available, 1, &next_end)) {
+   addr = next_end;
+   goto next_slice;
}
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
+   info.high_limit = addr;
+
+   found = vm_unmapped_area(&info);
+   if (!(found & ~PAGE_MASK))
+   return found;
}
 
return -ENOMEM;
@@ -277,39 +310,39 @@ static unsigned long slice_find_area_topdown(struct 
mm_struct *mm,
 struct slice_mask available,
 int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, prev;
+   struct vm_unmapped_area_info info;
 
-   addr = mm->mmap_base;
-   while (addr > len) {
-   /* Go down by chunk size */
-   addr = _ALIGN_DOWN(addr - len, 1ul << pshift);
+   info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+   info.length = len;
+   info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
+   inf

[PATCH 6/8] mm: remove free_area_cache use in powerpc architecture

2013-01-23 Thread Michel Lespinasse
As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.

This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/powerpc/include/asm/page_64.h   |3 +-
 arch/powerpc/mm/hugetlbpage.c|2 +-
 arch/powerpc/mm/slice.c  |  108 +
 arch/powerpc/platforms/cell/spufs/file.c |2 +-
 4 files changed, 22 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index cd915d6b093d..88693cef4f3d 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -99,8 +99,7 @@ extern unsigned long slice_get_unmapped_area(unsigned long 
addr,
 unsigned long len,
 unsigned long flags,
 unsigned int psize,
-int topdown,
-int use_cache);
+int topdown);
 
 extern unsigned int get_slice_psize(struct mm_struct *mm,
unsigned long addr);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1a6de0a7d8eb..5dc52d803ed8 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -742,7 +742,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
struct hstate *hstate = hstate_file(file);
int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
+   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
 }
 #endif
 
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index cf9dada734b6..999a74f25ebe 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -240,23 +240,15 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
 static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
  unsigned long len,
  struct slice_mask available,
- int psize, int use_cache)
+ int psize)
 {
struct vm_area_struct *vma;
-   unsigned long start_addr, addr;
+   unsigned long addr;
struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
 
-   if (use_cache) {
-   if (len <= mm->cached_hole_size) {
-   start_addr = addr = TASK_UNMAPPED_BASE;
-   mm->cached_hole_size = 0;
-   } else
-   start_addr = addr = mm->free_area_cache;
-   } else
-   start_addr = addr = TASK_UNMAPPED_BASE;
+   addr = TASK_UNMAPPED_BASE;
 
-full_search:
for (;;) {
addr = _ALIGN_UP(addr, 1ul << pshift);
if ((TASK_SIZE - len) < addr)
@@ -272,63 +264,24 @@ full_search:
addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_HIGH_SHIFT);
continue;
}
-   if (!vma || addr + len <= vma->vm_start) {
-   /*
-* Remember the place where we stopped the search:
-*/
-   if (use_cache)
-   mm->free_area_cache = addr + len;
+   if (!vma || addr + len <= vma->vm_start)
return addr;
-   }
-   if (use_cache && (addr + mm->cached_hole_size) < vma->vm_start)
-   mm->cached_hole_size = vma->vm_start - addr;
addr = vma->vm_end;
}
 
-   /* Make sure we didn't miss any holes */
-   if (use_cache && start_addr != TASK_UNMAPPED_BASE) {
-   start_addr = addr = TASK_UNMAPPED_BASE;
-   mm->cached_hole_size = 0;
-   goto full_search;
-   }
return -ENOMEM;
 }
 
 static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 unsigned long len,
 struct slice_mask available,
-int psize, int use_cache)
+int psize)
 {
struct vm_area_struct *vma;
unsigned long addr;
struct slice_mask mask;
int

[PATCH 5/8] mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture

2013-01-23 Thread Michel Lespinasse
Update the ia64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/ia64/mm/hugetlbpage.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 5ca674b74737..76069c18ee42 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -148,7 +148,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, 
unsigned long len,
unsigned long pgoff, unsigned long flags)
 {
-   struct vm_area_struct *vmm;
+   struct vm_unmapped_area_info info;
 
if (len > RGN_MAP_LIMIT)
return -ENOMEM;
@@ -165,16 +165,14 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr, u
/* This code assumes that RGN_HPAGE != 0. */
if ((REGION_NUMBER(addr) != RGN_HPAGE) || (addr & (HPAGE_SIZE - 1)))
addr = HPAGE_REGION_BASE;
-   else
-   addr = ALIGN(addr, HPAGE_SIZE);
-   for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
-   /* At this point:  (!vmm || addr < vmm->vm_end). */
-   if (REGION_OFFSET(addr) + len > RGN_MAP_LIMIT)
-   return -ENOMEM;
-   if (!vmm || (addr + len) <= vmm->vm_start)
-   return addr;
-   addr = ALIGN(vmm->vm_end, HPAGE_SIZE);
-   }
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = HPAGE_REGION_BASE + RGN_MAP_LIMIT;
+   info.align_mask = PAGE_MASK & (HPAGE_SIZE - 1);
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 static int __init hugetlb_setup_sz(char *str)
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 4/8] mm: use vm_unmapped_area() on ia64 architecture

2013-01-23 Thread Michel Lespinasse
Update the ia64 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/ia64/kernel/sys_ia64.c |   37 -
 1 files changed, 12 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/kernel/sys_ia64.c b/arch/ia64/kernel/sys_ia64.c
index d9439ef2f661..41e33f84c185 100644
--- a/arch/ia64/kernel/sys_ia64.c
+++ b/arch/ia64/kernel/sys_ia64.c
@@ -25,9 +25,9 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
unsigned long pgoff, unsigned long flags)
 {
long map_shared = (flags & MAP_SHARED);
-   unsigned long start_addr, align_mask = PAGE_SIZE - 1;
+   unsigned long align_mask = 0;
struct mm_struct *mm = current->mm;
-   struct vm_area_struct *vma;
+   struct vm_unmapped_area_info info;
 
if (len > RGN_MAP_LIMIT)
return -ENOMEM;
@@ -44,7 +44,7 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
addr = 0;
 #endif
if (!addr)
-   addr = mm->free_area_cache;
+   addr = TASK_UNMAPPED_BASE;
 
if (map_shared && (TASK_SIZE > 0xul))
/*
@@ -53,28 +53,15 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
 * tasks, we prefer to avoid exhausting the address space too 
quickly by
 * limiting alignment to a single page.
 */
-   align_mask = SHMLBA - 1;
-
-  full_search:
-   start_addr = addr = (addr + align_mask) & ~align_mask;
-
-   for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr || RGN_MAP_LIMIT - len < 
REGION_OFFSET(addr)) {
-   if (start_addr != TASK_UNMAPPED_BASE) {
-   /* Start a new search --- just in case we 
missed some holes.  */
-   addr = TASK_UNMAPPED_BASE;
-   goto full_search;
-   }
-   return -ENOMEM;
-   }
-   if (!vma || addr + len <= vma->vm_start) {
-   /* Remember the address where we stopped this search:  
*/
-   mm->free_area_cache = addr + len;
-   return addr;
-   }
-   addr = (vma->vm_end + align_mask) & ~align_mask;
-   }
+   align_mask = PAGE_MASK & (SHMLBA - 1);
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = TASK_SIZE;
+   info.align_mask = align_mask;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 asmlinkage long
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/8] mm: use vm_unmapped_area() on frv architecture

2013-01-23 Thread Michel Lespinasse
Update the frv arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/frv/mm/elf-fdpic.c |   49 --
 1 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/arch/frv/mm/elf-fdpic.c b/arch/frv/mm/elf-fdpic.c
index 385fd30b142f..836f14707a62 100644
--- a/arch/frv/mm/elf-fdpic.c
+++ b/arch/frv/mm/elf-fdpic.c
@@ -60,7 +60,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, 
unsigned long addr, unsi
 unsigned long pgoff, unsigned long flags)
 {
struct vm_area_struct *vma;
-   unsigned long limit;
+   struct vm_unmapped_area_info info;
 
if (len > TASK_SIZE)
return -ENOMEM;
@@ -79,39 +79,24 @@ unsigned long arch_get_unmapped_area(struct file *filp, 
unsigned long addr, unsi
}
 
/* search between the bottom of user VM and the stack grow area */
-   addr = PAGE_SIZE;
-   limit = (current->mm->start_stack - 0x0020);
-   if (addr + len <= limit) {
-   limit -= len;
-
-   if (addr <= limit) {
-   vma = find_vma(current->mm, PAGE_SIZE);
-   for (; vma; vma = vma->vm_next) {
-   if (addr > limit)
-   break;
-   if (addr + len <= vma->vm_start)
-   goto success;
-   addr = vma->vm_end;
-   }
-   }
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_SIZE;
+   info.high_limit = (current->mm->start_stack - 0x0020);
+   info.align_mask = 0;
+   info.align_offset = 0;
+   addr = vm_unmapped_area(&info);
+   if (!(addr & ~PAGE_MASK))
+   goto success;
+   VM_BUG_ON(addr != -ENOMEM);
 
/* search from just above the WorkRAM area to the top of memory */
-   addr = PAGE_ALIGN(0x8000);
-   limit = TASK_SIZE - len;
-   if (addr <= limit) {
-   vma = find_vma(current->mm, addr);
-   for (; vma; vma = vma->vm_next) {
-   if (addr > limit)
-   break;
-   if (addr + len <= vma->vm_start)
-   goto success;
-   addr = vma->vm_end;
-   }
-
-   if (!vma && addr <= limit)
-   goto success;
-   }
+   info.low_limit = PAGE_ALIGN(0x8000);
+   info.high_limit = TASK_SIZE;
+   addr = vm_unmapped_area(&info);
+   if (!(addr & ~PAGE_MASK))
+   goto success;
+   VM_BUG_ON(addr != -ENOMEM);
 
 #if 0
printk("[area] l=%lx (ENOMEM) f='%s'\n",
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/8] mm: use vm_unmapped_area() on alpha architecture

2013-01-23 Thread Michel Lespinasse
Update the alpha arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/alpha/kernel/osf_sys.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 14db93e4c8a8..ba707e23ef37 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1298,17 +1298,15 @@ static unsigned long
 arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
 unsigned long limit)
 {
-   struct vm_area_struct *vma = find_vma(current->mm, addr);
-
-   while (1) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (limit - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
-   vma = vma->vm_next;
-   }
+   struct vm_unmapped_area_info info;
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = limit;
+   info.align_mask = 0;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 unsigned long
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/8] mm: use vm_unmapped_area() on parisc architecture

2013-01-23 Thread Michel Lespinasse
Update the parisc arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 
Acked-by: Rik van Riel 

---
 arch/parisc/kernel/sys_parisc.c |   46 ++
 1 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c
index f76c10863c62..6ab138088076 100644
--- a/arch/parisc/kernel/sys_parisc.c
+++ b/arch/parisc/kernel/sys_parisc.c
@@ -35,18 +35,15 @@
 
 static unsigned long get_unshared_area(unsigned long addr, unsigned long len)
 {
-   struct vm_area_struct *vma;
+   struct vm_unmapped_area_info info;
 
-   addr = PAGE_ALIGN(addr);
-
-   for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_ALIGN(addr);
+   info.high_limit = TASK_SIZE;
+   info.align_mask = 0;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 #define DCACHE_ALIGN(addr) (((addr) + (SHMLBA - 1)) &~ (SHMLBA - 1))
@@ -63,30 +60,21 @@ static unsigned long get_unshared_area(unsigned long addr, 
unsigned long len)
  */
 static int get_offset(struct address_space *mapping)
 {
-   int offset = (unsigned long) mapping << (PAGE_SHIFT - 8);
-   return offset & 0x3FF000;
+   return (unsigned long) mapping >> 8;
 }
 
 static unsigned long get_shared_area(struct address_space *mapping,
unsigned long addr, unsigned long len, unsigned long pgoff)
 {
-   struct vm_area_struct *vma;
-   int offset = mapping ? get_offset(mapping) : 0;
-
-   offset = (offset + (pgoff << PAGE_SHIFT)) & 0x3FF000;
+   struct vm_unmapped_area_info info;
 
-   addr = DCACHE_ALIGN(addr - offset) + offset;
-
-   for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = DCACHE_ALIGN(vma->vm_end - offset) + offset;
-   if (addr < vma->vm_end) /* handle wraparound */
-   return -ENOMEM;
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_ALIGN(addr);
+   info.high_limit = TASK_SIZE;
+   info.align_mask = PAGE_MASK & (SHMLBA - 1);
+   info.align_offset = (get_offset(mapping) + pgoff) << PAGE_SHIFT;
+   return vm_unmapped_area(&info);
 }
 
 unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/8] convert remaining archs to use vm_unmapped_area()

2013-01-23 Thread Michel Lespinasse
This is a resend of my "finish the mission" patch series. I need arch
maintainers to approve so I can push this to andrew's -mm tree.

These patches, which apply on top of v3.8-rc kernels, are to complete the
VMA gap finding code I introduced (following Rik's initial proposal) in
v3.8-rc1.

First 5 patches introduce the use of vm_unmapped_area() to replace brute
force searches on parisc, alpha, frv and ia64 architectures (all relatively
trivial uses of the vm_unmapped_area() infrastructure)

Next 2 patches do the same as above for the powerpc architecture. This
change is not as trivial as for the other architectures, because we
need to account for each address space slice potentially having a
different page size.

The last patch removes the free_area_cache, which was used by all the
brute force searches before they got converted to the
vm_unmapped_area() infrastructure.

I did some basic testing on x86 and powerpc; however the first 5 (simpler)
patches for parisc, alpha, frv and ia64 architectures are untested.

Michel Lespinasse (8):
  mm: use vm_unmapped_area() on parisc architecture
  mm: use vm_unmapped_area() on alpha architecture
  mm: use vm_unmapped_area() on frv architecture
  mm: use vm_unmapped_area() on ia64 architecture
  mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture
  mm: remove free_area_cache use in powerpc architecture
  mm: use vm_unmapped_area() on powerpc architecture
  mm: remove free_area_cache

 arch/alpha/kernel/osf_sys.c  |   20 ++--
 arch/arm/mm/mmap.c   |2 -
 arch/arm64/mm/mmap.c |2 -
 arch/frv/mm/elf-fdpic.c  |   49 +++
 arch/ia64/kernel/sys_ia64.c  |   37 ++
 arch/ia64/mm/hugetlbpage.c   |   20 ++--
 arch/mips/mm/mmap.c  |2 -
 arch/parisc/kernel/sys_parisc.c  |   46 +++
 arch/powerpc/include/asm/page_64.h   |3 +-
 arch/powerpc/mm/hugetlbpage.c|2 +-
 arch/powerpc/mm/mmap_64.c|2 -
 arch/powerpc/mm/slice.c  |  228 +-
 arch/powerpc/platforms/cell/spufs/file.c |2 +-
 arch/s390/mm/mmap.c  |4 -
 arch/sparc/kernel/sys_sparc_64.c |2 -
 arch/tile/mm/mmap.c  |2 -
 arch/x86/ia32/ia32_aout.c|2 -
 arch/x86/mm/mmap.c   |2 -
 fs/binfmt_aout.c |2 -
 fs/binfmt_elf.c  |2 -
 include/linux/mm_types.h |3 -
 include/linux/sched.h|2 -
 kernel/fork.c|4 -
 mm/mmap.c|   28 
 mm/nommu.c   |4 -
 mm/util.c|1 -
 26 files changed, 163 insertions(+), 310 deletions(-)

-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-09 Thread Michel Lespinasse
On Wed, Jan 09, 2013 at 02:32:56PM +1100, Benjamin Herrenschmidt wrote:
> Ok. I think at least you can move that construct:
> 
> +   if (addr < SLICE_LOW_TOP) {
> +   slice = GET_LOW_SLICE_INDEX(addr);
> +   addr = (slice + 1) << SLICE_LOW_SHIFT;
> +   if (!(available.low_slices & (1u << slice)))
> +   continue;
> +   } else {
> +   slice = GET_HIGH_SLICE_INDEX(addr);
> +   addr = (slice + 1) << SLICE_HIGH_SHIFT;
> +   if (!(available.high_slices & (1u << slice)))
> +   continue;
> +   }
> 
> Into some kind of helper. It will probably compile to the same thing but
> at least it's more readable and it will avoid a fuckup in the future if
> somebody changes the algorithm and forgets to update one of the
> copies :-)

All right, does the following look more palatable then ?
(didn't re-test it, though)

Signed-off-by: Michel Lespinasse 

---
 arch/powerpc/mm/slice.c |  123 ++-
 1 files changed, 78 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 999a74f25ebe..3e99c149271a 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -237,36 +237,69 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
 #endif
 }
 
+/*
+ * Compute which slice addr is part of;
+ * set *boundary_addr to the start or end boundary of that slice
+ * (depending on 'end' parameter);
+ * return boolean indicating if the slice is marked as available in the
+ * 'available' slice_mark.
+ */
+static bool slice_scan_available(unsigned long addr,
+struct slice_mask available,
+int end,
+unsigned long *boundary_addr)
+{
+   unsigned long slice;
+   if (addr < SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end) << SLICE_LOW_SHIFT;
+   return !!(available.low_slices & (1u << slice));
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   *boundary_addr = (slice + end) ?
+   ((slice + end) << SLICE_HIGH_SHIFT) : SLICE_LOW_TOP;
+   return !!(available.high_slices & (1u << slice));
+   }
+}
+
 static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
  unsigned long len,
  struct slice_mask available,
  int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, next_end;
+   struct vm_unmapped_area_info info;
 
-   addr = TASK_UNMAPPED_BASE;
-
-   for (;;) {
-   addr = _ALIGN_UP(addr, 1ul << pshift);
-   if ((TASK_SIZE - len) < addr)
-   break;
-   vma = find_vma(mm, addr);
-   BUG_ON(vma && (addr >= vma->vm_end));
+   info.flags = 0;
+   info.length = len;
+   info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
+   info.align_offset = 0;
 
-   mask = slice_range_to_mask(addr, len);
-   if (!slice_check_fit(mask, available)) {
-   if (addr < SLICE_LOW_TOP)
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_LOW_SHIFT);
-   else
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_HIGH_SHIFT);
+   addr = TASK_UNMAPPED_BASE;
+   while (addr < TASK_SIZE) {
+   info.low_limit = addr;
+   if (!slice_scan_available(addr, available, 1, &addr))
continue;
+
+ next_slice:
+   /*
+* At this point [info.low_limit; addr) covers
+* available slices only and ends at a slice boundary.
+* Check if we need to reduce the range, or if we can
+* extend it to cover the next available slice.
+*/
+   if (addr >= TASK_SIZE)
+   addr = TASK_SIZE;
+   else if (slice_scan_available(addr, available, 1, &next_end)) {
+   addr = next_end;
+   goto next_slice;
}
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
+   info.high

Re: [PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-08 Thread Michel Lespinasse
On Tue, Jan 8, 2013 at 6:15 PM, Benjamin Herrenschmidt
 wrote:
> On Tue, 2013-01-08 at 17:28 -0800, Michel Lespinasse wrote:
>> Update the powerpc slice_get_unmapped_area function to make use of
>> vm_unmapped_area() instead of implementing a brute force search.
>>
>> Signed-off-by: Michel Lespinasse 
>>
>> ---
>>  arch/powerpc/mm/slice.c |  128 
>> +-
>>  1 files changed, 81 insertions(+), 47 deletions(-)
>
> That doesn't look good ... the resulting code is longer than the
> original, which makes me wonder how it is an improvement...

Well no fair, the previous patch (for powerpc as well) has 22
insertions and 93 deletions :)

The benefit is that the new code has lower algorithmic complexity, it
replaces a per-vma loop with O(N) complexity with an outer loop that
finds contiguous slice blocks and passes them to vm_unmapped_area()
which is only O(log N) complexity. So the new code will be faster for
workloads which use lots of vmas.

That said, I do agree that the code that looks for contiguous
available slices looks kinda ugly - just not sure how to make it look
nicer though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/8] vm_unmapped_area: finish the mission

2013-01-08 Thread Michel Lespinasse
Whoops, I was supposed to find a more appropriate subject line before
sending this :]

On Tue, Jan 8, 2013 at 5:28 PM, Michel Lespinasse  wrote:
> These patches, which apply on top of v3.8-rc kernels, are to complete the
> VMA gap finding code I introduced (following Rik's initial proposal) in
> v3.8-rc1.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 8/8] mm: remove free_area_cache

2013-01-08 Thread Michel Lespinasse
Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse 

---
 arch/arm/mm/mmap.c   |2 --
 arch/arm64/mm/mmap.c |2 --
 arch/mips/mm/mmap.c  |2 --
 arch/powerpc/mm/mmap_64.c|2 --
 arch/s390/mm/mmap.c  |4 
 arch/sparc/kernel/sys_sparc_64.c |2 --
 arch/tile/mm/mmap.c  |2 --
 arch/x86/ia32/ia32_aout.c|2 --
 arch/x86/mm/mmap.c   |2 --
 fs/binfmt_aout.c |2 --
 fs/binfmt_elf.c  |2 --
 include/linux/mm_types.h |3 ---
 include/linux/sched.h|2 --
 kernel/fork.c|4 
 mm/mmap.c|   28 
 mm/nommu.c   |4 
 mm/util.c|1 -
 17 files changed, 0 insertions(+), 66 deletions(-)

diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index 10062ceadd1c..0c6356255fe3 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -181,11 +181,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base(random_factor);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 7c7be7855638..8ed6cb1a900f 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -90,11 +90,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 EXPORT_SYMBOL_GPL(arch_pick_mmap_layout);
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index d9be7540a6be..f4e63c29d044 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -158,11 +158,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base(random_factor);
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/powerpc/mm/mmap_64.c b/arch/powerpc/mm/mmap_64.c
index 67a42ed0d2fc..cb8bdbe4972f 100644
--- a/arch/powerpc/mm/mmap_64.c
+++ b/arch/powerpc/mm/mmap_64.c
@@ -92,10 +92,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
diff --git a/arch/s390/mm/mmap.c b/arch/s390/mm/mmap.c
index c59a5efa58b1..f2a462625c9e 100644
--- a/arch/s390/mm/mmap.c
+++ b/arch/s390/mm/mmap.c
@@ -91,11 +91,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = arch_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = arch_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
@@ -173,11 +171,9 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
if (mmap_is_legacy()) {
mm->mmap_base = TASK_UNMAPPED_BASE;
mm->get_unmapped_area = s390_get_unmapped_area;
-   mm->unmap_area = arch_unmap_area;
} else {
mm->mmap_base = mmap_base();
mm->get_unmapped_area = s390_get_unmapped_area_topdown;
-   mm->unmap_area = arch_unmap_area_topdown;
}
 }
 
diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index 708bc29d36a8..f3c169f9d3a1 100644
--- a/arch/sparc/kernel/sys_sparc_64.c
+++ b/arch/sparc/kernel/sys_sparc_6

[PATCH 7/8] mm: use vm_unmapped_area() on powerpc architecture

2013-01-08 Thread Michel Lespinasse
Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/powerpc/mm/slice.c |  128 +-
 1 files changed, 81 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 999a74f25ebe..048346b7eed5 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -242,31 +242,51 @@ static unsigned long slice_find_area_bottomup(struct 
mm_struct *mm,
  struct slice_mask available,
  int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, slice;
+   struct vm_unmapped_area_info info;
 
-   addr = TASK_UNMAPPED_BASE;
+   info.flags = 0;
+   info.length = len;
+   info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
+   info.align_offset = 0;
 
-   for (;;) {
-   addr = _ALIGN_UP(addr, 1ul << pshift);
-   if ((TASK_SIZE - len) < addr)
-   break;
-   vma = find_vma(mm, addr);
-   BUG_ON(vma && (addr >= vma->vm_end));
+   addr = TASK_UNMAPPED_BASE;
+   while (addr < TASK_SIZE) {
+   info.low_limit = addr;
+   if (addr < SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   addr = (slice + 1) << SLICE_LOW_SHIFT;
+   if (!(available.low_slices & (1u << slice)))
+   continue;
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   addr = (slice + 1) << SLICE_HIGH_SHIFT;
+   if (!(available.high_slices & (1u << slice)))
+   continue;
+   }
 
-   mask = slice_range_to_mask(addr, len);
-   if (!slice_check_fit(mask, available)) {
-   if (addr < SLICE_LOW_TOP)
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_LOW_SHIFT);
-   else
-   addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_HIGH_SHIFT);
-   continue;
+ next_slice:
+   if (addr >= TASK_SIZE)
+   addr = TASK_SIZE;
+   else if (addr < SLICE_LOW_TOP) {
+   slice = GET_LOW_SLICE_INDEX(addr);
+   if (available.low_slices & (1u << slice)) {
+   addr = (slice + 1) << SLICE_LOW_SHIFT;
+   goto next_slice;
+   }
+   } else {
+   slice = GET_HIGH_SLICE_INDEX(addr);
+   if (available.high_slices & (1u << slice)) {
+   addr = (slice + 1) << SLICE_HIGH_SHIFT;
+   goto next_slice;
+   }
}
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
+   info.high_limit = addr;
+
+   found = vm_unmapped_area(&info);
+   if (!(found & ~PAGE_MASK))
+   return found;
}
 
return -ENOMEM;
@@ -277,39 +297,53 @@ static unsigned long slice_find_area_topdown(struct 
mm_struct *mm,
 struct slice_mask available,
 int psize)
 {
-   struct vm_area_struct *vma;
-   unsigned long addr;
-   struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
+   unsigned long addr, found, slice;
+   struct vm_unmapped_area_info info;
 
-   addr = mm->mmap_base;
-   while (addr > len) {
-   /* Go down by chunk size */
-   addr = _ALIGN_DOWN(addr - len, 1ul << pshift);
+   info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+   info.length = len;
+   info.align_mask = PAGE_MASK & ((1ul << pshift) - 1);
+   info.align_offset = 0;
 
-   /* Check for hit with different page size */
-   mask = slice_range_to_mask(addr, len);
-   if (!slice_check_fit(mask, available)) {
-   if (addr < SLICE_LOW_TOP)
-   addr = _ALIGN_DOWN(addr, 1ul << 
SLICE_LOW_SHIFT);
-   else if (addr < (1ul << SLICE_HIGH_SHIFT))
-   addr = SLICE_LOW_TOP;
-   els

[PATCH 6/8] mm: remove free_area_cache use in powerpc architecture

2013-01-08 Thread Michel Lespinasse
As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.

This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.

Signed-off-by: Michel Lespinasse 

---
 arch/powerpc/include/asm/page_64.h   |3 +-
 arch/powerpc/mm/hugetlbpage.c|2 +-
 arch/powerpc/mm/slice.c  |  108 +
 arch/powerpc/platforms/cell/spufs/file.c |2 +-
 4 files changed, 22 insertions(+), 93 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index cd915d6b093d..88693cef4f3d 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -99,8 +99,7 @@ extern unsigned long slice_get_unmapped_area(unsigned long 
addr,
 unsigned long len,
 unsigned long flags,
 unsigned int psize,
-int topdown,
-int use_cache);
+int topdown);
 
 extern unsigned int get_slice_psize(struct mm_struct *mm,
unsigned long addr);
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 1a6de0a7d8eb..5dc52d803ed8 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -742,7 +742,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
struct hstate *hstate = hstate_file(file);
int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
+   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
 }
 #endif
 
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index cf9dada734b6..999a74f25ebe 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -240,23 +240,15 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
 static unsigned long slice_find_area_bottomup(struct mm_struct *mm,
  unsigned long len,
  struct slice_mask available,
- int psize, int use_cache)
+ int psize)
 {
struct vm_area_struct *vma;
-   unsigned long start_addr, addr;
+   unsigned long addr;
struct slice_mask mask;
int pshift = max_t(int, mmu_psize_defs[psize].shift, PAGE_SHIFT);
 
-   if (use_cache) {
-   if (len <= mm->cached_hole_size) {
-   start_addr = addr = TASK_UNMAPPED_BASE;
-   mm->cached_hole_size = 0;
-   } else
-   start_addr = addr = mm->free_area_cache;
-   } else
-   start_addr = addr = TASK_UNMAPPED_BASE;
+   addr = TASK_UNMAPPED_BASE;
 
-full_search:
for (;;) {
addr = _ALIGN_UP(addr, 1ul << pshift);
if ((TASK_SIZE - len) < addr)
@@ -272,63 +264,24 @@ full_search:
addr = _ALIGN_UP(addr + 1,  1ul << 
SLICE_HIGH_SHIFT);
continue;
}
-   if (!vma || addr + len <= vma->vm_start) {
-   /*
-* Remember the place where we stopped the search:
-*/
-   if (use_cache)
-   mm->free_area_cache = addr + len;
+   if (!vma || addr + len <= vma->vm_start)
return addr;
-   }
-   if (use_cache && (addr + mm->cached_hole_size) < vma->vm_start)
-   mm->cached_hole_size = vma->vm_start - addr;
addr = vma->vm_end;
}
 
-   /* Make sure we didn't miss any holes */
-   if (use_cache && start_addr != TASK_UNMAPPED_BASE) {
-   start_addr = addr = TASK_UNMAPPED_BASE;
-   mm->cached_hole_size = 0;
-   goto full_search;
-   }
return -ENOMEM;
 }
 
 static unsigned long slice_find_area_topdown(struct mm_struct *mm,
 unsigned long len,
 struct slice_mask available,
-int psize, int use_cache)
+int psize)
 {
struct vm_area_struct *vma;
unsigned long addr;
struct slice_mask mask;
int pshift = max_t(int, mmu_psize_de

[PATCH 5/8] mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture

2013-01-08 Thread Michel Lespinasse
Update the ia64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/ia64/mm/hugetlbpage.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 5ca674b74737..76069c18ee42 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -148,7 +148,7 @@ void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, 
unsigned long len,
unsigned long pgoff, unsigned long flags)
 {
-   struct vm_area_struct *vmm;
+   struct vm_unmapped_area_info info;
 
if (len > RGN_MAP_LIMIT)
return -ENOMEM;
@@ -165,16 +165,14 @@ unsigned long hugetlb_get_unmapped_area(struct file 
*file, unsigned long addr, u
/* This code assumes that RGN_HPAGE != 0. */
if ((REGION_NUMBER(addr) != RGN_HPAGE) || (addr & (HPAGE_SIZE - 1)))
addr = HPAGE_REGION_BASE;
-   else
-   addr = ALIGN(addr, HPAGE_SIZE);
-   for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
-   /* At this point:  (!vmm || addr < vmm->vm_end). */
-   if (REGION_OFFSET(addr) + len > RGN_MAP_LIMIT)
-   return -ENOMEM;
-   if (!vmm || (addr + len) <= vmm->vm_start)
-   return addr;
-   addr = ALIGN(vmm->vm_end, HPAGE_SIZE);
-   }
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = HPAGE_REGION_BASE + RGN_MAP_LIMIT;
+   info.align_mask = PAGE_MASK & (HPAGE_SIZE - 1);
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 static int __init hugetlb_setup_sz(char *str)
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 4/8] mm: use vm_unmapped_area() on ia64 architecture

2013-01-08 Thread Michel Lespinasse
Update the ia64 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/ia64/kernel/sys_ia64.c |   37 -
 1 files changed, 12 insertions(+), 25 deletions(-)

diff --git a/arch/ia64/kernel/sys_ia64.c b/arch/ia64/kernel/sys_ia64.c
index d9439ef2f661..41e33f84c185 100644
--- a/arch/ia64/kernel/sys_ia64.c
+++ b/arch/ia64/kernel/sys_ia64.c
@@ -25,9 +25,9 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
unsigned long pgoff, unsigned long flags)
 {
long map_shared = (flags & MAP_SHARED);
-   unsigned long start_addr, align_mask = PAGE_SIZE - 1;
+   unsigned long align_mask = 0;
struct mm_struct *mm = current->mm;
-   struct vm_area_struct *vma;
+   struct vm_unmapped_area_info info;
 
if (len > RGN_MAP_LIMIT)
return -ENOMEM;
@@ -44,7 +44,7 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
addr = 0;
 #endif
if (!addr)
-   addr = mm->free_area_cache;
+   addr = TASK_UNMAPPED_BASE;
 
if (map_shared && (TASK_SIZE > 0xul))
/*
@@ -53,28 +53,15 @@ arch_get_unmapped_area (struct file *filp, unsigned long 
addr, unsigned long len
 * tasks, we prefer to avoid exhausting the address space too 
quickly by
 * limiting alignment to a single page.
 */
-   align_mask = SHMLBA - 1;
-
-  full_search:
-   start_addr = addr = (addr + align_mask) & ~align_mask;
-
-   for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr || RGN_MAP_LIMIT - len < 
REGION_OFFSET(addr)) {
-   if (start_addr != TASK_UNMAPPED_BASE) {
-   /* Start a new search --- just in case we 
missed some holes.  */
-   addr = TASK_UNMAPPED_BASE;
-   goto full_search;
-   }
-   return -ENOMEM;
-   }
-   if (!vma || addr + len <= vma->vm_start) {
-   /* Remember the address where we stopped this search:  
*/
-   mm->free_area_cache = addr + len;
-   return addr;
-   }
-   addr = (vma->vm_end + align_mask) & ~align_mask;
-   }
+   align_mask = PAGE_MASK & (SHMLBA - 1);
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = TASK_SIZE;
+   info.align_mask = align_mask;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 asmlinkage long
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/8] mm: use vm_unmapped_area() on frv architecture

2013-01-08 Thread Michel Lespinasse
Update the frv arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/frv/mm/elf-fdpic.c |   49 --
 1 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/arch/frv/mm/elf-fdpic.c b/arch/frv/mm/elf-fdpic.c
index 385fd30b142f..836f14707a62 100644
--- a/arch/frv/mm/elf-fdpic.c
+++ b/arch/frv/mm/elf-fdpic.c
@@ -60,7 +60,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, 
unsigned long addr, unsi
 unsigned long pgoff, unsigned long flags)
 {
struct vm_area_struct *vma;
-   unsigned long limit;
+   struct vm_unmapped_area_info info;
 
if (len > TASK_SIZE)
return -ENOMEM;
@@ -79,39 +79,24 @@ unsigned long arch_get_unmapped_area(struct file *filp, 
unsigned long addr, unsi
}
 
/* search between the bottom of user VM and the stack grow area */
-   addr = PAGE_SIZE;
-   limit = (current->mm->start_stack - 0x0020);
-   if (addr + len <= limit) {
-   limit -= len;
-
-   if (addr <= limit) {
-   vma = find_vma(current->mm, PAGE_SIZE);
-   for (; vma; vma = vma->vm_next) {
-   if (addr > limit)
-   break;
-   if (addr + len <= vma->vm_start)
-   goto success;
-   addr = vma->vm_end;
-   }
-   }
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_SIZE;
+   info.high_limit = (current->mm->start_stack - 0x0020);
+   info.align_mask = 0;
+   info.align_offset = 0;
+   addr = vm_unmapped_area(&info);
+   if (!(addr & ~PAGE_MASK))
+   goto success;
+   VM_BUG_ON(addr != -ENOMEM);
 
/* search from just above the WorkRAM area to the top of memory */
-   addr = PAGE_ALIGN(0x8000);
-   limit = TASK_SIZE - len;
-   if (addr <= limit) {
-   vma = find_vma(current->mm, addr);
-   for (; vma; vma = vma->vm_next) {
-   if (addr > limit)
-   break;
-   if (addr + len <= vma->vm_start)
-   goto success;
-   addr = vma->vm_end;
-   }
-
-   if (!vma && addr <= limit)
-   goto success;
-   }
+   info.low_limit = PAGE_ALIGN(0x8000);
+   info.high_limit = TASK_SIZE;
+   addr = vm_unmapped_area(&info);
+   if (!(addr & ~PAGE_MASK))
+   goto success;
+   VM_BUG_ON(addr != -ENOMEM);
 
 #if 0
printk("[area] l=%lx (ENOMEM) f='%s'\n",
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/8] mm: use vm_unmapped_area() on alpha architecture

2013-01-08 Thread Michel Lespinasse
Update the alpha arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/alpha/kernel/osf_sys.c |   20 +---
 1 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 14db93e4c8a8..ba707e23ef37 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -1298,17 +1298,15 @@ static unsigned long
 arch_get_unmapped_area_1(unsigned long addr, unsigned long len,
 unsigned long limit)
 {
-   struct vm_area_struct *vma = find_vma(current->mm, addr);
-
-   while (1) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (limit - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
-   vma = vma->vm_next;
-   }
+   struct vm_unmapped_area_info info;
+
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = addr;
+   info.high_limit = limit;
+   info.align_mask = 0;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 unsigned long
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/8] vm_unmapped_area: finish the mission

2013-01-08 Thread Michel Lespinasse
These patches, which apply on top of v3.8-rc kernels, are to complete the
VMA gap finding code I introduced (following Rik's initial proposal) in
v3.8-rc1.

First 5 patches introduce the use of vm_unmapped_area() to replace brute
force searches on parisc, alpha, frv and ia64 architectures (all relatively
trivial uses of the vm_unmapped_area() infrastructure)

Next 2 patches do the same as above for the powerpc architecture. This
change is not as trivial as for the other architectures, because we
need to account for each address space slice potentially having a
different page size.

The last patch removes the free_area_cache, which was used by all the
brute force searches before they got converted to the
vm_unmapped_area() infrastructure.

I did some basic testing on x86 and powerpc; however the first 5 (simpler)
patches for parisc, alpha, frv and ia64 architectures are untested.

Michel Lespinasse (8):
  mm: use vm_unmapped_area() on parisc architecture
  mm: use vm_unmapped_area() on alpha architecture
  mm: use vm_unmapped_area() on frv architecture
  mm: use vm_unmapped_area() on ia64 architecture
  mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture
  mm: remove free_area_cache use in powerpc architecture
  mm: use vm_unmapped_area() on powerpc architecture
  mm: remove free_area_cache

 arch/alpha/kernel/osf_sys.c  |   20 ++--
 arch/arm/mm/mmap.c   |2 -
 arch/arm64/mm/mmap.c |2 -
 arch/frv/mm/elf-fdpic.c  |   49 +++
 arch/ia64/kernel/sys_ia64.c  |   37 ++
 arch/ia64/mm/hugetlbpage.c   |   20 ++--
 arch/mips/mm/mmap.c  |2 -
 arch/parisc/kernel/sys_parisc.c  |   46 +++
 arch/powerpc/include/asm/page_64.h   |3 +-
 arch/powerpc/mm/hugetlbpage.c|2 +-
 arch/powerpc/mm/mmap_64.c|2 -
 arch/powerpc/mm/slice.c  |  228 +-
 arch/powerpc/platforms/cell/spufs/file.c |2 +-
 arch/s390/mm/mmap.c  |4 -
 arch/sparc/kernel/sys_sparc_64.c |2 -
 arch/tile/mm/mmap.c  |2 -
 arch/x86/ia32/ia32_aout.c|2 -
 arch/x86/mm/mmap.c   |2 -
 fs/binfmt_aout.c |2 -
 fs/binfmt_elf.c  |2 -
 include/linux/mm_types.h |3 -
 include/linux/sched.h|2 -
 kernel/fork.c|4 -
 mm/mmap.c|   28 
 mm/nommu.c   |4 -
 mm/util.c|1 -
 26 files changed, 163 insertions(+), 310 deletions(-)

-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/8] mm: use vm_unmapped_area() on parisc architecture

2013-01-08 Thread Michel Lespinasse
Update the parisc arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse 

---
 arch/parisc/kernel/sys_parisc.c |   46 ++
 1 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/arch/parisc/kernel/sys_parisc.c b/arch/parisc/kernel/sys_parisc.c
index f76c10863c62..6ab138088076 100644
--- a/arch/parisc/kernel/sys_parisc.c
+++ b/arch/parisc/kernel/sys_parisc.c
@@ -35,18 +35,15 @@
 
 static unsigned long get_unshared_area(unsigned long addr, unsigned long len)
 {
-   struct vm_area_struct *vma;
+   struct vm_unmapped_area_info info;
 
-   addr = PAGE_ALIGN(addr);
-
-   for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = vma->vm_end;
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_ALIGN(addr);
+   info.high_limit = TASK_SIZE;
+   info.align_mask = 0;
+   info.align_offset = 0;
+   return vm_unmapped_area(&info);
 }
 
 #define DCACHE_ALIGN(addr) (((addr) + (SHMLBA - 1)) &~ (SHMLBA - 1))
@@ -63,30 +60,21 @@ static unsigned long get_unshared_area(unsigned long addr, 
unsigned long len)
  */
 static int get_offset(struct address_space *mapping)
 {
-   int offset = (unsigned long) mapping << (PAGE_SHIFT - 8);
-   return offset & 0x3FF000;
+   return (unsigned long) mapping >> 8;
 }
 
 static unsigned long get_shared_area(struct address_space *mapping,
unsigned long addr, unsigned long len, unsigned long pgoff)
 {
-   struct vm_area_struct *vma;
-   int offset = mapping ? get_offset(mapping) : 0;
-
-   offset = (offset + (pgoff << PAGE_SHIFT)) & 0x3FF000;
+   struct vm_unmapped_area_info info;
 
-   addr = DCACHE_ALIGN(addr - offset) + offset;
-
-   for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
-   /* At this point:  (!vma || addr < vma->vm_end). */
-   if (TASK_SIZE - len < addr)
-   return -ENOMEM;
-   if (!vma || addr + len <= vma->vm_start)
-   return addr;
-   addr = DCACHE_ALIGN(vma->vm_end - offset) + offset;
-   if (addr < vma->vm_end) /* handle wraparound */
-   return -ENOMEM;
-   }
+   info.flags = 0;
+   info.length = len;
+   info.low_limit = PAGE_ALIGN(addr);
+   info.high_limit = TASK_SIZE;
+   info.align_mask = PAGE_MASK & (SHMLBA - 1);
+   info.align_offset = (get_offset(mapping) + pgoff) << PAGE_SHIFT;
+   return vm_unmapped_area(&info);
 }
 
 unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
-- 
1.7.7.3
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] mm: Check we have the right vma in access_process_vm()

2011-04-04 Thread Michel Lespinasse
On Mon, Apr 4, 2011 at 11:24 PM, Michael Ellerman
 wrote:
> In access_process_vm() we need to check that we have found the right
> vma, not the following vma, before we try to access it. Otherwise
> we might call the vma's access routine with an address which does
> not fall inside the vma.
>
> Signed-off-by: Michael Ellerman 

Please note that the code has moved into __access_remote_vm() in
current linus tree. Also, should len be truncated before calling
vma->vm_ops->access() so that we can guarantee it won't overflow past
the end of the vma ?

> diff --git a/mm/memory.c b/mm/memory.c
> index 5823698..7e6f17b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3619,7 +3619,7 @@ int access_process_vm(struct task_struct *tsk, unsigned 
> long addr, void *buf, in
>                         */
>  #ifdef CONFIG_HAVE_IOREMAP_PROT
>                        vma = find_vma(mm, addr);
> -                       if (!vma)
> +                       if (!vma || vma->vm_start > addr)
>                                break;
>                        if (vma->vm_ops && vma->vm_ops->access)
>                                ret = vma->vm_ops->access(vma, addr, buf,
> --
> 1.7.1

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/5] make *_gate_vma accept mm_struct instead of task_struct

2011-03-09 Thread Michel Lespinasse
On Tue, Mar 8, 2011 at 4:31 PM, Stephen Wilson  wrote:
> Morally, the question of whether an address lies in a gate vma should be asked
> with respect to an mm, not a particular task.
>
> Practically, dropping the dependency on task_struct will help make current and
> future operations on mm's more flexible and convenient.  In particular, it
> allows some code paths to avoid the need to hold task_lock.

Reviewed-by: Michel Lespinasse 

May I suggest ia32_compat instead of just compat for the flag name ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev