Re: [git pull] Please pull powerpc.git merge branch (updated)
Ben, Poke. :) - k On Aug 10, 2012, at 8:07 AM, Kumar Gala wrote: > Ben, > > Two updates from last week (one dts bug fix, one minor defconfig update) > > - k > > The following changes since commit 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee: > > Linux 3.6-rc1 (2012-08-02 16:38:10 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/galak/powerpc.git merge > > for you to fetch changes up to 09a3017a585eb8567a7de15b426bb1dfb548bf0f: > > powerpc/p4080ds: dts - add usb controller version info and port0 (2012-08-10 > 07:47:02 -0500) > > > Jia Hongtao (1): > powerpc/fsl-pci: Only scan PCI bus if configured as a host > > Shengzhou Liu (1): > powerpc/p4080ds: dts - add usb controller version info and port0 > > Zhao Chenhui (1): > powerpc/85xx: mpc85xx_defconfig - add VIA PATA support for MPC85xxCDS > > arch/powerpc/boot/dts/fsl/p4080si-post.dtsi |7 +++ > arch/powerpc/configs/mpc85xx_defconfig |1 + > arch/powerpc/sysdev/fsl_pci.c | 13 - > 3 files changed, 16 insertions(+), 5 deletions(-) > > ___ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On Thu, Aug 16, 2012 at 05:21:12PM +0200, Oleg Nesterov wrote: ... > > So, the arch agnostic code itself > > takes care of this case... > > Yes. I forgot about install_breakpoint()->is_swbp_insn() check which > returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does > this. > > > or am I missing something? > > No, it is me. > > > However, I see that we need a powerpc specific is_swbp_insn() > > implementation since we will have to take care of all the trap variants. > > Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic > code, should only return true if insn == UPROBE_SWBP_INSN (just in case, > this logic needs more fixes but this is offtopic). I think it does... > If powerpc has another insn(s) which can trigger powerpc's do_int3() > counterpart, they should be rejected by arch_uprobe_analyze_insn(). > I think. The insn that gets passed to arch_uprobe_analyze_insn() is copy_insn()'s version, which is the file copy of the instruction. We should also take care of the in-memory copy, in case gdb had inserted a breakpoint at the same location, right? Updating is_swbp_insn() per-arch where needed will take care of both the cases, 'cos it gets called before arch_analyze_uprobe_insn() too. > > I will need to update the patches based on changes being made by Oleg > > and Sebastien for the single-step issues. > > Perhaps you can do this in a separate change? > > We need some (simple) changes in the arch agnostic code first, they > should not break poweppc. These changes are still under discussion. > Once we have "__weak arch_uprobe_step*" you can reimplement these > hooks and fix the problems with single-stepping. OK. Agreed. Ananth ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
On Thu, 2012-08-16 at 16:15 +0200, Peter Zijlstra wrote: > On Fri, 2012-08-17 at 00:02 +1000, Michael Ellerman wrote: > > You do want to guarantee that the task will always be subject to the > > breakpoint, even if it moves cpus. So is there any way to guarantee that > > other than reserving a breakpoint slot on every cpu ahead of time? > > That's not how regular perf works.. regular perf can overload hw > resources at will and stuff is strictly per-cpu. .. > For regular (!pinned) events, we'll RR the created events on the > available hardware resources. Yeah I know, but that isn't really the semantics you want for a breakpoint. You don't want to sometimes have the breakpoint active and sometimes not, it needs to be active at all times when the task is running. At the very least you want it to behave like a pinned event, ie. if it can't be scheduled you get notified and can tell the user. > HWBP does things completely different and reserves a slot over all CPUs > for everything, thus stuff completely falls apart. So it would seem :) I guess my point was that reserving a slot on each cpu seems like a reasonable way of guaranteeing that wherever the task goes we will be able to install the breakpoint. But obviously we need some way to make it play nice with perf. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
> > > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1, > > > > despite there being no breakpoint on this CPU. This is because the call > > > > the task_bp_pinned, checks all CPUs, rather than just the current CPU. > > > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we > > > > return ENOSPC. > > > > > > I think this comes from the ptrace legacy, we register a breakpoint on > > > all cpus because when we migrate a task it cannot fail to migrate the > > > breakpoint. > > > > > > Its one of the things I hate most about the hwbp stuff as it relates to > > > perf. > > > > > > Frederic knows more... > > > > Maybe I should wait for Frederic to respond but I'm not sure I > > understand what you're saying. > > > > I can see how using ptrace hw breakpoints and perf hw breakpoints at the > > same time could be a problem, but I'm not sure how this would stop it. > > ptrace uses perf for hwbp support so we're stuck with all kinds of > stupid ptrace constraints.. or somesuch. OK > > Are you saying that we need to keep at least 1 slot free at all times, > > so that we can use it for ptrace? > > No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace > weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue > however.. OK. > > Is "perf record -e mem:0x1000 true" ever going to be able to work on > > POWER7 with only one hw breakpoint resource per CPU? > > I think it should work... but I'm fairly sure it currently doesn't > because of how things are done. 'perf record -ie mem:0x100... true' > might just work. Adding -i doesn't help. Mikey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: therm_pm72 units, interface
> If you have more things to print/offer via sysfs, I'm all for it. > > The XsG5 really has (by looking into the casing): 1 PCI Fan, > 6 center fans, 1 PSU intake and 1 PSU outblow fan (this last one > seems rather slow-turning, but maybe that's normal). > It is not quite clear which is which in the sysfs display. The cpu intake & exhaust are the same, they are handled by groups of 3 ie, cpu0_* is the 3 fans on CPU 0, cpu1_* is the 3 fans on CPU 1. Backside fan is supposed to blow on the U3 chip, I don't remember where it's located, and slots fan is the PCI one afaik. The PSU's own fan isn't under our direct control > What I did figure out: at the PROM, fans run at what seems > to be full speed (some 8000-9000 rpm?). Once Linux and therm_pm72 > are loaded, the fans settle down towards 4000 rpm, and if the machine > has warmed up, that is then when it powers off. (The kernel is indeed > 3.4. I now need to figure out how to place a new kernel on it without > it powering off inbetween.) You can try netbooting... OF netboot is limited to 4M sized zImages which can be a bit tough nowadays, but modern yaboot can netboot larger files. Another option is USB sticks. > >> $ cd /sys/devices/temperature; grep '' *; > >> backside_fan_pwm:32 > >> backside_temperature:54.000 > >> cpu0_current:34.423 > >> cpu0_exhaust_fan_rpm:5340 > >> cpu0_intake_fan_rpm:5340 > >> cpu0_temperature:72.889 > >> cpu0_voltage:1.252 > >> cpu1_current:34.179 > >> cpu1_exhaust_fan_rpm:4584 > >> cpu1_intake_fan_rpm:4584 > >> cpu1_temperature:68.526 > >> cpu1_voltage:1.259 > >> dimms_temperature:53.000 > >> grep: driver: Er en filkatalog > >> modalias:platform:temperature > >> grep: power: Er en filkatalog > >> slots_fan_pwm:20 > >> slots_temperature:38.500 > >> grep: subsystem: Er en filkatalog > >> uevent:DRIVER=temperature > >> uevent:OF_NAME=fan > >> uevent:OF_FULLNAME=/u3@0,f800/i2c@f8001000/fan@15e > >> uevent:OF_TYPE=fcu > >> uevent:OF_COMPATIBLE_0=fcu > >> uevent:OF_COMPATIBLE_N=1 > >> uevent:MODALIAS=of:NfanTfcuCfcu Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] scsi/ibmvscsi: /sys/class/scsi_host/hostX/config doesn't show any information
On Sun, Jul 29, 2012 at 8:33 PM, Benjamin Herrenschmidt wrote: > n Wed, 2012-07-18 at 18:49 +0200, o...@aepfle.de wrote: >> From: Linda Xie >> >> Expected result: >> It should show something like this: >> x1521p4:~ # cat /sys/class/scsi_host/host1/config >> PARTITIONNAME='x1521p4' >> NWSDNAME='X1521P4' >> HOSTNAME='X1521P4' >> DOMAINNAME='RCHLAND.IBM.COM' >> NAMESERVERS='9.10.244.100 9.10.244.200' >> >> Actual result: >> x1521p4:~ # cat /sys/class/scsi_host/host0/config >> x1521p4:~ # >> >> This patch changes the size of the buffer used for transfering config >> data to 4K. It was tested against 2.6.19-rc2 tree. >> >> Reported by IBM during SLES11 beta testing: > > So this patch just seems to blindly replace all occurrences of PAGE_SIZE > with HOST_PAGE_SIZE which is utterly wrong. Only one of those needs to > be changed, the one passed to ibmvscsi_do_host_config() which is what's > visible to the server, all the rest is just sysfs attributes and should > remain as-is. > > Additionally (not even mentioning that there is no explanation as to > what the real problem is anywhere in the changeset) I don't like the > fix. The root of the problem is that the MAD header has a 16-bit length > field, so writing 0x1 (64K PAGE_SIZE) into it doesn't quite work. > > So in addition to a better comment, I would suggest a fix more like > this: > > scsi/ibmvscsi: Fix host config length field overflow > > The length field in the host config packet is only 16-bit long, so > passing it 0x1 (64K which is our standard PAGE_SIZE) doesn't > work and result in an empty config from the server. > > Signed-off-by: Benjamin Herrenschmidt > CC: Acked-by: Robert Jennings Tested with an IBM i host and confirmed the fix. > --- > > diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c > b/drivers/scsi/ibmvscsi/ibmvscsi.c > index 3a6c474..337e8b3 100644 > --- a/drivers/scsi/ibmvscsi/ibmvscsi.c > +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c > @@ -1541,6 +1541,9 @@ static int ibmvscsi_do_host_config(struct > ibmvscsi_host_data *hostdata, > > host_config = &evt_struct->iu.mad.host_config; > > + /* The transport length field is only 16-bit */ > + length = min(0x, length); > + > /* Set up a lun reset SRP command */ > memset(host_config, 0x00, sizeof(*host_config)); > host_config->common.type = VIOSRP_HOST_CONFIG_TYPE; > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 09:37:25PM +0300, Kirill A. Shutemov wrote: > On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote: > > On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote: > > > Hm.. I think with static_key we can avoid cache overhead here. I'll try. > > > > Could you elaborate on the static_key? Is it some sort of self > > modifying code? > > Runtime code patching. See Documentation/static-keys.txt. We can patch it > on sysctl. I guessed it had to be patching the code, thanks for the pointer. It looks a perfect fit for this one agreed. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 08:29:44PM +0200, Andrea Arcangeli wrote: > On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote: > > Hm.. I think with static_key we can avoid cache overhead here. I'll try. > > Could you elaborate on the static_key? Is it some sort of self > modifying code? Runtime code patching. See Documentation/static-keys.txt. We can patch it on sysctl. > > > Thanks, for review. Could you take a look at huge zero page patchset? ;) > > I've noticed that too, nice :). I'm checking some detail on the > wrprotect fault behavior but I'll comment there. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- Kirill A. Shutemov ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 07:43:56PM +0300, Kirill A. Shutemov wrote: > Hm.. I think with static_key we can avoid cache overhead here. I'll try. Could you elaborate on the static_key? Is it some sort of self modifying code? > Thanks, for review. Could you take a look at huge zero page patchset? ;) I've noticed that too, nice :). I'm checking some detail on the wrprotect fault behavior but I'll comment there. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
On Thu, Aug 16, 2012 at 06:16:47PM +0200, Andrea Arcangeli wrote: > Hi Kirill, > > On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote: > > for (i = 0; i < pages_per_huge_page; > > i++, p = mem_map_next(p, page, i)) { > > It may be more optimal to avoid a multiplication/shiftleft before the > add, and to do: > > for (i = 0, vaddr = haddr; i < pages_per_huge_page; >i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) { > Makes sense. I'll update it. > > cond_resched(); > > - clear_user_highpage(p, addr + i * PAGE_SIZE); > > + vaddr = haddr + i*PAGE_SIZE; > > Not sure if gcc can optimize it away because of the external calls. > > > + if (!ARCH_HAS_USER_NOCACHE || i == target) > > + clear_user_highpage(page + i, vaddr); > > + else > > + clear_user_highpage_nocache(page + i, vaddr); > > } > > > My only worry overall is if there can be some workload where this may > actually slow down userland if the CPU cache is very large and > userland would access most of the faulted in memory after the first > fault. > > So I wouldn't mind to add one more check in addition of > !ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll > waste a cacheline yes but I doubt it's measurable compared to the time > it takes to do a >=2M hugepage copy. Hm.. I think with static_key we can avoid cache overhead here. I'll try. > Furthermore it would allow people to benchmark its effect without > having to rebuild the kernel themself. > > All other patches looks fine to me. Thanks, for review. Could you take a look at huge zero page patchset? ;) -- Kirill A. Shutemov ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 0/7] mv643xx.c: Add basic device tree support.
Ping :) Can we get some consensus on the right approach here? I'm loathe to code this if its going to be rejected. I'd prefer the driver to be properly split so we dont have the MDIO driver mapping the ethernet drivers address spaces, but if thats not going to be merged, I'm not feeling like doing the work for nothing. If the driver is to use the overlapping-address mapped-by-the-mdio scheme, then so be it, but I could do with knowing. Another point against the latter scheme is that the MDIO driver could sensibly be used (the block is identical) on the ArmadaXP, which has 4 ethernet blocks rather than two, yet grouped in two pairs with a discontiguous address range. I'd like to get this moved along as soon as possible though. -Ian ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
Hi Kirill, On Thu, Aug 16, 2012 at 06:15:53PM +0300, Kirill A. Shutemov wrote: > for (i = 0; i < pages_per_huge_page; >i++, p = mem_map_next(p, page, i)) { It may be more optimal to avoid a multiplication/shiftleft before the add, and to do: for (i = 0, vaddr = haddr; i < pages_per_huge_page; i++, p = mem_map_next(p, page, i), vaddr += PAGE_SIZE) { > cond_resched(); > - clear_user_highpage(p, addr + i * PAGE_SIZE); > + vaddr = haddr + i*PAGE_SIZE; Not sure if gcc can optimize it away because of the external calls. > + if (!ARCH_HAS_USER_NOCACHE || i == target) > + clear_user_highpage(page + i, vaddr); > + else > + clear_user_highpage_nocache(page + i, vaddr); > } My only worry overall is if there can be some workload where this may actually slow down userland if the CPU cache is very large and userland would access most of the faulted in memory after the first fault. So I wouldn't mind to add one more check in addition of !ARCH_HAS_USER_NOCACHE above to check a runtime sysctl variable. It'll waste a cacheline yes but I doubt it's measurable compared to the time it takes to do a >=2M hugepage copy. Furthermore it would allow people to benchmark its effect without having to rebuild the kernel themself. All other patches looks fine to me. Thanks! Andrea ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v3 2/2] powerpc: Uprobes port to powerpc
On 08/16, Ananth N Mavinakayanahalli wrote: > > On Thu, Aug 16, 2012 at 07:41:53AM +1000, Benjamin Herrenschmidt wrote: > > On Wed, 2012-08-15 at 18:59 +0200, Oleg Nesterov wrote: > > > On 07/26, Ananth N Mavinakayanahalli wrote: > > > > > > > > From: Ananth N Mavinakayanahalli > > > > > > > > This is the port of uprobes to powerpc. Usage is similar to x86. > > > > > > I am just curious why this series was ignored by powerpc maintainers... > > > > Because it arrived too late for the previous merge window considering my > > limited bandwidth for reviewing things and that nobody else seems to > > have reviewed it :-) > > > > It's still on track for the next one, and I'm hoping to dedicate most of > > next week going through patches & doing a powerpc -next. > > Thanks Ben! Great! > > > Just one question... Shouldn't arch_uprobe_pre_xol() forbid to probe > > > UPROBE_SWBP_INSN (at least) ? > > > > > > (I assume that emulate_step() can't handle this case but of course I > > > do not understand arch/powerpc/lib/sstep.c) > > > > > > Note that uprobe_pre_sstep_notifier() sets utask->state = UTASK_BP_HIT > > > without any checks. This doesn't look right if it was UTASK_SSTEP... > > > > > > But again, I do not know what powepc will actually do if we try to > > > single-step over UPROBE_SWBP_INSN. > > > > Ananth ? > > set_swbp() will return -EEXIST to install_breakpoint if we are trying to > put a breakpoint on UPROBE_SWBP_INSN. not really, this -EEXIST (already removed by recent changes) means that bp was already installed. But this doesn't matter, > So, the arch agnostic code itself > takes care of this case... Yes. I forgot about install_breakpoint()->is_swbp_insn() check which returns -ENOTSUPP, somehow I thought arch_uprobe_analyze_insn() does this. > or am I missing something? No, it is me. > However, I see that we need a powerpc specific is_swbp_insn() > implementation since we will have to take care of all the trap variants. Hmm, I am not sure. is_swbp_insn(insn), as it is used in the arch agnostic code, should only return true if insn == UPROBE_SWBP_INSN (just in case, this logic needs more fixes but this is offtopic). If powerpc has another insn(s) which can trigger powerpc's do_int3() counterpart, they should be rejected by arch_uprobe_analyze_insn(). I think. > I will need to update the patches based on changes being made by Oleg > and Sebastien for the single-step issues. Perhaps you can do this in a separate change? We need some (simple) changes in the arch agnostic code first, they should not break poweppc. These changes are still under discussion. Once we have "__weak arch_uprobe_step*" you can reimplement these hooks and fix the problems with single-stepping. Oleg. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: therm_pm72 units, interface
On Wednesday 2012-08-15 23:35, Benjamin Herrenschmidt wrote: >> XServe G5 of mine started powering off more or less >> randomly > >BTW. There's a new windfarm driver for these in recent kernels... > >Appart from that, the trip points are coming from a calibration EEPROM, >you may want to tweak the driver to warn a bit earlier or that sort of >things ? (Or just to print more things out ?) If you have more things to print/offer via sysfs, I'm all for it. The XsG5 really has (by looking into the casing): 1 PCI Fan, 6 center fans, 1 PSU intake and 1 PSU outblow fan (this last one seems rather slow-turning, but maybe that's normal). It is not quite clear which is which in the sysfs display. What I did figure out: at the PROM, fans run at what seems to be full speed (some 8000-9000 rpm?). Once Linux and therm_pm72 are loaded, the fans settle down towards 4000 rpm, and if the machine has warmed up, that is then when it powers off. (The kernel is indeed 3.4. I now need to figure out how to place a new kernel on it without it powering off inbetween.) >> $ cd /sys/devices/temperature; grep '' *; >> backside_fan_pwm:32 >> backside_temperature:54.000 >> cpu0_current:34.423 >> cpu0_exhaust_fan_rpm:5340 >> cpu0_intake_fan_rpm:5340 >> cpu0_temperature:72.889 >> cpu0_voltage:1.252 >> cpu1_current:34.179 >> cpu1_exhaust_fan_rpm:4584 >> cpu1_intake_fan_rpm:4584 >> cpu1_temperature:68.526 >> cpu1_voltage:1.259 >> dimms_temperature:53.000 >> grep: driver: Er en filkatalog >> modalias:platform:temperature >> grep: power: Er en filkatalog >> slots_fan_pwm:20 >> slots_temperature:38.500 >> grep: subsystem: Er en filkatalog >> uevent:DRIVER=temperature >> uevent:OF_NAME=fan >> uevent:OF_FULLNAME=/u3@0,f800/i2c@f8001000/fan@15e >> uevent:OF_TYPE=fcu >> uevent:OF_COMPATIBLE_0=fcu >> uevent:OF_COMPATIBLE_N=1 >> uevent:MODALIAS=of:NfanTfcuCfcu ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 6/7] mm: make clear_huge_page cache clear only around the fault address
From: Andi Kleen Clearing a 2MB huge page will typically blow away several levels of CPU caches. To avoid this only cache clear the 4K area around the fault address and use a cache avoiding clears for the rest of the 2MB area. Signed-off-by: Andi Kleen Signed-off-by: Kirill A. Shutemov --- mm/memory.c | 34 +- 1 files changed, 29 insertions(+), 5 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index dfc179b..d4626b9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3969,18 +3969,34 @@ EXPORT_SYMBOL(might_fault); #endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) + +#ifndef ARCH_HAS_USER_NOCACHE +#define ARCH_HAS_USER_NOCACHE 0 +#endif + +#if ARCH_HAS_USER_NOCACHE == 0 +#define clear_user_highpage_nocache clear_user_highpage +#endif + static void clear_gigantic_page(struct page *page, - unsigned long addr, - unsigned int pages_per_huge_page) + unsigned long haddr, unsigned long fault_address, + unsigned int pages_per_huge_page) { int i; struct page *p = page; + unsigned long vaddr; + int target = (fault_address - haddr) >> PAGE_SHIFT; might_sleep(); + vaddr = haddr; for (i = 0; i < pages_per_huge_page; i++, p = mem_map_next(p, page, i)) { cond_resched(); - clear_user_highpage(p, addr + i * PAGE_SIZE); + vaddr = haddr + i*PAGE_SIZE; + if (!ARCH_HAS_USER_NOCACHE || i == target) + clear_user_highpage(p, vaddr); + else + clear_user_highpage_nocache(p, vaddr); } } void clear_huge_page(struct page *page, @@ -3988,16 +4004,24 @@ void clear_huge_page(struct page *page, unsigned int pages_per_huge_page) { int i; + unsigned long vaddr; + int target = (fault_address - haddr) >> PAGE_SHIFT; if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, haddr, pages_per_huge_page); + clear_gigantic_page(page, haddr, fault_address, + pages_per_huge_page); return; } might_sleep(); + vaddr = haddr; for (i = 0; i < pages_per_huge_page; i++) { cond_resched(); - clear_user_highpage(page + i, haddr + i * PAGE_SIZE); + vaddr = haddr + i*PAGE_SIZE; + if (!ARCH_HAS_USER_NOCACHE || i == target) + clear_user_highpage(page + i, vaddr); + else + clear_user_highpage_nocache(page + i, vaddr); } } -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 5/7] x86: Add clear_page_nocache
From: Andi Kleen Add a cache avoiding version of clear_page. Straight forward integer variant of the existing 64bit clear_page, for both 32bit and 64bit. Also add the necessary glue for highmem including a layer that non cache coherent architectures that use the virtual address for flushing can hook in. This is not needed on x86 of course. If an architecture wants to provide cache avoiding version of clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement clear_page_nocache() and clear_user_highpage_nocache(). Signed-off-by: Andi Kleen Signed-off-by: Kirill A. Shutemov --- arch/x86/include/asm/page.h |2 + arch/x86/include/asm/string_32.h |5 +++ arch/x86/include/asm/string_64.h |5 +++ arch/x86/lib/Makefile|3 +- arch/x86/lib/clear_page_32.S | 72 ++ arch/x86/lib/clear_page_64.S | 29 +++ arch/x86/mm/fault.c |7 7 files changed, 122 insertions(+), 1 deletions(-) create mode 100644 arch/x86/lib/clear_page_32.S diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h index 8ca8283..aa83a1b 100644 --- a/arch/x86/include/asm/page.h +++ b/arch/x86/include/asm/page.h @@ -29,6 +29,8 @@ static inline void copy_user_page(void *to, void *from, unsigned long vaddr, copy_page(to, from); } +void clear_user_highpage_nocache(struct page *page, unsigned long vaddr); + #define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr) #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE diff --git a/arch/x86/include/asm/string_32.h b/arch/x86/include/asm/string_32.h index 3d3e835..3f2fbcf 100644 --- a/arch/x86/include/asm/string_32.h +++ b/arch/x86/include/asm/string_32.h @@ -3,6 +3,8 @@ #ifdef __KERNEL__ +#include + /* Let gcc decide whether to inline or use the out of line functions */ #define __HAVE_ARCH_STRCPY @@ -337,6 +339,9 @@ void *__constant_c_and_count_memset(void *s, unsigned long pattern, #define __HAVE_ARCH_MEMSCAN extern void *memscan(void *addr, int c, size_t size); +#define ARCH_HAS_USER_NOCACHE 1 +asmlinkage void clear_page_nocache(void *page); + #endif /* __KERNEL__ */ #endif /* _ASM_X86_STRING_32_H */ diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h index 19e2c46..ca23d1d 100644 --- a/arch/x86/include/asm/string_64.h +++ b/arch/x86/include/asm/string_64.h @@ -3,6 +3,8 @@ #ifdef __KERNEL__ +#include + /* Written 2002 by Andi Kleen */ /* Only used for special circumstances. Stolen from i386/string.h */ @@ -63,6 +65,9 @@ char *strcpy(char *dest, const char *src); char *strcat(char *dest, const char *src); int strcmp(const char *cs, const char *ct); +#define ARCH_HAS_USER_NOCACHE 1 +asmlinkage void clear_page_nocache(void *page); + #endif /* __KERNEL__ */ #endif /* _ASM_X86_STRING_64_H */ diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index b00f678..14e47a2 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -23,6 +23,7 @@ lib-y += memcpy_$(BITS).o lib-$(CONFIG_SMP) += rwlock.o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o +lib-y += clear_page_$(BITS).o obj-y += msr.o msr-reg.o msr-reg-export.o @@ -40,7 +41,7 @@ endif else obj-y += iomap_copy_64.o lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o -lib-y += thunk_64.o clear_page_64.o copy_page_64.o +lib-y += thunk_64.o copy_page_64.o lib-y += memmove_64.o memset_64.o lib-y += copy_user_64.o copy_user_nocache_64.o lib-y += cmpxchg16b_emu.o diff --git a/arch/x86/lib/clear_page_32.S b/arch/x86/lib/clear_page_32.S new file mode 100644 index 000..9592161 --- /dev/null +++ b/arch/x86/lib/clear_page_32.S @@ -0,0 +1,72 @@ +#include +#include +#include +#include + +/* + * Fallback version if SSE2 is not avaible. + */ +ENTRY(clear_page_nocache) + CFI_STARTPROC + mov%eax,%edx + xorl %eax,%eax + movl $4096/32,%ecx + .p2align 4 +.Lloop: + decl%ecx +#define PUT(x) mov %eax,x*4(%edx) + PUT(0) + PUT(1) + PUT(2) + PUT(3) + PUT(4) + PUT(5) + PUT(6) + PUT(7) +#undef PUT + lea 32(%edx),%edx + jnz .Lloop + nop + ret + CFI_ENDPROC +ENDPROC(clear_page_nocache) + + .section .altinstr_replacement,"ax" +1: .byte 0xeb /* jmp */ + .byte (clear_page_nocache_sse2 - clear_page_nocache) - (2f - 1b) + /* offset */ +2: + .previous + .section .altinstructions,"a" + altinstruction_entry clear_page_nocache,1b,X86_FEATURE_XMM2,\ + 16, 2b-1b + .previous + +/* + * Zero a page avoiding the caches + * eax page + */ +ENTRY(clear_page_nocache_sse2) + CFI_STARTPROC + mov%eax,%edx + xorl %eax,%eax +
[PATCH v3 3/7] hugetlb: pass fault address to hugetlb_no_page()
From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- mm/hugetlb.c | 38 +++--- 1 files changed, 19 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc72712..3c86d3d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2672,7 +2672,8 @@ static bool hugetlbfs_pagecache_present(struct hstate *h, } static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long address, pte_t *ptep, unsigned int flags) + unsigned long haddr, unsigned long fault_address, + pte_t *ptep, unsigned int flags) { struct hstate *h = hstate_vma(vma); int ret = VM_FAULT_SIGBUS; @@ -2696,7 +2697,7 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, } mapping = vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, vma, address); + idx = vma_hugecache_offset(h, vma, haddr); /* * Use page lock to guard against racing truncation @@ -2708,7 +2709,7 @@ retry: size = i_size_read(mapping->host) >> huge_page_shift(h); if (idx >= size) goto out; - page = alloc_huge_page(vma, address, 0); + page = alloc_huge_page(vma, haddr, 0); if (IS_ERR(page)) { ret = PTR_ERR(page); if (ret == -ENOMEM) @@ -2717,7 +2718,7 @@ retry: ret = VM_FAULT_SIGBUS; goto out; } - clear_huge_page(page, address, pages_per_huge_page(h)); + clear_huge_page(page, haddr, pages_per_huge_page(h)); __SetPageUptodate(page); if (vma->vm_flags & VM_MAYSHARE) { @@ -2763,7 +2764,7 @@ retry: * the spinlock. */ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) - if (vma_needs_reservation(h, vma, address) < 0) { + if (vma_needs_reservation(h, vma, haddr) < 0) { ret = VM_FAULT_OOM; goto backout_unlocked; } @@ -2778,16 +2779,16 @@ retry: goto backout; if (anon_rmap) - hugepage_add_new_anon_rmap(page, vma, address); + hugepage_add_new_anon_rmap(page, vma, haddr); else page_dup_rmap(page); new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); - set_huge_pte_at(mm, address, ptep, new_pte); + set_huge_pte_at(mm, haddr, ptep, new_pte); if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_cow(mm, vma, address, ptep, new_pte, page); + ret = hugetlb_cow(mm, vma, haddr, ptep, new_pte, page); } spin_unlock(&mm->page_table_lock); @@ -2813,21 +2814,20 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, struct page *pagecache_page = NULL; static DEFINE_MUTEX(hugetlb_instantiation_mutex); struct hstate *h = hstate_vma(vma); + unsigned long haddr = address & huge_page_mask(h); - address &= huge_page_mask(h); - - ptep = huge_pte_offset(mm, address); + ptep = huge_pte_offset(mm, haddr); if (ptep) { entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait(mm, (pmd_t *)ptep, address); + migration_entry_wait(mm, (pmd_t *)ptep, haddr); return 0; } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); } - ptep = huge_pte_alloc(mm, address, huge_page_size(h)); + ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); if (!ptep) return VM_FAULT_OOM; @@ -2839,7 +2839,7 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, mutex_lock(&hugetlb_instantiation_mutex); entry = huge_ptep_get(ptep); if (huge_pte_none(entry)) { - ret = hugetlb_no_page(mm, vma, address, ptep, flags); + ret = hugetlb_no_page(mm, vma, haddr, address, ptep, flags); goto out_mutex; } @@ -2854,14 +2854,14 @@ int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, * consumed. */ if ((flags & FAULT_FLAG_WRITE) && !pte_write(entry)) { - if (vma_needs_reservation(h, vma, address) < 0) { + if (vma_needs_reservation(h, vma, haddr) < 0) { ret = VM_FAULT_OOM;
[PATCH v3 2/7] THP: Pass fault address to __do_huge_pmd_anonymous_page()
From: Andi Kleen Signed-off-by: Andi Kleen Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 70737ec..6f0825b611 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -633,7 +633,8 @@ static inline pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma) static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, - unsigned long haddr, pmd_t *pmd, + unsigned long haddr, + unsigned long address, pmd_t *pmd, struct page *page) { pgtable_t pgtable; @@ -720,8 +721,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, put_page(page); goto out; } - if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, - page))) { + if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, + address, pmd, page))) { mem_cgroup_uncharge_page(page); put_page(page); goto out; -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 0/7] Avoid cache trashing on clearing huge/gigantic page
From: "Kirill A. Shutemov" Clearing a 2MB huge page will typically blow away several levels of CPU caches. To avoid this only cache clear the 4K area around the fault address and use a cache avoiding clears for the rest of the 2MB area. This patchset implements cache avoiding version of clear_page only for x86. If an architecture wants to provide cache avoiding version of clear_page it should to define ARCH_HAS_USER_NOCACHE to 1 and implement clear_page_nocache() and clear_user_highpage_nocache(). v3: - Rebased to current Linus' tree. kmap_atomic() build issue is fixed; - Pass fault address to clear_huge_page(). v2 had problem with clearing for sizes other than HPAGE_SIZE - x86: fix 32bit variant. Fallback version of clear_page_nocache() has been added for non-SSE2 systems; - x86: clear_page_nocache() moved to clear_page_{32,64}.S; - x86: use pushq_cfi/popq_cfi instead of push/pop; v2: - No code change. Only commit messages are updated. - RFC mark is dropped. Andi Kleen (5): THP: Use real address for NUMA policy THP: Pass fault address to __do_huge_pmd_anonymous_page() x86: Add clear_page_nocache mm: make clear_huge_page cache clear only around the fault address x86: switch the 64bit uncached page clear to SSE/AVX v2 Kirill A. Shutemov (2): hugetlb: pass fault address to hugetlb_no_page() mm: pass fault address to clear_huge_page() arch/x86/include/asm/page.h |2 + arch/x86/include/asm/string_32.h |5 ++ arch/x86/include/asm/string_64.h |5 ++ arch/x86/lib/Makefile|3 +- arch/x86/lib/clear_page_32.S | 72 +++ arch/x86/lib/clear_page_64.S | 78 ++ arch/x86/mm/fault.c |7 +++ include/linux/mm.h |2 +- mm/huge_memory.c | 17 mm/hugetlb.c | 39 ++- mm/memory.c | 37 +++--- 11 files changed, 232 insertions(+), 35 deletions(-) create mode 100644 arch/x86/lib/clear_page_32.S -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 4/7] mm: pass fault address to clear_huge_page()
From: "Kirill A. Shutemov" Signed-off-by: Kirill A. Shutemov --- include/linux/mm.h |2 +- mm/huge_memory.c |2 +- mm/hugetlb.c |3 ++- mm/memory.c|7 --- 4 files changed, 8 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 311be90..2858723 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1638,7 +1638,7 @@ extern void dump_page(struct page *page); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) extern void clear_huge_page(struct page *page, - unsigned long addr, + unsigned long haddr, unsigned long fault_address, unsigned int pages_per_huge_page); extern void copy_user_huge_page(struct page *dst, struct page *src, unsigned long addr, struct vm_area_struct *vma, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6f0825b611..070bf89 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -644,7 +644,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm, if (unlikely(!pgtable)) return VM_FAULT_OOM; - clear_huge_page(page, haddr, HPAGE_PMD_NR); + clear_huge_page(page, haddr, address, HPAGE_PMD_NR); __SetPageUptodate(page); spin_lock(&mm->page_table_lock); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 3c86d3d..5182192 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2718,7 +2718,8 @@ retry: ret = VM_FAULT_SIGBUS; goto out; } - clear_huge_page(page, haddr, pages_per_huge_page(h)); + clear_huge_page(page, haddr, fault_address, + pages_per_huge_page(h)); __SetPageUptodate(page); if (vma->vm_flags & VM_MAYSHARE) { diff --git a/mm/memory.c b/mm/memory.c index 5736170..dfc179b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3984,19 +3984,20 @@ static void clear_gigantic_page(struct page *page, } } void clear_huge_page(struct page *page, -unsigned long addr, unsigned int pages_per_huge_page) +unsigned long haddr, unsigned long fault_address, +unsigned int pages_per_huge_page) { int i; if (unlikely(pages_per_huge_page > MAX_ORDER_NR_PAGES)) { - clear_gigantic_page(page, addr, pages_per_huge_page); + clear_gigantic_page(page, haddr, pages_per_huge_page); return; } might_sleep(); for (i = 0; i < pages_per_huge_page; i++) { cond_resched(); - clear_user_highpage(page + i, addr + i * PAGE_SIZE); + clear_user_highpage(page + i, haddr + i * PAGE_SIZE); } } -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 7/7] x86: switch the 64bit uncached page clear to SSE/AVX v2
From: Andi Kleen With multiple threads vector stores are more efficient, so use them. This will cause the page clear to run non preemptable and add some overhead. However on 32bit it was already non preempable (due to kmap_atomic) and there is an preemption opportunity every 4K unit. On a NPB (Nasa Parallel Benchmark) 128GB run on a Westmere this improves the performance regression of enabling transparent huge pages by ~2% (2.81% to 0.81%), near the runtime variability now. On a system with AVX support more is expected. Signed-off-by: Andi Kleen [kirill.shute...@linux.intel.com: Properly save/restore arguments] Signed-off-by: Kirill A. Shutemov --- arch/x86/lib/clear_page_64.S | 79 ++ 1 files changed, 64 insertions(+), 15 deletions(-) diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S index 9d2f3c2..b302cff 100644 --- a/arch/x86/lib/clear_page_64.S +++ b/arch/x86/lib/clear_page_64.S @@ -73,30 +73,79 @@ ENDPROC(clear_page) .Lclear_page_end-clear_page,3b-2b .previous +#define SSE_UNROLL 128 + /* * Zero a page avoiding the caches * rdi page */ ENTRY(clear_page_nocache) CFI_STARTPROC - xorl %eax,%eax - movl $4096/64,%ecx + pushq_cfi %rdi + call kernel_fpu_begin + popq_cfi %rdi + sub$16,%rsp + CFI_ADJUST_CFA_OFFSET 16 + movdqu %xmm0,(%rsp) + xorpd %xmm0,%xmm0 + movl $4096/SSE_UNROLL,%ecx .p2align 4 .Lloop_nocache: decl%ecx -#define PUT(x) movnti %rax,x*8(%rdi) - movnti %rax,(%rdi) - PUT(1) - PUT(2) - PUT(3) - PUT(4) - PUT(5) - PUT(6) - PUT(7) -#undef PUT - leaq64(%rdi),%rdi + .set x,0 + .rept SSE_UNROLL/16 + movntdq %xmm0,x(%rdi) + .set x,x+16 + .endr + leaqSSE_UNROLL(%rdi),%rdi jnz .Lloop_nocache - nop - ret + movdqu (%rsp),%xmm0 + addq $16,%rsp + CFI_ADJUST_CFA_OFFSET -16 + jmp kernel_fpu_end CFI_ENDPROC ENDPROC(clear_page_nocache) + +#ifdef CONFIG_AS_AVX + + .section .altinstr_replacement,"ax" +1: .byte 0xeb /* jmp */ + .byte (clear_page_nocache_avx - clear_page_nocache) - (2f - 1b) + /* offset */ +2: + .previous + .section .altinstructions,"a" + altinstruction_entry clear_page_nocache,1b,X86_FEATURE_AVX,\ +16, 2b-1b + .previous + +#define AVX_UNROLL 256 /* TUNE ME */ + +ENTRY(clear_page_nocache_avx) + CFI_STARTPROC + pushq_cfi %rdi + call kernel_fpu_begin + popq_cfi %rdi + sub$32,%rsp + CFI_ADJUST_CFA_OFFSET 32 + vmovdqu %ymm0,(%rsp) + vxorpd %ymm0,%ymm0,%ymm0 + movl $4096/AVX_UNROLL,%ecx + .p2align 4 +.Lloop_avx: + decl%ecx + .set x,0 + .rept AVX_UNROLL/32 + vmovntdq %ymm0,x(%rdi) + .set x,x+32 + .endr + leaqAVX_UNROLL(%rdi),%rdi + jnz .Lloop_avx + vmovdqu (%rsp),%ymm0 + addq $32,%rsp + CFI_ADJUST_CFA_OFFSET -32 + jmp kernel_fpu_end + CFI_ENDPROC +ENDPROC(clear_page_nocache_avx) + +#endif -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v3 1/7] THP: Use real address for NUMA policy
From: Andi Kleen Use the fault address, not the rounded down hpage address for NUMA policy purposes. In some circumstances this can give more exact NUMA policy. Signed-off-by: Andi Kleen Signed-off-by: Kirill A. Shutemov --- mm/huge_memory.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 57c4b93..70737ec 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -681,11 +681,11 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp) static inline struct page *alloc_hugepage_vma(int defrag, struct vm_area_struct *vma, - unsigned long haddr, int nd, + unsigned long address, int nd, gfp_t extra_gfp) { return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp), - HPAGE_PMD_ORDER, vma, haddr, nd); + HPAGE_PMD_ORDER, vma, address, nd); } #ifndef CONFIG_NUMA @@ -710,7 +710,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, if (unlikely(khugepaged_enter(vma))) return VM_FAULT_OOM; page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, address, numa_node_id(), 0); if (unlikely(!page)) { count_vm_event(THP_FAULT_FALLBACK); goto out; @@ -944,7 +944,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (transparent_hugepage_enabled(vma) && !transparent_hugepage_debug_cow()) new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma), - vma, haddr, numa_node_id(), 0); + vma, address, numa_node_id(), 0); else new_page = NULL; -- 1.7.7.6 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Powerpc 8xx CPM_UART delay in receive
> MAX_IDL: Maximum idle characters. When a character is received, the > receiver begins counting idle characters. If MAX_IDL idle characters > are received before the next data character, an idle timeout occurs > and the buffer is closed, > generating a maskable interrupt request to the core to receive the > data from the buffer. Thus, MAX_IDL offers a way to demarcate frames. > To disable the feature, clear MAX_IDL. The bit length of an idle > character is calculated as follows: 1 + data length (5–9) + 1 (if > parity is used) > + number of stop bits (1–2). For 8 data bits, no parity, and 1 stop > bit, the character length is 10 bits So if you have slightly bursty high speed data as its quite typical before your change you would get one interrupt per buffer of 32 bytes, with it you'll get a lot more interrupts. You have two available hints about the way to set this - one of them is the baud rate (low baud rates mean the fifo isn't a big win and the latency is high), the other is the low_latency flag if the driver supports the low latency feature (and arguably you can still use a request for it as a hint even if you refuse the actual feature). So I think a reasonable approach would be set the idle timeout down for low baud rates or if low_latency is requested. > generated if there is at least one word in the FIFO and for a time > equivalent to the transmission of four characters Which is a bit more reasonable than one, although problematic at low speed (hence the fifo on/off). ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Powerpc 8xx CPM_UART delay in receive
Le 16/08/2012 16:29, Alan Cox a écrit : The PowerPC CPM is working differently. It doesn't use a fifo but buffers. Buffers are handed to the microprocessor only when they are full or after a timeout period which is adjustable. In the driver, the Which is different how - remembering we empty the FIFO on an IRQ buffers are configured with a size of 32 bytes. And the timeout is set to the size of the buffer. That is this timeout that I'm reducing to 1 byte in my proposed patch. I can't see what it would break for high speed I/O. How can a timeout be measured in "bytes". Can we have a bit more clarity on how the hardware works and take it from there ? Alan The reference manual of MPC885 says the following about the MAX_IDL parameter: MAX_IDL: Maximum idle characters. When a character is received, the receiver begins counting idle characters. If MAX_IDL idle characters are received before the next data character, an idle timeout occurs and the buffer is closed, generating a maskable interrupt request to the core to receive the data from the buffer. Thus, MAX_IDL offers a way to demarcate frames. To disable the feature, clear MAX_IDL. The bit length of an idle character is calculated as follows: 1 + data length (5–9) + 1 (if parity is used) + number of stop bits (1–2). For 8 data bits, no parity, and 1 stop bit, the character length is 10 bits If the UART is receiving data and gets an idle character (all ones), the channel begins counting consecutive idle characters received. If MAX_IDL is reached, the buffer is closed and an RX interrupt is generated if not masked. If no buffer is open, this event does not generate an interrupt or any status information. The internal idle counter (IDLC) is reset every time a character is received. To disable the idle sequence function, clear MAX_IDL The datasheet of the 16550 UART says: Besides, for FIFO mode operation a time out mechanism is implemented. Independently of the trigger level of the FIFO, an interrupt will be generated if there is at least one word in the FIFO and for a time equivalent to the transmission of four characters - no new character has been received and - the microprocessor has not read the RHR To compute the time out, the current total number of bits (start, data, parity and stop(s)) is used, together with the current baud rate (i.e., it depends on the contents of the LCR, DLL, DLM and PSD registers). Christophe ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Powerpc 8xx CPM_UART delay in receive
> The PowerPC CPM is working differently. It doesn't use a fifo but > buffers. Buffers are handed to the microprocessor only when they are > full or after a timeout period which is adjustable. In the driver, the Which is different how - remembering we empty the FIFO on an IRQ > buffers are configured with a size of 32 bytes. And the timeout is set > to the size of the buffer. That is this timeout that I'm reducing to 1 > byte in my proposed patch. I can't see what it would break for high > speed I/O. How can a timeout be measured in "bytes". Can we have a bit more clarity on how the hardware works and take it from there ? Alan ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
On Fri, 2012-08-17 at 00:02 +1000, Michael Ellerman wrote: > You do want to guarantee that the task will always be subject to the > breakpoint, even if it moves cpus. So is there any way to guarantee that > other than reserving a breakpoint slot on every cpu ahead of time? That's not how regular perf works.. regular perf can overload hw resources at will and stuff is strictly per-cpu. So the regular perf record has perf_event_attr::inherit enabled by default, this will result in it creating a per-task-per-cpu event for each cpu and this will succeed because there's no strict reservation to avoid/detect starvation against perf_event_attr::pinned events. For regular (!pinned) events, we'll RR the created events on the available hardware resources. HWBP does things completely different and reserves a slot over all CPUs for everything, thus stuff completely falls apart. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
On Thu, 2012-08-16 at 13:44 +0200, Peter Zijlstra wrote: > On Thu, 2012-08-16 at 21:17 +1000, Michael Neuling wrote: > > Peter, > > > > > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1, > > > > despite there being no breakpoint on this CPU. This is because the call > > > > the task_bp_pinned, checks all CPUs, rather than just the current CPU. > > > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we > > > > return ENOSPC. > > > > > > I think this comes from the ptrace legacy, we register a breakpoint on > > > all cpus because when we migrate a task it cannot fail to migrate the > > > breakpoint. > > > > > > Its one of the things I hate most about the hwbp stuff as it relates to > > > perf. > > > > > > Frederic knows more... > > > > Maybe I should wait for Frederic to respond but I'm not sure I > > understand what you're saying. > > > > I can see how using ptrace hw breakpoints and perf hw breakpoints at the > > same time could be a problem, but I'm not sure how this would stop it. > > ptrace uses perf for hwbp support so we're stuck with all kinds of > stupid ptrace constraints.. or somesuch. > > > Are you saying that we need to keep at least 1 slot free at all times, > > so that we can use it for ptrace? > > No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace > weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue > however.. But how else would it work, even if ptrace wasn't in the picture? You do want to guarantee that the task will always be subject to the breakpoint, even if it moves cpus. So is there any way to guarantee that other than reserving a breakpoint slot on every cpu ahead of time? Or can a hwbp event go into error state if it can't be installed on the new cpu, like a pinned event does? I can't see any code that does that. cheers ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH] Powerpc 8xx CPM_UART delay in receive
Le 14/08/2012 16:52, Alan Cox a écrit : On Tue, 14 Aug 2012 16:26:28 +0200 Christophe Leroy wrote: Hello, I'm not sure who to address this Patch to either It fixes a delay issue with CPM UART driver on Powerpc MPC8xx. The problem is that with the actual code, the driver waits 32 IDLE patterns before returning the received data to the upper level. It means for instance about 1 second at 300 bauds. This fix limits to one byte the waiting period. Take a look how the 8250 does it - I think you want to set the value based upon the data rate. Your patch will break it for everyone doing high seed I/O. Alan I'm not sure I understand what you mean. As far as I can see 8250/16550 is working a bit different, as it is based on a fifo and triggers an interrupt as soon as a given number of bytes is received. I also see that in case this amount is not reached, there is a receive-timeout which goes on after no byte is received for a duration of more than 4 bytes. The PowerPC CPM is working differently. It doesn't use a fifo but buffers. Buffers are handed to the microprocessor only when they are full or after a timeout period which is adjustable. In the driver, the buffers are configured with a size of 32 bytes. And the timeout is set to the size of the buffer. That is this timeout that I'm reducing to 1 byte in my proposed patch. I can't see what it would break for high speed I/O. Christophe ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
On Thu, 2012-08-16 at 21:17 +1000, Michael Neuling wrote: > Peter, > > > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1, > > > despite there being no breakpoint on this CPU. This is because the call > > > the task_bp_pinned, checks all CPUs, rather than just the current CPU. > > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we > > > return ENOSPC. > > > > I think this comes from the ptrace legacy, we register a breakpoint on > > all cpus because when we migrate a task it cannot fail to migrate the > > breakpoint. > > > > Its one of the things I hate most about the hwbp stuff as it relates to > > perf. > > > > Frederic knows more... > > Maybe I should wait for Frederic to respond but I'm not sure I > understand what you're saying. > > I can see how using ptrace hw breakpoints and perf hw breakpoints at the > same time could be a problem, but I'm not sure how this would stop it. ptrace uses perf for hwbp support so we're stuck with all kinds of stupid ptrace constraints.. or somesuch. > Are you saying that we need to keep at least 1 slot free at all times, > so that we can use it for ptrace? No, I'm saying perf-hwbp is weird because of ptrace, maybe the ptrace weirdness shouldn't live in perf-hwpb but in the ptrace-perf glue however.. > Is "perf record -e mem:0x1000 true" ever going to be able to work on > POWER7 with only one hw breakpoint resource per CPU? I think it should work... but I'm fairly sure it currently doesn't because of how things are done. 'perf record -ie mem:0x100... true' might just work. I always forget all the ptrace details but I am forever annoyed at the mess that is perf-hwbp.. Frederic is there really nothing we can do about this? The fact that ptrace hwbp semantics are different per architecture doesn't help of course. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
Peter, > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1, > > despite there being no breakpoint on this CPU. This is because the call > > the task_bp_pinned, checks all CPUs, rather than just the current CPU. > > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we > > return ENOSPC. > > I think this comes from the ptrace legacy, we register a breakpoint on > all cpus because when we migrate a task it cannot fail to migrate the > breakpoint. > > Its one of the things I hate most about the hwbp stuff as it relates to > perf. > > Frederic knows more... Maybe I should wait for Frederic to respond but I'm not sure I understand what you're saying. I can see how using ptrace hw breakpoints and perf hw breakpoints at the same time could be a problem, but I'm not sure how this would stop it. Are you saying that we need to keep at least 1 slot free at all times, so that we can use it for ptrace? Is "perf record -e mem:0x1000 true" ever going to be able to work on POWER7 with only one hw breakpoint resource per CPU? Thanks, Mikey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: powerpc/perf: hw breakpoints return ENOSPC
On Thu, 2012-08-16 at 14:23 +1000, Michael Neuling wrote: > > On this second syscall, fetch_bp_busy_slots() sets slots.pinned to be 1, > despite there being no breakpoint on this CPU. This is because the call > the task_bp_pinned, checks all CPUs, rather than just the current CPU. > POWER7 only has one hardware breakpoint per CPU (ie. HBP_NUM=1), so we > return ENOSPC. I think this comes from the ptrace legacy, we register a breakpoint on all cpus because when we migrate a task it cannot fail to migrate the breakpoint. Its one of the things I hate most about the hwbp stuff as it relates to perf. Frederic knows more... ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev