Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Michael Ellerman
Thomas Gleixner  writes:

> On Tue, 3 Oct 2017, Thomas Gleixner wrote:
>> On Tue, 3 Oct 2017, Thomas Gleixner wrote:
>> > On Tue, 3 Oct 2017, Michael Ellerman wrote:
>> > > Hmm, I tried that patch, it makes the warning go away. But then I
>> > > triggered a deliberate hard lockup and got nothing.
>> > > 
>> > > Then I went back to the existing code (in linux-next), and I still get
>> > > no warning from a deliberate hard lockup.
>> > > 
>> > > So seems there may be some more gremlins. Will test more in the morning.
>> > 
>> > Hrm. That's weird. I'll have a look and send a proper patch series on top
>> > of next.
>> 
>> The major difference is that the reworked code utilizes
>> watchdog_nmi_reconfigure() for both init and the sysctl updates, but I
>> can't for my life figure out why that doesn't work.
>
> I collected the changes which Linus requested along with the nmi_probe()
> one and pushed them into:
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.core/urgent
>
> That's based on 4.13 final so it neither contains 4.14 nor -next material.

Thanks. I tested that here and it seems fine. The warning at boot is
gone and it is correctly catching a hard lockup triggered via LKDTM, eg:

  # mount -t debugfs none /sys/kernel/debug
  # echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
  lkdtm: Performing direct entry HARDLOCKUP
  Watchdog CPU:0 Hard LOCKUP
  Modules linked in:
  CPU: 0 PID: 1215 Comm: sh Not tainted 4.13.0-gcc6-11846-g86be5ee #162
  task: c000f1fc4c00 task.stack: c000ee3ac000
  NIP:  c07205a4 LR: c071f950 CTR: c0720570
  REGS: c0003d80 TRAP: 0900   Not tainted  (4.13.0-gcc6-11846-g86be5ee)
  MSR:  90009033   CR: 28002228  XER: 
  CFAR: c07205a8 SOFTE: 0 
  GPR00: c071f950 c000ee3afbb0 c107cf00 c10604f0 
  GPR04: c000ffa05d90 c000ffa1c968   
  GPR08: 0007 0001  900030001003 
  GPR12: c0720570 cfd4   
  GPR16:    100b8fd0 
  GPR20: 01002f5a3485 100b8f90   
  GPR24: c1060778 c000ee3afe00 c000ee3afe00 c10603b0 
  GPR28: 000b c000f1fe 0140 c10604f0 
  NIP [c07205a4] lkdtm_HARDLOCKUP+0x34/0x40
  LR [c071f950] lkdtm_do_action+0x50/0x70
  Call Trace:
  [c000ee3afbb0] [0140] 0x140 (unreliable)
  [c000ee3afbd0] [c071f950] lkdtm_do_action+0x50/0x70
  [c000ee3afc00] [c071fdc0] direct_entry+0x110/0x1b0
  [c000ee3afc90] [c050141c] full_proxy_write+0x9c/0x110
  [c000ee3afcf0] [c0336a3c] __vfs_write+0x6c/0x210
  [c000ee3afd90] [c0338960] vfs_write+0xd0/0x270
  [c000ee3afde0] [c033a93c] SyS_write+0x6c/0x110
  [c000ee3afe30] [c000b220] system_call+0x58/0x6c
  Instruction dump:
  3842c990 7c0802a6 f8010010 f821ffe1 6000 6000 3940 892d027a 
  994d027a 6000 6042 7c210b78 <7c421378> 4bf8 6042 3c4c0096 
  Kernel panic - not syncing: Hard LOCKUP

Acked-by: Michael Ellerman  (powerpc)

cheers


[PATCH v2] powerpc: configs: Add Skiroot defconfig

2017-10-03 Thread Joel Stanley
This configuration is used by the OpenPower firmware for it's
Linux-as-bootloader implementation. Also known as the Petitboot kernel,
this configuration broke in 4.12 (CPU_HOTPLUG=n), so add it to the
upstream tree in order to get better coverage.

Signed-off-by: Joel Stanley 
---
v2:
 - refresh for 4.13

 arch/powerpc/configs/skiroot_defconfig | 232 +
 1 file changed, 232 insertions(+)
 create mode 100644 arch/powerpc/configs/skiroot_defconfig

diff --git a/arch/powerpc/configs/skiroot_defconfig 
b/arch/powerpc/configs/skiroot_defconfig
new file mode 100644
index ..6bd5e7261335
--- /dev/null
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -0,0 +1,232 @@
+CONFIG_PPC64=y
+CONFIG_ALTIVEC=y
+CONFIG_VSX=y
+CONFIG_NR_CPUS=2048
+CONFIG_CPU_LITTLE_ENDIAN=y
+# CONFIG_SWAP is not set
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+# CONFIG_CROSS_MEMORY_ATTACH is not set
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_TASKSTATS=y
+CONFIG_TASK_DELAY_ACCT=y
+CONFIG_TASK_XACCT=y
+CONFIG_TASK_IO_ACCOUNTING=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=20
+CONFIG_RELAY=y
+CONFIG_BLK_DEV_INITRD=y
+# CONFIG_RD_GZIP is not set
+# CONFIG_RD_BZIP2 is not set
+# CONFIG_RD_LZMA is not set
+# CONFIG_RD_LZO is not set
+# CONFIG_RD_LZ4 is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+CONFIG_PERF_EVENTS=y
+# CONFIG_COMPAT_BRK is not set
+CONFIG_JUMP_LABEL=y
+CONFIG_STRICT_KERNEL_RWX=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_MODULE_SIG=y
+CONFIG_MODULE_SIG_FORCE=y
+CONFIG_MODULE_SIG_SHA512=y
+CONFIG_PARTITION_ADVANCED=y
+# CONFIG_IOSCHED_DEADLINE is not set
+# CONFIG_PPC_PSERIES is not set
+CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
+CONFIG_CPU_IDLE=y
+CONFIG_HZ_100=y
+CONFIG_KEXEC=y
+CONFIG_IRQ_ALL_CPUS=y
+CONFIG_NUMA=y
+# CONFIG_COMPACTION is not set
+# CONFIG_MIGRATION is not set
+# CONFIG_BOUNCE is not set
+CONFIG_PPC_64K_PAGES=y
+CONFIG_SCHED_SMT=y
+CONFIG_CMDLINE_BOOL=y
+CONFIG_CMDLINE="console=tty0 console=hvc0 powersave=off"
+# CONFIG_SECCOMP is not set
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_IP_MULTICAST=y
+CONFIG_NET_IPIP=y
+CONFIG_SYN_COOKIES=y
+# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
+# CONFIG_INET_XFRM_MODE_TUNNEL is not set
+# CONFIG_INET_XFRM_MODE_BEET is not set
+# CONFIG_IPV6 is not set
+CONFIG_DNS_RESOLVER=y
+# CONFIG_WIRELESS is not set
+CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_MTD=m
+CONFIG_MTD_POWERNV_FLASH=m
+CONFIG_BLK_DEV_LOOP=y
+CONFIG_BLK_DEV_RAM=y
+CONFIG_BLK_DEV_RAM_SIZE=65536
+CONFIG_VIRTIO_BLK=m
+CONFIG_BLK_DEV_NVME=m
+CONFIG_EEPROM_AT24=y
+# CONFIG_CXL is not set
+CONFIG_BLK_DEV_SD=m
+CONFIG_BLK_DEV_SR=m
+CONFIG_BLK_DEV_SR_VENDOR=y
+CONFIG_CHR_DEV_SG=m
+CONFIG_SCSI_CONSTANTS=y
+CONFIG_SCSI_SCAN_ASYNC=y
+CONFIG_SCSI_FC_ATTRS=y
+CONFIG_SCSI_CXGB3_ISCSI=m
+CONFIG_SCSI_CXGB4_ISCSI=m
+CONFIG_SCSI_BNX2_ISCSI=m
+CONFIG_BE2ISCSI=m
+CONFIG_SCSI_AACRAID=m
+CONFIG_MEGARAID_NEWGEN=y
+CONFIG_MEGARAID_MM=m
+CONFIG_MEGARAID_MAILBOX=m
+CONFIG_MEGARAID_SAS=m
+CONFIG_SCSI_MPT2SAS=m
+CONFIG_SCSI_IPR=m
+# CONFIG_SCSI_IPR_TRACE is not set
+# CONFIG_SCSI_IPR_DUMP is not set
+CONFIG_SCSI_QLA_FC=m
+CONFIG_SCSI_QLA_ISCSI=m
+CONFIG_SCSI_LPFC=m
+CONFIG_SCSI_VIRTIO=m
+CONFIG_SCSI_DH=y
+CONFIG_SCSI_DH_ALUA=m
+CONFIG_ATA=y
+CONFIG_SATA_AHCI=y
+# CONFIG_ATA_SFF is not set
+CONFIG_MD=y
+CONFIG_BLK_DEV_MD=m
+CONFIG_MD_LINEAR=m
+CONFIG_MD_RAID0=m
+CONFIG_MD_RAID1=m
+CONFIG_MD_RAID10=m
+CONFIG_MD_RAID456=m
+CONFIG_MD_MULTIPATH=m
+CONFIG_MD_FAULTY=m
+CONFIG_BLK_DEV_DM=m
+CONFIG_DM_CRYPT=m
+CONFIG_DM_SNAPSHOT=m
+CONFIG_DM_MIRROR=m
+CONFIG_DM_ZERO=m
+CONFIG_DM_MULTIPATH=m
+CONFIG_ACENIC=m
+CONFIG_ACENIC_OMIT_TIGON_I=y
+CONFIG_TIGON3=y
+CONFIG_BNX2X=m
+CONFIG_CHELSIO_T1=y
+CONFIG_BE2NET=m
+CONFIG_S2IO=m
+CONFIG_E100=m
+CONFIG_E1000=m
+CONFIG_E1000E=m
+CONFIG_IXGB=m
+CONFIG_IXGBE=m
+CONFIG_MLX4_EN=m
+CONFIG_MLX5_CORE=m
+CONFIG_MLX5_CORE_EN=y
+CONFIG_MYRI10GE=m
+CONFIG_QLGE=m
+CONFIG_NETXEN_NIC=m
+CONFIG_SFC=m
+# CONFIG_USB_NET_DRIVERS is not set
+# CONFIG_WLAN is not set
+CONFIG_INPUT_EVDEV=y
+CONFIG_INPUT_MISC=y
+# CONFIG_SERIO_SERPORT is not set
+# CONFIG_DEVMEM is not set
+CONFIG_SERIAL_8250=y
+CONFIG_SERIAL_8250_CONSOLE=y
+CONFIG_IPMI_HANDLER=y
+CONFIG_IPMI_DEVICE_INTERFACE=y
+CONFIG_IPMI_POWERNV=y
+CONFIG_HW_RANDOM=y
+CONFIG_TCG_TIS_I2C_NUVOTON=y
+# CONFIG_I2C_COMPAT is not set
+CONFIG_I2C_CHARDEV=y
+# CONFIG_I2C_HELPER_AUTO is not set
+CONFIG_DRM=y
+CONFIG_DRM_RADEON=y
+CONFIG_DRM_AST=m
+CONFIG_FIRMWARE_EDID=y
+CONFIG_FB_MODE_HELPERS=y
+CONFIG_FB_OF=y
+CONFIG_FB_MATROX=y
+CONFIG_FB_MATROX_MILLENIUM=y
+CONFIG_FB_MATROX_MYSTIQUE=y
+CONFIG_FB_MATROX_G=y
+# CONFIG_LCD_CLASS_DEVICE is not set
+# CONFIG_BACKLIGHT_GENERIC is not set
+# CONFIG_VGA_CONSOLE is not set
+CONFIG_LOGO=y
+# CONFIG_LOGO_LINUX_MONO is not set
+# CONFIG_LOGO_LINUX_VGA16 is not set
+CONFIG_USB_HIDDEV=y
+CONFIG_USB=y
+CONFIG_USB_MON=y
+CONFIG_USB_XHCI_HCD=y
+CONFIG_USB_EHCI_HCD=y
+# CONFIG_USB_EHCI_HCD_PPC_OF 

Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Jerome Glisse
On Tue, Oct 03, 2017 at 05:43:47PM -0700, Nadav Amit wrote:
> Jerome Glisse  wrote:
> 
> > On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> > 
> >> I'd like some more explanation about the inner working of "that new
> >> user" as per comment above.
> >> 
> >> It would be enough to drop mmu_notifier_invalidate_range from above
> >> without adding it to the filebacked case. The above gives higher prio
> >> to the hypothetical and uncertain future case, than to the current
> >> real filebacked case that doesn't need ->invalidate_range inside the
> >> PT lock, or do you see something that might already need such
> >> ->invalidate_range?
> > 
> > No i don't see any new user today that might need such invalidate but
> > i was trying to be extra cautious as i have a tendency to assume that
> > someone might do a patch that use try_to_unmap() without going through
> > all the comments in the function and thus possibly using it in a an
> > unexpected way from mmu_notifier callback point of view. I am fine
> > with putting the burden on new user to get it right and adding an
> > extra warning in the function description to try to warn people in a
> > sensible way.
> 
> I must be missing something. After the PTE is changed, but before the
> secondary TLB notification/invalidation - What prevents another thread from
> changing the mappings (e.g., using munmap/mmap), and setting a new page
> at that PTE?
> 
> Wouldn’t it end with the page being mapped without a secondary TLB flush in
> between?

munmap would call mmu_notifier to invalidate the range too so secondary
TLB would be properly flush before any new pte can be setup in for that
particular virtual address range. Unlike CPU TLB flush, secondary TLB
flush are un-conditional and thus current pte value does not play any
role.

Cheers,
Jérôme


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Jerome Glisse
On Wed, Oct 04, 2017 at 01:42:15AM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> > +Case A is obvious you do not want to take the risk for the device to write 
> > to
> > +a page that might now be use by some completely different task.
> 
> used
> 
> > +is true ven if the thread doing the page table update is preempted right 
> > after
> 
> even
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 90731e3b7e58..5706252b828a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> > vm_fault *vmf, pmd_t orig_pmd,
> > goto out_free_pages;
> > VM_BUG_ON_PAGE(!PageHead(page), page);
> >  
> > +   /*
> > +* Leave pmd empty until pte is filled note we must notify here as
> > +* concurrent CPU thread might write to new page before the call to
> > +* mmu_notifier_invalidate_range_end() happen which can lead to a
> 
> happens
> 
> > +* device seeing memory write in different order than CPU.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> > -   /* leave pmd empty until pte is filled */
> >  
> 
> Here we can change the following mmu_notifier_invalidate_range_end to
> skip calling ->invalidate_range. It could be called
> mmu_notifier_invalidate_range_only_end, or other suggestions
> welcome. Otherwise we'll repeat the call for nothing.
> 
> We need it inside the PT lock for the ordering issue, but we don't
> need to run it twice.
> 
> Same in do_huge_pmd_wp_page, wp_page_copy and
> migrate_vma_insert_page. Every time *clear_flush_notify is used
> mmu_notifier_invalidate_range_only_end should be called after it,
> instead of mmu_notifier_invalidate_range_end.
> 
> I think optimizing that part too, fits in the context of this patchset
> (if not in the same patch), because the objective is still the same:
> to remove unnecessary ->invalidate_range calls.

Yes you are right, good idea, i will respin with that too (and with the
various typo you noted thank you for that). I can do 2 patch or 1, i don't
mind either way. I will probably do 2 as first and they can be folded into
1 if people prefer just one.


> 
> > +* No need to notify as we downgrading page
> 
> are
> 
> > +* table protection not changing it to point
> > +* to a new page.
> > +*
> 
> > +* No need to notify as we downgrading page table to read only
> 
> are
> 
> > +* No need to notify as we replacing a read only page with another
> 
> are
> 
> > @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, 
> > struct vm_area_struct *vma,
> > if (pte_soft_dirty(pteval))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -   } else
> > +   } else {
> > +   /*
> > +* We should not need to notify here as we reach this
> > +* case only from freeze_page() itself only call from
> > +* split_huge_page_to_list() so everything below must
> > +* be true:
> > +*   - page is not anonymous
> > +*   - page is locked
> > +*
> > +* So as it is a shared page and it is locked, it can
> > +* not be remove from the page cache and replace by
> > +* a new page before mmu_notifier_invalidate_range_end
> > +* so no concurrent thread might update its page table
> > +* to point at new page while a device still is using
> > +* this page.
> > +*
> > +* But we can not assume that new user of try_to_unmap
> > +* will have that in mind so just to be safe here call
> > +* mmu_notifier_invalidate_range()
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > dec_mm_counter(mm, mm_counter_file(page));
> > +   mmu_notifier_invalidate_range(mm, address,
> > + address + PAGE_SIZE);
> > +   }
> >  discard:
> > +   /*
> > +* No need to call mmu_notifier_invalidate_range() as we are
> > +* either replacing a present pte with non present one (either
> > +* a swap or special one). We handling the clearing pte case
> > +* above.
> > +*
> > +* See Documentation/vm/mmu_notifier.txt
> > +*/
> > page_remove_rmap(subpage, PageHuge(page));
> > 

Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Andrea Arcangeli
Hello Jerome,

On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> +Case A is obvious you do not want to take the risk for the device to write to
> +a page that might now be use by some completely different task.

used

> +is true ven if the thread doing the page table update is preempted right 
> after

even

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 90731e3b7e58..5706252b828a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> vm_fault *vmf, pmd_t orig_pmd,
>   goto out_free_pages;
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> + /*
> +  * Leave pmd empty until pte is filled note we must notify here as
> +  * concurrent CPU thread might write to new page before the call to
> +  * mmu_notifier_invalidate_range_end() happen which can lead to a

happens

> +  * device seeing memory write in different order than CPU.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> - /* leave pmd empty until pte is filled */
>  

Here we can change the following mmu_notifier_invalidate_range_end to
skip calling ->invalidate_range. It could be called
mmu_notifier_invalidate_range_only_end, or other suggestions
welcome. Otherwise we'll repeat the call for nothing.

We need it inside the PT lock for the ordering issue, but we don't
need to run it twice.

Same in do_huge_pmd_wp_page, wp_page_copy and
migrate_vma_insert_page. Every time *clear_flush_notify is used
mmu_notifier_invalidate_range_only_end should be called after it,
instead of mmu_notifier_invalidate_range_end.

I think optimizing that part too, fits in the context of this patchset
(if not in the same patch), because the objective is still the same:
to remove unnecessary ->invalidate_range calls.

> +  * No need to notify as we downgrading page

are

> +  * table protection not changing it to point
> +  * to a new page.
> +  *

> +  * No need to notify as we downgrading page table to read only

are

> +  * No need to notify as we replacing a read only page with another

are

> @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, 
> struct vm_area_struct *vma,
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
>   set_pte_at(mm, address, pvmw.pte, swp_pte);
> - } else
> + } else {
> + /*
> +  * We should not need to notify here as we reach this
> +  * case only from freeze_page() itself only call from
> +  * split_huge_page_to_list() so everything below must
> +  * be true:
> +  *   - page is not anonymous
> +  *   - page is locked
> +  *
> +  * So as it is a shared page and it is locked, it can
> +  * not be remove from the page cache and replace by
> +  * a new page before mmu_notifier_invalidate_range_end
> +  * so no concurrent thread might update its page table
> +  * to point at new page while a device still is using
> +  * this page.
> +  *
> +  * But we can not assume that new user of try_to_unmap
> +  * will have that in mind so just to be safe here call
> +  * mmu_notifier_invalidate_range()
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   dec_mm_counter(mm, mm_counter_file(page));
> + mmu_notifier_invalidate_range(mm, address,
> +   address + PAGE_SIZE);
> + }
>  discard:
> + /*
> +  * No need to call mmu_notifier_invalidate_range() as we are
> +  * either replacing a present pte with non present one (either
> +  * a swap or special one). We handling the clearing pte case
> +  * above.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   page_remove_rmap(subpage, PageHuge(page));
>   put_page(page);
> - mmu_notifier_invalidate_range(mm, address,
> -   address + PAGE_SIZE);
>   }
>  
>   mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

That is the path that unmaps filebacked pages (btw, not necessarily
shared unlike comment says, they can be private but still filebacked).

I'd like some more explanation about the inner working of "that new

Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-03 Thread Pasha Tatashin

Hi Michal,

I decided not to merge these two patches, because in addition to sparc 
optimization move, we have this dependancies:


mm: zero reserved and unavailable struct pages

must be before

mm: stop zeroing memory during allocation in vmemmap.

Otherwise, we can end-up with struct pages that are not zeroed properly.

However, the first patch depends on
mm: zero struct pages during initialization

As it uses mm_zero_struct_page().

Pasha


On 10/03/2017 11:34 AM, Pasha Tatashin wrote:

On 10/03/2017 09:19 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page 
memory

is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in 
parallel

when struct pages are zeroed.


Is it possible to merge this patch with 
http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatas...@oracle.com


Yes, I will do that. It would also require re-arranging
[PATCH v9 07/12] sparc64: optimized struct page zeroing
optimization to come after this patch.

Pasha


RE: [PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, David Laight wrote:

> From: Christoph Hellwig
> > Sent: 03 October 2017 11:43
> > x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> > make any sense to do any work in dma_cache_sync given that it must be a
> > no-op when dma_alloc_attrs returns coherent memory.
> 
> I believe it is just about possible to require an explicit
> write flush on x86.
> ISTR this can happen with something like write combining.

As the changelog says: x86 only implements dma_alloc_coherent() which as
the name says returns coherent memory, i.e. type WB (write back), which is
not subject to dma_cache_sync() operations.

If the driver converts that memory to WC (write combine) on its own via
PAT/MTRR, then it needs to take care of flushing the write buffer on its
own. It's not convered by this interface.

Thanks,

tglx



Re: [PATCH v2] powerpc: Blacklist GCC 5.4 6.1 and 6.2

2017-10-03 Thread Michal Suchánek
On Mon, 13 Feb 2017 19:49:16 -0600
Segher Boessenkool  wrote:

> On Tue, Feb 14, 2017 at 11:25:43AM +1100, Cyril Bur wrote:
> > > > At the time of writing GCC 5.4 is the most recent and is
> > > > affected. GCC 6.3 contains the backported fix, has been tested
> > > > and appears safe to use.  
> > > 
> > > 6.3 is (of course) the newer release; 5.4 is a maintenance
> > > release of a compiler that is a year older.  
> > 
> > Yes. I think the point I was trying to make is that since they
> > backported the fix to 5.x and 6.x then I expect that 5.5 will have
> > the fix but since it doesn't exist yet, I can't be sure. I'll add
> > something to that effect.  
> 
> The patch has been backported to the GCC 5 branch; it will be part of
> any future GCC 5 release (5.5 and later, if any later will happen; 5.5
> will).
> 
> Don't be so unsure about these things, we aren't *that*
> incompetent ;-)

Nonetheless the message is factually correct.

> 
> > > Please mention the GCC PR # somewhere in the code, too?  
> > 
> > Sure.  

It seems this has never happened?

Anyway, having the bug number in the text reported to user seems
excessive. So far the bug number has been mentioned only in the commit
messages and it looks fine to me. You can always look it up if you
really want.

Thanks

Michal



Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> > On Tue, 3 Oct 2017, Michael Ellerman wrote:
> > > Hmm, I tried that patch, it makes the warning go away. But then I
> > > triggered a deliberate hard lockup and got nothing.
> > > 
> > > Then I went back to the existing code (in linux-next), and I still get
> > > no warning from a deliberate hard lockup.
> > > 
> > > So seems there may be some more gremlins. Will test more in the morning.
> > 
> > Hrm. That's weird. I'll have a look and send a proper patch series on top
> > of next.
> 
> The major difference is that the reworked code utilizes
> watchdog_nmi_reconfigure() for both init and the sysctl updates, but I
> can't for my life figure out why that doesn't work.

I collected the changes which Linus requested along with the nmi_probe()
one and pushed them into:

 git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.core/urgent

That's based on 4.13 final so it neither contains 4.14 nor -next material.

Thanks,

tglx


[PATCH8/8] powerpc: Enable support of ibm,dynamic-memory-v2

2017-10-03 Thread Nathan Fontenot
Add required bits to the architecture vector to enable support
of the ibm,dynamic-memory-v2 device tree property.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/firmware.h   |3 ++-
 arch/powerpc/include/asm/prom.h   |1 +
 arch/powerpc/kernel/prom_init.c   |1 +
 arch/powerpc/platforms/pseries/firmware.c |1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 8645897472b1..832df61f30ef 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -51,6 +51,7 @@
 #define FW_FEATURE_BEST_ENERGY ASM_CONST(0x8000)
 #define FW_FEATURE_TYPE1_AFFINITY ASM_CONST(0x0001)
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
+#define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 
 #ifndef __ASSEMBLY__
 
@@ -67,7 +68,7 @@ enum {
FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
-   FW_FEATURE_HPT_RESIZE,
+   FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index f0a30a003bd8..9f27866e3126 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -143,6 +143,7 @@ extern int of_get_ibm_chip_id(struct device_node *np);
 #define OV5_PFO_HW_842 0x1140  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
+#define OV5_DRMEM_V2   0x1680  /* ibm,dynamic-reconfiguration-v2 */
 #define OV5_XIVE_SUPPORT   0x17C0  /* XIVE Exploitation Support Mask */
 #define OV5_XIVE_LEGACY0x1700  /* XIVE legacy mode Only */
 #define OV5_XIVE_EXPLOIT   0x1740  /* XIVE exploitation mode Only */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 02190e90c7ae..acf4b2e0530c 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -869,6 +869,7 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
.reserved2 = 0,
.reserved3 = 0,
.subprocessors = 1,
+   .byte22 = OV5_FEAT(OV5_DRMEM_V2),
.intarch = 0,
.mmu = 0,
.hash_ext = 0,
diff --git a/arch/powerpc/platforms/pseries/firmware.c 
b/arch/powerpc/platforms/pseries/firmware.c
index 63cc82ad58ac..aac3ea2911b2 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -114,6 +114,7 @@ static __initdata struct vec5_fw_feature
 vec5_fw_features_table[] = {
{FW_FEATURE_TYPE1_AFFINITY, OV5_TYPE1_AFFINITY},
{FW_FEATURE_PRRN,   OV5_PRRN},
+   {FW_FEATURE_DRMEM_V2,   OV5_DRMEM_V2},
 };
 
 static void __init fw_vec5_feature_init(const char *vec5, unsigned long len)



[PATCH7/8] powerpc/pseries: Add support for ibm, dynamic-memory-v2 property

2017-10-03 Thread Nathan Fontenot
The Power Hypervisor has introduced a new device tree format for
the property describing the dynamic reconfiguration LMBs for a system.
This new format condenses the size of the property, especially
on large memory systems, to provide an entry per range of LMBs
that possess the same flags and associativity index instead of
specifying an entry per LMB.

This patch updates the drmem code to be able to parse the new
property format at boot and create the LMB array. This also updates
the device tree updating routine to build a device tree property
in this format.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/drmem.h |   12 ++
 arch/powerpc/mm/drmem.c  |  188 ++
 2 files changed, 182 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 32d859c84202..b7becafc528d 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -49,6 +49,18 @@ struct of_drconf_cell_v1 {
u32 flags;
 };
 
+/* Version 2 of the ibm,dynamic-memory property is defined as a
+ * 32-bit value specifying the number of LMB sets followed by an
+ * array of of_drconf_cell_v2 entries, one per LMB set.
+ */
+struct of_drconf_cell_v2 {
+   u32 seq_lmbs;
+   u64 base_addr;
+   u32 drc_index;
+   u32 aa_index;
+   u32 flags;
+} __attribute__((packed));
+
 #define DRCONF_MEM_ASSIGNED0x0008
 #define DRCONF_MEM_AI_INVALID  0x0040
 #define DRCONF_MEM_RESERVED0x0080
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 9cd9e680874e..b5b8b8f46292 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -21,25 +21,13 @@
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
-int __init init_drmem_lmbs(unsigned long node)
+static int __init init_drmem_lmbs_v1(const __be32 *prop, u32 len)
 {
struct drmem_lmb *lmb;
-   const __be32 *prop;
-   int prop_sz;
-   u32 len;
-
-   prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
-   if (!prop || len < dt_root_size_cells * sizeof(__be32))
-   return -1;
-
-   drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, );
-
-   prop = of_get_flat_dt_prop(node, "ibm,dynamic-memory", );
-   if (!prop || len < dt_root_size_cells * sizeof(__be32))
-   return -1;
+   u32 prop_sz;
 
drmem_info->n_lmbs = of_read_number(prop++, 1);
-   prop_sz = drmem_info->n_lmbs * sizeof(struct of_drconf_cell)
+   prop_sz = drmem_info->n_lmbs * sizeof(struct of_drconf_cell_v1)
  + sizeof(__be32);
if (prop_sz < len)
return -1;
@@ -61,6 +49,89 @@ int __init init_drmem_lmbs(unsigned long node)
return 0;
 }
 
+static void read_one_drconf_v2_cell(const __be32 **cell,
+   struct of_drconf_cell_v2 *dr_cell)
+{
+   const __be32 *p = *cell;
+
+   dr_cell->seq_lmbs = of_read_number(p++, 1);
+   dr_cell->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   dr_cell->drc_index = of_read_number(p++, 1);
+   dr_cell->aa_index = of_read_number(p++, 1);
+   dr_cell->flags = of_read_number(p++, 1);
+
+   *cell = p;
+}
+
+static int __init init_drmem_lmbs_v2(const __be32 *prop, u32 len)
+{
+   struct drmem_lmb *lmb;
+   struct of_drconf_cell_v2 dr_cell;
+   const __be32 *p;
+   u32 lmb_sets, prop_sz;
+   int i, j, lmb_index;
+
+   lmb_sets = of_read_number(prop++, 1);
+   prop_sz = lmb_sets * sizeof(struct of_drconf_cell_v2)
+ + sizeof(__be32);
+   if (prop_sz < len)
+   return -1;
+
+   /* first pass, calculate the number of LMBs */
+   p = prop;
+   for (i = 0; i < lmb_sets; i++) {
+   read_one_drconf_v2_cell(, _cell);
+   drmem_info->n_lmbs += dr_cell.seq_lmbs;
+   }
+
+   drmem_info->lmbs = alloc_bootmem(drmem_info->n_lmbs * sizeof(*lmb));
+   if (!drmem_info->lmbs)
+   return -1;
+
+   lmb_index = 0;
+   p = prop;
+   for (i = 0; i < lmb_sets; i++) {
+   read_one_drconf_v2_cell(, _cell);
+
+   for (j = 0; j < dr_cell.seq_lmbs; j++) {
+   lmb = _info->lmbs[lmb_index++];
+
+   lmb->base_addr = dr_cell.base_addr;
+   dr_cell.base_addr += drmem_info->lmb_size;
+
+   lmb->drc_index = dr_cell.drc_index;
+   dr_cell.drc_index++;
+
+   lmb->aa_index = dr_cell.aa_index;
+   lmb->flags = dr_cell.flags;
+   }
+   }
+
+   return 0;
+}
+
+int __init init_drmem_lmbs(unsigned long node)
+{
+   const __be32 *prop;
+   u32 len;
+
+   prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
+   if (!prop || len < dt_root_size_cells * 

[PATCH6/8] powerpc: Move of_drconf_cell struct to asm/drmem.h

2017-10-03 Thread Nathan Fontenot
Now that the powerpc code parses dynamic reconfiguration memory
LMB information from the LMB array and not the device tree
directly we can move the of_drconf_cell struct to drmem.h where
it fits better.

In addition, the struct is renamed to of_drconf_cell_v1 in
anticipation of upcoming support for version 2 of the dynamic
reconfiguration property.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/drmem.h|   18 ++
 arch/powerpc/include/asm/prom.h |   16 
 arch/powerpc/mm/drmem.c |4 ++--
 arch/powerpc/platforms/pseries/hotplug-memory.c |6 +++---
 4 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index ccd8e1aa0cec..32d859c84202 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -35,6 +35,24 @@ extern struct drmem_lmb_info *drmem_info;
_info->lmbs[0],   \
_info->lmbs[drmem_info->n_lmbs - 1])
 
+/* The of_drconf_cell struct defines the layout of the LMB array
+ * specified in the ibm,dynamic-memory device tree property.
+ * The property itself is a 32-bit value specifying the number of
+ * LMBs followed by an array of of_drconf_cell_v1 entries, one
+ * per LMB.
+ */
+struct of_drconf_cell_v1 {
+   u64 base_addr;
+   u32 drc_index;
+   u32 reserved;
+   u32 aa_index;
+   u32 flags;
+};
+
+#define DRCONF_MEM_ASSIGNED0x0008
+#define DRCONF_MEM_AI_INVALID  0x0040
+#define DRCONF_MEM_RESERVED0x0080
+
 static inline u32 drmem_lmb_size(void)
 {
return drmem_info->lmb_size;
diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 825bd5998701..f0a30a003bd8 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -80,22 +80,6 @@ extern void of_instantiate_rtc(void);
 
 extern int of_get_ibm_chip_id(struct device_node *np);
 
-/* The of_drconf_cell struct defines the layout of the LMB array
- * specified in the device tree property
- * ibm,dynamic-reconfiguration-memory/ibm,dynamic-memory
- */
-struct of_drconf_cell {
-   u64 base_addr;
-   u32 drc_index;
-   u32 reserved;
-   u32 aa_index;
-   u32 flags;
-};
-
-#define DRCONF_MEM_ASSIGNED0x0008
-#define DRCONF_MEM_AI_INVALID  0x0040
-#define DRCONF_MEM_RESERVED0x0080
-
 /*
  * There are two methods for telling firmware what our capabilities are.
  * Newer machines have an "ibm,client-architecture-support" method on the
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 5aaee23b315c..9cd9e680874e 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -100,7 +100,7 @@ static int drmem_update_dt_v1(struct device_node *memory,
  struct property *prop)
 {
struct property *new_prop;
-   struct of_drconf_cell *dr_cell;
+   struct of_drconf_cell_v1 *dr_cell;
struct drmem_lmb *lmb;
u32 *p;
 
@@ -111,7 +111,7 @@ static int drmem_update_dt_v1(struct device_node *memory,
p = new_prop->value;
*p++ = cpu_to_be32(drmem_info->n_lmbs);
 
-   dr_cell = (struct of_drconf_cell *)p;
+   dr_cell = (struct of_drconf_cell_v1 *)p;
 
for_each_drmem_lmb(lmb) {
dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 11a9982b163d..76f558bf84f3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -996,7 +996,7 @@ static int pseries_add_mem_node(struct device_node *np)
 
 static int pseries_update_drconf_memory(struct of_reconfig_data *pr)
 {
-   struct of_drconf_cell *new_drmem, *old_drmem;
+   struct of_drconf_cell_v1 *new_drmem, *old_drmem;
unsigned long memblock_size;
u32 entries;
__be32 *p;
@@ -1019,11 +1019,11 @@ static int pseries_update_drconf_memory(struct 
of_reconfig_data *pr)
 * of_drconf_cell's.
 */
entries = be32_to_cpu(*p++);
-   old_drmem = (struct of_drconf_cell *)p;
+   old_drmem = (struct of_drconf_cell_v1 *)p;
 
p = (__be32 *)pr->prop->value;
p++;
-   new_drmem = (struct of_drconf_cell *)p;
+   new_drmem = (struct of_drconf_cell_v1 *)p;
 
for (i = 0; i < entries; i++) {
if ((be32_to_cpu(old_drmem[i].flags) & DRCONF_MEM_ASSIGNED) &&



[PATCH5/8] powerpc/pseries: Update memory hotplug code to use drmem LMB array

2017-10-03 Thread Nathan Fontenot
Update the pseries memory hotplug code to use the newly added
dynamic reconfiguration LMB array. Doing this is required for the
upcoming support of the version 2 of the dynamic reconfiguration
device tree property.

In addition, making this change cleans up the code that parses the
LMB information as we no longer need to worry about device tree
format. This allows us to discard one of the first steps on memory
hotpluf where we make a working copy of the device tree property and
convert the entire property to cpu format. Instead we just use the
LMB array directly while holding the memory hotplug lock.

This patch also moves the updating of the device tree property to
powerpc/mm/drmem.c. This allows to the hotplug code to work without
needing to know the device tree format and provides a single
routine for updating the device tree property. This new routine
will handle determination of the proper device tree format to
output to.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/drmem.h|   18 +
 arch/powerpc/mm/drmem.c |   81 
 arch/powerpc/platforms/pseries/hotplug-memory.c |  516 +--
 3 files changed, 297 insertions(+), 318 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 912712ecf6c6..ccd8e1aa0cec 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -40,7 +40,25 @@ static inline u32 drmem_lmb_size(void)
return drmem_info->lmb_size;
 }
 
+#define DRMEM_LMB_RESERVED 0x8000
+
+static inline void drmem_mark_lmb_reserved(struct drmem_lmb *lmb)
+{
+   lmb->flags |= DRMEM_LMB_RESERVED;
+}
+
+static inline void drmem_remove_lmb_reservation(struct drmem_lmb *lmb)
+{
+   lmb->flags &= ~DRMEM_LMB_RESERVED;
+}
+
+static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
+{
+   return lmb->flags & DRMEM_LMB_RESERVED;
+}
+
 extern int __init init_drmem_lmbs(unsigned long node);
 extern u64 drmem_lmb_memory_max(void);
+extern int drmem_update_dt(void);
 
 #endif
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 96e453e1fdd7..5aaee23b315c 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -69,6 +69,87 @@ u64 drmem_lmb_memory_max(void)
return last_lmb->base_addr + drmem_lmb_size();
 }
 
+static u32 drmem_lmb_flags(struct drmem_lmb *lmb)
+{
+   return lmb->flags & ~DRMEM_LMB_RESERVED;
+}
+
+static struct property *clone_property(struct property *prop, u32 prop_sz)
+{
+   struct property *new_prop;
+
+   new_prop = kzalloc(sizeof(*new_prop), GFP_KERNEL);
+   if (!new_prop)
+   return NULL;
+
+   new_prop->name = kstrdup(prop->name, GFP_KERNEL);
+   new_prop->value = kzalloc(prop_sz, GFP_KERNEL);
+   if (!new_prop->name || !new_prop->value) {
+   kfree(new_prop->name);
+   kfree(new_prop->value);
+   kfree(new_prop);
+   return NULL;
+   }
+
+   new_prop->length = prop_sz;
+   of_property_set_flag(new_prop, OF_DYNAMIC);
+   return new_prop;
+}
+
+static int drmem_update_dt_v1(struct device_node *memory,
+ struct property *prop)
+{
+   struct property *new_prop;
+   struct of_drconf_cell *dr_cell;
+   struct drmem_lmb *lmb;
+   u32 *p;
+
+   new_prop = clone_property(prop, prop->length);
+   if (!new_prop)
+   return -1;
+
+   p = new_prop->value;
+   *p++ = cpu_to_be32(drmem_info->n_lmbs);
+
+   dr_cell = (struct of_drconf_cell *)p;
+
+   for_each_drmem_lmb(lmb) {
+   dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
+   dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
+   dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
+
+   /* Do not copy out the bit we use internally to mark
+* an lmb as reserved during hortplug processing.
+*/
+   dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
+
+   dr_cell++;
+   }
+
+   of_update_property(memory, new_prop);
+   return 0;
+}
+
+int drmem_update_dt(void)
+{
+   struct device_node *memory;
+   struct property *prop;
+   int rc;
+
+   memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!memory)
+   return -1;
+
+   prop = of_find_property(memory, "ibm,dynamic-memory", NULL);
+   if (prop)
+   rc = drmem_update_dt_v1(memory, prop);
+   else
+   rc = -1;
+
+   of_node_put(memory);
+   return rc;
+}
+
 static int __init drmem_init(void)
 {
struct drmem_lmb *lmbs;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 1d48ab424bd9..11a9982b163d 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ 

[PATCH4/8] powerpc/numa: Update numa code use drmem LMB array

2017-10-03 Thread Nathan Fontenot
Update code in powerpc/numa.c to use the array of dynamic
reconfiguration memory LMBs instead of parsing the device tree
property directly. This allows for the removal of several
helper routines and makes gathering LMB information easier.

This patch also prepares the numa code for support of the
version 2 dynamic reconfiguration memory property.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/drmem.h |1 
 arch/powerpc/mm/drmem.c  |9 ++
 arch/powerpc/mm/numa.c   |  153 --
 3 files changed, 40 insertions(+), 123 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index cafe1a3b7da6..912712ecf6c6 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -41,5 +41,6 @@ static inline u32 drmem_lmb_size(void)
 }
 
 extern int __init init_drmem_lmbs(unsigned long node);
+extern u64 drmem_lmb_memory_max(void);
 
 #endif
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 8ad7cf36b2c4..96e453e1fdd7 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -61,6 +61,14 @@ int __init init_drmem_lmbs(unsigned long node)
return 0;
 }
 
+u64 drmem_lmb_memory_max(void)
+{
+   struct drmem_lmb *last_lmb;
+
+   last_lmb = _info->lmbs[drmem_info->n_lmbs - 1];
+   return last_lmb->base_addr + drmem_lmb_size();
+}
+
 static int __init drmem_init(void)
 {
struct drmem_lmb *lmbs;
@@ -81,4 +89,3 @@ static int __init drmem_init(void)
 }
 
 late_initcall(drmem_init);
-
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a9aa353d41cd..d03a57100d4f 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int numa_enabled = 1;
 
@@ -395,69 +396,6 @@ static unsigned long read_n_cells(int n, const __be32 
**buf)
return result;
 }
 
-/*
- * Read the next memblock list entry from the ibm,dynamic-memory property
- * and return the information in the provided of_drconf_cell structure.
- */
-static void read_drconf_cell(struct of_drconf_cell *drmem, const __be32 
**cellp)
-{
-   const __be32 *cp;
-
-   drmem->base_addr = read_n_cells(n_mem_addr_cells, cellp);
-
-   cp = *cellp;
-   drmem->drc_index = of_read_number(cp, 1);
-   drmem->reserved = of_read_number([1], 1);
-   drmem->aa_index = of_read_number([2], 1);
-   drmem->flags = of_read_number([3], 1);
-
-   *cellp = cp + 4;
-}
-
-/*
- * Retrieve and validate the ibm,dynamic-memory property of the device tree.
- *
- * The layout of the ibm,dynamic-memory property is a number N of memblock
- * list entries followed by N memblock list entries.  Each memblock list entry
- * contains information as laid out in the of_drconf_cell struct above.
- */
-static int of_get_drconf_memory(struct device_node *memory, const __be32 **dm)
-{
-   const __be32 *prop;
-   u32 len, entries;
-
-   prop = of_get_property(memory, "ibm,dynamic-memory", );
-   if (!prop || len < sizeof(unsigned int))
-   return 0;
-
-   entries = of_read_number(prop++, 1);
-
-   /* Now that we know the number of entries, revalidate the size
-* of the property read in to ensure we have everything
-*/
-   if (len < (entries * (n_mem_addr_cells + 4) + 1) * sizeof(unsigned int))
-   return 0;
-
-   *dm = prop;
-   return entries;
-}
-
-/*
- * Retrieve and validate the ibm,lmb-size property for drconf memory
- * from the device tree.
- */
-static u64 of_get_lmb_size(struct device_node *memory)
-{
-   const __be32 *prop;
-   u32 len;
-
-   prop = of_get_property(memory, "ibm,lmb-size", );
-   if (!prop || len < sizeof(unsigned int))
-   return 0;
-
-   return read_n_cells(n_mem_size_cells, );
-}
-
 struct assoc_arrays {
u32 n_arrays;
u32 array_sz;
@@ -509,7 +447,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct of_drconf_cell *drmem,
+static int of_drconf_to_nid_single(struct drmem_lmb *lmb,
   struct assoc_arrays *aa)
 {
int default_nid = 0;
@@ -517,16 +455,16 @@ static int of_drconf_to_nid_single(struct of_drconf_cell 
*drmem,
int index;
 
if (min_common_depth > 0 && min_common_depth <= aa->array_sz &&
-   !(drmem->flags & DRCONF_MEM_AI_INVALID) &&
-   drmem->aa_index < aa->n_arrays) {
-   index = drmem->aa_index * aa->array_sz + min_common_depth - 1;
+   !(lmb->flags & DRCONF_MEM_AI_INVALID) &&
+   lmb->aa_index < aa->n_arrays) {
+   index = lmb->aa_index * aa->array_sz + min_common_depth - 1;
nid = of_read_number(>arrays[index], 1);
 
  

[PATCH3/8] powerpc/mm: Separate ibm, dynamic-memory data from DT format

2017-10-03 Thread Nathan Fontenot
We currently have code to parse the dynamic reconfiguration LMB
information from the ibm,dynamic-meory device tree property in
multiple locations (numa.c, prom.c, and pseries/hotplug-memory.c).
In anticipation of adding support for a version 2 of the
ibm,dynamic-memory property this patch aims to separate the device
tree information from the device tree format.

To do this an array of per-LMB information is created at boot time
and any pieces of code that need to look at LMB information are then
updated to use this array instead of parsing the device tree directly.

Doing this provides several benefits. The device tree property
parsing is located in a single place, it makes walking through the
possible LMBs on a system easier, and provides a common data structure
so that any piece of the kernel needing LMB information does not
have to provide multiple parsing routines for the supported devide
tree formats.

This patch consolidates the device tree parsing into a common
location, builds the LMB array at boot time, and updates the prom
code to use the new LMB array.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/include/asm/drmem.h |   45 ++
 arch/powerpc/kernel/prom.c   |  126 +++---
 arch/powerpc/mm/Makefile |2 -
 arch/powerpc/mm/drmem.c  |   84 +
 4 files changed, 194 insertions(+), 63 deletions(-)
 create mode 100644 arch/powerpc/include/asm/drmem.h
 create mode 100644 arch/powerpc/mm/drmem.c

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
new file mode 100644
index ..cafe1a3b7da6
--- /dev/null
+++ b/arch/powerpc/include/asm/drmem.h
@@ -0,0 +1,45 @@
+/*
+ * drmem.h: Power specific logical memory block representation
+ *
+ * Copyright 2017 IBM Corporation
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef _ASM_POWERPC_LMB_H
+#define _ASM_POWERPC_LMB_H
+
+struct drmem_lmb {
+   u64 base_addr;
+   u32 drc_index;
+   u32 aa_index;
+   u32 flags;
+};
+
+struct drmem_lmb_info {
+   struct drmem_lmb*lmbs;
+   int n_lmbs;
+   u32 lmb_size;
+};
+
+extern struct drmem_lmb_info *drmem_info;
+
+#define for_each_drmem_lmb_in_range(lmb, start, end)   \
+   for ((lmb) = (start); (lmb) <= (end); (lmb)++)
+
+#define for_each_drmem_lmb(lmb)\
+   for_each_drmem_lmb_in_range((lmb),  \
+   _info->lmbs[0],   \
+   _info->lmbs[drmem_info->n_lmbs - 1])
+
+static inline u32 drmem_lmb_size(void)
+{
+   return drmem_info->lmb_size;
+}
+
+extern int __init init_drmem_lmbs(unsigned long node);
+
+#endif
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index f83056297441..a41d6348ef9e 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -454,80 +455,74 @@ static int __init early_init_dt_scan_chosen_ppc(unsigned 
long node,
 
 #ifdef CONFIG_PPC_PSERIES
 /*
- * Interpret the ibm,dynamic-memory property in the
- * /ibm,dynamic-reconfiguration-memory node.
+ * Interpret the ibm dynamic reconfiguration memory LMBs.
  * This contains a list of memory blocks along with NUMA affinity
  * information.
  */
-static int __init early_init_dt_scan_drconf_memory(unsigned long node)
+
+static void __init add_drmem_lmb(struct drmem_lmb *lmb, const __be32 **usm)
 {
-   const __be32 *dm, *ls, *usm;
-   int l;
-   unsigned long n, flags;
-   u64 base, size, memblock_size;
-   unsigned int is_kexec_kdump = 0, rngs;
+   u64 base, size;
+   int is_kexec_kdump = 0, rngs;
 
-   ls = of_get_flat_dt_prop(node, "ibm,lmb-size", );
-   if (ls == NULL || l < dt_root_size_cells * sizeof(__be32))
-   return 0;
-   memblock_size = dt_mem_next_cell(dt_root_size_cells, );
+   base = lmb->base_addr;
+   size = drmem_lmb_size();
+   rngs = 1;
 
-   dm = of_get_flat_dt_prop(node, "ibm,dynamic-memory", );
-   if (dm == NULL || l < sizeof(__be32))
-   return 0;
+   if (*usm)
+   is_kexec_kdump = 1;
 
-   n = of_read_number(dm++, 1);/* number of entries */
-   if (l < (n * (dt_root_addr_cells + 4) + 1) * sizeof(__be32))
-   return 0;
+   if (is_kexec_kdump) {
+   /*
+* For each memblock in ibm,dynamic-memory, a
+* corresponding entry in linux,drconf-usable-memory
+* property contains a counter 'p' followed by 'p'
+* (base, size) duple. Now read the 

[PATCH2/8] powerpc/numa: Look up device node in of_get_usable_memory()

2017-10-03 Thread Nathan Fontenot
Look up the device node for the usable memory property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_usable_memory() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index ca5cc1d4d387..a9aa353d41cd 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -184,11 +184,19 @@ static const __be32 *of_get_associativity(struct 
device_node *dev)
  * it exists (the property exists only in kexec/kdump kernels,
  * added by kexec-tools)
  */
-static const __be32 *of_get_usable_memory(struct device_node *memory)
+static const __be32 *of_get_usable_memory(void)
 {
+   struct device_node *memory;
const __be32 *prop;
u32 len;
+
+   memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!memory)
+   return NULL;
+
prop = of_get_property(memory, "linux,drconf-usable-memory", );
+   of_node_put(memory);
+
if (!prop || len < sizeof(unsigned int))
return NULL;
return prop;
@@ -674,7 +682,7 @@ static void __init parse_drconf_memory(struct device_node 
*memory)
return;
 
/* check if this is a kexec/kdump kernel */
-   usm = of_get_usable_memory(memory);
+   usm = of_get_usable_memory();
if (usm != NULL)
is_kexec_kdump = 1;
 



[PATCH1/8] powerpc/numa: Look up device node in of_get_assoc_arrays()

2017-10-03 Thread Nathan Fontenot
Look up the device node for the associativity array property instead
of having it passed in as a parameter. This changes precedes an update
in which the calling routines for of_get_assoc_arrays() will not have
the device node pointer to pass in.

Signed-off-by: Nathan Fontenot 
---
 arch/powerpc/mm/numa.c |   18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b95c584ce19d..ca5cc1d4d387 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -466,19 +466,27 @@ struct assoc_arrays {
  * indicating the size of each associativity array, followed by a list
  * of N associativity arrays.
  */
-static int of_get_assoc_arrays(struct device_node *memory,
-  struct assoc_arrays *aa)
+static int of_get_assoc_arrays(struct assoc_arrays *aa)
 {
+   struct device_node *memory;
const __be32 *prop;
u32 len;
 
+   memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+   if (!memory)
+   return -1;
+
prop = of_get_property(memory, "ibm,associativity-lookup-arrays", );
-   if (!prop || len < 2 * sizeof(unsigned int))
+   if (!prop || len < 2 * sizeof(unsigned int)) {
+   of_node_put(memory);
return -1;
+   }
 
aa->n_arrays = of_read_number(prop++, 1);
aa->array_sz = of_read_number(prop++, 1);
 
+   of_node_put(memory);
+
/* Now that we know the number of arrays and size of each array,
 * revalidate the size of the property read in.
 */
@@ -661,7 +669,7 @@ static void __init parse_drconf_memory(struct device_node 
*memory)
if (!lmb_size)
return;
 
-   rc = of_get_assoc_arrays(memory, );
+   rc = of_get_assoc_arrays();
if (rc)
return;
 
@@ -996,7 +1004,7 @@ static int hot_add_drconf_scn_to_nid(struct device_node 
*memory,
if (!lmb_size)
return -1;
 
-   rc = of_get_assoc_arrays(memory, );
+   rc = of_get_assoc_arrays();
if (rc)
return -1;
 



[PATCH0/8] Support for ibm,dynamic-memory-v2 device tree property

2017-10-03 Thread Nathan Fontenot
This patch set provides a set of updates to de-couple the LMB information
provided in the ibm,dynamic-memory device tree property from the device
tree property format. The goal is to provide a data set of LMB information
so that consumners of this data do not need to understand and provide
multiple parsing routines for the supported device tree property formats.

The first two patches update the of_get_assoc_arrays() and
of_get_usable_memory() routines to look up the device node for the
properties they parse. This is needed because the calling routines for
these two functions will not have the device node to pass in in
subsequent patches.

The third patch adds a new kernel structure, struct drmem_lmb, that
is used to represent each of the possible LMBs specified in the
ibm,dynamic-reconfiguration* device tree properties. The patch adds code
to parse the property and build the LMB array data, and updates prom.c
to use this new data structure instead of parsing the device tree directly.

The fourth and fifth patches update the numa and pseries hotplug code
respectively to use the new LMB array data instead of parsing the
device tree directly.

The sixth patch moves the of_drconf_cell struct to drmem.h where it
fits better than prom.h

The seventh patch introduces support for the ibm,dynamic-memory-v2
property format by updating the new drmem.c code to be able to parse
and create this new device tree format.

The last patch in the series updates the architecture vector to indicate
suppport for ibm,dynamic-memory-v2.


-Nathan
---

Nathan Fontenot (8):
  powerpc/numa: Look up device node in of_get_assoc_arrays()
  powerpc/numa: Look up device node in of_get_usable_memory()
  powerpc/mm: Separate ibm,dynamic-memory data from DT format
  powerpc/numa: Update numa code use drmem LMB array
  powerpc/pseries: Update memory hotplug code to use drmem LMB array
  powerpc: Move of_drconf_cell struct to asm/drmem.h
  powerpc/pseries: Add support for ibm,dynamic-memory-v2 property
  powerpc: Enable support of ibm,dynamic-memory-v2


 arch/powerpc/include/asm/drmem.h|   94 
 arch/powerpc/include/asm/firmware.h |3 
 arch/powerpc/include/asm/prom.h |   17 -
 arch/powerpc/kernel/prom.c  |  126 +++---
 arch/powerpc/kernel/prom_init.c |1 
 arch/powerpc/mm/Makefile|2 
 arch/powerpc/mm/drmem.c |  324 ++
 arch/powerpc/mm/numa.c  |  183 ++--
 arch/powerpc/platforms/pseries/firmware.c   |1 
 arch/powerpc/platforms/pseries/hotplug-memory.c |  522 +--
 10 files changed, 743 insertions(+), 530 deletions(-)
 create mode 100644 arch/powerpc/include/asm/drmem.h
 create mode 100644 arch/powerpc/mm/drmem.c



Re: [PATCH v2 4/5] powerpc: pseries: only store the device node basename in full_name

2017-10-03 Thread Rob Herring
On Tue, Oct 3, 2017 at 4:26 AM, Michael Ellerman  wrote:
> Hi Rob,
>
> Unfortunately this one has a bug, which only showed up after some stress
> testing.
>
> Rob Herring  writes:
>> With dependencies on full_name containing the entire device node path
>> removed, stop storing the full_name in nodes created by
>> dlpar_configure_connector() and pSeries_reconfig_add_node().
>>
>> Signed-off-by: Rob Herring 
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Cc: Michael Ellerman 
>> Cc: linuxppc-dev@lists.ozlabs.org
>> ---
>> v2:
>> - Rework dlpar_parse_cc_node() to a single allocation and strcpy instead
>>   of kasprintf
>>
>>  arch/powerpc/platforms/pseries/dlpar.c| 30 
>> +++---
>>  arch/powerpc/platforms/pseries/reconfig.c |  2 +-
>>  2 files changed, 8 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
>> b/arch/powerpc/platforms/pseries/dlpar.c
>> index 783f36364690..31a0615aeea1 100644
>> --- a/arch/powerpc/platforms/pseries/dlpar.c
>> +++ b/arch/powerpc/platforms/pseries/dlpar.c
>> @@ -75,28 +75,17 @@ static struct property *dlpar_parse_cc_property(struct 
>> cc_workarea *ccwa)
>>   return prop;
>>  }
>>
>> -static struct device_node *dlpar_parse_cc_node(struct cc_workarea *ccwa,
>> -const char *path)
>> +static struct device_node *(dlpar_parse_cc_node)(struct cc_workarea *ccwa)
>>  {
>>   struct device_node *dn;
>> - char *name;
>> -
>> - /* If parent node path is "/" advance path to NULL terminator to
>> -  * prevent double leading slashs in full_name.
>> -  */
>> - if (!path[1])
>> - path++;
>> + const char *name = (const char *)ccwa + be32_to_cpu(ccwa->name_offset);
>>
>> - dn = kzalloc(sizeof(*dn), GFP_KERNEL);
>> + dn = kzalloc(sizeof(*dn) + strlen(name) + 1, GFP_KERNEL);
>>   if (!dn)
>>   return NULL;
>>
>> - name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
>> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
>> - if (!dn->full_name) {
>> - kfree(dn);
>> - return NULL;
>> - }
>> + dn->full_name = (char *)(dn + 1);
>> + strcpy((char *)dn->full_name, name);
>
> Allocating the full_name on the tail of the struct this way breaks the
> call to kfree(node->full_name) in of_node_release().
>
> eg:
>
>
> [  322.543581] Freeing node->full_name c007f4ed4090
> [  322.543583] 
> =
> [  322.543586] BUG kmalloc-192 (Tainted: GB  ): Invalid object 
> pointer 0xc007f4ed4090
> [  322.543588] 
> -
>
> [  322.543592] INFO: Slab 0xf1fd3b40 objects=292 used=72 
> fp=0xc007f4ed41a8 flags=0x1380101
> [  322.543595] CPU: 40 PID: 5 Comm: kworker/u192:0 Tainted: GB   
> 4.14.0-rc2-gcc-6.3.1-next-20170929-dirty #429
> [  322.543600] Workqueue: pseries hotplug workque pseries_hp_work_fn
> [  322.543602] Call Trace:
> [  322.543605] [c003f769f660] [c0a6b710] dump_stack+0xb0/0xf0 
> (unreliable)
> [  322.543609] [c003f769f6a0] [c032cd5c] slab_err+0x9c/0xc0
> [  322.543612] [c003f769f790] [c033440c] 
> free_debug_processing+0x19c/0x490
> [  322.543615] [c003f769f860] [c03349fc] __slab_free+0x2fc/0x4b0
> [  322.543619] [c003f769f960] [c08f2484] 
> of_node_release+0xe4/0x1a0
> [  322.543622] [c003f769fa00] [c0a71c04] kobject_put+0xd4/0x160
> [  322.543625] [c003f769fa80] [c08f10c4] of_node_put+0x34/0x50
> [  322.543628] [c003f769fab0] [c00c0248] 
> dlpar_cpu_remove_by_index+0x108/0x130
> [  322.543631] [c003f769fb40] [c00c166c] dlpar_cpu+0x27c/0x510
> [  322.543634] [c003f769fbf0] [c00bb2d0] 
> handle_dlpar_errorlog+0xc0/0x160
> [  322.543638] [c003f769fc60] [c00bb3ac] 
> pseries_hp_work_fn+0x3c/0xa0
> [  322.543641] [c003f769fc90] [c012200c] 
> process_one_work+0x2bc/0x560
> [  322.543645] [c003f769fd20] [c0122348] worker_thread+0x98/0x5d0
> [  322.543648] [c003f769fdc0] [c012b4e4] kthread+0x164/0x1b0
> [  322.543651] [c003f769fe30] [c000bae0] 
> ret_from_kernel_thread+0x5c/0x7c
>
>
> The obvious fix is just to allocate it separately as before, eg ~=:

Yes, I'll go back to doing 2 allocs like v1, but using kstrdup as was
also pointed out.

Rob


Re: [PATCH] mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().

2017-10-03 Thread Kirill A. Shutemov
On Tue, Oct 03, 2017 at 10:26:06AM -0400, Zi Yan wrote:
> From: Zi Yan 
> 
> A non present pmd entry can appear after pmd_lock is taken in
> page_vma_mapped_walk(), even if THP migration is not enabled.
> The WARN_ONCE is unnecessary.
> 
> Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
> Reported-and-tested-by: Abdul Haleem 
> Signed-off-by: Zi Yan 
> Cc: "Kirill A. Shutemov" 
> Cc: Anshuman Khandual 
> Cc: Andrew Morton 

Acked-by: Kirill A. Shutemov 

-- 
 Kirill A. Shutemov


Re: [PATCH 1/1] KVM: PPC: Book3S: Fix server always zero from kvmppc_xive_get_xive()

2017-10-03 Thread Radim Krčmář
2017-10-02, David Gibson:
> Paolo,
> 
> Here's BenH's ack.  Again, this is a pretty important fix for us, and
> Paulus is away.  Can you take this into the KVM tree please.

Applied to kvm/master, thanks.

(Turns out I'm not on any cc'd list, so the author date might be off.)


RE: [PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
> x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.

I believe it is just about possible to require an explicit
write flush on x86.
ISTR this can happen with something like write combining.

Whether this can actually happen is the kernel is another matter.

David



Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-03 Thread Pasha Tatashin

Hi Michal,

Are you OK, if I replace DEFERRED_FREE() macro with a function like this:

/*
 * Helper for deferred_init_range, free the given range, and reset the
 * counters
 */
static inline unsigned long __def_free(unsigned long *nr_free,
   unsigned long *free_base_pfn,
   struct page **page)
{
unsigned long nr = *nr_free;

deferred_free_range(*free_base_pfn, nr);
*free_base_pfn = 0;
*nr_free = 0;
*page = NULL;

return nr;
}

Since it is inline, and we operate with non-volatile counters, compiler 
will be smart enough to remove all the unnecessary de-references. As a 
plus, we won't be adding any new branches, and the code is still going 
to stay compact.


Pasha

On 10/03/2017 11:15 AM, Pasha Tatashin wrote:

Hi Michal,



Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.


OK



I like how the resulting code is more compact and smaller.


That was the goal :)


for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.


Sure, I can re-arrange to have a goto place. Function won't be as small, 
and if compiler is not smart enough we might end up with having more 
branches than what my current code has.




please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.


Sure, I will submit this patch separately from the rest of the project. 
In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
should make sure it is working with as many configs as possible.


Thank you,
Pasha


Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-03 Thread Pasha Tatashin

On 10/03/2017 09:19 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.


Is it possible to merge this patch with 
http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatas...@oracle.com


Yes, I will do that. It would also require re-arranging
[PATCH v9 07/12] sparc64: optimized struct page zeroing
optimization to come after this patch.

Pasha


Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-03 Thread Pasha Tatashin

On 10/03/2017 09:18 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().


Could you be more specific where is such a memory reserved?



I know of one example: trim_low_memory_range() unconditionally reserves 
from pfn 0, but e820__memblock_setup() might provide the exiting memory 
from pfn 1 (i.e. KVM).


But, there could be more based on this comment from linux/page-flags.h:

 19  * PG_reserved is set for special pages, which can never be swapped 
out. Some

 20  * of them might not even exist (eg empty_bad_page)...

Pasha


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-03 Thread Pasha Tatashin



On 10/03/2017 09:08 AM, Michal Hocko wrote:

On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

 BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
   --
Total   16.124447727s   8.362381922s


Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?


Correct, without the last patch sparse_init time won't change.

Pasha


Re: [PATCH v9 04/12] sparc64: simplify vmemmap_populate

2017-10-03 Thread Pasha Tatashin




Acked-by: Michal Hocko 


Thank you,
Pasha


Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-03 Thread Pasha Tatashin

Hi Michal,



Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.


OK



I like how the resulting code is more compact and smaller.


That was the goal :)


for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.


Sure, I can re-arrange to have a goto place. Function won't be as small, 
and if compiler is not smart enough we might end up with having more 
branches than what my current code has.




please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.


Sure, I will submit this patch separately from the rest of the project. 
In my opinion DEFERRED_STRUCT_PAGE_INIT is the way of the future, so we 
should make sure it is working with as many configs as possible.


Thank you,
Pasha


Re: [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages

2017-10-03 Thread Pasha Tatashin


As you separated x86 and sparc patches doing essentially the same I
assume David is going to take this patch?


Correct, I noticed that usually platform specific changes are done in 
separate patches even if they are small. Dave already Acked this patch. 
So, I do not think it should be separated from the rest of the patches 
when this projects goes into mm-tree.




Acked-by: Michal Hocko 


Thank you,
Pasha



Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-03 Thread Pasha Tatashin

Hi Mark,

I considered using a new  *populate() function for shadow without using 
vmemmap_populate(), but that makes things unnecessary complicated: 
vmemmap_populate() has builtin:


1. large page support
2. device memory support
3. node locality support
4. several config based variants on different platforms

All of that  will cause the code simply be duplicated on each platform 
if we want to support that in kasan.


We could limit ourselves to only supporting base pages in memory by 
using something like vmemmap_populate_basepages(), but that is a step 
backward.  Kasan benefits from using large pages now, why remove it?


So, the solution I provide is walking page table right after memory is 
mapped. Since, we are using the actual page table, it is guaranteed that 
we are not going to miss any mapped memory, and also it is in common 
code, which makes things smaller and nicer.


Thank you,
Pasha

On 10/03/2017 10:48 AM, Mark Rutland wrote:


I've given this a spin on arm64, and can confirm that it works.

Given that this involes redundant walking of page tables, I still think
it'd be preferable to have some common *_populate() helper that took a
gfp argument, but I guess it's not the end of the world.

I'll leave it to Will and Catalin to say whether they're happy with the
page table walking and the new p{u,m}d_large() helpers added to arm64.



Re: [PATCH v9 01/12] x86/mm: setting fields in deferred pages

2017-10-03 Thread Pasha Tatashin

Hi Michal,



I hope I haven't missed anything but it looks good to me.

Acked-by: Michal Hocko 


Thank you for your review.



one nit below

---
  arch/x86/mm/init_64.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ea1c3c2636e..30fe22558720 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1182,12 +1182,17 @@ void __init mem_init(void)
  
  	/* clear_bss() already clear the empty_zero_page */
  
-	register_page_bootmem_info();

-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
  
+	/* Must be done after boot memory is put on freelist, because here we


standard code style is to do
/*
 * text starts here


OK, will change for both patch 1 and 2.

Pasha


Re: [PATCH v9 09/12] mm/kasan: kasan specific map populate function

2017-10-03 Thread Mark Rutland
Hi Pavel,

On Wed, Sep 20, 2017 at 04:17:11PM -0400, Pavel Tatashin wrote:
> During early boot, kasan uses vmemmap_populate() to establish its shadow
> memory. But, that interface is intended for struct pages use.
> 
> Because of the current project, vmemmap won't be zeroed during allocation,
> but kasan expects that memory to be zeroed. We are adding a new
> kasan_map_populate() function to resolve this difference.

Thanks for putting this together.

I've given this a spin on arm64, and can confirm that it works.

Given that this involes redundant walking of page tables, I still think
it'd be preferable to have some common *_populate() helper that took a
gfp argument, but I guess it's not the end of the world.

I'll leave it to Will and Catalin to say whether they're happy with the
page table walking and the new p{u,m}d_large() helpers added to arm64.

Thanks,
Mark.

> 
> Signed-off-by: Pavel Tatashin 
> ---
>  arch/arm64/include/asm/pgtable.h |  3 ++
>  include/linux/kasan.h|  2 ++
>  mm/kasan/kasan_init.c| 67 
> 
>  3 files changed, 72 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index bc4e92337d16..d89713f04354 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -381,6 +381,9 @@ extern pgprot_t phys_mem_access_prot(struct file *file, 
> unsigned long pfn,
>PUD_TYPE_TABLE)
>  #endif
>  
> +#define pmd_large(pmd)   pmd_sect(pmd)
> +#define pud_large(pud)   pud_sect(pud)
> +
>  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
>  {
>   *pmdp = pmd;
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index a5c7046f26b4..7e13df1722c2 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -78,6 +78,8 @@ size_t kasan_metadata_size(struct kmem_cache *cache);
>  
>  bool kasan_save_enable_multi_shot(void);
>  void kasan_restore_multi_shot(bool enabled);
> +int __meminit kasan_map_populate(unsigned long start, unsigned long end,
> +  int node);
>  
>  #else /* CONFIG_KASAN */
>  
> diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
> index 554e4c0f23a2..57a973f05f63 100644
> --- a/mm/kasan/kasan_init.c
> +++ b/mm/kasan/kasan_init.c
> @@ -197,3 +197,70 @@ void __init kasan_populate_zero_shadow(const void 
> *shadow_start,
>   zero_p4d_populate(pgd, addr, next);
>   } while (pgd++, addr = next, addr != end);
>  }
> +
> +/* Creates mappings for kasan during early boot. The mapped memory is zeroed 
> */
> +int __meminit kasan_map_populate(unsigned long start, unsigned long end,
> +  int node)
> +{
> + unsigned long addr, pfn, next;
> + unsigned long long size;
> + pgd_t *pgd;
> + p4d_t *p4d;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte;
> + int ret;
> +
> + ret = vmemmap_populate(start, end, node);
> + /*
> +  * We might have partially populated memory, so check for no entries,
> +  * and zero only those that actually exist.
> +  */
> + for (addr = start; addr < end; addr = next) {
> + pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd)) {
> + next = pgd_addr_end(addr, end);
> + continue;
> + }
> +
> + p4d = p4d_offset(pgd, addr);
> + if (p4d_none(*p4d)) {
> + next = p4d_addr_end(addr, end);
> + continue;
> + }
> +
> + pud = pud_offset(p4d, addr);
> + if (pud_none(*pud)) {
> + next = pud_addr_end(addr, end);
> + continue;
> + }
> + if (pud_large(*pud)) {
> + /* This is PUD size page */
> + next = pud_addr_end(addr, end);
> + size = PUD_SIZE;
> + pfn = pud_pfn(*pud);
> + } else {
> + pmd = pmd_offset(pud, addr);
> + if (pmd_none(*pmd)) {
> + next = pmd_addr_end(addr, end);
> + continue;
> + }
> + if (pmd_large(*pmd)) {
> + /* This is PMD size page */
> + next = pmd_addr_end(addr, end);
> + size = PMD_SIZE;
> + pfn = pmd_pfn(*pmd);
> + } else {
> + pte = pte_offset_kernel(pmd, addr);
> + next = addr + PAGE_SIZE;
> + if (pte_none(*pte))
> + continue;
> + /* This is base size page */
> + size = PAGE_SIZE;
> + pfn = 

Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> On Tue, 3 Oct 2017, Michael Ellerman wrote:
> > Hmm, I tried that patch, it makes the warning go away. But then I
> > triggered a deliberate hard lockup and got nothing.
> > 
> > Then I went back to the existing code (in linux-next), and I still get
> > no warning from a deliberate hard lockup.
> > 
> > So seems there may be some more gremlins. Will test more in the morning.
> 
> Hrm. That's weird. I'll have a look and send a proper patch series on top
> of next.

The major difference is that the reworked code utilizes
watchdog_nmi_reconfigure() for both init and the sysctl updates, but I
can't for my life figure out why that doesn't work.

Thanks,

tglx


Re: [PATCH v9 12/12] mm: stop zeroing memory during allocation in vmemmap

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:14, Pavel Tatashin wrote:
> vmemmap_alloc_block() will no longer zero the block, so zero memory
> at its call sites for everything except struct pages.  Struct page memory
> is zero'd by struct page initialization.
> 
> Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
> we will get the performance improvement by zeroing the memory in parallel
> when struct pages are zeroed.

Is it possible to merge this patch with 
http://lkml.kernel.org/r/20170920201714.19817-7-pasha.tatas...@oracle.com

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  include/linux/mm.h  | 11 +++
>  mm/sparse-vmemmap.c | 15 +++
>  mm/sparse.c |  6 +++---
>  3 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a7bba4ce79ba..25848764570f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
> long size, int node)
>   return __vmemmap_alloc_block_buf(size, node, NULL);
>  }
>  
> +static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
> +{
> + void *p = vmemmap_alloc_block(size, node);
> +
> + if (!p)
> + return NULL;
> + memset(p, 0, size);
> +
> + return p;
> +}
> +
>  void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
>  int vmemmap_populate_basepages(unsigned long start, unsigned long end,
>  int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index d1a39b8051e0..c2f5654e7c9d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
>   unsigned long align,
>   unsigned long goal)
>  {
> - return memblock_virt_alloc_try_nid(size, align, goal,
> + return memblock_virt_alloc_try_nid_raw(size, align, goal,
>   BOOTMEM_ALLOC_ACCESSIBLE, node);
>  }
>  
> @@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, 
> int node)
>   if (slab_is_available()) {
>   struct page *page;
>  
> - page = alloc_pages_node(node,
> - GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> - get_order(size));
> + page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
> + get_order(size));
>   if (page)
>   return page_address(page);
>   return NULL;
> @@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, 
> unsigned long addr, int node)
>  {
>   pmd_t *pmd = pmd_offset(pud, addr);
>   if (pmd_none(*pmd)) {
> - void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>   if (!p)
>   return NULL;
>   pmd_populate_kernel(_mm, pmd, p);
> @@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, 
> unsigned long addr, int node)
>  {
>   pud_t *pud = pud_offset(p4d, addr);
>   if (pud_none(*pud)) {
> - void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>   if (!p)
>   return NULL;
>   pud_populate(_mm, pud, p);
> @@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, 
> unsigned long addr, int node)
>  {
>   p4d_t *p4d = p4d_offset(pgd, addr);
>   if (p4d_none(*p4d)) {
> - void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>   if (!p)
>   return NULL;
>   p4d_populate(_mm, p4d, p);
> @@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long 
> addr, int node)
>  {
>   pgd_t *pgd = pgd_offset_k(addr);
>   if (pgd_none(*pgd)) {
> - void *p = vmemmap_alloc_block(PAGE_SIZE, node);
> + void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>   if (!p)
>   return NULL;
>   pgd_populate(_mm, pgd, p);
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 83b3bf6461af..d22f51bb7c79 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page 
> **map_map,
>   }
>  
>   size = PAGE_ALIGN(size);
> - map = memblock_virt_alloc_try_nid(size * map_count,
> -   PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> -   BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
> + map = 

Re: [PATCH v9 08/12] mm: zero reserved and unavailable struct pages

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:10, Pavel Tatashin wrote:
> Some memory is reserved but unavailable: not present in memblock.memory
> (because not backed by physical pages), but present in memblock.reserved.
> Such memory has backing struct pages, but they are not initialized by going
> through __init_single_page().

Could you be more specific where is such a memory reserved?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 06/12] mm: zero struct pages during initialization

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:08, Pavel Tatashin wrote:
> Add struct page zeroing as a part of initialization of other fields in
> __init_single_page().
> 
> This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
> v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):
> 
> BASEFIX
> sparse_init 11.244671836s   0.007199623s
> zone_sizes_init  4.879775891s   8.355182299s
>   --
> Total   16.124447727s   8.362381922s

Hmm, this is confusing. This assumes that sparse_init doesn't zero pages
anymore, right? So these number depend on the last patch in the series?

> sparse_init is where memory for struct pages is zeroed, and the zeroing
> part is moved later in this patch into __init_single_page(), which is
> called from zone_sizes_init().
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> Acked-by: Michal Hocko 
> ---
>  include/linux/mm.h | 9 +
>  mm/page_alloc.c| 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f8c10d336e42..50b74d628243 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #define mm_forbids_zeropage(X)   (0)
>  #endif
>  
> +/*
> + * On some architectures it is expensive to call memset() for small sizes.
> + * Those architectures should provide their own implementation of "struct 
> page"
> + * zeroing by defining this macro in .
> + */
> +#ifndef mm_zero_struct_page
> +#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
> +#endif
> +
>  /*
>   * Default maximum number of active map areas, this limits the number of vmas
>   * per mm struct. Users can overwrite this number by sysctl but there is a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a8dbd405ed94..4b630ee91430 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
>  static void __meminit __init_single_page(struct page *page, unsigned long 
> pfn,
>   unsigned long zone, int nid)
>  {
> + mm_zero_struct_page(page);
>   set_page_links(page, zone, nid, pfn);
>   init_page_count(page);
>   page_mapcount_reset(page);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-03 Thread Cédric Le Goater
Hello Michael,

On 10/03/2017 01:23 PM, Michael Ellerman wrote:
> Cédric Le Goater  writes:
> 
>> On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
>>> Hi,
>>>
>>> Here are a couple of small fixes to support CPU hot unplug. There are
>>> still some issues to be investigated as, in some occasions, after a
>>> couple of plug and unplug, the cpu which was removed receives a 'lost'
>>> interrupt. This showed to be the decrementer under QEMU.
>>
>> So this seems to be a QEMU issue only which can be solved by 
>> removing the DEE bit from the LPCR on P9 processor when the CPU 
>> is stopped in rtas. PECE3 bit on P8 processors. 
>>
>> I think these patches are valuable fixes for 4.14. The first 
>> is trivial and the second touches the common xive part but it
>> is only called on the pseries platform.  
>>
>> Could you please take a look ? 
> 
> I can.
> 
> You didn't give me much indication in the change logs of what the
> failure mode is if I _don't_ have the fixes, which makes it hard for me
> to assess the severity.
Support for CPU removal came recently on PowerVM and was not
tested when the patchset was sent. A couple of cleanups of the 
XIVE internal structures are missing resulting in a kernel crash 
when the CPU is released. 

There are still some corner cases when stressing the lpar with 
plug/unplug loops. Investigation in progress.  

> Can you flesh out the ".. or else" case in the change logs?

OK. I will improve patch 2 changelog.

> And should both be tagged?
> 
>   Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE 
> interrupt controller")
yes I think so. These patches are too small to be tagged as 
adding hotplug support.

Thanks,

C. 


Re: [PATCH v9 04/12] sparc64: simplify vmemmap_populate

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:06, Pavel Tatashin wrote:
> Remove duplicating code by using common functions
> vmemmap_pud_populate and vmemmap_pgd_populate.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> Acked-by: David S. Miller 

Acked-by: Michal Hocko 

> ---
>  arch/sparc/mm/init_64.c | 23 ++-
>  1 file changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 310c6754bcaa..99aea4d15a5f 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2651,30 +2651,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
> unsigned long vend,
>   vstart = vstart & PMD_MASK;
>   vend = ALIGN(vend, PMD_SIZE);
>   for (; vstart < vend; vstart += PMD_SIZE) {
> - pgd_t *pgd = pgd_offset_k(vstart);
> + pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
>   unsigned long pte;
>   pud_t *pud;
>   pmd_t *pmd;
>  
> - if (pgd_none(*pgd)) {
> - pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
> + if (!pgd)
> + return -ENOMEM;
>  
> - if (!new)
> - return -ENOMEM;
> - pgd_populate(_mm, pgd, new);
> - }
> -
> - pud = pud_offset(pgd, vstart);
> - if (pud_none(*pud)) {
> - pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
> -
> - if (!new)
> - return -ENOMEM;
> - pud_populate(_mm, pud, new);
> - }
> + pud = vmemmap_pud_populate(pgd, vstart, node);
> + if (!pud)
> + return -ENOMEM;
>  
>   pmd = pmd_offset(pud, vstart);
> -
>   pte = pmd_val(*pmd);
>   if (!(pte & _PAGE_VALID)) {
>   void *block = vmemmap_alloc_block(PMD_SIZE, node);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 03/12] mm: deferred_init_memmap improvements

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:05, Pavel Tatashin wrote:
> This patch fixes two issues in deferred_init_memmap
> 
> =
> In deferred_init_memmap() where all deferred struct pages are initialized
> we have a check like this:
> 
> if (page->flags) {
>   VM_BUG_ON(page_zone(page) != zone);
>   goto free_range;
> }
> 
> This way we are checking if the current deferred page has already been
> initialized. It works, because memory for struct pages has been zeroed, and
> the only way flags are not zero if it went through __init_single_page()
> before.  But, once we change the current behavior and won't zero the memory
> in memblock allocator, we cannot trust anything inside "struct page"es
> until they are initialized. This patch fixes this.
> 
> The deferred_init_memmap() is re-written to loop through only free memory
> ranges provided by memblock.

Please be explicit that this is possible only because we discard
memblock data later after 3010f876500f ("mm: discard memblock data
later"). Also be more explicit how the new code works.

I like how the resulting code is more compact and smaller.
for_each_free_mem_range also looks more appropriate but I really detest
the DEFERRED_FREE thingy. Maybe we can handle all that in a single goto
section. I know this is not an art but manipulating variables from
macros is more error prone and much more ugly IMHO.

> =
> This patch fixes another existing issue on systems that have holes in
> zones i.e CONFIG_HOLES_IN_ZONE is defined.
> 
> In for_each_mem_pfn_range() we have code like this:
> 
> if (!pfn_valid_within(pfn)
>   goto free_range;
> 
> Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
> Thus means if deferred struct pages are enabled on systems with these kind
> of holes, linux would get memory corruptions. I have fixed this issue by
> defining a new macro that performs all the necessary operations when we
> free the current set of pages.

please do not use macros. Btw. this deserves its own fix. I suspect that
no CONFIG_HOLES_IN_ZONE arch enables DEFERRED_STRUCT_PAGE_INIT but
purely from the review point of view it should be its own patch.

> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  mm/page_alloc.c | 161 
> +++-
>  1 file changed, 78 insertions(+), 83 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af88836a..d132c801d2c1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
>  }
>  
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> -static void __init deferred_free_range(struct page *page,
> - unsigned long pfn, int nr_pages)
> +static void __init deferred_free_range(unsigned long pfn,
> +unsigned long nr_pages)
>  {
> - int i;
> + struct page *page;
> + unsigned long i;
>  
> - if (!page)
> + if (!nr_pages)
>   return;
>  
> + page = pfn_to_page(pfn);
> +
>   /* Free a large naturally-aligned chunk if possible */
>   if (nr_pages == pageblock_nr_pages &&
>   (pfn & (pageblock_nr_pages - 1)) == 0) {
> @@ -1443,19 +1446,82 @@ static inline void __init 
> pgdat_init_report_one_done(void)
>   complete(_init_all_done_comp);
>  }
>  
> +#define DEFERRED_FREE(nr_free, free_base_pfn, page)  \
> +({   \
> + unsigned long nr = (nr_free);   \
> + \
> + deferred_free_range((free_base_pfn), (nr)); \
> + (free_base_pfn) = 0;\
> + (nr_free) = 0;  \
> + page = NULL;\
> + nr; \
> +})
> +
> +static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
> +  unsigned long end_pfn)
> +{
> + struct mminit_pfnnid_cache nid_init_state = { };
> + unsigned long nr_pgmask = pageblock_nr_pages - 1;
> + unsigned long free_base_pfn = 0;
> + unsigned long nr_pages = 0;
> + unsigned long nr_free = 0;
> + struct page *page = NULL;
> +
> + for (; pfn < end_pfn; pfn++) {
> + /*
> +  * First we check if pfn is valid on architectures where it is
> +  * possible to have holes within pageblock_nr_pages. On systems
> +  * where it is not possible, this function is optimized out.
> +  *
> +  * Then, we check if a current large page 

Re: [PATCH v9 02/12] sparc64/mm: setting fields in deferred pages

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:04, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled there is a case where we set some
> fields prior to initializing:
> 
> mem_init() {
>  register_page_bootmem_info();
>  free_all_bootmem();
>  ...
> }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
> mem_init
> register_page_bootmem_info
> register_page_bootmem_info_node
>  get_page_bootmem
>   .. setting fields here ..
>   such as: page->freelist = (void *)type;
> 
> free_all_bootmem()
> free_low_memory_core_early()
>  for_each_reserved_mem_region()
>   reserve_bootmem_region()
>init_reserved_page() <- Only if this is deferred reserved page
> __init_single_pfn()
>  __init_single_page()
>   memset(0) <-- Loose the set fields here
> 
> We end-up with similar issue as in the previous patch, where currently we
> do not observe problem as memory is zeroed. But, if flag asserts are
> changed we can start hitting issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> Acked-by: David S. Miller 

As you separated x86 and sparc patches doing essentially the same I
assume David is going to take this patch?

Acked-by: Michal Hocko 

> ---
>  arch/sparc/mm/init_64.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 6034569e2c0d..310c6754bcaa 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2548,9 +2548,15 @@ void __init mem_init(void)
>  {
>   high_memory = __va(last_valid_pfn << PAGE_SHIFT);
>  
> - register_page_bootmem_info();
>   free_all_bootmem();
>  
> + /* Must be done after boot memory is put on freelist, because here we
> +  * might set fields in deferred struct pages that have not yet been
> +  * initialized, and free_all_bootmem() initializes all the reserved
> +  * deferred pages for us.
> +  */
> + register_page_bootmem_info();
> +
>   /*
>* Set up the zero page, mark it reserved, so that page count
>* is not manipulated when freeing the page from user ptes.
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 01/12] x86/mm: setting fields in deferred pages

2017-10-03 Thread Michal Hocko
On Wed 20-09-17 16:17:03, Pavel Tatashin wrote:
> Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
> flags and other fields in "struct page"es are never changed prior to first
> initializing struct pages by going through __init_single_page().
> 
> With deferred struct page feature enabled, however, we set fields in
> register_page_bootmem_info that are subsequently clobbered right after in
> free_all_bootmem:
> 
> mem_init() {
> register_page_bootmem_info();
> free_all_bootmem();
> ...
> }
> 
> When register_page_bootmem_info() is called only non-deferred struct pages
> are initialized. But, this function goes through some reserved pages which
> might be part of the deferred, and thus are not yet initialized.
> 
>   mem_init
>register_page_bootmem_info
> register_page_bootmem_info_node
>  get_page_bootmem
>   .. setting fields here ..
>   such as: page->freelist = (void *)type;
> 
>   free_all_bootmem()
>free_low_memory_core_early()
> for_each_reserved_mem_region()
>  reserve_bootmem_region()
>   init_reserved_page() <- Only if this is deferred reserved page
>__init_single_pfn()
> __init_single_page()
> memset(0) <-- Loose the set fields here
> 
> We end-up with issue where, currently we do not observe problem as memory
> is explicitly zeroed. But, if flag asserts are changed we can start hitting
> issues.
> 
> Also, because in this patch series we will stop zeroing struct page memory
> during allocation, we must make sure that struct pages are properly
> initialized prior to using them.
> 
> The deferred-reserved pages are initialized in free_all_bootmem().
> Therefore, the fix is to switch the above calls.

Thanks for extending the changelog. This is more informative now.
 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 

I hope I haven't missed anything but it looks good to me.

Acked-by: Michal Hocko 

one nit below
> ---
>  arch/x86/mm/init_64.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5ea1c3c2636e..30fe22558720 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1182,12 +1182,17 @@ void __init mem_init(void)
>  
>   /* clear_bss() already clear the empty_zero_page */
>  
> - register_page_bootmem_info();
> -
>   /* this will put all memory onto the freelists */
>   free_all_bootmem();
>   after_bootmem = 1;
>  
> + /* Must be done after boot memory is put on freelist, because here we

standard code style is to do
/*
 * text starts here

> +  * might set fields in deferred struct pages that have not yet been
> +  * initialized, and free_all_bootmem() initializes all the reserved
> +  * deferred pages for us.
> +  */
> + register_page_bootmem_info();
> +
>   /* Register memory areas for /proc/kcore */
>   kclist_add(_vsyscall, (void *)VSYSCALL_ADDR,
>PAGE_SIZE, KCORE_OTHER);
> -- 
> 2.14.1

-- 
Michal Hocko
SUSE Labs


[RFC PATCH] powerpc/perf: Add compact mode pmu support for powernv

2017-10-03 Thread Madhavan Srinivasan
Most of the power processor generation performance monitoring
unit (PMU) driver code is bundled in the kernel and one of those
is enabled/registered based on the oprofile_cpu_type check at
the boot.

But things get little tricky incase of "compact" mode boot.
IBM POWER System Server based processors has a compactibility
mode feature, which simpily put is, Nth generation processor
(lets say POWER8) will act and appear in a mode consistent
with an earlier generation (N-1) processor (that is POWER7).
And in this "compact" mode boot, kernel modify the
"oprofile_cpu_type" to be Nth generation (POWER8). If Nth
generation pmu driver is bundled (POWER8), it gets registered.

Key dependency here is to have distro support for latest
processor performance monitoring support. Patch here adds
a generic "compact-mode" performance monitoring driver to
be register in absence of powernv platform specific pmu driver.

Driver supports only "cycles" and "instruction" events.
"0x0001e" used as event code for "cycles" and "0x2"
used as event code for "instruction" events. New file
called "compact-pmu.c" is created to contain the driver
specific code. And base raw event code format modeled
on power9. Currently patch checks for "oprofile_cpu_type"
in the cpufeatures_setup_finished() to identify platform
specific driver initialization.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/kernel/dt_cpu_ftrs.c |  33 +
 arch/powerpc/perf/Makefile|   3 +-
 arch/powerpc/perf/compact-pmu.c   | 247 ++
 3 files changed, 282 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/perf/compact-pmu.c

diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 1df770e8cbe0..c283b6cc5132 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -474,6 +474,34 @@ static int __init feat_enable_pmu_power9(struct 
dt_cpu_feature *f)
return 1;
 }
 
+static void check_pmu_setup(void)
+{
+   /*
+* Both "oprofile_cpu_type" and CPU_FTR_ARCH_300 checked
+* for enabling compact mode pmu. Compact mode pmu is a
+* basic pmu driver that will be installed in absence of
+* platform specific pmu driver.
+*/
+   if ((cur_cpu_spec->cpu_features & CPU_FTR_ARCH_300) &&
+   !cur_cpu_spec->oprofile_cpu_type) {
+
+   hfscr_pmu_enable();
+
+   /* Use power9 setup function since CPU_FTR_ARCH_300 enabled */
+   init_pmu_power9();
+   init_pmu_registers = init_pmu_power9;
+
+   cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA;
+   cur_cpu_spec->cpu_user_features |= 
PPC_FEATURE_PSERIES_PERFMON_COMPAT;
+
+   cur_cpu_spec->num_pmcs  = 6;
+   cur_cpu_spec->pmc_type  = PPC_PMC_IBM;
+   cur_cpu_spec->oprofile_cpu_type = "ppc64/compact-mode";
+
+   pr_debug("Enabling compact mode pmu\n");
+   }
+}
+
 static int __init feat_enable_tm(struct dt_cpu_feature *f)
 {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -750,6 +778,11 @@ static void __init cpufeatures_setup_finished(void)
system_registers.hfscr = mfspr(SPRN_HFSCR);
system_registers.fscr = mfspr(SPRN_FSCR);
 
+   /*
+* If no pmu driver loaded, load a default basic pmu driver 
+*/
+   check_pmu_setup();
+
cpufeatures_flush_tlb();
 
pr_info("final cpu/mmu features = 0x%016lx 0x%08x\n",
diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 3f3a5ce66495..311144929b6d 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -5,7 +5,8 @@ obj-$(CONFIG_PERF_EVENTS)   += callchain.o perf_regs.o
 obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
-  isa207-common.o power8-pmu.o power9-pmu.o
+  isa207-common.o power8-pmu.o power9-pmu.o\
+  compact-pmu.o
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_PPC_POWERNV)  += imc-pmu.o
diff --git a/arch/powerpc/perf/compact-pmu.c b/arch/powerpc/perf/compact-pmu.c
new file mode 100644
index ..67fec7706fef
--- /dev/null
+++ b/arch/powerpc/perf/compact-pmu.c
@@ -0,0 +1,247 @@
+/*
+ * Performance counter support.
+ *
+ * Copyright 2017 Madhavan Srinivasan, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or later version.
+ */
+
+#define pr_fmt(fmt)"compact-pmu: " fmt
+
+#include "isa207-common.h"
+
+/*
+ * Raw event encoding for Power9:
+ *
+ *60565248   

Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Michael Ellerman wrote:
> Thomas Gleixner  writes:
> >> The first call is new because previously watchdog_nmi_reconfigure()
> >> wasn't called from softlockup_reconfigure_threads().
> >
> > Hmm, don't you have the same problem with CPU hotplug or do you just get
> > lucky because the hotplug callback in your code is ordered vs. the
> > softlockup thread hotplug callback in a way that this does not hit?
> 
> I don't see it with CPU hotplug.
> 
> AFAICS that's because softlockup_reconfigure_threads() isn't called for
> CPU hotplug. Unless there's a path I'm missing?

As I said in the other reply, I assumed that its called via
watchdog_nmi_enable(cpu), but that's a weak function which is not
implemented on power. So no issue.

> >> I'm not sure what the easiest fix is. One option would be to just drop
> >> the WARN_ON, it's just there for paranoia AFAICS.
> >
> > The straight forward way is to make use of the new probe function. Patch
> > below.
> 
> Hmm, I tried that patch, it makes the warning go away. But then I
> triggered a deliberate hard lockup and got nothing.
> 
> Then I went back to the existing code (in linux-next), and I still get
> no warning from a deliberate hard lockup.
> 
> So seems there may be some more gremlins. Will test more in the morning.

Hrm. That's weird. I'll have a look and send a proper patch series on top
of next.

Thanks,

tglx


Re: refactor dma_cache_sync V2

2017-10-03 Thread Robin Murphy
On 03/10/17 11:43, Christoph Hellwig wrote:
> The dma_cache_sync routines is used to flush caches for memory returned
> by dma_alloc_attrs with the DMA_ATTR_NON_CONSISTENT flag (or previously
> from dma_alloc_noncoherent), but the requirements for it seems to be
> frequently misunderstood.  dma_cache_sync is documented to be a no-op for
> allocations that do not have the DMA_ATTR_NON_CONSISTENT flag set, and
> yet a lot of architectures implement it in some way despite not
> implementing DMA_ATTR_NON_CONSISTENT.
> 
> This series removes a few abuses of dma_cache_sync for non-DMA API
> purposes, then changes all remaining architectures that do not implement
> DMA_ATTR_NON_CONSISTENT to implement dma_cache_sync as a no-op, and
> then adds the struct dma_map_ops indirection we use for all other
> DMA mapping operations to dma_cache_sync as well, thus removing all but
> two implementations of the function.
> 
> Changes since V1:
>  - drop the mips fd_cacheflush, merged via maintainer
>  - spelling fix in last commit descriptions (thanks Geert)
> 

For the series,

Reviewed-by: Robin Murphy 

The MIPS DMA ops are a little fiddly, but I've satisfied myself that
CONFIG_DMA_NONCOHERENT and CONFIG_SWIOTLB are mutually exclusive for
Loongson64 such that patch #11 isn't missing anything.

Robin.


Re: [PATCH 07/11] powerpc: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
On Tue, Oct 03, 2017 at 01:24:57PM +0200, Christophe LEROY wrote:
>> powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
>> doesn't make any sense to do any work in dma_cache_sync given that it
>> must be a no-op when dma_alloc_attrs returns coherent memory.
> What about arch/powerpc/mm/dma-noncoherent.c ?
>
> Powerpc 8xx doesn't have coherent memory.

It doesn't implement the DMA_ATTR_NON_CONSISTENT interface either,
so if it really doesn't have a way to provide dma coherent allocation
(although the code in __dma_alloc_coherent suggests it does provide
dma coherent allocations) I have no idea how it could ever have
worked.


Re: [PATCH 07/11] powerpc: make dma_cache_sync a no-op

2017-10-03 Thread Robin Murphy
On 03/10/17 12:24, Christophe LEROY wrote:
> 
> 
> Le 03/10/2017 à 12:43, Christoph Hellwig a écrit :
>> powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
>> doesn't make any sense to do any work in dma_cache_sync given that it
>> must be a no-op when dma_alloc_attrs returns coherent memory.
> What about arch/powerpc/mm/dma-noncoherent.c ?

AFAICS, just like the ARM code from which it is derived, that will
always return a non-cacheable remap of the allocation, which should thus
not require explicit maintenance after CPU accesses. dma_cache_sync()
would only matter if dma_alloc_attrs(..., DMA_ATTR_NON_CONSISTENT) got
special treatment and was allowed to return a cacheable address, but PPC
doesn't even propagate the attrs to its internal __dma_alloc_coherent()
implementations.

Robin.

> Powerpc 8xx doesn't have coherent memory.
> 
> Christophe
> 
>>
>> Signed-off-by: Christoph Hellwig 
>> ---
>>   arch/powerpc/include/asm/dma-mapping.h | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/dma-mapping.h
>> b/arch/powerpc/include/asm/dma-mapping.h
>> index eaece3d3e225..320846442bfb 100644
>> --- a/arch/powerpc/include/asm/dma-mapping.h
>> +++ b/arch/powerpc/include/asm/dma-mapping.h
>> @@ -144,8 +144,6 @@ static inline phys_addr_t dma_to_phys(struct
>> device *dev, dma_addr_t daddr)
>>   static inline void dma_cache_sync(struct device *dev, void *vaddr,
>> size_t size,
>>   enum dma_data_direction direction)
>>   {
>> -    BUG_ON(direction == DMA_NONE);
>> -    __dma_sync(vaddr, size, (int)direction);
>>   }
>>     #endif /* __KERNEL__ */
>>



Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Michael Ellerman
Thomas Gleixner  writes:

> On Tue, 3 Oct 2017, Michael Ellerman wrote:
>> Hi Thomas,
>> Unfortunately this is hitting the WARN_ON in start_wd_cpu() on powerpc
>> because we're calling it multiple times for the boot CPU.
>> 
>> The first call is via:
>> 
>>   start_wd_on_cpu+0x80/0x2f0
>>   watchdog_nmi_reconfigure+0x124/0x170
>>   softlockup_reconfigure_threads+0x110/0x130
>>   lockup_detector_init+0xbc/0xe0
>>   kernel_init_freeable+0x18c/0x37c
>>   kernel_init+0x2c/0x160
>>   ret_from_kernel_thread+0x5c/0xbc
>> 
>> And then again via the CPU hotplug registration:
>> 
>>   start_wd_on_cpu+0x80/0x2f0
>>   cpuhp_invoke_callback+0x194/0x620
>>   cpuhp_thread_fun+0x7c/0x1b0
>>   smpboot_thread_fn+0x290/0x2a0
>>   kthread+0x168/0x1b0
>>   ret_from_kernel_thread+0x5c/0xbc
>> 
>> 
>> The first call is new because previously watchdog_nmi_reconfigure()
>> wasn't called from softlockup_reconfigure_threads().
>
> Hmm, don't you have the same problem with CPU hotplug or do you just get
> lucky because the hotplug callback in your code is ordered vs. the
> softlockup thread hotplug callback in a way that this does not hit?

I don't see it with CPU hotplug.

AFAICS that's because softlockup_reconfigure_threads() isn't called for
CPU hotplug. Unless there's a path I'm missing?

>> I'm not sure what the easiest fix is. One option would be to just drop
>> the WARN_ON, it's just there for paranoia AFAICS.
>
> The straight forward way is to make use of the new probe function. Patch
> below.

Thanks.

Hmm, I tried that patch, it makes the warning go away. But then I
triggered a deliberate hard lockup and got nothing.

Then I went back to the existing code (in linux-next), and I still get
no warning from a deliberate hard lockup.

So seems there may be some more gremlins. Will test more in the morning.

cheers


Re: [PATCH 07/11] powerpc: make dma_cache_sync a no-op

2017-10-03 Thread Christophe LEROY



Le 03/10/2017 à 12:43, Christoph Hellwig a écrit :

powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

What about arch/powerpc/mm/dma-noncoherent.c ?

Powerpc 8xx doesn't have coherent memory.

Christophe



Signed-off-by: Christoph Hellwig 
---
  arch/powerpc/include/asm/dma-mapping.h | 2 --
  1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index eaece3d3e225..320846442bfb 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -144,8 +144,6 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
  static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t 
size,
enum dma_data_direction direction)
  {
-   BUG_ON(direction == DMA_NONE);
-   __dma_sync(vaddr, size, (int)direction);
  }
  
  #endif /* __KERNEL__ */




Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-03 Thread Michael Ellerman
Cédric Le Goater  writes:

> On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
>> Hi,
>> 
>> Here are a couple of small fixes to support CPU hot unplug. There are
>> still some issues to be investigated as, in some occasions, after a
>> couple of plug and unplug, the cpu which was removed receives a 'lost'
>> interrupt. This showed to be the decrementer under QEMU.
>
> So this seems to be a QEMU issue only which can be solved by 
> removing the DEE bit from the LPCR on P9 processor when the CPU 
> is stopped in rtas. PECE3 bit on P8 processors. 
>
> I think these patches are valuable fixes for 4.14. The first 
> is trivial and the second touches the common xive part but it
> is only called on the pseries platform.  
>
> Could you please take a look ? 

I can.

You didn't give me much indication in the change logs of what the
failure mode is if I _don't_ have the fixes, which makes it hard for me
to assess the severity. Can you flesh out the ".. or else" case in the
change logs?

And should both be tagged?

  Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt 
controller")

cheers


[PATCH 2/2] powerpc/perf: Fix unit_sel/cache_sel checks

2017-10-03 Thread Madhavan Srinivasan
Raw event code has couple of fields "unit" and "cache" in it to capture
the "unit" to monitor for a given pmcxsel and cache reload qualifier to
program in MMCR1.

isa207_get_constraint() refers "unit" field to update the MMCRC
L2/L3 Event bus control fields with "cache" bits of the raw event code.
These are power8 specific and not supported by PowerISA v3.0 pmu. So wrap
the checks to be power8 specific. Also, "cache" bit field is referred to
update MMCR1[16:17] and this check can be power8 specific.

Fixes: 7ffd948fae4cd ('powerpc/perf: factor out power8 pmu functions')
Signed-off-by: Madhavan Srinivasan 
---
These changes should not have any impact since none of the published events
in power9 pmu-event json use the "cache" fields.

 arch/powerpc/perf/isa207-common.c | 25 ++---
 arch/powerpc/perf/isa207-common.h |  2 ++
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/perf/isa207-common.c 
b/arch/powerpc/perf/isa207-common.c
index 2efee3f196f5..f47564a9f223 100644
--- a/arch/powerpc/perf/isa207-common.c
+++ b/arch/powerpc/perf/isa207-common.c
@@ -150,6 +150,14 @@ static bool is_thresh_cmp_valid(u64 event)
return true;
 }
 
+static unsigned int dc_ic_rld_quad_l1_sel(u64 event)
+{
+   unsigned int cache;
+
+   cache = (event >> EVENT_CACHE_SEL_SHIFT) & MMCR1_DC_IC_QUAL_MASK;
+   return cache;
+}
+
 static inline u64 isa207_find_source(u64 idx, u32 sub_idx)
 {
u64 ret = PERF_MEM_NA;
@@ -275,7 +283,7 @@ int isa207_get_constraint(u64 event, unsigned long *maskp, 
unsigned long *valp)
value |= CNST_NC_VAL;
}
 
-   if (unit >= 6 && unit <= 9) {
+   if ( !cpu_has_feature(CPU_FTR_ARCH_300) && (unit >= 6 && unit <= 9)) {
/*
 * L2/L3 events contain a cache selector field, which is
 * supposed to be programmed into MMCRC. However MMCRC is only
@@ -288,7 +296,7 @@ int isa207_get_constraint(u64 event, unsigned long *maskp, 
unsigned long *valp)
if (cache & 0x7)
return -1;
 
-   } else if (event & EVENT_IS_L1) {
+   } else if (cpu_has_feature(CPU_FTR_ARCH_300) || (event & EVENT_IS_L1)) {
mask  |= CNST_L1_QUAL_MASK;
value |= CNST_L1_QUAL_VAL(cache);
}
@@ -391,11 +399,14 @@ int isa207_compute_mmcr(u64 event[], int n_ev,
/* In continuous sampling mode, update SDAR on TLB miss */
mmcra_sdar_mode(event[i], );
 
-   if (event[i] & EVENT_IS_L1) {
-   cache = event[i] >> EVENT_CACHE_SEL_SHIFT;
-   mmcr1 |= (cache & 1) << MMCR1_IC_QUAL_SHIFT;
-   cache >>= 1;
-   mmcr1 |= (cache & 1) << MMCR1_DC_QUAL_SHIFT;
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   cache = dc_ic_rld_quad_l1_sel(event[i]);
+   mmcr1 |= (cache) << MMCR1_DC_IC_QUAL_SHIFT;
+   } else {
+   if (event[i] & EVENT_IS_L1) {
+   cache = dc_ic_rld_quad_l1_sel(event[i]);
+   mmcr1 |= (cache) << MMCR1_DC_IC_QUAL_SHIFT;
+   }
}
 
if (is_event_marked(event[i])) {
diff --git a/arch/powerpc/perf/isa207-common.h 
b/arch/powerpc/perf/isa207-common.h
index 6c737d675792..c9fa037e3ae4 100644
--- a/arch/powerpc/perf/isa207-common.h
+++ b/arch/powerpc/perf/isa207-common.h
@@ -234,6 +234,8 @@
 #define MMCR1_FAB_SHIFT36
 #define MMCR1_DC_QUAL_SHIFT47
 #define MMCR1_IC_QUAL_SHIFT46
+#define MMCR1_DC_IC_QUAL_MASK  0x3
+#define MMCR1_DC_IC_QUAL_SHIFT 46
 
 /* MMCR1 Combine bits macro for power9 */
 #define p9_MMCR1_COMBINE_SHIFT(pmc)(38 - ((pmc - 1) * 2))
-- 
2.7.4



[PATCH 1/2] powerpc/perf: Cleanup cache_sel bits comment

2017-10-03 Thread Madhavan Srinivasan
Update the raw event code comment in power9-pmu.c with respect to
"cache" bits, since power9 MMCRC does not support these.

Signed-off-by: Madhavan Srinivasan 
---
 arch/powerpc/perf/power9-pmu.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
index 24b5b5b7a206..88b0a78eff4d 100644
--- a/arch/powerpc/perf/power9-pmu.c
+++ b/arch/powerpc/perf/power9-pmu.c
@@ -63,16 +63,8 @@
  * MMCRA[9:11] = thresh_cmp[0:2]
  * MMCRA[12:18] = thresh_cmp[3:9]
  *
- * if unit == 6 or unit == 7
- * MMCRC[53:55] = cache_sel[1:3]  (L2EVENT_SEL)
- * else if unit == 8 or unit == 9:
- * if cache_sel[0] == 0: # L3 bank
- * MMCRC[47:49] = cache_sel[1:3]  (L3EVENT_SEL0)
- * else if cache_sel[0] == 1:
- * MMCRC[50:51] = cache_sel[2:3]  (L3EVENT_SEL1)
- * else if cache_sel[1]: # L1 event
- * MMCR1[16] = cache_sel[2]
- * MMCR1[17] = cache_sel[3]
+ * MMCR1[16] = cache_sel[2]
+ * MMCR1[17] = cache_sel[3]
  *
  * if mark:
  * MMCRA[63]= 1(SAMPLE_ENABLE)
-- 
2.7.4



Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Nicholas Piggin wrote:
> On Tue, 3 Oct 2017 09:04:03 +0200 (CEST)
> Thomas Gleixner  wrote:
> 
> > On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> > > On Tue, 3 Oct 2017, Michael Ellerman wrote:  
> > > > Hi Thomas,
> > > > Unfortunately this is hitting the WARN_ON in start_wd_cpu() on powerpc
> > > > because we're calling it multiple times for the boot CPU.
> > > > 
> > > > The first call is via:
> > > > 
> > > >   start_wd_on_cpu+0x80/0x2f0
> > > >   watchdog_nmi_reconfigure+0x124/0x170
> > > >   softlockup_reconfigure_threads+0x110/0x130
> > > >   lockup_detector_init+0xbc/0xe0
> > > >   kernel_init_freeable+0x18c/0x37c
> > > >   kernel_init+0x2c/0x160
> > > >   ret_from_kernel_thread+0x5c/0xbc
> > > > 
> > > > And then again via the CPU hotplug registration:
> > > > 
> > > >   start_wd_on_cpu+0x80/0x2f0
> > > >   cpuhp_invoke_callback+0x194/0x620
> > > >   cpuhp_thread_fun+0x7c/0x1b0
> > > >   smpboot_thread_fn+0x290/0x2a0
> > > >   kthread+0x168/0x1b0
> > > >   ret_from_kernel_thread+0x5c/0xbc
> > > > 
> > > > 
> > > > The first call is new because previously watchdog_nmi_reconfigure()
> > > > wasn't called from softlockup_reconfigure_threads().  
> > > 
> > > Hmm, don't you have the same problem with CPU hotplug or do you just get
> > > lucky because the hotplug callback in your code is ordered vs. the
> > > softlockup thread hotplug callback in a way that this does not hit?
> 
> I had the idea that it watchdog_nmi_reconfigure() being only called
> with get_online_cpus held would prevent hotplug callbacks running.
>   
> > 
> > Which leads me to the question why you need the hotplug state at all if the
> > softlockup detector is enabled. Wouldn't it make more sense to only
> > register the state if softlockup detector is turned off in Kconfig and
> > actually move it to the core code?
> 
> I don't understand what you mean exactly, but it was done to avoid
> relying on the softlockup detector at all, because it wasn't needed
> for anything else (unlike the perf lockup detector).

If the softlockup detector is enabled along with your hardlockup detector
then the current code in mainline invokes watchdog_nmi_enable(cpu), which
is a weak function and as I just noticed not implemented by powerpc. So
it's a non issue because it's not implemented.

Thanks,

tglx






[PATCH 11/11] dma-mapping: turn dma_cache_sync into a dma_map_ops method

2017-10-03 Thread Christoph Hellwig
After we removed all the dead wood it turns out only two architectures
actually implement dma_cache_sync as a real op: mips and parisc.  Add
a cache_sync method to struct dma_map_ops and implement it for the
mips defualt DMA ops, and the parisc pa11 ops.

Note that arm, arc and openrisc support DMA_ATTR_NON_CONSISTENT, but
never provided a functional dma_cache_sync implementations, which
seems somewhat odd.

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/include/asm/dma-mapping.h  |  2 --
 arch/cris/include/asm/dma-mapping.h   |  6 --
 arch/frv/include/asm/dma-mapping.h|  6 --
 arch/hexagon/include/asm/dma-mapping.h|  3 ---
 arch/ia64/include/asm/dma-mapping.h   |  6 --
 arch/m32r/include/asm/dma-mapping.h   |  5 -
 arch/m68k/include/asm/dma-mapping.h   |  6 --
 arch/metag/include/asm/dma-mapping.h  | 10 --
 arch/microblaze/include/asm/dma-mapping.h |  5 -
 arch/mips/include/asm/dma-mapping.h   |  3 ---
 arch/mips/mm/dma-default.c|  7 +++
 arch/mn10300/include/asm/dma-mapping.h|  6 --
 arch/nios2/include/asm/dma-mapping.h  |  9 -
 arch/parisc/include/asm/dma-mapping.h |  8 
 arch/parisc/kernel/pci-dma.c  |  8 
 arch/powerpc/include/asm/dma-mapping.h|  5 -
 arch/s390/include/asm/dma-mapping.h   |  5 -
 arch/sh/include/asm/dma-mapping.h |  6 --
 arch/sparc/include/asm/dma-mapping.h  |  8 
 arch/tile/include/asm/dma-mapping.h   |  9 -
 arch/unicore32/include/asm/dma-mapping.h  |  5 -
 arch/x86/include/asm/dma-mapping.h|  6 --
 arch/xtensa/include/asm/dma-mapping.h |  5 -
 include/linux/dma-mapping.h   | 13 +
 24 files changed, 24 insertions(+), 128 deletions(-)

diff --git a/arch/alpha/include/asm/dma-mapping.h 
b/arch/alpha/include/asm/dma-mapping.h
index 5d53666935e6..399a4f49355e 100644
--- a/arch/alpha/include/asm/dma-mapping.h
+++ b/arch/alpha/include/asm/dma-mapping.h
@@ -8,6 +8,4 @@ static inline const struct dma_map_ops *get_arch_dma_ops(struct 
bus_type *bus)
return dma_ops;
 }
 
-#define dma_cache_sync(dev, va, size, dir)   ((void)0)
-
 #endif /* _ALPHA_DMA_MAPPING_H */
diff --git a/arch/cris/include/asm/dma-mapping.h 
b/arch/cris/include/asm/dma-mapping.h
index 256169de3743..e30adde42beb 100644
--- a/arch/cris/include/asm/dma-mapping.h
+++ b/arch/cris/include/asm/dma-mapping.h
@@ -16,10 +16,4 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
 }
 #endif
 
-static inline void
-dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-  enum dma_data_direction direction)
-{
-}
-
 #endif
diff --git a/arch/frv/include/asm/dma-mapping.h 
b/arch/frv/include/asm/dma-mapping.h
index da0e5c9744c4..da24ae943f02 100644
--- a/arch/frv/include/asm/dma-mapping.h
+++ b/arch/frv/include/asm/dma-mapping.h
@@ -14,10 +14,4 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return _dma_ops;
 }
 
-static inline
-void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-   enum dma_data_direction direction)
-{
-}
-
 #endif  /* _ASM_DMA_MAPPING_H */
diff --git a/arch/hexagon/include/asm/dma-mapping.h 
b/arch/hexagon/include/asm/dma-mapping.h
index 463dbc18f853..5208de242e79 100644
--- a/arch/hexagon/include/asm/dma-mapping.h
+++ b/arch/hexagon/include/asm/dma-mapping.h
@@ -37,9 +37,6 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return dma_ops;
 }
 
-extern void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-  enum dma_data_direction direction);
-
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
 {
if (!dev->dma_mask)
diff --git a/arch/ia64/include/asm/dma-mapping.h 
b/arch/ia64/include/asm/dma-mapping.h
index 99dfc1aa9d3c..9e5b5df76ff8 100644
--- a/arch/ia64/include/asm/dma-mapping.h
+++ b/arch/ia64/include/asm/dma-mapping.h
@@ -44,10 +44,4 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
return daddr;
 }
 
-static inline void
-dma_cache_sync (struct device *dev, void *vaddr, size_t size,
-   enum dma_data_direction dir)
-{
-}
-
 #endif /* _ASM_IA64_DMA_MAPPING_H */
diff --git a/arch/m32r/include/asm/dma-mapping.h 
b/arch/m32r/include/asm/dma-mapping.h
index aff3ae8b62f7..9e993daed7a0 100644
--- a/arch/m32r/include/asm/dma-mapping.h
+++ b/arch/m32r/include/asm/dma-mapping.h
@@ -13,11 +13,6 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return _noop_ops;
 }
 
-static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
- enum dma_data_direction direction)
-{
-}
-
 static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t 
size)
 {
if (!dev->dma_mask)

[PATCH 10/11] sh: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
sh does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

On the other hand sh uses dma_cache_sync internally in the dma_ops
implementation and for the maple bus that does not use the DMA API,
so a the old functionality for dma_cache_sync is still provided under
the name sh_sync_dma_for_device, and without the redundant dev
argument.  While at it two of the syncing dma_ops also go the proper
_for_device postfix.

Signed-off-by: Christoph Hellwig 
---
 arch/sh/include/asm/dma-mapping.h |  9 +++--
 arch/sh/kernel/dma-nommu.c| 17 +
 arch/sh/mm/consistent.c   |  6 +++---
 drivers/sh/maple/maple.c  |  5 ++---
 4 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/arch/sh/include/asm/dma-mapping.h 
b/arch/sh/include/asm/dma-mapping.h
index 9b06be07db4d..b46194ecef17 100644
--- a/arch/sh/include/asm/dma-mapping.h
+++ b/arch/sh/include/asm/dma-mapping.h
@@ -9,8 +9,10 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return dma_ops;
 }
 
-void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-   enum dma_data_direction dir);
+static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
+   enum dma_data_direction dir)
+{
+}
 
 /* arch/sh/mm/consistent.c */
 extern void *dma_generic_alloc_coherent(struct device *dev, size_t size,
@@ -20,4 +22,7 @@ extern void dma_generic_free_coherent(struct device *dev, 
size_t size,
  void *vaddr, dma_addr_t dma_handle,
  unsigned long attrs);
 
+void sh_sync_dma_for_device(void *vaddr, size_t size,
+   enum dma_data_direction dir);
+
 #endif /* __ASM_SH_DMA_MAPPING_H */
diff --git a/arch/sh/kernel/dma-nommu.c b/arch/sh/kernel/dma-nommu.c
index d24c707b2181..62b485107eae 100644
--- a/arch/sh/kernel/dma-nommu.c
+++ b/arch/sh/kernel/dma-nommu.c
@@ -9,6 +9,7 @@
  */
 #include 
 #include 
+#include 
 
 static dma_addr_t nommu_map_page(struct device *dev, struct page *page,
 unsigned long offset, size_t size,
@@ -20,7 +21,7 @@ static dma_addr_t nommu_map_page(struct device *dev, struct 
page *page,
WARN_ON(size == 0);
 
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-   dma_cache_sync(dev, page_address(page) + offset, size, dir);
+   sh_sync_dma_for_device(page_address(page) + offset, size, dir);
 
return addr;
 }
@@ -38,7 +39,7 @@ static int nommu_map_sg(struct device *dev, struct 
scatterlist *sg,
BUG_ON(!sg_page(s));
 
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-   dma_cache_sync(dev, sg_virt(s), s->length, dir);
+   sh_sync_dma_for_device(sg_virt(s), s->length, dir);
 
s->dma_address = sg_phys(s);
s->dma_length = s->length;
@@ -48,20 +49,20 @@ static int nommu_map_sg(struct device *dev, struct 
scatterlist *sg,
 }
 
 #ifdef CONFIG_DMA_NONCOHERENT
-static void nommu_sync_single(struct device *dev, dma_addr_t addr,
+static void nommu_sync_single_for_device(struct device *dev, dma_addr_t addr,
  size_t size, enum dma_data_direction dir)
 {
-   dma_cache_sync(dev, phys_to_virt(addr), size, dir);
+   sh_sync_dma_for_device(phys_to_virt(addr), size, dir);
 }
 
-static void nommu_sync_sg(struct device *dev, struct scatterlist *sg,
+static void nommu_sync_sg_for_device(struct device *dev, struct scatterlist 
*sg,
  int nelems, enum dma_data_direction dir)
 {
struct scatterlist *s;
int i;
 
for_each_sg(sg, s, nelems, i)
-   dma_cache_sync(dev, sg_virt(s), s->length, dir);
+   sh_sync_dma_for_device(sg_virt(s), s->length, dir);
 }
 #endif
 
@@ -71,8 +72,8 @@ const struct dma_map_ops nommu_dma_ops = {
.map_page   = nommu_map_page,
.map_sg = nommu_map_sg,
 #ifdef CONFIG_DMA_NONCOHERENT
-   .sync_single_for_device = nommu_sync_single,
-   .sync_sg_for_device = nommu_sync_sg,
+   .sync_single_for_device = nommu_sync_single_for_device,
+   .sync_sg_for_device = nommu_sync_sg_for_device,
 #endif
.is_phys= 1,
 };
diff --git a/arch/sh/mm/consistent.c b/arch/sh/mm/consistent.c
index d1275adfa0ef..6ea3aab508f2 100644
--- a/arch/sh/mm/consistent.c
+++ b/arch/sh/mm/consistent.c
@@ -49,7 +49,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t 
size,
 * Pages from the page allocator may have data present in
 * cache. So flush the cache before using uncached memory.
 */
-   dma_cache_sync(dev, ret, size, DMA_BIDIRECTIONAL);
+   sh_sync_dma_for_device(ret, size, DMA_BIDIRECTIONAL);
 
ret_nocache = (void 

[PATCH 09/11] xtensa: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
xtensa does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/xtensa/include/asm/dma-mapping.h |  6 --
 arch/xtensa/kernel/pci-dma.c  | 23 ---
 2 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/arch/xtensa/include/asm/dma-mapping.h 
b/arch/xtensa/include/asm/dma-mapping.h
index 269738dc9d1d..353e0314d6ba 100644
--- a/arch/xtensa/include/asm/dma-mapping.h
+++ b/arch/xtensa/include/asm/dma-mapping.h
@@ -23,8 +23,10 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return _dma_map_ops;
 }
 
-void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-   enum dma_data_direction direction);
+static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
+   enum dma_data_direction direction)
+{
+}
 
 static inline dma_addr_t phys_to_dma(struct device *dev, phys_addr_t paddr)
 {
diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index cec86a1c2acc..623720a11143 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -26,29 +26,6 @@
 #include 
 #include 
 
-void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
-   enum dma_data_direction dir)
-{
-   switch (dir) {
-   case DMA_BIDIRECTIONAL:
-   __flush_invalidate_dcache_range((unsigned long)vaddr, size);
-   break;
-
-   case DMA_FROM_DEVICE:
-   __invalidate_dcache_range((unsigned long)vaddr, size);
-   break;
-
-   case DMA_TO_DEVICE:
-   __flush_dcache_range((unsigned long)vaddr, size);
-   break;
-
-   case DMA_NONE:
-   BUG();
-   break;
-   }
-}
-EXPORT_SYMBOL(dma_cache_sync);
-
 static void do_cache_op(dma_addr_t dma_handle, size_t size,
void (*fn)(unsigned long, unsigned long))
 {
-- 
2.14.1



[PATCH 08/11] unicore32: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
unicore32 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/unicore32/include/asm/cacheflush.h  |  9 -
 arch/unicore32/include/asm/dma-mapping.h | 17 -
 arch/unicore32/mm/proc-syms.c|  3 ---
 3 files changed, 29 deletions(-)

diff --git a/arch/unicore32/include/asm/cacheflush.h 
b/arch/unicore32/include/asm/cacheflush.h
index c0301e6c8b81..a5e08e2d5d6d 100644
--- a/arch/unicore32/include/asm/cacheflush.h
+++ b/arch/unicore32/include/asm/cacheflush.h
@@ -101,15 +101,6 @@ extern void __cpuc_coherent_user_range(unsigned long, 
unsigned long);
 extern void __cpuc_flush_dcache_area(void *, size_t);
 extern void __cpuc_flush_kern_dcache_area(void *addr, size_t size);
 
-/*
- * These are private to the dma-mapping API.  Do not use directly.
- * Their sole purpose is to ensure that data held in the cache
- * is visible to DMA, or data written by DMA to system memory is
- * visible to the CPU.
- */
-extern void __cpuc_dma_clean_range(unsigned long, unsigned long);
-extern void __cpuc_dma_flush_range(unsigned long, unsigned long);
-
 /*
  * Copy user data from/to a page which is mapped into a different
  * processes address space.  Really, we want to allow our "user
diff --git a/arch/unicore32/include/asm/dma-mapping.h 
b/arch/unicore32/include/asm/dma-mapping.h
index 518ba5848dd6..e949855bb794 100644
--- a/arch/unicore32/include/asm/dma-mapping.h
+++ b/arch/unicore32/include/asm/dma-mapping.h
@@ -18,9 +18,6 @@
 #include 
 #include 
 
-#include 
-#include 
-
 extern const struct dma_map_ops swiotlb_dma_map_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
@@ -51,20 +48,6 @@ static inline void dma_mark_clean(void *addr, size_t size) {}
 static inline void dma_cache_sync(struct device *dev, void *vaddr,
size_t size, enum dma_data_direction direction)
 {
-   unsigned long start = (unsigned long)vaddr;
-   unsigned long end   = start + size;
-
-   switch (direction) {
-   case DMA_NONE:
-   BUG();
-   case DMA_FROM_DEVICE:
-   case DMA_BIDIRECTIONAL: /* writeback and invalidate */
-   __cpuc_dma_flush_range(start, end);
-   break;
-   case DMA_TO_DEVICE: /* writeback only */
-   __cpuc_dma_clean_range(start, end);
-   break;
-   }
 }
 
 #endif /* __KERNEL__ */
diff --git a/arch/unicore32/mm/proc-syms.c b/arch/unicore32/mm/proc-syms.c
index 21c00fc85c99..df215fd6d639 100644
--- a/arch/unicore32/mm/proc-syms.c
+++ b/arch/unicore32/mm/proc-syms.c
@@ -20,6 +20,3 @@ EXPORT_SYMBOL(cpu_dcache_clean_area);
 EXPORT_SYMBOL(cpu_set_pte);
 
 EXPORT_SYMBOL(__cpuc_coherent_kern_range);
-
-EXPORT_SYMBOL(__cpuc_dma_flush_range);
-EXPORT_SYMBOL(__cpuc_dma_clean_range);
-- 
2.14.1



[PATCH 07/11] powerpc: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
powerpc does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/include/asm/dma-mapping.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index eaece3d3e225..320846442bfb 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -144,8 +144,6 @@ static inline phys_addr_t dma_to_phys(struct device *dev, 
dma_addr_t daddr)
 static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
enum dma_data_direction direction)
 {
-   BUG_ON(direction == DMA_NONE);
-   __dma_sync(vaddr, size, (int)direction);
 }
 
 #endif /* __KERNEL__ */
-- 
2.14.1



[PATCH 06/11] mn10300: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
mn10300 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it must
be a no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/mn10300/include/asm/dma-mapping.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/mn10300/include/asm/dma-mapping.h 
b/arch/mn10300/include/asm/dma-mapping.h
index 737ef574b3ea..dc24163b190f 100644
--- a/arch/mn10300/include/asm/dma-mapping.h
+++ b/arch/mn10300/include/asm/dma-mapping.h
@@ -11,9 +11,6 @@
 #ifndef _ASM_DMA_MAPPING_H
 #define _ASM_DMA_MAPPING_H
 
-#include 
-#include 
-
 extern const struct dma_map_ops mn10300_dma_ops;
 
 static inline const struct dma_map_ops *get_arch_dma_ops(struct bus_type *bus)
@@ -25,7 +22,6 @@ static inline
 void dma_cache_sync(void *vaddr, size_t size,
enum dma_data_direction direction)
 {
-   mn10300_dcache_flush_inv();
 }
 
 #endif
-- 
2.14.1



[PATCH 05/11] microblaze: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
microblaze does not implement DMA_ATTR_NON_CONSISTENT allocations, so it
doesn't make any sense to do any work in dma_cache_sync given that it
must be a no-op when dma_alloc_attrs returns coherent memory.

This also allows moving __dma_sync out of the microblaze asm/dma-mapping.h
and thus greatly reduce the amount of includes there.

Signed-off-by: Christoph Hellwig 
---
 arch/microblaze/include/asm/dma-mapping.h | 34 ---
 arch/microblaze/kernel/dma.c  | 17 
 2 files changed, 17 insertions(+), 34 deletions(-)

diff --git a/arch/microblaze/include/asm/dma-mapping.h 
b/arch/microblaze/include/asm/dma-mapping.h
index e15cd2f76e23..ad448e4aedb6 100644
--- a/arch/microblaze/include/asm/dma-mapping.h
+++ b/arch/microblaze/include/asm/dma-mapping.h
@@ -15,22 +15,6 @@
 #ifndef _ASM_MICROBLAZE_DMA_MAPPING_H
 #define _ASM_MICROBLAZE_DMA_MAPPING_H
 
-/*
- * See Documentation/DMA-API-HOWTO.txt and
- * Documentation/DMA-API.txt for documentation.
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#define __dma_alloc_coherent(dev, gfp, size, handle)   NULL
-#define __dma_free_coherent(size, addr)((void)0)
-
 /*
  * Available generic sets of operations
  */
@@ -41,27 +25,9 @@ static inline const struct dma_map_ops 
*get_arch_dma_ops(struct bus_type *bus)
return _direct_ops;
 }
 
-static inline void __dma_sync(unsigned long paddr,
- size_t size, enum dma_data_direction direction)
-{
-   switch (direction) {
-   case DMA_TO_DEVICE:
-   case DMA_BIDIRECTIONAL:
-   flush_dcache_range(paddr, paddr + size);
-   break;
-   case DMA_FROM_DEVICE:
-   invalidate_dcache_range(paddr, paddr + size);
-   break;
-   default:
-   BUG();
-   }
-}
-
 static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
enum dma_data_direction direction)
 {
-   BUG_ON(direction == DMA_NONE);
-   __dma_sync(virt_to_phys(vaddr), size, (int)direction);
 }
 
 #endif /* _ASM_MICROBLAZE_DMA_MAPPING_H */
diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index 94700c5270a9..e52b684f8ea2 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NOT_COHERENT_CACHE
 
@@ -51,6 +52,22 @@ static void dma_direct_free_coherent(struct device *dev, 
size_t size,
 #endif
 }
 
+static inline void __dma_sync(unsigned long paddr,
+ size_t size, enum dma_data_direction direction)
+{
+   switch (direction) {
+   case DMA_TO_DEVICE:
+   case DMA_BIDIRECTIONAL:
+   flush_dcache_range(paddr, paddr + size);
+   break;
+   case DMA_FROM_DEVICE:
+   invalidate_dcache_range(paddr, paddr + size);
+   break;
+   default:
+   BUG();
+   }
+}
+
 static int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl,
 int nents, enum dma_data_direction direction,
 unsigned long attrs)
-- 
2.14.1



[PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
ia64 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
make any sense to do any work in dma_cache_sync given that it must be a
no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/ia64/include/asm/dma-mapping.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/ia64/include/asm/dma-mapping.h 
b/arch/ia64/include/asm/dma-mapping.h
index 3ce5ab4339f3..99dfc1aa9d3c 100644
--- a/arch/ia64/include/asm/dma-mapping.h
+++ b/arch/ia64/include/asm/dma-mapping.h
@@ -48,11 +48,6 @@ static inline void
 dma_cache_sync (struct device *dev, void *vaddr, size_t size,
enum dma_data_direction dir)
 {
-   /*
-* IA-64 is cache-coherent, so this is mostly a no-op.  However, we do 
need to
-* ensure that dma_cache_sync() enforces order, hence the mb().
-*/
-   mb();
 }
 
 #endif /* _ASM_IA64_DMA_MAPPING_H */
-- 
2.14.1



[PATCH 03/11] frv: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
frv does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
make any sense to do any work in dma_cache_sync given that it must be a
no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
---
 arch/frv/include/asm/dma-mapping.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/frv/include/asm/dma-mapping.h 
b/arch/frv/include/asm/dma-mapping.h
index 354900917585..da0e5c9744c4 100644
--- a/arch/frv/include/asm/dma-mapping.h
+++ b/arch/frv/include/asm/dma-mapping.h
@@ -18,7 +18,6 @@ static inline
 void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
enum dma_data_direction direction)
 {
-   flush_write_buffers();
 }
 
 #endif  /* _ASM_DMA_MAPPING_H */
-- 
2.14.1



[PATCH 01/11] floppy: consolidate the dummy fd_cacheflush definition

2017-10-03 Thread Christoph Hellwig
Only mips defines this helper, so remove all the other arch definitions.

Signed-off-by: Christoph Hellwig 
---
 arch/alpha/include/asm/floppy.h| 2 --
 arch/powerpc/include/asm/floppy.h  | 2 --
 arch/sparc/include/asm/floppy_32.h | 1 -
 arch/sparc/include/asm/floppy_64.h | 1 -
 drivers/block/floppy.c | 4 
 5 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/asm/floppy.h b/arch/alpha/include/asm/floppy.h
index bae97eb19d26..942924756cf2 100644
--- a/arch/alpha/include/asm/floppy.h
+++ b/arch/alpha/include/asm/floppy.h
@@ -24,7 +24,6 @@
 #define fd_set_dma_count(count) set_dma_count(FLOPPY_DMA,count)
 #define fd_enable_irq() enable_irq(FLOPPY_IRQ)
 #define fd_disable_irq()disable_irq(FLOPPY_IRQ)
-#define fd_cacheflush(addr,size) /* nothing */
 #define fd_request_irq()request_irq(FLOPPY_IRQ, floppy_interrupt,\
0, "floppy", NULL)
 #define fd_free_irq()   free_irq(FLOPPY_IRQ, NULL)
@@ -62,7 +61,6 @@ alpha_fd_dma_setup(char *addr, unsigned long size, int mode, 
int io)
prev_dir = dir;
 
fd_clear_dma_ff();
-   fd_cacheflush(addr, size);
fd_set_dma_mode(mode);
set_dma_addr(FLOPPY_DMA, bus_addr);
fd_set_dma_count(size);
diff --git a/arch/powerpc/include/asm/floppy.h 
b/arch/powerpc/include/asm/floppy.h
index 936a904ae78c..167c44b58848 100644
--- a/arch/powerpc/include/asm/floppy.h
+++ b/arch/powerpc/include/asm/floppy.h
@@ -25,7 +25,6 @@
 #define fd_get_dma_residue()fd_ops->_get_dma_residue(FLOPPY_DMA)
 #define fd_enable_irq() enable_irq(FLOPPY_IRQ)
 #define fd_disable_irq()disable_irq(FLOPPY_IRQ)
-#define fd_cacheflush(addr,size) /* nothing */
 #define fd_free_irq()   free_irq(FLOPPY_IRQ, NULL);
 
 #include 
@@ -152,7 +151,6 @@ static int hard_dma_setup(char *addr, unsigned long size, 
int mode, int io)
prev_dir = dir;
 
fd_clear_dma_ff();
-   fd_cacheflush(addr, size);
fd_set_dma_mode(mode);
set_dma_addr(FLOPPY_DMA, bus_addr);
fd_set_dma_count(size);
diff --git a/arch/sparc/include/asm/floppy_32.h 
b/arch/sparc/include/asm/floppy_32.h
index 071b83e52f15..dab58525229e 100644
--- a/arch/sparc/include/asm/floppy_32.h
+++ b/arch/sparc/include/asm/floppy_32.h
@@ -70,7 +70,6 @@ static struct sun_floppy_ops sun_fdops;
 #define fd_set_dma_count(count)   sun_fd_set_dma_count(count)
 #define fd_enable_irq()   /* nothing... */
 #define fd_disable_irq()  /* nothing... */
-#define fd_cacheflush(addr, size) /* nothing... */
 #define fd_request_irq()  sun_fd_request_irq()
 #define fd_free_irq() /* nothing... */
 #if 0  /* P3: added by Alain, these cause a MMU corruption. 19960524 XXX */
diff --git a/arch/sparc/include/asm/floppy_64.h 
b/arch/sparc/include/asm/floppy_64.h
index 625756406a7e..a1db35a22c99 100644
--- a/arch/sparc/include/asm/floppy_64.h
+++ b/arch/sparc/include/asm/floppy_64.h
@@ -72,7 +72,6 @@ static struct sun_floppy_ops sun_fdops;
 #define fd_set_dma_addr(addr) sun_fdops.fd_set_dma_addr(addr)
 #define fd_set_dma_count(count)   sun_fdops.fd_set_dma_count(count)
 #define get_dma_residue(x)sun_fdops.get_dma_residue()
-#define fd_cacheflush(addr, size) /* nothing... */
 #define fd_request_irq()  sun_fdops.fd_request_irq()
 #define fd_free_irq() sun_fdops.fd_free_irq()
 #define fd_eject(drive)   sun_fdops.fd_eject(drive)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 60c086a53609..a54183935aa1 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -275,6 +275,10 @@ static int set_next_request(void);
 #define fd_dma_mem_alloc(size) __get_dma_pages(GFP_KERNEL, get_order(size))
 #endif
 
+#ifndef fd_cacheflush
+#define fd_cacheflush(addr, size) /* nothing... */
+#endif
+
 static inline void fallback_on_nodma_alloc(char **addr, size_t l)
 {
 #ifdef FLOPPY_CAN_FALLBACK_ON_NODMA
-- 
2.14.1



[PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread Christoph Hellwig
x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
make any sense to do any work in dma_cache_sync given that it must be a
no-op when dma_alloc_attrs returns coherent memory.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Thomas Gleixner 
---
 arch/x86/include/asm/dma-mapping.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/include/asm/dma-mapping.h 
b/arch/x86/include/asm/dma-mapping.h
index 1387dafdba2d..c6d3367be916 100644
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -71,7 +71,6 @@ static inline void
 dma_cache_sync(struct device *dev, void *vaddr, size_t size,
enum dma_data_direction dir)
 {
-   flush_write_buffers();
 }
 
 static inline unsigned long dma_alloc_coherent_mask(struct device *dev,
-- 
2.14.1



refactor dma_cache_sync V2

2017-10-03 Thread Christoph Hellwig
The dma_cache_sync routines is used to flush caches for memory returned
by dma_alloc_attrs with the DMA_ATTR_NON_CONSISTENT flag (or previously
from dma_alloc_noncoherent), but the requirements for it seems to be
frequently misunderstood.  dma_cache_sync is documented to be a no-op for
allocations that do not have the DMA_ATTR_NON_CONSISTENT flag set, and
yet a lot of architectures implement it in some way despite not
implementing DMA_ATTR_NON_CONSISTENT.

This series removes a few abuses of dma_cache_sync for non-DMA API
purposes, then changes all remaining architectures that do not implement
DMA_ATTR_NON_CONSISTENT to implement dma_cache_sync as a no-op, and
then adds the struct dma_map_ops indirection we use for all other
DMA mapping operations to dma_cache_sync as well, thus removing all but
two implementations of the function.

Changes since V1:
 - drop the mips fd_cacheflush, merged via maintainer
 - spelling fix in last commit descriptions (thanks Geert)


Re: [linux-next][Oops] memory hot-unplug results fault instruction address at /include/linux/list.h:104

2017-10-03 Thread Abdul Haleem
On Wed, 2017-09-20 at 12:54 -0700, Kees Cook wrote:
> On Wed, Sep 20, 2017 at 12:40 AM, Abdul Haleem
>  wrote:
> > On Tue, 2017-09-12 at 12:11 +0530, abdul wrote:
> >> Hi,
> >>
> >> Memory hot-unplug on PowerVM LPAR running next-20170911 results in
> >> Faulting instruction address: 0xc02b56c4
> >>
> >> which maps to the below code path:
> >>
> >> 0xc02b56c4 is in __rmqueue (./include/linux/list.h:104).
> >> 99 * This is only for internal list manipulation where we know
> >> 100* the prev/next entries already!
> >> 101*/
> >> 102   static inline void __list_del(struct list_head * prev, struct
> >> list_head * next)
> >> 103   {
> >> 104   next->prev = prev;
> >> 105   WRITE_ONCE(prev->next, next);
> >> 106   }
> >> 107
> >> 108   /**
> >>
> >
> > I see another kernel Oops when running transparent hugepages
> > de-fragmentation test.
> >
> > And the faulty instruction address again pointing to same code line
> > 0xc026f9f4 is in compaction_alloc (./include/linux/list.h:104)
> >
> > steps to recreate:
> > -
> > 1. Enable transparent hugepages ("always")
> > 2. Turn off the defrag $ echo 0 > khugepaged/defrag
> > 3. Write random to memory path
> > 4. Set huge pages numbers
> > 5. Turn on defrag $ echo 1 > khugepaged/defrag
> >
> >
> > new trace:
> > --
> > Unable to handle kernel paging request for data at address
> > 0x5deadbeef108
> 
> This looks like use-after-list-removal, that value appears to be LIST_POISON1.
> 
> Try enabling CONFIG_DEBUG_LIST to see if you get better details?

Trace messages after enabling CONFIG_DEBUG_LIST

BUG: Bad page state in process in:imklog  pfn:6cbb3
page:f1b2ecc0 count:2 mapcount:0 mapping:c00769aafd20 index:0x1
flags: 0x3381068(uptodate|lru|active|private)
raw: 03381068 c00769aafd20 0001 0002
raw: 5deadbeef100 5deadbeef200  c000feca3400
page dumped because: page still charged to cgroup
page->mem_cgroup:c000feca3400
bad because of flags: 0x1068(uptodate|lru|active|private)
kernel BUG at mm/vmscan.c:1556!
[c5da79f0] [c02bfe74] __alloc_pages_nodemask+0x754/0x1160
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: xt_addrtype xt_conntrack ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
iptable_filter 
[c5da7bf0] [c034c238] alloc_pages_vma+0xb8/0x290
[c5da7c60] [c03102b0] __handle_mm_fault+0x1150/0x1ad0
[c5da7d40] [c0310d58] handle_mm_fault+0x128/0x210
[c5da7d80] [c0067878] __do_page_fault+0x218/0x8e0
[c5da7e30] [c000a4a4] handle_page_fault+0x18/0x38
Instruction dump:
38210060 e8010010 7c0803a6 4e800020 6042 3c62ff93 7ca62b78 7d244b78 
7d455378 3863edc8 4bafe4d1 6000 <0fe0> 3860 4b60 6000 
---[ end trace 1e619608a776e913 ]---
list_add corruption. next->prev should be prev (c0077ff54710), but was 
5deadbeef200. (next=f1b2ece0).
[ cut here ]
WARNING: CPU: 5 PID: 308 at lib/list_debug.c:25 __list_add_valid+0xa4/0xf0
Modules linked in: xt_addrtype xt_conntrack ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 
iptable_filter ip_tables x_tables nf_nat nf_conntrack bridge stp llc 
dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c vmx_crypto 
pseries_rng
 ip_tables x_tables nf_nat nf_conntrack bridge stp llc dm_thin_pool 
dm_persistent_data dm_bio_prison dm_bufio libcrc32c vmx_crypto pseries_rng 
rtc_generic autofs4
CPU: 2 PID: 1 Comm: systemd Tainted: GB   W   
4.14.0-rc2-next-20170929-autotest #2
task: c00777e0 task.stack: c00777e8
NIP:  c02d5900 LR: c02d586c CTR: 
REGS: c00777e82c20 TRAP: 0700   Tainted: GB   W
(4.14.0-rc2-next-20170929-autotest)
MSR:  80029033   CR: 22248428  XER: 200a  
CFAR: c02d587c SOFTE: 0 
GPR00: c02d586c c00777e82ea0 c15ac700 ffea 
GPR04:  c00777e830a0 00014f28 0001 
GPR08:  033800010008  3563376431303030 
GPR12: 8800 
 rtc_generic
ce741500 f1d7c4a0 0001 
GPR16: c00777e833ac c00777e830b0 0002 c00777e830a0 
GPR20:  c00777e833c4 c00777e82f10 0006 
GPR24: c00777e82f50 0020 0007 c00774193800 
GPR28: 0006 000c c00774193820 
 autofs4
f1d7c560 
NIP [c02d5900] isolate_lru_pages.isra.21+0x360/0x580
LR [c02d586c] isolate_lru_pages.isra.21+0x2cc/0x580
Call Trace:
[c00777e82ea0] [c02d586c] isolate_lru_pages.isra.21+0x2cc/0x580 
(unreliable)

Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Nicholas Piggin
On Tue, 3 Oct 2017 09:04:03 +0200 (CEST)
Thomas Gleixner  wrote:

> On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> > On Tue, 3 Oct 2017, Michael Ellerman wrote:  
> > > Hi Thomas,
> > > Unfortunately this is hitting the WARN_ON in start_wd_cpu() on powerpc
> > > because we're calling it multiple times for the boot CPU.
> > > 
> > > The first call is via:
> > > 
> > >   start_wd_on_cpu+0x80/0x2f0
> > >   watchdog_nmi_reconfigure+0x124/0x170
> > >   softlockup_reconfigure_threads+0x110/0x130
> > >   lockup_detector_init+0xbc/0xe0
> > >   kernel_init_freeable+0x18c/0x37c
> > >   kernel_init+0x2c/0x160
> > >   ret_from_kernel_thread+0x5c/0xbc
> > > 
> > > And then again via the CPU hotplug registration:
> > > 
> > >   start_wd_on_cpu+0x80/0x2f0
> > >   cpuhp_invoke_callback+0x194/0x620
> > >   cpuhp_thread_fun+0x7c/0x1b0
> > >   smpboot_thread_fn+0x290/0x2a0
> > >   kthread+0x168/0x1b0
> > >   ret_from_kernel_thread+0x5c/0xbc
> > > 
> > > 
> > > The first call is new because previously watchdog_nmi_reconfigure()
> > > wasn't called from softlockup_reconfigure_threads().  
> > 
> > Hmm, don't you have the same problem with CPU hotplug or do you just get
> > lucky because the hotplug callback in your code is ordered vs. the
> > softlockup thread hotplug callback in a way that this does not hit?

I had the idea that it watchdog_nmi_reconfigure() being only called
with get_online_cpus held would prevent hotplug callbacks running.
  
> 
> Which leads me to the question why you need the hotplug state at all if the
> softlockup detector is enabled. Wouldn't it make more sense to only
> register the state if softlockup detector is turned off in Kconfig and
> actually move it to the core code?

I don't understand what you mean exactly, but it was done to avoid
relying on the softlockup detector at all, because it wasn't needed
for anything else (unlike the perf lockup detector).

Thanks,
Nick


Re: [PATCH] kernel/module_64.c: Add REL24 relocation support of livepatch symbols

2017-10-03 Thread Naveen N . Rao
Hi Kamalesh,

On 2017/10/03 05:36AM, Kamalesh Babulal wrote:
> With commit 425595a7fc20 ("livepatch: reuse module loader code to
> write relocations") livepatch uses module loader to write relocations
> of livepatch symbols, instead of managing them by arch-dependent
> klp_write_module_reloc() function.
> 
> livepatch module managed relocation entries are written to sections
> marked with SHF_RELA_LIVEPATCH flag and livepatch symbols within the
> section are marked with SHN_LIVEPATCH symbol section index. When the
> livepatching module is loaded, the livepatch symbols are resolved
> before calling apply_relocate_add() to apply the relocations.
> 
> R_PPC64_REL24 relocation type resolves to a function address, those may
> be local to the livepatch module or available in kernel/other modules.
> For every such non-local function, apply_relocate_add() constructs a stub
> (a.k.a trampoline) to branch to a function. Stub code is responsible
> to save toc onto the stack, before calling the function via the global
> entry point. A NOP instruction is expected after every non local function
> branch, i.e. after the REL24 relocation. Which in-turn is replaced by
> toc restore instruction by apply_relocate_add().
> 
> livepatch symbols with R_PPC64_REL24 relocation type, may not be
> reachable within current TOC range or might not have a NOP instruction
> following a branch. Latter symbols are the local symbols, whose became
> global to the livepatched function. As per ABIv2, local functions are
> accessed via local entry point, which is relative to current module's
> toc value.
> 
> For example, consider the following livepatch relocations (the example is
> from livepatch module generated by kpatch tool):
> 
> Relocation section '.klp.rela.vmlinux..text.meminfo_proc_show' at offset 
> 0x84530 contains 44 entries:
> Offset Info Type  Symbol's Value   Symbol's 
> Name + Addend
> 0054  0056000a R_PPC64_REL24   
> .klp.sym.vmlinux.si_swapinfo,0 + 0
> 007c  0057000a R_PPC64_REL24   
> .klp.sym.vmlinux.total_swapcache_pages,0 + 0
> 00e8  0058000a R_PPC64_REL24   
> .klp.sym.vmlinux.show_val_kb,1 + 0
> [...]
> 
> 1. .klp.sym.vmlinux.si_swapinfo and .klp.sym.vmlinux.total_swapcache_pages
>are not available within the livepatch module TOC range.
> 
> 2. .klp.sym.vmlinux.show_val_kb livepatch symbol was previously local
>but now global to fs/proc/meminfo.c::meminfo_proc_show().
> 
> While the livepatch module is loaded the livepatch symbols mentioned in
> case 1 will fail with an error:
> [   74.485405] module_64: kpatch_meminfo_string: REL24 -1152921504751525976 
> out of range!
> 
> and livepatch symbols mentioned in case 2 with fail with an error:
> [   24.568425] module_64: kpatch_meminfo_string: Expect noop after relocate, 
> got 3d22
> 
> Both the failures with REL24 livepatch symbols relocation, can be resolved
> by constructing a new livepatch stub. The newly setup klp_stub mimics the
> functionality of entry_64.S::livepatch_handler introduced by commit
> 85baa095497f ("powerpc/livepatch: Add live patching support on ppc64le").
> 
> Which introduces a "livepatch stack" growing upwards from the base of
> the regular stack and is used to store/restore TOC/LR values, other than
> the stub setup and branch. The additional instructions sequences to handle
> klp_stub increases the stub size and current ppc64_stub_insn[] is not
> sufficient to hold them. This patch also introduces new
> ppc64le_klp_stub_entry[], along with the helpers to find/allocate
> livepatch stub.
> 
> Signed-off-by: Kamalesh Babulal 
> Cc: Balbir Singh 
> Cc: Naveen N. Rao 
> Cc: Josh Poimboeuf 
> Cc: Jessica Yu 
> Cc: Ananth N Mavinakayanahalli 
> Cc: Aravinda Prasad 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: live-patch...@vger.kernel.org
> ---
>  arch/powerpc/include/asm/module.h  |   4 +
>  arch/powerpc/kernel/module_64.c| 119 
> -
>  arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  77 
>  3 files changed, 197 insertions(+), 3 deletions(-)
> 
 
[snip]

A few small nits focusing on just the trampoline...

> diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
> b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
> index c98e90b..708a96d 100644
> --- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
> +++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
> @@ -249,6 +249,83 @@ livepatch_handler:
> 
>   /* Return to original caller of live patched function */
>   blr
> +
> + /*
> +  * This livepatch stub code, called from livepatch module to jump into
> +  * kernel or other modules. It replicates the livepatch_handler code,
> +  * 

Re: [PATCH v2 4/5] powerpc: pseries: only store the device node basename in full_name

2017-10-03 Thread Michael Ellerman
Hi Rob,

Unfortunately this one has a bug, which only showed up after some stress
testing.

Rob Herring  writes:
> With dependencies on full_name containing the entire device node path
> removed, stop storing the full_name in nodes created by
> dlpar_configure_connector() and pSeries_reconfig_add_node().
>
> Signed-off-by: Rob Herring 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
> v2:
> - Rework dlpar_parse_cc_node() to a single allocation and strcpy instead
>   of kasprintf
>   
>  arch/powerpc/platforms/pseries/dlpar.c| 30 +++---
>  arch/powerpc/platforms/pseries/reconfig.c |  2 +-
>  2 files changed, 8 insertions(+), 24 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> b/arch/powerpc/platforms/pseries/dlpar.c
> index 783f36364690..31a0615aeea1 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -75,28 +75,17 @@ static struct property *dlpar_parse_cc_property(struct 
> cc_workarea *ccwa)
>   return prop;
>  }
>  
> -static struct device_node *dlpar_parse_cc_node(struct cc_workarea *ccwa,
> -const char *path)
> +static struct device_node *(dlpar_parse_cc_node)(struct cc_workarea *ccwa)
>  {
>   struct device_node *dn;
> - char *name;
> -
> - /* If parent node path is "/" advance path to NULL terminator to
> -  * prevent double leading slashs in full_name.
> -  */
> - if (!path[1])
> - path++;
> + const char *name = (const char *)ccwa + be32_to_cpu(ccwa->name_offset);
>  
> - dn = kzalloc(sizeof(*dn), GFP_KERNEL);
> + dn = kzalloc(sizeof(*dn) + strlen(name) + 1, GFP_KERNEL);
>   if (!dn)
>   return NULL;
>  
> - name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
> - if (!dn->full_name) {
> - kfree(dn);
> - return NULL;
> - }
> + dn->full_name = (char *)(dn + 1);
> + strcpy((char *)dn->full_name, name);

Allocating the full_name on the tail of the struct this way breaks the
call to kfree(node->full_name) in of_node_release().

eg:


[  322.543581] Freeing node->full_name c007f4ed4090
[  322.543583] 
=
[  322.543586] BUG kmalloc-192 (Tainted: GB  ): Invalid object 
pointer 0xc007f4ed4090
[  322.543588] 
-

[  322.543592] INFO: Slab 0xf1fd3b40 objects=292 used=72 
fp=0xc007f4ed41a8 flags=0x1380101
[  322.543595] CPU: 40 PID: 5 Comm: kworker/u192:0 Tainted: GB   
4.14.0-rc2-gcc-6.3.1-next-20170929-dirty #429
[  322.543600] Workqueue: pseries hotplug workque pseries_hp_work_fn
[  322.543602] Call Trace:
[  322.543605] [c003f769f660] [c0a6b710] dump_stack+0xb0/0xf0 
(unreliable)
[  322.543609] [c003f769f6a0] [c032cd5c] slab_err+0x9c/0xc0
[  322.543612] [c003f769f790] [c033440c] 
free_debug_processing+0x19c/0x490
[  322.543615] [c003f769f860] [c03349fc] __slab_free+0x2fc/0x4b0
[  322.543619] [c003f769f960] [c08f2484] of_node_release+0xe4/0x1a0
[  322.543622] [c003f769fa00] [c0a71c04] kobject_put+0xd4/0x160
[  322.543625] [c003f769fa80] [c08f10c4] of_node_put+0x34/0x50
[  322.543628] [c003f769fab0] [c00c0248] 
dlpar_cpu_remove_by_index+0x108/0x130
[  322.543631] [c003f769fb40] [c00c166c] dlpar_cpu+0x27c/0x510
[  322.543634] [c003f769fbf0] [c00bb2d0] 
handle_dlpar_errorlog+0xc0/0x160
[  322.543638] [c003f769fc60] [c00bb3ac] 
pseries_hp_work_fn+0x3c/0xa0
[  322.543641] [c003f769fc90] [c012200c] 
process_one_work+0x2bc/0x560
[  322.543645] [c003f769fd20] [c0122348] worker_thread+0x98/0x5d0
[  322.543648] [c003f769fdc0] [c012b4e4] kthread+0x164/0x1b0
[  322.543651] [c003f769fe30] [c000bae0] 
ret_from_kernel_thread+0x5c/0x7c


The obvious fix is just to allocate it separately as before, eg ~=:

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 267b01e30ac0..8617d7e28bf7 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -80,11 +80,16 @@ static struct device_node *dlpar_parse_cc_node(struct 
cc_workarea *ccwa)
struct device_node *dn;
const char *name = (const char *)ccwa + be32_to_cpu(ccwa->name_offset);
 
-   dn = kzalloc(sizeof(*dn) + strlen(name) + 1, GFP_KERNEL);
+   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
if (!dn)
return NULL;
 
-   dn->full_name = (char *)(dn + 1);
+   dn->full_name = 

Re: [linux-next] [bisected a4615d11] Memory DLPAR triggers WARN_ONCE() in mm/page_vma_mapped.c

2017-10-03 Thread Abdul Haleem
On Fri, 2017-09-29 at 10:07 -0400, Zi Yan wrote:
> Hi Abdul,
> 
> I just want to follow up with this.
> 
> Did you have a chance to test my patch? Does it fix your original problem?

Yes I did test the patch. it fixes the warning.

Reported-and-tested-by:  Abdul Haleem 

Thanks for the fix.

--
Regard's

Abdul Haleem
IBM Linux Technology Centre

> 
> On 13 Sep 2017, at 1:48, abdul wrote:
> 
> > On Mon, 2017-09-11 at 10:53 -0400, Zi Yan wrote:
> >> Hi Abdul,
> >>
> >> Can you try this patch below? I think I missed that pmd entries
> >> can be zapped, so the WARN_ONCE is unnecessary.
> >>
> >> Thanks.
> >>
> >> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >> index 6a03946469a9..eb462e7db0a9 100644
> >> --- a/mm/page_vma_mapped.c
> >> +++ b/mm/page_vma_mapped.c
> >> @@ -167,8 +167,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk 
> >> *pvmw)
> >> return not_found(pvmw);
> >> return true;
> >> }
> >> -   } else
> >> -   WARN_ONCE(1, "Non present huge pmd without 
> >> pmd migration enabled!");
> >> +   }
> >> return not_found(pvmw);
> >> } else {
> >> /* THP pmd was split under us: handle on pte level 
> >> */
> >>
> >> --
> >> Best Regards
> >> Yan Zi




Re: [PATCH 4.4] cxl: Fix driver use count

2017-10-03 Thread Greg KH
On Wed, Sep 27, 2017 at 10:43:17AM +1000, Andrew Donnellan wrote:
> From: Frederic Barrat 
> 
> commit 197267d0356004a31c4d6b6336598f5dff3301e1 upstream.
> 
> cxl keeps a driver use count, which is used with the hash memory model
> on p8 to know when to upgrade local TLBIs to global and to trigger
> callbacks to manage the MMU for PSL8.
> 
> If a process opens a context and closes without attaching or fails the
> attachment, the driver use count is never decremented. As a
> consequence, TLB invalidations remain global, even if there are no
> active cxl contexts.
> 
> We should increment the driver use count when the process is attaching
> to the cxl adapter, and not on open. It's not needed before the
> adapter starts using the context and the use count is decremented on
> the detach path, so it makes more sense.
> 
> It affects only the user api. The kernel api is already doing The
> Right Thing.
> 
> Signed-off-by: Frederic Barrat 
> Cc: sta...@vger.kernel.org # v4.2+
> Fixes: 7bb5d91a4dda ("cxl: Rework context lifetimes")
> Acked-by: Andrew Donnellan 
> Signed-off-by: Michael Ellerman 
> [ajd: backport to stable v4.4 tree]
> Signed-off-by: Andrew Donnellan 

Thanks for this and the 4.9 backport.

greg k-h


Re: [PATCH v3] powerpc: fix compile error on 64K pages on 40x, 44x

2017-10-03 Thread Christophe LEROY



Le 01/10/2017 à 16:33, Christian Lamparter a écrit :

The mmu context on the 40x, 44x does not define pte_frag
entry. This causes gcc abort the compilation due to:

setup-common.c: In function ‘setup_arch’:
setup-common.c:908: error: ‘mm_context_t’ has no ‘pte_frag’

This patch fixes the issue by removing the pte_frag
initialization in setup-common.c.

This is possible, because the compiler will do the
initialization, since the mm_context is a sub struct of
init_mm. init_mm is declared in mm_types.h as external linkage.
according to C99 6.2.4.3:
"An object whose identifier is declared with external linkage
[...] has static storage duration."

C99 defines in 6.7.8.10 that: "
If an object that has static storage duration is not
initialized explicitly, then:
- if it has pointer type, it is initialized to a null pointer
[...]
"

Signed-off-by: Christian Lamparter 


Reviewed-by: Christophe Leroy 


---
  arch/powerpc/kernel/setup-common.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 0ac741fae90e..2e3bc16d02b2 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -904,9 +904,6 @@ void __init setup_arch(char **cmdline_p)
  #endif
  #endif
  
-#ifdef CONFIG_PPC_64K_PAGES

-   init_mm.context.pte_frag = NULL;
-#endif
  #ifdef CONFIG_SPAPR_TCE_IOMMU
mm_iommu_init(_mm);
  #endif



Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-03 Thread Benjamin Herrenschmidt
On Tue, 2017-10-03 at 17:58 +1100, David Gibson wrote:
> 
> Ok.. why do you think this isn't of use?  I'm pretty sure this is
> necessary for the TCG case, since MSR is checked in cpu_has_work(),
> which could otherwise wake up the "dead" cpu.

Ony if it's not in a PM state, in that case we check the corresponding
LPCR:PECE* bit. At least on P7 and later.

Cheers,
Ben.



Re: [PATCH v3 00/20] Speculative page faults

2017-10-03 Thread Laurent Dufour
On 03/10/2017 03:27, Michael Ellerman wrote:
> Laurent Dufour  writes:
> 
>> Hi Andrew,
>>
>> On 28/09/2017 22:38, Andrew Morton wrote:
>>> On Thu, 28 Sep 2017 14:29:02 +0200 Laurent Dufour 
>>>  wrote:
>>>
> Laurent's [0/n] provides some nice-looking performance benefits for
> workloads which are chosen to show performance benefits(!) but, alas,
> no quantitative testing results for workloads which we may suspect will
> be harmed by the changes(?).  Even things as simple as impact upon
> single-threaded pagefault-intensive workloads and its effect upon
> CONFIG_SMP=n .text size?

 I forgot to mention in my previous email the impact on the .text section.

 Here are the metrics I got :

 .text size UP  SMP Delta
 4.13-mmotm 8444201 8964137 6.16%
 '' +spf8452041 8971929 6.15%
Delta   0.09%   0.09%   

 No major impact as you could see.
>>>
>>> 8k text increase seems rather a lot actually.  That's a lot more
>>> userspace cacheclines that get evicted during a fault...
>>>
>>> Is the feature actually beneficial on uniprocessor?
>>
>> This is useless on uniprocessor, and I will disable it on x86 when !SMP 
>> by not defining __HAVE_ARCH_CALL_SPF.
>> So the speculative page fault handler will not be built but the vm 
>> sequence counter and the SCRU stuff will still be there. I may also make 
>> it disabled through macro when __HAVE_ARCH_CALL_SPF is not defined, but 
>> this may obfuscated the code a bit...
>>
>> On ppc64, as this feature requires book3s, it can't be built without SMP 
>> support.
> 
> Book3S doesn't force SMP, eg. PMAC is Book3S but can be built with SMP=n.
> 
> It's true that POWERNV and PSERIES both force SMP, and those are the
> platforms used on modern Book3S CPUs.

Thanks Michael,

I'll add a check on CONFIG_SMP on ppc too.

Laurent.



Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Thomas Gleixner wrote:
> On Tue, 3 Oct 2017, Michael Ellerman wrote:
> > Hi Thomas,
> > Unfortunately this is hitting the WARN_ON in start_wd_cpu() on powerpc
> > because we're calling it multiple times for the boot CPU.
> > 
> > The first call is via:
> > 
> >   start_wd_on_cpu+0x80/0x2f0
> >   watchdog_nmi_reconfigure+0x124/0x170
> >   softlockup_reconfigure_threads+0x110/0x130
> >   lockup_detector_init+0xbc/0xe0
> >   kernel_init_freeable+0x18c/0x37c
> >   kernel_init+0x2c/0x160
> >   ret_from_kernel_thread+0x5c/0xbc
> > 
> > And then again via the CPU hotplug registration:
> > 
> >   start_wd_on_cpu+0x80/0x2f0
> >   cpuhp_invoke_callback+0x194/0x620
> >   cpuhp_thread_fun+0x7c/0x1b0
> >   smpboot_thread_fn+0x290/0x2a0
> >   kthread+0x168/0x1b0
> >   ret_from_kernel_thread+0x5c/0xbc
> > 
> > 
> > The first call is new because previously watchdog_nmi_reconfigure()
> > wasn't called from softlockup_reconfigure_threads().
> 
> Hmm, don't you have the same problem with CPU hotplug or do you just get
> lucky because the hotplug callback in your code is ordered vs. the
> softlockup thread hotplug callback in a way that this does not hit?

Which leads me to the question why you need the hotplug state at all if the
softlockup detector is enabled. Wouldn't it make more sense to only
register the state if softlockup detector is turned off in Kconfig and
actually move it to the core code?

Thanks,

tglx




Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-03 Thread David Gibson
On Tue, Oct 03, 2017 at 08:24:07AM +0200, Cédric Le Goater wrote:
> On 10/03/2017 05:36 AM, David Gibson wrote:
> > On Mon, Oct 02, 2017 at 06:27:20PM +0200, Cédric Le Goater wrote:
> >> On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
> >>> Hi,
> >>>
> >>> Here are a couple of small fixes to support CPU hot unplug. There are
> >>> still some issues to be investigated as, in some occasions, after a
> >>> couple of plug and unplug, the cpu which was removed receives a 'lost'
> >>> interrupt. This showed to be the decrementer under QEMU.
> >>
> >> So this seems to be a QEMU issue only which can be solved by 
> >> removing the DEE bit from the LPCR on P9 processor when the CPU 
> >> is stopped in rtas. PECE3 bit on P8 processors. 
> >>
> >> I think these patches are valuable fixes for 4.14. The first 
> >> is trivial and the second touches the common xive part but it
> >> is only called on the pseries platform.  
> >>
> >> Could you please take a look ?
> > 
> > Sorry, I think I've missed something here.
> > 
> > Is there a qemu bug involved in this?  Has there been a patch sent
> > that I didn't spot?
> 
> 
> No, not yet, but I will today probably. something like below to stop
> the decrementer when a CPU is stopped:
> 
>   --- qemu.git.orig/hw/ppc/spapr_rtas.c
>   +++ qemu.git/hw/ppc/spapr_rtas.c
>   @@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *c
>kvm_cpu_synchronize_state(cs);
>
>env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
>   +
>   +/* Enable DECR interrupt */
>   +if (env->mmu_model == POWERPC_MMU_3_00) {
>   +env->spr[SPR_LPCR] |= LPCR_DEE;
>   +} else {
>   +/* P7 and P8 both have same bit for DECR */
>   +env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
>   +}
>   +
>env->nip = start;
>env->gpr[3] = r3;
>cs->halted = 0;
>   @@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *c
> * no need to bother with specific bits, we just clear it.
> */
>env->msr = 0;
>   +
>   +if (env->mmu_model == POWERPC_MMU_3_00) {
>   +env->spr[SPR_LPCR] &= ~LPCR_DEE;
>   +} else {
>   +/* P7 and P8 both have same bit for DECR */
>   +env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
>   +}
>}
>
>static inline int sysparm_st(target_ulong addr, target_ulong len,
>   
> I haven't yet because I fail to understand why the decrementer is not 
> interrupting the dying CPU under xics as it is the case under XIVE.

Oh.. ok.  This sounds very similar to the problem Nikunj hit under TCG
with decrementer interrupts waking up a supposedly dead CPU.  He had a
couple of proposed fixes, but we got bogged down trying to work out
why  (with TCG at least) it only seemed to bite after a system_reset,
and not on initial boot up.

> Also I am not sure this hack is of any use :
> 
> /*
>  * While stopping a CPU, the guest calls H_CPPR which
>  * effectively disables interrupts on XICS level.
>  * However decrementer interrupts in TCG can still
>  * wake the CPU up so here we disable interrupts in MSR
>  * as well.
>  * As rtas_start_cpu() resets the whole MSR anyway, there is
>  * no need to bother with specific bits, we just clear it.
>  */
> env->msr = 0;

Ok.. why do you think this isn't of use?  I'm pretty sure this is
necessary for the TCG case, since MSR is checked in cpu_has_work(),
which could otherwise wake up the "dead" cpu.

> and the different CPU states are confusing. Nikunj already to a look
> at this when trying to fix the TCG reboot. Anyway, the QEMU patch 
> should (re)start a thread. This is not the place to discuss.
> 
> Thanks,
> 
> C.  
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [patch V2 22/29] lockup_detector: Make watchdog_nmi_reconfigure() two stage

2017-10-03 Thread Thomas Gleixner
On Tue, 3 Oct 2017, Michael Ellerman wrote:
> Hi Thomas,
> Unfortunately this is hitting the WARN_ON in start_wd_cpu() on powerpc
> because we're calling it multiple times for the boot CPU.
> 
> The first call is via:
> 
>   start_wd_on_cpu+0x80/0x2f0
>   watchdog_nmi_reconfigure+0x124/0x170
>   softlockup_reconfigure_threads+0x110/0x130
>   lockup_detector_init+0xbc/0xe0
>   kernel_init_freeable+0x18c/0x37c
>   kernel_init+0x2c/0x160
>   ret_from_kernel_thread+0x5c/0xbc
> 
> And then again via the CPU hotplug registration:
> 
>   start_wd_on_cpu+0x80/0x2f0
>   cpuhp_invoke_callback+0x194/0x620
>   cpuhp_thread_fun+0x7c/0x1b0
>   smpboot_thread_fn+0x290/0x2a0
>   kthread+0x168/0x1b0
>   ret_from_kernel_thread+0x5c/0xbc
> 
> 
> The first call is new because previously watchdog_nmi_reconfigure()
> wasn't called from softlockup_reconfigure_threads().

Hmm, don't you have the same problem with CPU hotplug or do you just get
lucky because the hotplug callback in your code is ordered vs. the
softlockup thread hotplug callback in a way that this does not hit?

> I'm not sure what the easiest fix is. One option would be to just drop
> the WARN_ON, it's just there for paranoia AFAICS.

The straight forward way is to make use of the new probe function. Patch
below.

Thanks,

tglx

8<--
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -375,20 +375,18 @@ void watchdog_nmi_start(void)
 /*
  * This runs after lockup_detector_init() which sets up watchdog_cpumask.
  */
-static int __init powerpc_watchdog_init(void)
+int __init watchdog_nmi_probe(void)
 {
int err;
 
-   watchdog_calc_timeouts();
-
-   err = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "powerpc/watchdog:online",
-   start_wd_on_cpu, stop_wd_on_cpu);
+   err = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+   "powerpc/watchdog:online",
+   start_wd_on_cpu, stop_wd_on_cpu);
if (err < 0)
pr_warn("Watchdog could not be initialized");
 
return 0;
 }
-arch_initcall(powerpc_watchdog_init);
 
 static void handle_backtrace_ipi(struct pt_regs *regs)
 {
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -608,7 +608,6 @@ static inline int watchdog_park_threads(
 static inline void watchdog_unpark_threads(void) { }
 static inline int watchdog_enable_all_cpus(void) { return 0; }
 static inline void watchdog_disable_all_cpus(void) { }
-static inline void softlockup_init_threads(void) { }
 static void softlockup_reconfigure_threads(void)
 {
cpus_read_lock();
@@ -617,6 +616,10 @@ static void softlockup_reconfigure_threa
watchdog_nmi_start();
cpus_read_unlock();
 }
+static inline void softlockup_init_threads(void)
+{
+   softlockup_reconfigure_threads();
+}
 #endif /* !CONFIG_SOFTLOCKUP_DETECTOR */
 
 static void __lockup_detector_cleanup(void)


Re: [PATCH 0/2] powerpc/xive: fix CPU hot unplug

2017-10-03 Thread Cédric Le Goater
On 10/03/2017 05:36 AM, David Gibson wrote:
> On Mon, Oct 02, 2017 at 06:27:20PM +0200, Cédric Le Goater wrote:
>> On 09/23/2017 10:26 AM, Cédric Le Goater wrote:
>>> Hi,
>>>
>>> Here are a couple of small fixes to support CPU hot unplug. There are
>>> still some issues to be investigated as, in some occasions, after a
>>> couple of plug and unplug, the cpu which was removed receives a 'lost'
>>> interrupt. This showed to be the decrementer under QEMU.
>>
>> So this seems to be a QEMU issue only which can be solved by 
>> removing the DEE bit from the LPCR on P9 processor when the CPU 
>> is stopped in rtas. PECE3 bit on P8 processors. 
>>
>> I think these patches are valuable fixes for 4.14. The first 
>> is trivial and the second touches the common xive part but it
>> is only called on the pseries platform.  
>>
>> Could you please take a look ?
> 
> Sorry, I think I've missed something here.
> 
> Is there a qemu bug involved in this?  Has there been a patch sent
> that I didn't spot?


No, not yet, but I will today probably. something like below to stop
the decrementer when a CPU is stopped:

--- qemu.git.orig/hw/ppc/spapr_rtas.c
+++ qemu.git/hw/ppc/spapr_rtas.c
@@ -174,6 +174,15 @@ static void rtas_start_cpu(PowerPCCPU *c
 kvm_cpu_synchronize_state(cs);
 
 env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME);
+
+/* Enable DECR interrupt */
+if (env->mmu_model == POWERPC_MMU_3_00) {
+env->spr[SPR_LPCR] |= LPCR_DEE;
+} else {
+/* P7 and P8 both have same bit for DECR */
+env->spr[SPR_LPCR] |= LPCR_P8_PECE3;
+}
+
 env->nip = start;
 env->gpr[3] = r3;
 cs->halted = 0;
@@ -210,6 +219,13 @@ static void rtas_stop_self(PowerPCCPU *c
  * no need to bother with specific bits, we just clear it.
  */
 env->msr = 0;
+
+if (env->mmu_model == POWERPC_MMU_3_00) {
+env->spr[SPR_LPCR] &= ~LPCR_DEE;
+} else {
+/* P7 and P8 both have same bit for DECR */
+env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3;
+}
 }
 
 static inline int sysparm_st(target_ulong addr, target_ulong len,

I haven't yet because I fail to understand why the decrementer is not 
interrupting the dying CPU under xics as it is the case under XIVE.

Also I am not sure this hack is of any use :

/*
 * While stopping a CPU, the guest calls H_CPPR which
 * effectively disables interrupts on XICS level.
 * However decrementer interrupts in TCG can still
 * wake the CPU up so here we disable interrupts in MSR
 * as well.
 * As rtas_start_cpu() resets the whole MSR anyway, there is
 * no need to bother with specific bits, we just clear it.
 */
env->msr = 0;

and the different CPU states are confusing. Nikunj already to a look
at this when trying to fix the TCG reboot. Anyway, the QEMU patch 
should (re)start a thread. This is not the place to discuss.

Thanks,

C.