date:20161027

[PATCH -v4 RESEND 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page

2016-10-27 Thread Huang, Ying

From: Huang Ying 

__swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
for the huge page.  This will free the specified swap cluster now.
Because now this function will be called only in the error path to free
the swap cluster just allocated.  So the corresponding swap_map[i] ==
SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
implementation simpler than that of the ordinary swap entry.

This will be used for delaying splitting THP (Transparent Huge Page)
during swapping out.  Where for one THP to swap out, we will allocate a
swap cluster, add the THP into the swap cache, then split the THP.  If
anything fails after allocating the swap cluster and before splitting
the THP successfully, the swapcache_free_trans_huge() will be used to
free the swap space allocated.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h |  9 +++--
 mm/swapfile.c| 33 +++--
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cb8c1b0..b185e39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,7 +408,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t, bool);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -480,7 +480,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -551,6 +551,11 @@ static inline swp_entry_t get_huge_swap_page(void)
 
 #endif /* CONFIG_SWAP */
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+   __swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8224150..126c789 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -732,6 +732,27 @@ static void swap_free_huge_cluster(struct swap_info_struct 
*si,
__swap_entry_free(si, offset, true);
 }
 
+/*
+ * Caller should hold si->lock.
+ */
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+ swp_entry_t entry)
+{
+   unsigned long offset = swp_offset(entry);
+   unsigned long idx = offset / SWAPFILE_CLUSTER;
+   unsigned char *map;
+   unsigned int i;
+
+   map = si->swap_map + offset;
+   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+   VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+   map[i] &= ~SWAP_HAS_CACHE;
+   }
+   /* Cluster size is same as huge page size */
+   mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR);
+   swap_free_huge_cluster(si, idx);
+}
+
 static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 {
unsigned long idx;
@@ -758,6 +779,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct 
swap_info_struct *si)
 {
return 0;
 }
+
+static inline void swapcache_free_trans_huge(struct swap_info_struct *si,
+swp_entry_t entry)
+{
+}
 #endif
 
 swp_entry_t __get_swap_page(bool huge)
@@ -949,13 +975,16 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
struct swap_info_struct *p;
 
p = swap_info_get(entry);
if (p) {
-   swap_entry_free(p, entry, SWAP_HAS_CACHE);
+   if (unlikely(huge))
+   swapcache_free_trans_huge(p, entry);
+   else
+   swap_entry_free(p, entry, SWAP_HAS_CACHE);
spin_unlock(>lock);
}
 }
-- 
2.9.3

[PATCH -v4 RESEND 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page

2016-10-27 Thread Huang, Ying

From: Huang Ying 

__swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag
for the huge page.  This will free the specified swap cluster now.
Because now this function will be called only in the error path to free
the swap cluster just allocated.  So the corresponding swap_map[i] ==
SWAP_HAS_CACHE, that is, the swap count is 0.  This makes the
implementation simpler than that of the ordinary swap entry.

This will be used for delaying splitting THP (Transparent Huge Page)
during swapping out.  Where for one THP to swap out, we will allocate a
swap cluster, add the THP into the swap cache, then split the THP.  If
anything fails after allocating the swap cluster and before splitting
the THP successfully, the swapcache_free_trans_huge() will be used to
free the swap space allocated.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h |  9 +++--
 mm/swapfile.c| 33 +++--
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index cb8c1b0..b185e39 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -408,7 +408,7 @@ extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
 extern int swapcache_prepare(swp_entry_t);
 extern void swap_free(swp_entry_t);
-extern void swapcache_free(swp_entry_t);
+extern void __swapcache_free(swp_entry_t, bool);
 extern int free_swap_and_cache(swp_entry_t);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
@@ -480,7 +480,7 @@ static inline void swap_free(swp_entry_t swp)
 {
 }
 
-static inline void swapcache_free(swp_entry_t swp)
+static inline void __swapcache_free(swp_entry_t swp, bool huge)
 {
 }
 
@@ -551,6 +551,11 @@ static inline swp_entry_t get_huge_swap_page(void)
 
 #endif /* CONFIG_SWAP */
 
+static inline void swapcache_free(swp_entry_t entry)
+{
+   __swapcache_free(entry, false);
+}
+
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8224150..126c789 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -732,6 +732,27 @@ static void swap_free_huge_cluster(struct swap_info_struct 
*si,
__swap_entry_free(si, offset, true);
 }
 
+/*
+ * Caller should hold si->lock.
+ */
+static void swapcache_free_trans_huge(struct swap_info_struct *si,
+ swp_entry_t entry)
+{
+   unsigned long offset = swp_offset(entry);
+   unsigned long idx = offset / SWAPFILE_CLUSTER;
+   unsigned char *map;
+   unsigned int i;
+
+   map = si->swap_map + offset;
+   for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+   VM_BUG_ON(map[i] != SWAP_HAS_CACHE);
+   map[i] &= ~SWAP_HAS_CACHE;
+   }
+   /* Cluster size is same as huge page size */
+   mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR);
+   swap_free_huge_cluster(si, idx);
+}
+
 static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si)
 {
unsigned long idx;
@@ -758,6 +779,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct 
swap_info_struct *si)
 {
return 0;
 }
+
+static inline void swapcache_free_trans_huge(struct swap_info_struct *si,
+swp_entry_t entry)
+{
+}
 #endif
 
 swp_entry_t __get_swap_page(bool huge)
@@ -949,13 +975,16 @@ void swap_free(swp_entry_t entry)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void swapcache_free(swp_entry_t entry)
+void __swapcache_free(swp_entry_t entry, bool huge)
 {
struct swap_info_struct *p;
 
p = swap_info_get(entry);
if (p) {
-   swap_entry_free(p, entry, SWAP_HAS_CACHE);
+   if (unlikely(huge))
+   swapcache_free_trans_huge(p, entry);
+   else
+   swap_entry_free(p, entry, SWAP_HAS_CACHE);
spin_unlock(>lock);
}
 }
-- 
2.9.3

[PATCH -v4 RESEND 3/9] mm, THP, swap: Add swap cluster allocate/free functions

2016-10-27 Thread Huang, Ying

From: Huang Ying 

The swap cluster allocation/free functions are added based on the
existing swap cluster management mechanism for SSD.  These functions
don't work for the rotating hard disks because the existing swap cluster
management mechanism doesn't work for them.  The hard disks support may
be added if someone really need it.  But that needn't be included in
this patchset.

This will be used for the THP (Transparent Huge Page) swap support.
Where one swap cluster will hold the contents of each THP swapped out.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 mm/swapfile.c | 203 +-
 1 file changed, 146 insertions(+), 57 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f3fc83f..3643049 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct 
swap_info_struct *si,
schedule_work(>discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info;
+
+   cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+   cluster_list_add_tail(>free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct 
swap_info_struct *si)
SWAPFILE_CLUSTER);
 
spin_lock(>lock);
-   cluster_set_flag([idx], CLUSTER_FLAG_FREE);
-   cluster_list_add_tail(>free_clusters, info, idx);
+   __free_cluster(si, idx);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
}
@@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work)
spin_unlock(>lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info;
+
+   VM_BUG_ON(cluster_list_first(>free_clusters) != idx);
+   cluster_list_del_first(>free_clusters, ci);
+   cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info + idx;
+
+   VM_BUG_ON(cluster_count(ci) != 0);
+   /*
+* If the swap is discardable, prepare discard the cluster
+* instead of free it immediately. The cluster will be freed
+* after discard.
+*/
+   if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+   (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+   swap_cluster_schedule_discard(si, idx);
+   return;
+   }
+
+   __free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct 
*p,
 
if (!cluster_info)
return;
-   if (cluster_is_free(_info[idx])) {
-   VM_BUG_ON(cluster_list_first(>free_clusters) != idx);
-   cluster_list_del_first(>free_clusters, cluster_info);
-   cluster_set_count_flag(_info[idx], 0, 0);
-   }
+   if (cluster_is_free(_info[idx]))
+   alloc_cluster(p, idx);
 
VM_BUG_ON(cluster_count(_info[idx]) >= SWAPFILE_CLUSTER);
cluster_set_count(_info[idx],
@@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct 
*p,
cluster_set_count(_info[idx],
cluster_count(_info[idx]) - 1);
 
-   if (cluster_count(_info[idx]) == 0) {
-   /*
-* If the swap is discardable, prepare discard the cluster
-* instead of free it immediately. The cluster will be freed
-* after discard.
-*/
-   if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-   swap_cluster_schedule_discard(p, idx);
-   return;
-   }
-
-   cluster_set_flag(_info[idx], CLUSTER_FLAG_FREE);
-   cluster_list_add_tail(>free_clusters, cluster_info, idx);
-   }
+   if (cluster_count(_info[idx]) == 0)
+   free_cluster(p, idx);
 }
 
 /*
@@ -497,6 +516,69 @@ static void scan_swap_map_try_ssd_cluster(struct 
swap_info_struct *si,
*scan_base = tmp;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline unsigned int huge_cluster_nr_entries(bool

[v14, 0/8] Fix eSDHC host version register bug

2016-10-27 Thread Yangbo Lu

This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0
eSDHC controller. To match the SoC version and revision, 10 previous version
patchsets had tried many methods but all of them were rejected by reviewers.
Such as
- dts compatible method
- syscon method
- ifdef PPC method
- GUTS driver getting SVR method
Anrd suggested a soc_device_match method in v10, and this is the only available
method left now. This v11 patchset introduces the soc_device_match interface in
soc driver.

The first six patches of Yangbo are to add the GUTS driver. This is used to
register a soc device which contain soc version and revision information.
The other two patches introduce the soc_device_match method in soc driver
and apply it on esdhc driver to fix this bug.

Arnd Bergmann (1):
  base: soc: introduce soc_device_match() interface

Yangbo Lu (7):
  dt: bindings: update Freescale DCFG compatible
  ARM64: dts: ls2080a: add device configuration node
  dt: bindings: move guts devicetree doc out of powerpc directory
  powerpc/fsl: move mpc85xx.h to include/linux/fsl
  soc: fsl: add GUTS driver for QorIQ platforms
  MAINTAINERS: add entry for Freescale SoC drivers
  mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

 Documentation/devicetree/bindings/arm/fsl.txt  |   6 +-
 .../bindings/{powerpc => soc}/fsl/guts.txt |   3 +
 MAINTAINERS|  11 +-
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi |   6 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 arch/powerpc/sysdev/fsl_pci.c  |   2 +-
 drivers/base/Kconfig   |   1 +
 drivers/base/soc.c |  66 ++
 drivers/clk/clk-qoriq.c|   3 +-
 drivers/i2c/busses/i2c-mpc.c   |   2 +-
 drivers/iommu/fsl_pamu.c   |   3 +-
 drivers/mmc/host/Kconfig   |   1 +
 drivers/mmc/host/sdhci-of-esdhc.c  |  20 ++
 drivers/net/ethernet/freescale/gianfar.c   |   2 +-
 drivers/soc/Kconfig|   3 +-
 drivers/soc/fsl/Kconfig|  18 ++
 drivers/soc/fsl/Makefile   |   1 +
 drivers/soc/fsl/guts.c | 238 +
 include/linux/fsl/guts.h   | 125 ++-
 .../asm/mpc85xx.h => include/linux/fsl/svr.h   |   4 +-
 include/linux/sys_soc.h|   3 +
 21 files changed, 458 insertions(+), 62 deletions(-)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

-- 
2.1.0.27.g96db324

[PATCH -v4 RESEND 8/9] mm, THP, swap: Support to split THP in swap cache

2016-10-27 Thread Huang, Ying

From: Huang Ying 

This patch enhanced the split_huge_page_to_list() to work properly for
the THP (Transparent Huge Page) in the swap cache during swapping out.

This is used for delaying splitting the THP during swapping out.  Where
for a THP to be swapped out, we will allocate a swap cluster, add the
THP into the swap cache, then split the THP.  The page lock will be held
during this process.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Ebru Akagunduz 
Signed-off-by: "Huang, Ying" 
---
 mm/huge_memory.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 326b145..199eaba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1831,7 +1831,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 * atomic_set() here would be safe on all archs (and not only on x86),
 * it's safer to use atomic_inc()/atomic_add().
 */
-   if (PageAnon(head)) {
+   if (PageAnon(head) && !PageSwapCache(head)) {
page_ref_inc(page_tail);
} else {
/* Additional pin to radix tree */
@@ -1842,6 +1842,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
 (1L << PG_swapbacked) |
+(1L << PG_swapcache) |
 (1L << PG_mlocked) |
 (1L << PG_uptodate) |
 (1L << PG_active) |
@@ -1904,7 +1905,11 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
ClearPageCompound(head);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
-   page_ref_inc(head);
+   /* Additional pin to radix tree of swap cache */
+   if (PageSwapCache(head))
+   page_ref_add(head, 2);
+   else
+   page_ref_inc(head);
} else {
/* Additional pin to radix tree */
page_ref_add(head, 2);
@@ -2016,10 +2021,12 @@ int page_trans_huge_mapcount(struct page *page, int 
*total_mapcount)
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page)
 {
-   int extra_pins = 0;
+   int extra_pins;
 
/* Additional pins from radix tree */
-   if (!PageAnon(page))
+   if (PageAnon(page))
+   extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+   else
extra_pins = HPAGE_PMD_NR;
return total_mapcount(page) == page_count(page) - extra_pins - 1;
 }
@@ -2072,7 +2079,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
ret = -EBUSY;
goto out;
}
-   extra_pins = 0;
+   extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0;
mapping = NULL;
anon_vma_lock_write(anon_vma);
} else {
-- 
2.9.3

[v14, 1/8] dt: bindings: update Freescale DCFG compatible

2016-10-27 Thread Yangbo Lu

Update Freescale DCFG compatible with 'fsl,-dcfg' instead
of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a,
ls1043a, and ls2080a.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Signed-off-by: Scott Wood 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added a list for the possible compatibles
Changes for v10:
- None
Changes for v11:
- Added 'Acked-by: Rob Herring'
- Updated commit message by Scott
Changes for v12:
- None
Changes for v13:
- None
Changes for v14:
- None
---
 Documentation/devicetree/bindings/arm/fsl.txt | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
b/Documentation/devicetree/bindings/arm/fsl.txt
index dbbc095..713c1ae 100644
--- a/Documentation/devicetree/bindings/arm/fsl.txt
+++ b/Documentation/devicetree/bindings/arm/fsl.txt
@@ -119,7 +119,11 @@ Freescale DCFG
 configuration and status for the device. Such as setting the secondary
 core start address and release the secondary core from holdoff and startup.
   Required properties:
-  - compatible: should be "fsl,ls1021a-dcfg"
+  - compatible: should be "fsl,-dcfg"
+Possible compatibles:
+   "fsl,ls1021a-dcfg"
+   "fsl,ls1043a-dcfg"
+   "fsl,ls2080a-dcfg"
   - reg : should contain base address and length of DCFG memory-mapped 
registers
 
 Example:
-- 
2.1.0.27.g96db324

[PATCH -v4 RESEND 3/9] mm, THP, swap: Add swap cluster allocate/free functions

2016-10-27 Thread Huang, Ying

From: Huang Ying 

The swap cluster allocation/free functions are added based on the
existing swap cluster management mechanism for SSD.  These functions
don't work for the rotating hard disks because the existing swap cluster
management mechanism doesn't work for them.  The hard disks support may
be added if someone really need it.  But that needn't be included in
this patchset.

This will be used for the THP (Transparent Huge Page) swap support.
Where one swap cluster will hold the contents of each THP swapped out.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 mm/swapfile.c | 203 +-
 1 file changed, 146 insertions(+), 57 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f3fc83f..3643049 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct 
swap_info_struct *si,
schedule_work(>discard_work);
 }
 
+static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info;
+
+   cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
+   cluster_list_add_tail(>free_clusters, ci, idx);
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. caller should hold si->lock.
@@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct 
swap_info_struct *si)
SWAPFILE_CLUSTER);
 
spin_lock(>lock);
-   cluster_set_flag([idx], CLUSTER_FLAG_FREE);
-   cluster_list_add_tail(>free_clusters, info, idx);
+   __free_cluster(si, idx);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
}
@@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work)
spin_unlock(>lock);
 }
 
+static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info;
+
+   VM_BUG_ON(cluster_list_first(>free_clusters) != idx);
+   cluster_list_del_first(>free_clusters, ci);
+   cluster_set_count_flag(ci + idx, 0, 0);
+}
+
+static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+{
+   struct swap_cluster_info *ci = si->cluster_info + idx;
+
+   VM_BUG_ON(cluster_count(ci) != 0);
+   /*
+* If the swap is discardable, prepare discard the cluster
+* instead of free it immediately. The cluster will be freed
+* after discard.
+*/
+   if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
+   (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
+   swap_cluster_schedule_discard(si, idx);
+   return;
+   }
+
+   __free_cluster(si, idx);
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
  * removed from free cluster list and its usage counter will be increased.
@@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct 
*p,
 
if (!cluster_info)
return;
-   if (cluster_is_free(_info[idx])) {
-   VM_BUG_ON(cluster_list_first(>free_clusters) != idx);
-   cluster_list_del_first(>free_clusters, cluster_info);
-   cluster_set_count_flag(_info[idx], 0, 0);
-   }
+   if (cluster_is_free(_info[idx]))
+   alloc_cluster(p, idx);
 
VM_BUG_ON(cluster_count(_info[idx]) >= SWAPFILE_CLUSTER);
cluster_set_count(_info[idx],
@@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct 
*p,
cluster_set_count(_info[idx],
cluster_count(_info[idx]) - 1);
 
-   if (cluster_count(_info[idx]) == 0) {
-   /*
-* If the swap is discardable, prepare discard the cluster
-* instead of free it immediately. The cluster will be freed
-* after discard.
-*/
-   if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
-(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-   swap_cluster_schedule_discard(p, idx);
-   return;
-   }
-
-   cluster_set_flag(_info[idx], CLUSTER_FLAG_FREE);
-   cluster_list_add_tail(>free_clusters, cluster_info, idx);
-   }
+   if (cluster_count(_info[idx]) == 0)
+   free_cluster(p, idx);
 }
 
 /*
@@ -497,6 +516,69 @@ static void scan_swap_map_try_ssd_cluster(struct 
swap_info_struct *si,
*scan_base = tmp;
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline unsigned int huge_cluster_nr_entries(bool huge)
+{
+   return huge ? SWAPFILE_CLUSTER : 1;
+}
+#else
+#define huge_cluster_nr_entries(huge)  1
+#endif
+
+static void __swap_entry_alloc(struct

[v14, 0/8] Fix eSDHC host version register bug

2016-10-27 Thread Yangbo Lu

This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0
eSDHC controller. To match the SoC version and revision, 10 previous version
patchsets had tried many methods but all of them were rejected by reviewers.
Such as
- dts compatible method
- syscon method
- ifdef PPC method
- GUTS driver getting SVR method
Anrd suggested a soc_device_match method in v10, and this is the only available
method left now. This v11 patchset introduces the soc_device_match interface in
soc driver.

The first six patches of Yangbo are to add the GUTS driver. This is used to
register a soc device which contain soc version and revision information.
The other two patches introduce the soc_device_match method in soc driver
and apply it on esdhc driver to fix this bug.

Arnd Bergmann (1):
  base: soc: introduce soc_device_match() interface

Yangbo Lu (7):
  dt: bindings: update Freescale DCFG compatible
  ARM64: dts: ls2080a: add device configuration node
  dt: bindings: move guts devicetree doc out of powerpc directory
  powerpc/fsl: move mpc85xx.h to include/linux/fsl
  soc: fsl: add GUTS driver for QorIQ platforms
  MAINTAINERS: add entry for Freescale SoC drivers
  mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

 Documentation/devicetree/bindings/arm/fsl.txt  |   6 +-
 .../bindings/{powerpc => soc}/fsl/guts.txt |   3 +
 MAINTAINERS|  11 +-
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi |   6 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 arch/powerpc/sysdev/fsl_pci.c  |   2 +-
 drivers/base/Kconfig   |   1 +
 drivers/base/soc.c |  66 ++
 drivers/clk/clk-qoriq.c|   3 +-
 drivers/i2c/busses/i2c-mpc.c   |   2 +-
 drivers/iommu/fsl_pamu.c   |   3 +-
 drivers/mmc/host/Kconfig   |   1 +
 drivers/mmc/host/sdhci-of-esdhc.c  |  20 ++
 drivers/net/ethernet/freescale/gianfar.c   |   2 +-
 drivers/soc/Kconfig|   3 +-
 drivers/soc/fsl/Kconfig|  18 ++
 drivers/soc/fsl/Makefile   |   1 +
 drivers/soc/fsl/guts.c | 238 +
 include/linux/fsl/guts.h   | 125 ++-
 .../asm/mpc85xx.h => include/linux/fsl/svr.h   |   4 +-
 include/linux/sys_soc.h|   3 +
 21 files changed, 458 insertions(+), 62 deletions(-)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

-- 
2.1.0.27.g96db324

[PATCH -v4 RESEND 8/9] mm, THP, swap: Support to split THP in swap cache

2016-10-27 Thread Huang, Ying

From: Huang Ying 

This patch enhanced the split_huge_page_to_list() to work properly for
the THP (Transparent Huge Page) in the swap cache during swapping out.

This is used for delaying splitting the THP during swapping out.  Where
for a THP to be swapped out, we will allocate a swap cluster, add the
THP into the swap cache, then split the THP.  The page lock will be held
during this process.  So in the code path other than swapping out, if
the THP need to be split, the PageSwapCache(THP) will be always false.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Ebru Akagunduz 
Signed-off-by: "Huang, Ying" 
---
 mm/huge_memory.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 326b145..199eaba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1831,7 +1831,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
 * atomic_set() here would be safe on all archs (and not only on x86),
 * it's safer to use atomic_inc()/atomic_add().
 */
-   if (PageAnon(head)) {
+   if (PageAnon(head) && !PageSwapCache(head)) {
page_ref_inc(page_tail);
} else {
/* Additional pin to radix tree */
@@ -1842,6 +1842,7 @@ static void __split_huge_page_tail(struct page *head, int 
tail,
page_tail->flags |= (head->flags &
((1L << PG_referenced) |
 (1L << PG_swapbacked) |
+(1L << PG_swapcache) |
 (1L << PG_mlocked) |
 (1L << PG_uptodate) |
 (1L << PG_active) |
@@ -1904,7 +1905,11 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
ClearPageCompound(head);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
-   page_ref_inc(head);
+   /* Additional pin to radix tree of swap cache */
+   if (PageSwapCache(head))
+   page_ref_add(head, 2);
+   else
+   page_ref_inc(head);
} else {
/* Additional pin to radix tree */
page_ref_add(head, 2);
@@ -2016,10 +2021,12 @@ int page_trans_huge_mapcount(struct page *page, int 
*total_mapcount)
 /* Racy check whether the huge page can be split */
 bool can_split_huge_page(struct page *page)
 {
-   int extra_pins = 0;
+   int extra_pins;
 
/* Additional pins from radix tree */
-   if (!PageAnon(page))
+   if (PageAnon(page))
+   extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
+   else
extra_pins = HPAGE_PMD_NR;
return total_mapcount(page) == page_count(page) - extra_pins - 1;
 }
@@ -2072,7 +2079,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
ret = -EBUSY;
goto out;
}
-   extra_pins = 0;
+   extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0;
mapping = NULL;
anon_vma_lock_write(anon_vma);
} else {
-- 
2.9.3

[v14, 1/8] dt: bindings: update Freescale DCFG compatible

2016-10-27 Thread Yangbo Lu

Update Freescale DCFG compatible with 'fsl,-dcfg' instead
of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a,
ls1043a, and ls2080a.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Signed-off-by: Scott Wood 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added a list for the possible compatibles
Changes for v10:
- None
Changes for v11:
- Added 'Acked-by: Rob Herring'
- Updated commit message by Scott
Changes for v12:
- None
Changes for v13:
- None
Changes for v14:
- None
---
 Documentation/devicetree/bindings/arm/fsl.txt | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
b/Documentation/devicetree/bindings/arm/fsl.txt
index dbbc095..713c1ae 100644
--- a/Documentation/devicetree/bindings/arm/fsl.txt
+++ b/Documentation/devicetree/bindings/arm/fsl.txt
@@ -119,7 +119,11 @@ Freescale DCFG
 configuration and status for the device. Such as setting the secondary
 core start address and release the secondary core from holdoff and startup.
   Required properties:
-  - compatible: should be "fsl,ls1021a-dcfg"
+  - compatible: should be "fsl,-dcfg"
+Possible compatibles:
+   "fsl,ls1021a-dcfg"
+   "fsl,ls1043a-dcfg"
+   "fsl,ls2080a-dcfg"
   - reg : should contain base address and length of DCFG memory-mapped 
registers
 
 Example:
-- 
2.1.0.27.g96db324

[PATCH -v4 RESEND 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache

2016-10-27 Thread Huang, Ying

From: Huang Ying 

With this patch, a THP (Transparent Huge Page) can be added/deleted
to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages.

This will be used for the THP (Transparent Huge Page) swap support.
Where one THP may be added/delted to/from the swap cache.  This will
batch the swap cache operations to reduce the lock acquire/release times
for the THP swap too.

Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Signed-off-by: "Huang, Ying" 
---
 include/linux/page-flags.h |  2 +-
 mm/swap_state.c| 58 --
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..f5bcbea 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d3f047b..3115762 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = {
 };
 
 #define INC_CACHE_INFO(x)  do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)  do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
unsigned long add_total;
@@ -80,25 +81,33 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-   int error;
+   int error, i, nr = hpage_nr_pages(page);
struct address_space *address_space;
+   struct page *cur_page;
+   swp_entry_t cur_entry;
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-   get_page(page);
+   page_ref_add(page, nr);
SetPageSwapCache(page);
-   set_page_private(page, entry.val);
 
address_space = swap_address_space(entry);
+   cur_page = page;
+   cur_entry.val = entry.val;
spin_lock_irq(_space->tree_lock);
-   error = radix_tree_insert(_space->page_tree,
- swp_offset(entry), page);
+   for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) {
+   set_page_private(cur_page, cur_entry.val);
+   error = radix_tree_insert(_space->page_tree,
+ swp_offset(cur_entry), cur_page);
+   if (unlikely(error))
+   break;
+   }
if (likely(!error)) {
-   address_space->nrpages++;
-   __inc_node_page_state(page, NR_FILE_PAGES);
-   INC_CACHE_INFO(add_total);
+   address_space->nrpages += nr;
+   __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+   ADD_CACHE_INFO(add_total, nr);
}
spin_unlock_irq(_space->tree_lock);
 
@@ -109,9 +118,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t 
entry)
 * So add_to_swap_cache() doesn't returns -EEXIST.
 */
VM_BUG_ON(error == -EEXIST);
-   set_page_private(page, 0UL);
ClearPageSwapCache(page);
-   put_page(page);
+   set_page_private(cur_page, 0UL);
+   while (i--) {
+   cur_page--;
+   cur_entry.val--;
+   set_page_private(cur_page, 0UL);
+   radix_tree_delete(_space->page_tree,
+ swp_offset(cur_entry));
+   }
+   page_ref_sub(page, nr);
}
 
return error;
@@ -122,7 +138,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, 
gfp_t gfp_mask)
 {
int error;
 
-   error = radix_tree_maybe_preload(gfp_mask);
+   error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
if (!error) {
error = __add_to_swap_cache(page, entry);
radix_tree_preload_end();
@@ -138,6 +154,7 @@ void __delete_from_swap_cache(struct page *page)
 {
swp_entry_t entry;
struct address_space *address_space;
+   int i, nr = hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -145,12 +162,17 @@ void __delete_from_swap_cache(struct page *page)
 
entry.val = page_private(page);
address_space = swap_address_space(entry);
-   radix_tree_delete(_space->page_tree, swp_offset(entry));
-   set_page_private(page, 0);
ClearPageSwapCache(page);
-   address_space->nrpages--;
-

[PATCH -v4 RESEND 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache

2016-10-27 Thread Huang, Ying

From: Huang Ying 

With this patch, a THP (Transparent Huge Page) can be added/deleted
to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages.

This will be used for the THP (Transparent Huge Page) swap support.
Where one THP may be added/delted to/from the swap cache.  This will
batch the swap cache operations to reduce the lock acquire/release times
for the THP swap too.

Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Signed-off-by: "Huang, Ying" 
---
 include/linux/page-flags.h |  2 +-
 mm/swap_state.c| 58 --
 2 files changed, 41 insertions(+), 19 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..f5bcbea 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d3f047b..3115762 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = {
 };
 
 #define INC_CACHE_INFO(x)  do { swap_cache_info.x++; } while (0)
+#define ADD_CACHE_INFO(x, nr)  do { swap_cache_info.x += (nr); } while (0)
 
 static struct {
unsigned long add_total;
@@ -80,25 +81,33 @@ void show_swap_cache_info(void)
  */
 int __add_to_swap_cache(struct page *page, swp_entry_t entry)
 {
-   int error;
+   int error, i, nr = hpage_nr_pages(page);
struct address_space *address_space;
+   struct page *cur_page;
+   swp_entry_t cur_entry;
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
 
-   get_page(page);
+   page_ref_add(page, nr);
SetPageSwapCache(page);
-   set_page_private(page, entry.val);
 
address_space = swap_address_space(entry);
+   cur_page = page;
+   cur_entry.val = entry.val;
spin_lock_irq(_space->tree_lock);
-   error = radix_tree_insert(_space->page_tree,
- swp_offset(entry), page);
+   for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) {
+   set_page_private(cur_page, cur_entry.val);
+   error = radix_tree_insert(_space->page_tree,
+ swp_offset(cur_entry), cur_page);
+   if (unlikely(error))
+   break;
+   }
if (likely(!error)) {
-   address_space->nrpages++;
-   __inc_node_page_state(page, NR_FILE_PAGES);
-   INC_CACHE_INFO(add_total);
+   address_space->nrpages += nr;
+   __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
+   ADD_CACHE_INFO(add_total, nr);
}
spin_unlock_irq(_space->tree_lock);
 
@@ -109,9 +118,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t 
entry)
 * So add_to_swap_cache() doesn't returns -EEXIST.
 */
VM_BUG_ON(error == -EEXIST);
-   set_page_private(page, 0UL);
ClearPageSwapCache(page);
-   put_page(page);
+   set_page_private(cur_page, 0UL);
+   while (i--) {
+   cur_page--;
+   cur_entry.val--;
+   set_page_private(cur_page, 0UL);
+   radix_tree_delete(_space->page_tree,
+ swp_offset(cur_entry));
+   }
+   page_ref_sub(page, nr);
}
 
return error;
@@ -122,7 +138,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, 
gfp_t gfp_mask)
 {
int error;
 
-   error = radix_tree_maybe_preload(gfp_mask);
+   error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
if (!error) {
error = __add_to_swap_cache(page, entry);
radix_tree_preload_end();
@@ -138,6 +154,7 @@ void __delete_from_swap_cache(struct page *page)
 {
swp_entry_t entry;
struct address_space *address_space;
+   int i, nr = hpage_nr_pages(page);
 
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
@@ -145,12 +162,17 @@ void __delete_from_swap_cache(struct page *page)
 
entry.val = page_private(page);
address_space = swap_address_space(entry);
-   radix_tree_delete(_space->page_tree, swp_offset(entry));
-   set_page_private(page, 0);
ClearPageSwapCache(page);
-   address_space->nrpages--;
-   __dec_node_page_state(page, NR_FILE_PAGES);
-   INC_CACHE_INFO(del_total);
+   for (i = 0; i < nr; i++, entry.val++) {
+   struct page *cur_page

[PATCH -v4 RESEND 9/9] mm, THP, swap: Delay splitting THP during swap out

2016-10-27 Thread Huang, Ying

From: Huang Ying 

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the
THP (Transparent Huge Page) and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

This is the first step for the THP swap support.  The plan is to delay
splitting the THP step by step and avoid splitting the THP finally.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help to improve the THP swap performance.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will help to improve the THP swap performance too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after the THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead to
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s
to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16
processes.  The test is done on a Xeon E5 v3 system.  The RAM simulated
PMEM (persistent memory) device is used as the swap device.  To test
sequential swapping out, the test case uses 16 processes sequentially
allocate and write to the anonymous pages until the RAM and part of the
swap device is used up.

The detailed compare result is as follow,

base base+patchset
 --
 %stddev %change %stddev
 \  |\
   1118821 ±  0% +12.1%1254241 ±  1%  vmstat.swap.so
   2460636 ±  1% +10.6%2720983 ±  1%  vm-scalability.throughput
308.79 ±  1%  -7.9% 284.53 ±  1%  vm-scalability.time.elapsed_time
  1639 ±  4%+232.3%   5446 ±  1%  meminfo.SwapCached
  0.70 ±  3%  +8.7%   0.77 ±  5%  perf-stat.ipc
  9.82 ±  8% -31.6%   6.72 ±  2%  
perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list

Signed-off-by: "Huang, Ying" 
---
 mm/swap_state.c | 65 ++---
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3115762..b338523 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -175,12 +176,53 @@ void __delete_from_swap_cache(struct page *page)
ADD_CACHE_INFO(del_total, nr);
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+   swp_entry_t entry;
+   int ret = 0;
+
+   /* cannot split, which may be needed during swap in, skip it */
+   if (!can_split_huge_page(page))
+   return -EBUSY;
+   /* fallback to split huge page firstly if no PMD map */
+   if (!compound_mapcount(page))
+   return 0;
+   entry = get_huge_swap_page();
+   if (!entry.val)
+   return 0;
+   if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+   __swapcache_free(entry, true);
+   return -EOVERFLOW;
+   }
+   ret = add_to_swap_cache(page, entry,
+   __GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+   /* -ENOMEM radix-tree allocation failure */
+   if (ret) {
+   __swapcache_free(entry, true);
+   return 0;
+   }
+   ret = split_huge_page_to_list(page, list);
+   if (ret) {
+   delete_from_swap_cache(page);
+   return -EBUSY;
+   }
+   return 1;
+}
+#else
+static inline int add_to_swap_trans_huge(struct page *page,
+struct list_head *list)
+{
+   return 0;
+}
+#endif
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page

[PATCH -v4 RESEND 4/9] mm, THP, swap: Add get_huge_swap_page()

2016-10-27 Thread Huang, Ying

From: Huang Ying 

A variation of get_swap_page(), get_huge_swap_page(), is added to
allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap
cluster allocation function.  A fair simple algorithm is used, that is,
only the first swap device in priority list will be tried to allocate
the swap cluster.  The function will fail if the trying is not
successful, and the caller will fallback to allocate a single swap slot
instead.  This works good enough for normal cases.

This will be used for the THP (Transparent Huge Page) swap support.
Where get_huge_swap_page() will be used to allocate one swap cluster for
each THP swapped out.

Because of the algorithm adopted, if the difference of the number of the
free swap clusters among multiple swap devices is significant, it is
possible that some THPs are split earlier than necessary.  For example,
this could be caused by big size difference among multiple swap devices.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h | 24 +++-
 mm/swapfile.c| 18 --
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 001b506..cb8c1b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,7 +401,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t __get_swap_page(bool huge);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -421,6 +421,23 @@ extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+static inline swp_entry_t get_swap_page(void)
+{
+   return __get_swap_page(false);
+}
+
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return __get_swap_page(true);
+}
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return (swp_entry_t) {0};
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)  (NULL)
@@ -527,6 +544,11 @@ static inline swp_entry_t get_swap_page(void)
return entry;
 }
 
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return (swp_entry_t) {0};
+}
+
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3643049..8224150 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -760,14 +760,15 @@ static inline unsigned long 
swap_alloc_huge_cluster(struct swap_info_struct *si)
 }
 #endif
 
-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(bool huge)
 {
struct swap_info_struct *si, *next;
pgoff_t offset;
+   int nr_pages = huge_cluster_nr_entries(huge);
 
-   if (atomic_long_read(_swap_pages) <= 0)
+   if (atomic_long_read(_swap_pages) < nr_pages)
goto noswap;
-   atomic_long_dec(_swap_pages);
+   atomic_long_sub(nr_pages, _swap_pages);
 
spin_lock(_avail_lock);
 
@@ -795,10 +796,15 @@ swp_entry_t get_swap_page(void)
}
 
/* This is called for allocating swap entry for cache */
-   offset = scan_swap_map(si, SWAP_HAS_CACHE);
+   if (likely(nr_pages == 1))
+   offset = scan_swap_map(si, SWAP_HAS_CACHE);
+   else
+   offset = swap_alloc_huge_cluster(si);
spin_unlock(>lock);
if (offset)
return swp_entry(si->type, offset);
+   else if (unlikely(nr_pages != 1))
+   goto fail_alloc;
pr_debug("scan_swap_map of si %d failed to find offset\n",
   si->type);
spin_lock(_avail_lock);
@@ -818,8 +824,8 @@ swp_entry_t get_swap_page(void)
}
 
spin_unlock(_avail_lock);
-
-   atomic_long_inc(_swap_pages);
+fail_alloc:
+   atomic_long_add(nr_pages, _swap_pages);
 noswap:
return (swp_entry_t) {0};
 }
-- 
2.9.3

[PATCH -v4 RESEND 7/9] mm, THP: Add can_split_huge_page()

2016-10-27 Thread Huang, Ying

From: Huang Ying 

Separates checking whether we can split the huge page from
split_huge_page_to_list() into a function.  This will help to check that
before splitting the THP (Transparent Huge Page) really.

This will be used for delaying splitting THP during swapping out.  Where
for a THP, we will allocate a swap cluster, add the THP into the swap
cache, then split the THP.  To avoid the unnecessary operations for the
un-splittable THP, we will check that firstly.

There is no functionality change in this patch.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Ebru Akagunduz 
Signed-off-by: "Huang, Ying" 
---
 include/linux/huge_mm.h |  7 +++
 mm/huge_memory.c| 13 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d..14ffa3f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -176,6 +177,12 @@ static inline void prep_transhuge_page(struct page *page) 
{}
 
 #define thp_get_unmapped_area  NULL
 
+static inline bool
+can_split_huge_page(struct page *page)
+{
+   BUILD_BUG();
+   return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25c..326b145 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2013,6 +2013,17 @@ int page_trans_huge_mapcount(struct page *page, int 
*total_mapcount)
return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page)
+{
+   int extra_pins = 0;
+
+   /* Additional pins from radix tree */
+   if (!PageAnon(page))
+   extra_pins = HPAGE_PMD_NR;
+   return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2083,7 +2094,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * Racy check if we can split the page, before freeze_page() will
 * split PMDs
 */
-   if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+   if (!can_split_huge_page(head)) {
ret = -EBUSY;
goto out_unlock;
}
-- 
2.9.3

[PATCH -v4 RESEND 9/9] mm, THP, swap: Delay splitting THP during swap out

2016-10-27 Thread Huang, Ying

From: Huang Ying 

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the
THP (Transparent Huge Page) and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

This is the first step for the THP swap support.  The plan is to delay
splitting the THP step by step and avoid splitting the THP finally.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help to improve the THP swap performance.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which usually are 4k random
  IO.  This will help to improve the THP swap performance too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after the THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead to
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s
to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16
processes.  The test is done on a Xeon E5 v3 system.  The RAM simulated
PMEM (persistent memory) device is used as the swap device.  To test
sequential swapping out, the test case uses 16 processes sequentially
allocate and write to the anonymous pages until the RAM and part of the
swap device is used up.

The detailed compare result is as follow,

base base+patchset
 --
 %stddev %change %stddev
 \  |\
   1118821 ±  0% +12.1%1254241 ±  1%  vmstat.swap.so
   2460636 ±  1% +10.6%2720983 ±  1%  vm-scalability.throughput
308.79 ±  1%  -7.9% 284.53 ±  1%  vm-scalability.time.elapsed_time
  1639 ±  4%+232.3%   5446 ±  1%  meminfo.SwapCached
  0.70 ±  3%  +8.7%   0.77 ±  5%  perf-stat.ipc
  9.82 ±  8% -31.6%   6.72 ±  2%  
perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list

Signed-off-by: "Huang, Ying" 
---
 mm/swap_state.c | 65 ++---
 1 file changed, 62 insertions(+), 3 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3115762..b338523 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -175,12 +176,53 @@ void __delete_from_swap_cache(struct page *page)
ADD_CACHE_INFO(del_total, nr);
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+int add_to_swap_trans_huge(struct page *page, struct list_head *list)
+{
+   swp_entry_t entry;
+   int ret = 0;
+
+   /* cannot split, which may be needed during swap in, skip it */
+   if (!can_split_huge_page(page))
+   return -EBUSY;
+   /* fallback to split huge page firstly if no PMD map */
+   if (!compound_mapcount(page))
+   return 0;
+   entry = get_huge_swap_page();
+   if (!entry.val)
+   return 0;
+   if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) {
+   __swapcache_free(entry, true);
+   return -EOVERFLOW;
+   }
+   ret = add_to_swap_cache(page, entry,
+   __GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN);
+   /* -ENOMEM radix-tree allocation failure */
+   if (ret) {
+   __swapcache_free(entry, true);
+   return 0;
+   }
+   ret = split_huge_page_to_list(page, list);
+   if (ret) {
+   delete_from_swap_cache(page);
+   return -EBUSY;
+   }
+   return 1;
+}
+#else
+static inline int add_to_swap_trans_huge(struct page *page,
+struct list_head *list)
+{
+   return 0;
+}
+#endif
+
 /**
  * add_to_swap - allocate swap space for a page
  * @page: page we want to move to swap
  *
  * Allocate

[PATCH -v4 RESEND 4/9] mm, THP, swap: Add get_huge_swap_page()

2016-10-27 Thread Huang, Ying

From: Huang Ying 

A variation of get_swap_page(), get_huge_swap_page(), is added to
allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap
cluster allocation function.  A fair simple algorithm is used, that is,
only the first swap device in priority list will be tried to allocate
the swap cluster.  The function will fail if the trying is not
successful, and the caller will fallback to allocate a single swap slot
instead.  This works good enough for normal cases.

This will be used for the THP (Transparent Huge Page) swap support.
Where get_huge_swap_page() will be used to allocate one swap cluster for
each THP swapped out.

Because of the algorithm adopted, if the difference of the number of the
free swap clusters among multiple swap devices is significant, it is
possible that some THPs are split earlier than necessary.  For example,
this could be caused by big size difference among multiple swap devices.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Signed-off-by: "Huang, Ying" 
---
 include/linux/swap.h | 24 +++-
 mm/swapfile.c| 18 --
 2 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 001b506..cb8c1b0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -401,7 +401,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-extern swp_entry_t get_swap_page(void);
+extern swp_entry_t __get_swap_page(bool huge);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
@@ -421,6 +421,23 @@ extern bool reuse_swap_page(struct page *, int *);
 extern int try_to_free_swap(struct page *);
 struct backing_dev_info;
 
+static inline swp_entry_t get_swap_page(void)
+{
+   return __get_swap_page(false);
+}
+
+#ifdef CONFIG_THP_SWAP_CLUSTER
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return __get_swap_page(true);
+}
+#else
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return (swp_entry_t) {0};
+}
+#endif
+
 #else /* CONFIG_SWAP */
 
 #define swap_address_space(entry)  (NULL)
@@ -527,6 +544,11 @@ static inline swp_entry_t get_swap_page(void)
return entry;
 }
 
+static inline swp_entry_t get_huge_swap_page(void)
+{
+   return (swp_entry_t) {0};
+}
+
 #endif /* CONFIG_SWAP */
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3643049..8224150 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -760,14 +760,15 @@ static inline unsigned long 
swap_alloc_huge_cluster(struct swap_info_struct *si)
 }
 #endif
 
-swp_entry_t get_swap_page(void)
+swp_entry_t __get_swap_page(bool huge)
 {
struct swap_info_struct *si, *next;
pgoff_t offset;
+   int nr_pages = huge_cluster_nr_entries(huge);
 
-   if (atomic_long_read(_swap_pages) <= 0)
+   if (atomic_long_read(_swap_pages) < nr_pages)
goto noswap;
-   atomic_long_dec(_swap_pages);
+   atomic_long_sub(nr_pages, _swap_pages);
 
spin_lock(_avail_lock);
 
@@ -795,10 +796,15 @@ swp_entry_t get_swap_page(void)
}
 
/* This is called for allocating swap entry for cache */
-   offset = scan_swap_map(si, SWAP_HAS_CACHE);
+   if (likely(nr_pages == 1))
+   offset = scan_swap_map(si, SWAP_HAS_CACHE);
+   else
+   offset = swap_alloc_huge_cluster(si);
spin_unlock(>lock);
if (offset)
return swp_entry(si->type, offset);
+   else if (unlikely(nr_pages != 1))
+   goto fail_alloc;
pr_debug("scan_swap_map of si %d failed to find offset\n",
   si->type);
spin_lock(_avail_lock);
@@ -818,8 +824,8 @@ swp_entry_t get_swap_page(void)
}
 
spin_unlock(_avail_lock);
-
-   atomic_long_inc(_swap_pages);
+fail_alloc:
+   atomic_long_add(nr_pages, _swap_pages);
 noswap:
return (swp_entry_t) {0};
 }
-- 
2.9.3

[PATCH -v4 RESEND 7/9] mm, THP: Add can_split_huge_page()

2016-10-27 Thread Huang, Ying

From: Huang Ying 

Separates checking whether we can split the huge page from
split_huge_page_to_list() into a function.  This will help to check that
before splitting the THP (Transparent Huge Page) really.

This will be used for delaying splitting THP during swapping out.  Where
for a THP, we will allocate a swap cluster, add the THP into the swap
cache, then split the THP.  To avoid the unnecessary operations for the
un-splittable THP, we will check that firstly.

There is no functionality change in this patch.

Cc: Andrea Arcangeli 
Cc: Kirill A. Shutemov 
Cc: Ebru Akagunduz 
Signed-off-by: "Huang, Ying" 
---
 include/linux/huge_mm.h |  7 +++
 mm/huge_memory.c| 13 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9b9f65d..14ffa3f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 extern void prep_transhuge_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 
+bool can_split_huge_page(struct page *page);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
@@ -176,6 +177,12 @@ static inline void prep_transhuge_page(struct page *page) 
{}
 
 #define thp_get_unmapped_area  NULL
 
+static inline bool
+can_split_huge_page(struct page *page)
+{
+   BUILD_BUG();
+   return false;
+}
 static inline int
 split_huge_page_to_list(struct page *page, struct list_head *list)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cdcd25c..326b145 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2013,6 +2013,17 @@ int page_trans_huge_mapcount(struct page *page, int 
*total_mapcount)
return ret;
 }
 
+/* Racy check whether the huge page can be split */
+bool can_split_huge_page(struct page *page)
+{
+   int extra_pins = 0;
+
+   /* Additional pins from radix tree */
+   if (!PageAnon(page))
+   extra_pins = HPAGE_PMD_NR;
+   return total_mapcount(page) == page_count(page) - extra_pins - 1;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
@@ -2083,7 +2094,7 @@ int split_huge_page_to_list(struct page *page, struct 
list_head *list)
 * Racy check if we can split the page, before freeze_page() will
 * split PMDs
 */
-   if (total_mapcount(head) != page_count(head) - extra_pins - 1) {
+   if (!can_split_huge_page(head)) {
ret = -EBUSY;
goto out_unlock;
}
-- 
2.9.3

[PATCH -v4 RESEND 1/9] mm, swap: Make swap cluster size same of THP size on x86_64

2016-10-27 Thread Huang, Ying

From: Huang Ying 

In this patch, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
the THP swap support on x86_64.  Where one swap cluster will be used to
hold the contents of each THP swapped out.  And some information of the
swapped out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

For other architectures which want THP swap support,
ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for
the architecture.

In effect, this will enlarge swap cluster size by 2 times on x86_64.
Which may make it harder to find a free cluster when the swap space
becomes fragmented.  So that, this may reduce the continuous swap space
allocation and sequential write in theory.  The performance test in 0day
shows no regressions caused by this.

Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Suggested-by: Andrew Morton 
Signed-off-by: "Huang, Ying" 
---
 arch/x86/Kconfig |  1 +
 mm/Kconfig   | 13 +
 mm/swapfile.c|  4 
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bada636..a8446bc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
select HAVE_STACK_VALIDATIONif X86_64
select ARCH_USES_HIGH_VMA_FLAGS if 
X86_INTEL_MEMORY_PROTECTION_KEYS
select ARCH_HAS_PKEYS   if 
X86_INTEL_MEMORY_PROTECTION_KEYS
+   select ARCH_USES_THP_SWAP_CLUSTER   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..2da8128 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -503,6 +503,19 @@ config FRONTSWAP
 
  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+   bool
+   default n
+
+config THP_SWAP_CLUSTER
+   bool
+   depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+   default y
+   help
+ Use one swap cluster to hold the contents of the THP
+ (Transparent Huge Page) swapped out.  The size of the swap
+ cluster will be same as that of THP.
+
 config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_MEMBLOCK && MMU
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2210de2..18e247b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct 
*si,
}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER   HPAGE_PMD_NR
+#else
 #define SWAPFILE_CLUSTER   256
+#endif
 #define LATENCY_LIMIT  256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-- 
2.9.3

[PATCH -v4 RESEND 1/9] mm, swap: Make swap cluster size same of THP size on x86_64

2016-10-27 Thread Huang, Ying

From: Huang Ying 

In this patch, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  This is for
the THP swap support on x86_64.  Where one swap cluster will be used to
hold the contents of each THP swapped out.  And some information of the
swapped out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

For other architectures which want THP swap support,
ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for
the architecture.

In effect, this will enlarge swap cluster size by 2 times on x86_64.
Which may make it harder to find a free cluster when the swap space
becomes fragmented.  So that, this may reduce the continuous swap space
allocation and sequential write in theory.  The performance test in 0day
shows no regressions caused by this.

Cc: Hugh Dickins 
Cc: Shaohua Li 
Cc: Minchan Kim 
Cc: Rik van Riel 
Suggested-by: Andrew Morton 
Signed-off-by: "Huang, Ying" 
---
 arch/x86/Kconfig |  1 +
 mm/Kconfig   | 13 +
 mm/swapfile.c|  4 
 3 files changed, 18 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index bada636..a8446bc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@ config X86
select HAVE_STACK_VALIDATIONif X86_64
select ARCH_USES_HIGH_VMA_FLAGS if 
X86_INTEL_MEMORY_PROTECTION_KEYS
select ARCH_HAS_PKEYS   if 
X86_INTEL_MEMORY_PROTECTION_KEYS
+   select ARCH_USES_THP_SWAP_CLUSTER   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/mm/Kconfig b/mm/Kconfig
index be0ee11..2da8128 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -503,6 +503,19 @@ config FRONTSWAP
 
  If unsure, say Y to enable frontswap.
 
+config ARCH_USES_THP_SWAP_CLUSTER
+   bool
+   default n
+
+config THP_SWAP_CLUSTER
+   bool
+   depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
+   default y
+   help
+ Use one swap cluster to hold the contents of the THP
+ (Transparent Huge Page) swapped out.  The size of the swap
+ cluster will be same as that of THP.
+
 config CMA
bool "Contiguous Memory Allocator"
depends on HAVE_MEMBLOCK && MMU
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2210de2..18e247b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct 
*si,
}
 }
 
+#ifdef CONFIG_THP_SWAP_CLUSTER
+#define SWAPFILE_CLUSTER   HPAGE_PMD_NR
+#else
 #define SWAPFILE_CLUSTER   256
+#endif
 #define LATENCY_LIMIT  256
 
 static inline void cluster_set_flag(struct swap_cluster_info *info,
-- 
2.9.3

[PATCH -v4 RESEND 0/9] THP swap: Delay splitting THP during swapping out

2016-10-27 Thread Huang, Ying

From: Huang Ying 

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [1/9], [3/9], [4/9], [5/9],
[6/9], [9/9].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [2/9], [7/9] and [8/9].

Hi, Johannes, Michal and Vladimir, I am not very confident about the
memory cgroup part, especially [2/9].  Could you help me to review it?

And for all, Any comment is welcome!


Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead to
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is based on 10/11 head of mmotm/master.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

With the patchset, the swap out throughput improves 12.1% (from about
1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case uses 16 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

The detailed compare result is as follow,

base base+patchset
 -- 
 %stddev %change %stddev
 \  |\  
   1118821 ±  0% +12.1%1254241 ±  1%  vmstat.swap.so
   2460636 ±  1% +10.6%2720983 ±  1%  vm-scalability.throughput
308.79 ±  1%  -7.9% 284.53 ±  1%  vm-scalability.time.elapsed_time
  1639 ±  4%+232.3%   5446 ±  1%  meminfo.SwapCached
  0.70 ±  3%  +8.7%   0.77 ±  5%  perf-stat.ipc
  9.82 ±  8% -31.6%   6.72 ±  2%  
perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list


>From the swap out throughput number, we can find, even tested on a RAM
simulated PMEM (Persistent Memory) device, the swap out throughput can
reach only about 1.1GB/s.  While, in the file IO test, the sequential
write throughput of an Intel P3700 SSD can reach about 1.8GB/s
steadily.  And according the following URL,

https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html

The sequential write throughput of Intel P3608 SSD can reach about
3.0GB/s, while the random read IOPS can reach about 850k.  It is clear
that the bottleneck has moved from the disk to the

[PATCH -v4 RESEND 0/9] THP swap: Delay splitting THP during swapping out

2016-10-27 Thread Huang, Ying

From: Huang Ying 

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [1/9], [3/9], [4/9], [5/9],
[6/9], [9/9].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [2/9], [7/9] and [8/9].

Hi, Johannes, Michal and Vladimir, I am not very confident about the
memory cgroup part, especially [2/9].  Could you help me to review it?

And for all, Any comment is welcome!


Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

- Batch the swap operations for the THP to reduce lock
  acquiring/releasing, including allocating/freeing the swap space,
  adding/deleting to/from the swap cache, and writing/reading the swap
  space, etc.  This will help improve the performance of the THP swap.

- The THP swap space read/write will be 2M sequential IO.  It is
  particularly helpful for the swap read, which are usually 4k random
  IO.  This will improve the performance of the THP swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The 2M continuous pages will be
  free up after THP swapping out.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapping out, it will take quite long time for the normal pages to
  collapse back into the THP after being swapped in.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead to
the storage device.  To deal with that, the THP swap in should be
turned on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is based on 10/11 head of mmotm/master.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP
during the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is
delayed from almost the first step of swapping out to after allocating
the swap space for the THP and adding the THP into the swap cache.
This will reduce lock acquiring/releasing for the locks used for the
swap cache management.

With the patchset, the swap out throughput improves 12.1% (from about
1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To
test the sequential swapping out, the test case uses 16 processes,
which sequentially allocate and write to the anonymous pages until the
RAM and part of the swap device is used up.

The detailed compare result is as follow,

base base+patchset
 -- 
 %stddev %change %stddev
 \  |\  
   1118821 ±  0% +12.1%1254241 ±  1%  vmstat.swap.so
   2460636 ±  1% +10.6%2720983 ±  1%  vm-scalability.throughput
308.79 ±  1%  -7.9% 284.53 ±  1%  vm-scalability.time.elapsed_time
  1639 ±  4%+232.3%   5446 ±  1%  meminfo.SwapCached
  0.70 ±  3%  +8.7%   0.77 ±  5%  perf-stat.ipc
  9.82 ±  8% -31.6%   6.72 ±  2%  
perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list


>From the swap out throughput number, we can find, even tested on a RAM
simulated PMEM (Persistent Memory) device, the swap out throughput can
reach only about 1.1GB/s.  While, in the file IO test, the sequential
write throughput of an Intel P3700 SSD can reach about 1.8GB/s
steadily.  And according the following URL,

https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html

The sequential write throughput of Intel P3608 SSD can reach about
3.0GB/s, while the random read IOPS can reach about 850k.  It is clear
that the bottleneck has moved from the disk to the kernel swap
component

Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Anshuman Khandual

On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
>> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
>>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
 On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>> Jerome Glisse  writes:
 On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
 In my patchset there is no policy, it is all under device driver control 
 which
 decide what range of memory is migrated and when. I think only device 
 driver as
 proper knowledge to make such decision. By coalescing data from GPU 
 counters and
 request from application made through the uppler level programming API like
 Cuda.

>>>
>>> Right, I understand that. But what I pointed out here is that there are 
>>> problems
>>> now migrating user mapped pages back and forth between LRU system RAM 
>>> memory and
>>> non LRU device memory which is yet to be solved. Because you are proposing 
>>> a non
>>> LRU based design with ZONE_DEVICE, how we are solving/working around these
>>> problems for bi-directional migration ?
>>
>> Let me elaborate on this bit more. Before non LRU migration support patch 
>> series
>> from Minchan, it was not possible to migrate non LRU pages which are 
>> generally
>> driver managed through migrate_pages interface. This was affecting the 
>> ability
>> to do compaction on platforms which has a large share of non LRU pages. That 
>> series
>> actually solved the migration problem and allowed compaction. But it still 
>> did not
>> solve the migration problem for non LRU *user mapped* pages. So if the non 
>> LRU pages
>> are mapped into a process's page table and being accessed from user space, 
>> it can
>> not be moved using migrate_pages interface.
>>
>> Minchan had a draft solution for that problem which is still hosted here. On 
>> his
>> suggestion I had tried this solution but still faced some other problems 
>> during
>> mapped pages migration. (NOTE: IIRC this was not posted in the community)
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the 
>> following
>> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) 
>>
>> As I had mentioned earlier, we intend to support all possible migrations 
>> between
>> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
>>
>> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
>> (2) System RAM (File mapping) --> Device memory, back and forth many times
> 
> I achieve this 2 objective in HMM, i sent you the additional patches for file
> back page migration. I am not done working on them but they are small.

Sure, will go through them. Thanks !

> 
> 
>> This is not happening now with non LRU pages. Here are some of reasons but 
>> before
>> that some notes.
>>
>> * Driver initiates all the migrations
>> * Driver does the isolation of pages
>> * Driver puts the isolated pages in a linked list
>> * Driver passes the linked list to migrate_pages interface for migration
>> * IIRC isolation of non LRU pages happens through 
>> page->as->aops->isolate_page call
>> * If migration fails, call page->as->aops->putback_page to give the page 
>> back to the
>>   device driver
>>
>> 1. queue_pages_range() currently does not work with non LRU pages, needs to 
>> be fixed
>>
>> 2. After a successful migration from non LRU device memory to LRU system 
>> RAM, the non
>>LRU will be freed back. Right now migrate_pages releases these pages to 
>> buddy, but
>>in this situation we need the pages to be given back to the driver 
>> instead. Hence
>>migrate_pages needs to be changed to accommodate this.
>>
>> 3. After LRU system RAM to non LRU device migration for a mapped page, does 
>> the new
>>page (which came from device memory) will be part of core MM LRU either 
>> for Anon
>>or File mapping ?
>>
>> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" and 
>> "Anon VMA
>>Chain" reverse mapping information both on the page->mapping element ?
>>
>> 5. After LRU (File mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" of 
>> the device
>>driver and radix tree based reverse mapping information for the existing 
>> file
>>mapping both on the same page->mapping element ?
>>
>> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops 
>> which will
>>defined inside the device driver) and the reverse mapping information

Re: [RFC 0/8] Define coherent device memory node

2016-10-27 Thread Anshuman Khandual

On 10/27/2016 08:35 PM, Jerome Glisse wrote:
> On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote:
>> On 10/27/2016 10:08 AM, Anshuman Khandual wrote:
>>> On 10/26/2016 09:32 PM, Jerome Glisse wrote:
 On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote:
> On 10/26/2016 12:22 AM, Jerome Glisse wrote:
>> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
>>> Jerome Glisse  writes:
 On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse  writes:
>> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> 
> [...]
> 
 In my patchset there is no policy, it is all under device driver control 
 which
 decide what range of memory is migrated and when. I think only device 
 driver as
 proper knowledge to make such decision. By coalescing data from GPU 
 counters and
 request from application made through the uppler level programming API like
 Cuda.

>>>
>>> Right, I understand that. But what I pointed out here is that there are 
>>> problems
>>> now migrating user mapped pages back and forth between LRU system RAM 
>>> memory and
>>> non LRU device memory which is yet to be solved. Because you are proposing 
>>> a non
>>> LRU based design with ZONE_DEVICE, how we are solving/working around these
>>> problems for bi-directional migration ?
>>
>> Let me elaborate on this bit more. Before non LRU migration support patch 
>> series
>> from Minchan, it was not possible to migrate non LRU pages which are 
>> generally
>> driver managed through migrate_pages interface. This was affecting the 
>> ability
>> to do compaction on platforms which has a large share of non LRU pages. That 
>> series
>> actually solved the migration problem and allowed compaction. But it still 
>> did not
>> solve the migration problem for non LRU *user mapped* pages. So if the non 
>> LRU pages
>> are mapped into a process's page table and being accessed from user space, 
>> it can
>> not be moved using migrate_pages interface.
>>
>> Minchan had a draft solution for that problem which is still hosted here. On 
>> his
>> suggestion I had tried this solution but still faced some other problems 
>> during
>> mapped pages migration. (NOTE: IIRC this was not posted in the community)
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the 
>> following
>> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) 
>>
>> As I had mentioned earlier, we intend to support all possible migrations 
>> between
>> system RAM (LRU) and device memory (Non LRU) for user space mapped pages.
>>
>> (1) System RAM (Anon mapping) --> Device memory, back and forth many times
>> (2) System RAM (File mapping) --> Device memory, back and forth many times
> 
> I achieve this 2 objective in HMM, i sent you the additional patches for file
> back page migration. I am not done working on them but they are small.

Sure, will go through them. Thanks !

> 
> 
>> This is not happening now with non LRU pages. Here are some of reasons but 
>> before
>> that some notes.
>>
>> * Driver initiates all the migrations
>> * Driver does the isolation of pages
>> * Driver puts the isolated pages in a linked list
>> * Driver passes the linked list to migrate_pages interface for migration
>> * IIRC isolation of non LRU pages happens through 
>> page->as->aops->isolate_page call
>> * If migration fails, call page->as->aops->putback_page to give the page 
>> back to the
>>   device driver
>>
>> 1. queue_pages_range() currently does not work with non LRU pages, needs to 
>> be fixed
>>
>> 2. After a successful migration from non LRU device memory to LRU system 
>> RAM, the non
>>LRU will be freed back. Right now migrate_pages releases these pages to 
>> buddy, but
>>in this situation we need the pages to be given back to the driver 
>> instead. Hence
>>migrate_pages needs to be changed to accommodate this.
>>
>> 3. After LRU system RAM to non LRU device migration for a mapped page, does 
>> the new
>>page (which came from device memory) will be part of core MM LRU either 
>> for Anon
>>or File mapping ?
>>
>> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" and 
>> "Anon VMA
>>Chain" reverse mapping information both on the page->mapping element ?
>>
>> 5. After LRU (File mapped) system RAM to non LRU device migration for a 
>> mapped page,
>>how we are going to store "address_space->address_space_operations" of 
>> the device
>>driver and radix tree based reverse mapping information for the existing 
>> file
>>mapping both on the same page->mapping element ?
>>
>> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops 
>> which will
>>defined inside the device driver) and the reverse mapping information 
>> (either anon
>>or file mapping)

Re: [PATCH] arm64: defconfig: Enable DRM DU and V4L2 FCP + VSP modules

2016-10-27 Thread Simon Horman

On Thu, Oct 27, 2016 at 04:37:53PM +0900, Magnus Damm wrote:
> Hi Simon,
> 
> On Thu, Oct 27, 2016 at 4:15 PM, Simon Horman  wrote:
> > On Thu, Oct 27, 2016 at 09:08:01AM +0200, Simon Horman wrote:
> >> On Wed, Oct 26, 2016 at 02:24:22PM +0900, Magnus Damm wrote:
> >> > From: Magnus Damm 
> >> >
> >> > Extend the ARM64 defconfig to enable the DU DRM device as module
> >> > together with required dependencies of V4L2 FCP and VSP modules.
> >> >
> >> > This enables VGA output on the r8a7795 Salvator-X board.
> >> >
> >> > Signed-off-by: Magnus Damm 
> >>
> >> Thanks, I have queued this up.
> >
> > Given discussion elsewhere on enabling DU I am holding off on this for a
> > little; it is not queued up for now.
> 
> Sure, thanks for holding off the DT integration patches for r8a7796.
> Please note that as of mainline v4.9-rc2 the r8a7795 Salvator-X board
> has thanks to DU, FCP and VSP a working VGA port. So enabling those
> devices in the defconfig from now on makes sense to me.

Understood, I have queued this up.

Re: [PATCH] arm64: defconfig: Enable DRM DU and V4L2 FCP + VSP modules

2016-10-27 Thread Simon Horman

On Thu, Oct 27, 2016 at 04:37:53PM +0900, Magnus Damm wrote:
> Hi Simon,
> 
> On Thu, Oct 27, 2016 at 4:15 PM, Simon Horman  wrote:
> > On Thu, Oct 27, 2016 at 09:08:01AM +0200, Simon Horman wrote:
> >> On Wed, Oct 26, 2016 at 02:24:22PM +0900, Magnus Damm wrote:
> >> > From: Magnus Damm 
> >> >
> >> > Extend the ARM64 defconfig to enable the DU DRM device as module
> >> > together with required dependencies of V4L2 FCP and VSP modules.
> >> >
> >> > This enables VGA output on the r8a7795 Salvator-X board.
> >> >
> >> > Signed-off-by: Magnus Damm 
> >>
> >> Thanks, I have queued this up.
> >
> > Given discussion elsewhere on enabling DU I am holding off on this for a
> > little; it is not queued up for now.
> 
> Sure, thanks for holding off the DT integration patches for r8a7796.
> Please note that as of mainline v4.9-rc2 the r8a7795 Salvator-X board
> has thanks to DU, FCP and VSP a working VGA port. So enabling those
> devices in the defconfig from now on makes sense to me.

Understood, I have queued this up.

Re: [PATCH v4 0/3] nvme power saving

2016-10-27 Thread Christoph Hellwig

On Thu, Oct 27, 2016 at 05:06:16PM -0700, Andy Lutomirski wrote:
> It looks like there is at least one NVMe disk in existence (a
> different Samsung device) that sporadically dies when APST is on.
> This device appears to also sporadically die when APST is off, but it
> lasts considerably longer before dying with APST off.

Judy, can you help Andy to find someone in Samsung to report this
to?

> So here's what I'm tempted to do:
> 
>  - For devices that report NVMe version 1.2 support, APST is on by
> default.  I hope this is safe.

It should be safe.  That being said NVMe is being driven more and more
into consumer markets so eventually we will find some device we need
to work around inevitably, but that's life.

>  - For devices that don't report NVMe 1.2 or higher but do report
> APSTA (which implies NVMe 1.1), then we can have a blacklist or a
> whitelist.  A blacklist is nicer, but a whitelist is safer.

We just had a discussion about advertising features before claiming
conformance where they appear in in the NVMe technical working group.
The general concensus was that it should be safe.  I'm thus tempted
to start out with the blacklist.

>  - A sysfs and/or module control allows overriding this.
> 
>  - Implement dev_pm_qos latency control.  The chosen latency (if APST
> is enabled) will be the lesser of the dev_pm_qos setting and a module
> parameter.
> 
> How does that sound?

Great!

Re: [PATCH v4 0/3] nvme power saving

2016-10-27 Thread Christoph Hellwig

On Thu, Oct 27, 2016 at 05:06:16PM -0700, Andy Lutomirski wrote:
> It looks like there is at least one NVMe disk in existence (a
> different Samsung device) that sporadically dies when APST is on.
> This device appears to also sporadically die when APST is off, but it
> lasts considerably longer before dying with APST off.

Judy, can you help Andy to find someone in Samsung to report this
to?

> So here's what I'm tempted to do:
> 
>  - For devices that report NVMe version 1.2 support, APST is on by
> default.  I hope this is safe.

It should be safe.  That being said NVMe is being driven more and more
into consumer markets so eventually we will find some device we need
to work around inevitably, but that's life.

>  - For devices that don't report NVMe 1.2 or higher but do report
> APSTA (which implies NVMe 1.1), then we can have a blacklist or a
> whitelist.  A blacklist is nicer, but a whitelist is safer.

We just had a discussion about advertising features before claiming
conformance where they appear in in the NVMe technical working group.
The general concensus was that it should be safe.  I'm thus tempted
to start out with the blacklist.

>  - A sysfs and/or module control allows overriding this.
> 
>  - Implement dev_pm_qos latency control.  The chosen latency (if APST
> is enabled) will be the lesser of the dev_pm_qos setting and a module
> parameter.
> 
> How does that sound?

Great!

Re: [RFC PATCH] usb: core: correct usb_get_dev() documentation

2016-10-27 Thread Peter Chen

On Thu, Oct 27, 2016 at 04:49:18PM -0700, Dmitry Torokhov wrote:
> On Thu, Oct 27, 2016 at 03:02:30PM -0700, Brian Norris wrote:
> > In reading through a USB interface driver, I noticed that it called
> > usb_{get,put}_dev() in its probe() and disconnect() methods. This seemed
> > unnecessary, but a look at the comments here matched the usage.
> > 
> > USB interface devices seem to be well covered by the parent/child
> > relationship of the device model, and so it should be unnecessary for a
> > child device to grab a refcount on its parent device.
> > 
> > Signed-off-by: Brian Norris 
> 
> Yes, usb_device is parent of usb_interface and device core does "parent
> = get_device(dev->parent);" as part of device_add() when registering new
> interfaces.
> 
> Reviewed-by: Dmitry Torokhov 
> 

Yes, current code seems a little messy for get{put}_device.
Eg, for USB device, it tries to call get_device again at usb_set_configuration
when create its child device (interface device).
For USB interface device, it handles get{put}_device at message.c for
common interface, it seems to be not necessary to call
usb_get{put}_dev again at individual interface driver.

Peter
> > ---
> > This reflects my understanding (and testing), as well as the majority of 
> > usage
> > -- there are *very* few interface drivers that actually call usb_get_dev(). 
> > If
> > I'm wrong, please feel free to tell me so! But I thought patching the
> > documentation would be the best way to solicit a response :)
> > 
> >  drivers/usb/core/usb.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c
> > index 592151461017..0ba7e070f04e 100644
> > --- a/drivers/usb/core/usb.c
> > +++ b/drivers/usb/core/usb.c
> > @@ -539,9 +539,9 @@ EXPORT_SYMBOL_GPL(usb_alloc_dev);
> >   *
> >   * Each live reference to a device should be refcounted.
> >   *
> > - * Drivers for USB interfaces should normally record such references in
> > - * their probe() methods, when they bind to an interface, and release
> > - * them by calling usb_put_dev(), in their disconnect() methods.
> > + * The device driver core automatically handles this refcounting for USB
> > + * interface drivers, but this API can be used for non-parent/child
> > + * relationships.
> >   *
> >   * Return: A pointer to the device with the incremented reference counter.
> >   */
> > -- 
> > 2.8.0.rc3.226.g39d4020
> > 
> 
> -- 
> Dmitry
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Best Regards,
Peter Chen

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Viresh Kumar

On 28-10-16, 12:07, Fengguang Wu wrote:
> On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote:
> >On 28-10-16, 07:22, kbuild test robot wrote:
> >>tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> >>master
> >>head:   e3300ffef0653774f1099cab153d25d24bd773ce
> >>commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
> >>dependent code in a separate file
> >>date:   6 months ago
> >
> >Why are we picking it up now ?
> 
> Sorry due to problems in the 0day infrastructure some few errors are
> missed in May. Now we catch it when the commit goes mainline.
> 
> https://lists.01.org/pipermail/kbuild-all/
> 
> June 2016:... [ Gzip'd Text 853 KB ]
> May 2016: ... [ Gzip'd Text 294 KB ]
> April 2016:   ... [ Gzip'd Text 599 KB ]
> 
> As you can see, the report volumes are noticeably lower in "May 2016".

No issues :)

So I will just ignore this email now as things are probably stable
right now.

-- 
viresh

Re: [RFC PATCH] usb: core: correct usb_get_dev() documentation

2016-10-27 Thread Peter Chen

On Thu, Oct 27, 2016 at 04:49:18PM -0700, Dmitry Torokhov wrote:
> On Thu, Oct 27, 2016 at 03:02:30PM -0700, Brian Norris wrote:
> > In reading through a USB interface driver, I noticed that it called
> > usb_{get,put}_dev() in its probe() and disconnect() methods. This seemed
> > unnecessary, but a look at the comments here matched the usage.
> > 
> > USB interface devices seem to be well covered by the parent/child
> > relationship of the device model, and so it should be unnecessary for a
> > child device to grab a refcount on its parent device.
> > 
> > Signed-off-by: Brian Norris 
> 
> Yes, usb_device is parent of usb_interface and device core does "parent
> = get_device(dev->parent);" as part of device_add() when registering new
> interfaces.
> 
> Reviewed-by: Dmitry Torokhov 
> 

Yes, current code seems a little messy for get{put}_device.
Eg, for USB device, it tries to call get_device again at usb_set_configuration
when create its child device (interface device).
For USB interface device, it handles get{put}_device at message.c for
common interface, it seems to be not necessary to call
usb_get{put}_dev again at individual interface driver.

Peter
> > ---
> > This reflects my understanding (and testing), as well as the majority of 
> > usage
> > -- there are *very* few interface drivers that actually call usb_get_dev(). 
> > If
> > I'm wrong, please feel free to tell me so! But I thought patching the
> > documentation would be the best way to solicit a response :)
> > 
> >  drivers/usb/core/usb.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c
> > index 592151461017..0ba7e070f04e 100644
> > --- a/drivers/usb/core/usb.c
> > +++ b/drivers/usb/core/usb.c
> > @@ -539,9 +539,9 @@ EXPORT_SYMBOL_GPL(usb_alloc_dev);
> >   *
> >   * Each live reference to a device should be refcounted.
> >   *
> > - * Drivers for USB interfaces should normally record such references in
> > - * their probe() methods, when they bind to an interface, and release
> > - * them by calling usb_put_dev(), in their disconnect() methods.
> > + * The device driver core automatically handles this refcounting for USB
> > + * interface drivers, but this API can be used for non-parent/child
> > + * relationships.
> >   *
> >   * Return: A pointer to the device with the incremented reference counter.
> >   */
> > -- 
> > 2.8.0.rc3.226.g39d4020
> > 
> 
> -- 
> Dmitry
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

Best Regards,
Peter Chen

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Viresh Kumar

On 28-10-16, 12:07, Fengguang Wu wrote:
> On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote:
> >On 28-10-16, 07:22, kbuild test robot wrote:
> >>tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> >>master
> >>head:   e3300ffef0653774f1099cab153d25d24bd773ce
> >>commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
> >>dependent code in a separate file
> >>date:   6 months ago
> >
> >Why are we picking it up now ?
> 
> Sorry due to problems in the 0day infrastructure some few errors are
> missed in May. Now we catch it when the commit goes mainline.
> 
> https://lists.01.org/pipermail/kbuild-all/
> 
> June 2016:... [ Gzip'd Text 853 KB ]
> May 2016: ... [ Gzip'd Text 294 KB ]
> April 2016:   ... [ Gzip'd Text 599 KB ]
> 
> As you can see, the report volumes are noticeably lower in "May 2016".

No issues :)

So I will just ignore this email now as things are probably stable
right now.

-- 
viresh

Re: [REVIEW][PATCH v2] mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

2016-10-27 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Cyrill Gorcunov <gorcu...@gmail.com> writes:
>
>> On Fri, Oct 28, 2016 at 12:39:18AM +0300, Cyrill Gorcunov wrote:
>>> On Thu, Oct 27, 2016 at 10:54:34AM -0500, Eric W. Biederman wrote:
>>> > 
>>> > 
>>> > I can't imagine either of these changes making a practical difference
>>> > to anyone but I am calling them out in case someone can.
>>> > 
>>> >  include/linux/mm_types.h |  1 +
>>> >  kernel/fork.c|  9 ++---
>>> >  kernel/ptrace.c  | 26 +++---
>>> >  mm/init-mm.c |  2 ++
>>> >  4 files changed, 20 insertions(+), 18 deletions(-)
>>> 
>>> Thanks a huge, Eric! And really sorry for delay in response,
>>> I managed to miss this quite important mail for me in mail
>>> storm. Gonna test it and will write you the results. Overall looks
>>> great, but better be sure and run the tests.
>>> 
>>> Reviewed-by: Cyrill Gorcunov <gorcu...@openvz.org>
>>
>> Eric, on which kernel the patch is on top of?
>> It doesn't apply on linux-next for some reason.
>>
>>  | Date:   Thu Oct 27 14:21:59 2016 +1100
>>  | 
>>  | Add linux-next specific files for 20161027
>>  | 
>>  | Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au>
>>
>> I applied it on Linus' master and tests passed fine
>> (but they were passing fine even without the patch,
>>  only linux-next failed).
>
> Odd.  I don't think I have taken the old version out of
> linux-next yet.   So you can probably revert the old version out of
> linux-next and apply this one.  All of my development at this point is
> against v4.9-rc1.
>
> I suspect you will find my last version on top of against v4.9-rc1 will
> pass.  Since my tree is only one deep and I don't think anyone except
> linux-next is based on it, I plan to drop and readd this patch.
> Especially since it is candidate for backporting.

Mind if I add your tested-by?

To see Linus's tree fail with my patch you can apply the patch below.
That is the essence of what I changed to fix things.  Just ignoring
dumpable when an mm exists.

Eric

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 44a25a1e6e83..b53983ee3f03 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -272,7 +272,7 @@ static int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
 ok:
rcu_read_unlock();
mm = task->mm;
-   if (mm &&
+   if (!mm ||
((get_dumpable(mm) != SUID_DUMP_USER) &&
 !ptrace_has_cap(mm->user_ns, mode)))
return -EPERM;

Re: [REVIEW][PATCH v2] mm: Add a user_ns owner to mm_struct and fix ptrace permission checks

2016-10-27 Thread Eric W. Biederman

ebied...@xmission.com (Eric W. Biederman) writes:

> Cyrill Gorcunov  writes:
>
>> On Fri, Oct 28, 2016 at 12:39:18AM +0300, Cyrill Gorcunov wrote:
>>> On Thu, Oct 27, 2016 at 10:54:34AM -0500, Eric W. Biederman wrote:
>>> > 
>>> > 
>>> > I can't imagine either of these changes making a practical difference
>>> > to anyone but I am calling them out in case someone can.
>>> > 
>>> >  include/linux/mm_types.h |  1 +
>>> >  kernel/fork.c|  9 ++---
>>> >  kernel/ptrace.c  | 26 +++---
>>> >  mm/init-mm.c |  2 ++
>>> >  4 files changed, 20 insertions(+), 18 deletions(-)
>>> 
>>> Thanks a huge, Eric! And really sorry for delay in response,
>>> I managed to miss this quite important mail for me in mail
>>> storm. Gonna test it and will write you the results. Overall looks
>>> great, but better be sure and run the tests.
>>> 
>>> Reviewed-by: Cyrill Gorcunov 
>>
>> Eric, on which kernel the patch is on top of?
>> It doesn't apply on linux-next for some reason.
>>
>>  | Date:   Thu Oct 27 14:21:59 2016 +1100
>>  | 
>>  | Add linux-next specific files for 20161027
>>  | 
>>  | Signed-off-by: Stephen Rothwell 
>>
>> I applied it on Linus' master and tests passed fine
>> (but they were passing fine even without the patch,
>>  only linux-next failed).
>
> Odd.  I don't think I have taken the old version out of
> linux-next yet.   So you can probably revert the old version out of
> linux-next and apply this one.  All of my development at this point is
> against v4.9-rc1.
>
> I suspect you will find my last version on top of against v4.9-rc1 will
> pass.  Since my tree is only one deep and I don't think anyone except
> linux-next is based on it, I plan to drop and readd this patch.
> Especially since it is candidate for backporting.

Mind if I add your tested-by?

To see Linus's tree fail with my patch you can apply the patch below.
That is the essence of what I changed to fix things.  Just ignoring
dumpable when an mm exists.

Eric

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 44a25a1e6e83..b53983ee3f03 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -272,7 +272,7 @@ static int __ptrace_may_access(struct task_struct *task, 
unsigned int mode)
 ok:
rcu_read_unlock();
mm = task->mm;
-   if (mm &&
+   if (!mm ||
((get_dumpable(mm) != SUID_DUMP_USER) &&
 !ptrace_has_cap(mm->user_ns, mode)))
return -EPERM;

Re: [PATCH v2 3/4] input: Deprecate real timestamps beyond year 2106

2016-10-27 Thread Peter Hutterer

On Thu, Oct 27, 2016 at 03:24:55PM -0700, Deepa Dinamani wrote:
> On Wed, Oct 26, 2016 at 7:56 PM, Peter Hutterer
>  wrote:
> > On Mon, Oct 17, 2016 at 08:27:32PM -0700, Deepa Dinamani wrote:
> >> struct timeval is not y2038 safe.
> >> All usage of timeval in the kernel will be replaced by
> >> y2038 safe structures.
> >>
> >> struct input_event maintains time for each input event.
> >> Real time timestamps are not ideal for input as this
> >> time can go backwards as noted in the patch a80b83b7b8
> >> by John Stultz. Hence, having the input_event.time fields
> >> only big enough for monotonic and boot times are
> >> sufficient.
> >>
> >> Leave the original input_event as is. This is to maintain
> >> backward compatibility with existing userspace interfaces
> >> that use input_event.
> >> Introduce a new replacement struct raw_input_event.
> >
> > general comment here - please don't name it "raw_input_event".
> > First, when you grep for input_event you want the new ones to show up too,
> > so a struct input_event_raw would be better here. That also has better
> > namespacing in general. Second though: the event isn't any more "raw" than
> > the previous we had.
> >
> > I can't think of anything better than struct input_event_v2 though.
> 
> The general idea was to leave the original struct input_event as a
> common interface for userspace (as it cannot be deleted).
> So reading raw data unformatted by the userspace will have the new
> struct raw_input_event format.
> This was the reason for the "raw" in the name.
> 
> struct input_event_v2 is fine too, if this is more preferred.
> 
> >> This replaces timeval with struct input_timeval. This structure
> >> maintains time in __kernel_ulong_t or compat_ulong_t to allow
> >> for architectures to override types as in the case of x32.
> >>
> >> The change requires any userspace utilities reading or writing
> >> from event nodes to update their reading format to match
> >> raw_input_event. The changes to the popular libraries will be
> >> posted along with the kernel changes.
> >> The driver version is also updated to reflect the change in
> >> event format.
> >
> > Doesn't this break *all* of userspace then? I don't see anything to
> > negotiate the type of input event the kernel gives me. And nothing right now
> > checks for EVDEV_VERSION, so they all just assume it's a struct
> > input_event. Best case, if the available events aren't a multiple of
> > sizeof(struct input_event) userspace will bomb out, but unless that happens,
> > everyone will just happily read old-style events.
> >
> > So we need some negotiation what is acceptable. Which also needs to address
> > the race conditions we're going to get when events start coming in before
> > the client has announced that it supports the new-style events.
> 
> No, this does not break any userspace right now.
> Both struct input_event and struct raw_input_event are exactly the same today.

oh, right, the ABI is the same. I see that now, thanks.

> This will be the case until a 2038-safe glibc is used with a 64 bit time_t 
> flag.
> 
> So these are the scenarios:
> 1. old kernel driver + new userspace
>   -- should still be ok until 2038. Version checks could help discover these
> 2. new kernel driver + old userspace (without recompiled with new 2038 gblic)
>   -- works because the format is really the same.
> 
> The patch I posted to libevdev checks this driver version.

btw, where did you post the libevdev patch? I haven't seen it anywhere I'm
subscribed to.

> And, hence any library that results in a call to libevdev_set_fd()
> will fail if it is not this updated driver.

without having seen the libevdev patch - that sounds like a bad idea . there
are plenty of usecases where libevdev_set_fd() is called but timestamps in
events just don't matter. So we may need need some more negotiation between
the library user, libevdev and the kernel.

Cheers,
   Peter

> We could just do a similar check in every library also.
> I think the latter would be better.
> 
> So, the kernel patches can go in as a no-op right now and then I can
> add version checks to respective user space libraries.

Re: [PATCH v2 3/4] input: Deprecate real timestamps beyond year 2106

2016-10-27 Thread Peter Hutterer

On Thu, Oct 27, 2016 at 03:24:55PM -0700, Deepa Dinamani wrote:
> On Wed, Oct 26, 2016 at 7:56 PM, Peter Hutterer
>  wrote:
> > On Mon, Oct 17, 2016 at 08:27:32PM -0700, Deepa Dinamani wrote:
> >> struct timeval is not y2038 safe.
> >> All usage of timeval in the kernel will be replaced by
> >> y2038 safe structures.
> >>
> >> struct input_event maintains time for each input event.
> >> Real time timestamps are not ideal for input as this
> >> time can go backwards as noted in the patch a80b83b7b8
> >> by John Stultz. Hence, having the input_event.time fields
> >> only big enough for monotonic and boot times are
> >> sufficient.
> >>
> >> Leave the original input_event as is. This is to maintain
> >> backward compatibility with existing userspace interfaces
> >> that use input_event.
> >> Introduce a new replacement struct raw_input_event.
> >
> > general comment here - please don't name it "raw_input_event".
> > First, when you grep for input_event you want the new ones to show up too,
> > so a struct input_event_raw would be better here. That also has better
> > namespacing in general. Second though: the event isn't any more "raw" than
> > the previous we had.
> >
> > I can't think of anything better than struct input_event_v2 though.
> 
> The general idea was to leave the original struct input_event as a
> common interface for userspace (as it cannot be deleted).
> So reading raw data unformatted by the userspace will have the new
> struct raw_input_event format.
> This was the reason for the "raw" in the name.
> 
> struct input_event_v2 is fine too, if this is more preferred.
> 
> >> This replaces timeval with struct input_timeval. This structure
> >> maintains time in __kernel_ulong_t or compat_ulong_t to allow
> >> for architectures to override types as in the case of x32.
> >>
> >> The change requires any userspace utilities reading or writing
> >> from event nodes to update their reading format to match
> >> raw_input_event. The changes to the popular libraries will be
> >> posted along with the kernel changes.
> >> The driver version is also updated to reflect the change in
> >> event format.
> >
> > Doesn't this break *all* of userspace then? I don't see anything to
> > negotiate the type of input event the kernel gives me. And nothing right now
> > checks for EVDEV_VERSION, so they all just assume it's a struct
> > input_event. Best case, if the available events aren't a multiple of
> > sizeof(struct input_event) userspace will bomb out, but unless that happens,
> > everyone will just happily read old-style events.
> >
> > So we need some negotiation what is acceptable. Which also needs to address
> > the race conditions we're going to get when events start coming in before
> > the client has announced that it supports the new-style events.
> 
> No, this does not break any userspace right now.
> Both struct input_event and struct raw_input_event are exactly the same today.

oh, right, the ABI is the same. I see that now, thanks.

> This will be the case until a 2038-safe glibc is used with a 64 bit time_t 
> flag.
> 
> So these are the scenarios:
> 1. old kernel driver + new userspace
>   -- should still be ok until 2038. Version checks could help discover these
> 2. new kernel driver + old userspace (without recompiled with new 2038 gblic)
>   -- works because the format is really the same.
> 
> The patch I posted to libevdev checks this driver version.

btw, where did you post the libevdev patch? I haven't seen it anywhere I'm
subscribed to.

> And, hence any library that results in a call to libevdev_set_fd()
> will fail if it is not this updated driver.

without having seen the libevdev patch - that sounds like a bad idea . there
are plenty of usecases where libevdev_set_fd() is called but timestamps in
events just don't matter. So we may need need some more negotiation between
the library user, libevdev and the kernel.

Cheers,
   Peter

> We could just do a similar check in every library also.
> I think the latter would be better.
> 
> So, the kernel patches can go in as a no-op right now and then I can
> add version checks to respective user space libraries.

Re: [v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-10-27 Thread Scott Wood

On Fri, 2016-10-28 at 11:32 +0800, Yangbo Lu wrote:
> + guts->regs = of_iomap(np, 0);
> + if (!guts->regs)
> + return -ENOMEM;
> +
> + /* Register soc device */
> + machine = of_flat_dt_get_machine_name();
> + if (machine)
> + soc_dev_attr.machine = devm_kstrdup(dev, machine,
> GFP_KERNEL);
> +
> + svr = fsl_guts_get_svr();
> + soc_die = fsl_soc_die_match(svr, fsl_soc_die);
> + if (soc_die) {
> + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL,
> +  "QorIQ %s", soc_die-
> >die);
> + } else {
> + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL,
> "QorIQ");
> + }
> + soc_dev_attr.soc_id = devm_kasprintf(dev, GFP_KERNEL,
> +  "svr:0x%08x", svr);
> + soc_dev_attr.revision = devm_kasprintf(dev, GFP_KERNEL, "%d.%d",
> +    SVR_MAJ(svr), SVR_MIN(svr));
> +
> + soc_dev = soc_device_register(_dev_attr);
> + if (IS_ERR(soc_dev))
> + return PTR_ERR(soc_dev);

ioremap leaks on this error path.  Use devm_ioremap_resource().

-Scott

Re: [v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-10-27 Thread Scott Wood

On Fri, 2016-10-28 at 11:32 +0800, Yangbo Lu wrote:
> + guts->regs = of_iomap(np, 0);
> + if (!guts->regs)
> + return -ENOMEM;
> +
> + /* Register soc device */
> + machine = of_flat_dt_get_machine_name();
> + if (machine)
> + soc_dev_attr.machine = devm_kstrdup(dev, machine,
> GFP_KERNEL);
> +
> + svr = fsl_guts_get_svr();
> + soc_die = fsl_soc_die_match(svr, fsl_soc_die);
> + if (soc_die) {
> + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL,
> +  "QorIQ %s", soc_die-
> >die);
> + } else {
> + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL,
> "QorIQ");
> + }
> + soc_dev_attr.soc_id = devm_kasprintf(dev, GFP_KERNEL,
> +  "svr:0x%08x", svr);
> + soc_dev_attr.revision = devm_kasprintf(dev, GFP_KERNEL, "%d.%d",
> +    SVR_MAJ(svr), SVR_MIN(svr));
> +
> + soc_dev = soc_device_register(_dev_attr);
> + if (IS_ERR(soc_dev))
> + return PTR_ERR(soc_dev);

ioremap leaks on this error path.  Use devm_ioremap_resource().

-Scott

[PATCH RESEND] mpt3sas: Fix for block device of raid exists even after deleting raid disk

2016-10-27 Thread Sreekanth Reddy

While merging mpt3sas & mpt2sas code, we posted below patch for WarpDrive
support,

mpt3sas: Ported WarpDrive product SSS6200 support
commit id is 7786ab6aff

In this patch and in the below hunk, we have added is_warpdrive
check condition on the wrong line
---
 scsih_target_alloc(struct scsi_target *starget)
sas_target_priv_data->handle = raid_device->handle;
sas_target_priv_data->sas_address = raid_device->wwid;
sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME;
-   raid_device->starget = starget;
+   sas_target_priv_data->raid_device = raid_device;
+   if (ioc->is_warpdrive)
+   raid_device->starget = starget;
}
spin_unlock_irqrestore(>raid_device_lock, flags);
return 0;
--

Actually that check should be for below line
 sas_target_priv_data->raid_device = raid_device;

Due to above hunk, we are not initializing raid_device's starget for raid 
volumes,
and so during raid disk deletion driver is not calling scsi_remove_target() API 
as
driver observes starget field of raid_device's structure as NULL.

Signed-off-by: Sreekanth Reddy 
Cc: 
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 981be7b..618c9df8 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -1279,9 +1279,9 @@ scsih_target_alloc(struct scsi_target *starget)
sas_target_priv_data->handle = raid_device->handle;
sas_target_priv_data->sas_address = raid_device->wwid;
sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME;
-   sas_target_priv_data->raid_device = raid_device;
if (ioc->is_warpdrive)
-   raid_device->starget = starget;
+   sas_target_priv_data->raid_device = raid_device;
+   raid_device->starget = starget;
}
spin_unlock_irqrestore(>raid_device_lock, flags);
return 0;
-- 
2.4.3

[PATCH RESEND] mpt3sas: Fix for block device of raid exists even after deleting raid disk

2016-10-27 Thread Sreekanth Reddy

While merging mpt3sas & mpt2sas code, we posted below patch for WarpDrive
support,

mpt3sas: Ported WarpDrive product SSS6200 support
commit id is 7786ab6aff

In this patch and in the below hunk, we have added is_warpdrive
check condition on the wrong line
---
 scsih_target_alloc(struct scsi_target *starget)
sas_target_priv_data->handle = raid_device->handle;
sas_target_priv_data->sas_address = raid_device->wwid;
sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME;
-   raid_device->starget = starget;
+   sas_target_priv_data->raid_device = raid_device;
+   if (ioc->is_warpdrive)
+   raid_device->starget = starget;
}
spin_unlock_irqrestore(>raid_device_lock, flags);
return 0;
--

Actually that check should be for below line
 sas_target_priv_data->raid_device = raid_device;

Due to above hunk, we are not initializing raid_device's starget for raid 
volumes,
and so during raid disk deletion driver is not calling scsi_remove_target() API 
as
driver observes starget field of raid_device's structure as NULL.

Signed-off-by: Sreekanth Reddy 
Cc: 
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c 
b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 981be7b..618c9df8 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -1279,9 +1279,9 @@ scsih_target_alloc(struct scsi_target *starget)
sas_target_priv_data->handle = raid_device->handle;
sas_target_priv_data->sas_address = raid_device->wwid;
sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME;
-   sas_target_priv_data->raid_device = raid_device;
if (ioc->is_warpdrive)
-   raid_device->starget = starget;
+   sas_target_priv_data->raid_device = raid_device;
+   raid_device->starget = starget;
}
spin_unlock_irqrestore(>raid_device_lock, flags);
return 0;
-- 
2.4.3

[v13, 8/8] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-10-27 Thread Yangbo Lu

The eSDHC of T4240-R1.0-R2.0 has incorrect vender version and spec version.
Acturally the right version numbers should be VVN=0x13 and SVN = 0x1.
This patch adds the GUTS driver support for eSDHC driver to match SoC.
And fix host version to avoid that incorrect version numbers break down
the ADMA data transfer.

Signed-off-by: Yangbo Lu 
Acked-by: Ulf Hansson 
Acked-by: Scott Wood 
---
Changes for v2:
- Got SVR through iomap instead of dts
Changes for v3:
- Managed GUTS through syscon instead of iomap in eSDHC driver
Changes for v4:
- Got SVR by GUTS driver instead of SYSCON
Changes for v5:
- Changed to get SVR through API fsl_guts_get_svr()
- Combined patch 4, patch 5 and patch 6 into one
Changes for v6:
- Added 'Acked-by: Ulf Hansson'
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Changed to use soc_device_match
Changes for v12:
- Matched soc through .family field instead of .soc_id
Changes for v13:
- None
---
 drivers/mmc/host/Kconfig  |  1 +
 drivers/mmc/host/sdhci-of-esdhc.c | 20 
 2 files changed, 21 insertions(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 5274f50..a1135a9 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -144,6 +144,7 @@ config MMC_SDHCI_OF_ESDHC
depends on MMC_SDHCI_PLTFM
depends on PPC || ARCH_MXC || ARCH_LAYERSCAPE
select MMC_SDHCI_IO_ACCESSORS
+   select FSL_GUTS
help
  This selects the Freescale eSDHC controller support.
 
diff --git a/drivers/mmc/host/sdhci-of-esdhc.c 
b/drivers/mmc/host/sdhci-of-esdhc.c
index fb71c86..57bdb9e 100644
--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "sdhci-pltfm.h"
 #include "sdhci-esdhc.h"
@@ -28,6 +29,7 @@
 struct sdhci_esdhc {
u8 vendor_ver;
u8 spec_ver;
+   bool quirk_incorrect_hostver;
 };
 
 /**
@@ -73,6 +75,8 @@ static u32 esdhc_readl_fixup(struct sdhci_host *host,
 static u16 esdhc_readw_fixup(struct sdhci_host *host,
 int spec_reg, u32 value)
 {
+   struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
+   struct sdhci_esdhc *esdhc = sdhci_pltfm_priv(pltfm_host);
u16 ret;
int shift = (spec_reg & 0x2) * 8;
 
@@ -80,6 +84,12 @@ static u16 esdhc_readw_fixup(struct sdhci_host *host,
ret = value & 0x;
else
ret = (value >> shift) & 0x;
+   /* Workaround for T4240-R1.0-R2.0 eSDHC which has incorrect
+* vendor version and spec version information.
+*/
+   if ((spec_reg == SDHCI_HOST_VERSION) &&
+   (esdhc->quirk_incorrect_hostver))
+   ret = (VENDOR_V_23 << SDHCI_VENDOR_VER_SHIFT) | SDHCI_SPEC_200;
return ret;
 }
 
@@ -558,6 +568,12 @@ static const struct sdhci_pltfm_data sdhci_esdhc_le_pdata 
= {
.ops = _esdhc_le_ops,
 };
 
+static struct soc_device_attribute soc_incorrect_hostver[] = {
+   { .family = "QorIQ T4240", .revision = "1.0", },
+   { .family = "QorIQ T4240", .revision = "2.0", },
+   { },
+};
+
 static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host)
 {
struct sdhci_pltfm_host *pltfm_host;
@@ -571,6 +587,10 @@ static void esdhc_init(struct platform_device *pdev, 
struct sdhci_host *host)
esdhc->vendor_ver = (host_ver & SDHCI_VENDOR_VER_MASK) >>
 SDHCI_VENDOR_VER_SHIFT;
esdhc->spec_ver = host_ver & SDHCI_SPEC_VER_MASK;
+   if (soc_device_match(soc_incorrect_hostver))
+   esdhc->quirk_incorrect_hostver = true;
+   else
+   esdhc->quirk_incorrect_hostver = false;
 }
 
 static int sdhci_esdhc_probe(struct platform_device *pdev)
-- 
2.1.0.27.g96db324

[v13, 8/8] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

2016-10-27 Thread Yangbo Lu

The eSDHC of T4240-R1.0-R2.0 has incorrect vender version and spec version.
Acturally the right version numbers should be VVN=0x13 and SVN = 0x1.
This patch adds the GUTS driver support for eSDHC driver to match SoC.
And fix host version to avoid that incorrect version numbers break down
the ADMA data transfer.

Signed-off-by: Yangbo Lu 
Acked-by: Ulf Hansson 
Acked-by: Scott Wood 
---
Changes for v2:
- Got SVR through iomap instead of dts
Changes for v3:
- Managed GUTS through syscon instead of iomap in eSDHC driver
Changes for v4:
- Got SVR by GUTS driver instead of SYSCON
Changes for v5:
- Changed to get SVR through API fsl_guts_get_svr()
- Combined patch 4, patch 5 and patch 6 into one
Changes for v6:
- Added 'Acked-by: Ulf Hansson'
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Changed to use soc_device_match
Changes for v12:
- Matched soc through .family field instead of .soc_id
Changes for v13:
- None
---
 drivers/mmc/host/Kconfig  |  1 +
 drivers/mmc/host/sdhci-of-esdhc.c | 20 
 2 files changed, 21 insertions(+)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 5274f50..a1135a9 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -144,6 +144,7 @@ config MMC_SDHCI_OF_ESDHC
depends on MMC_SDHCI_PLTFM
depends on PPC || ARCH_MXC || ARCH_LAYERSCAPE
select MMC_SDHCI_IO_ACCESSORS
+   select FSL_GUTS
help
  This selects the Freescale eSDHC controller support.
 
diff --git a/drivers/mmc/host/sdhci-of-esdhc.c 
b/drivers/mmc/host/sdhci-of-esdhc.c
index fb71c86..57bdb9e 100644
--- a/drivers/mmc/host/sdhci-of-esdhc.c
+++ b/drivers/mmc/host/sdhci-of-esdhc.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "sdhci-pltfm.h"
 #include "sdhci-esdhc.h"
@@ -28,6 +29,7 @@
 struct sdhci_esdhc {
u8 vendor_ver;
u8 spec_ver;
+   bool quirk_incorrect_hostver;
 };
 
 /**
@@ -73,6 +75,8 @@ static u32 esdhc_readl_fixup(struct sdhci_host *host,
 static u16 esdhc_readw_fixup(struct sdhci_host *host,
 int spec_reg, u32 value)
 {
+   struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host);
+   struct sdhci_esdhc *esdhc = sdhci_pltfm_priv(pltfm_host);
u16 ret;
int shift = (spec_reg & 0x2) * 8;
 
@@ -80,6 +84,12 @@ static u16 esdhc_readw_fixup(struct sdhci_host *host,
ret = value & 0x;
else
ret = (value >> shift) & 0x;
+   /* Workaround for T4240-R1.0-R2.0 eSDHC which has incorrect
+* vendor version and spec version information.
+*/
+   if ((spec_reg == SDHCI_HOST_VERSION) &&
+   (esdhc->quirk_incorrect_hostver))
+   ret = (VENDOR_V_23 << SDHCI_VENDOR_VER_SHIFT) | SDHCI_SPEC_200;
return ret;
 }
 
@@ -558,6 +568,12 @@ static const struct sdhci_pltfm_data sdhci_esdhc_le_pdata 
= {
.ops = _esdhc_le_ops,
 };
 
+static struct soc_device_attribute soc_incorrect_hostver[] = {
+   { .family = "QorIQ T4240", .revision = "1.0", },
+   { .family = "QorIQ T4240", .revision = "2.0", },
+   { },
+};
+
 static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host)
 {
struct sdhci_pltfm_host *pltfm_host;
@@ -571,6 +587,10 @@ static void esdhc_init(struct platform_device *pdev, 
struct sdhci_host *host)
esdhc->vendor_ver = (host_ver & SDHCI_VENDOR_VER_MASK) >>
 SDHCI_VENDOR_VER_SHIFT;
esdhc->spec_ver = host_ver & SDHCI_SPEC_VER_MASK;
+   if (soc_device_match(soc_incorrect_hostver))
+   esdhc->quirk_incorrect_hostver = true;
+   else
+   esdhc->quirk_incorrect_hostver = false;
 }
 
 static int sdhci_esdhc_probe(struct platform_device *pdev)
-- 
2.1.0.27.g96db324

Re: [PATCH v2 1/4] uinput: Add ioctl for using monotonic/ boot times

2016-10-27 Thread Peter Hutterer

On Thu, Oct 27, 2016 at 01:39:30PM -0700, Deepa Dinamani wrote:
> > hmm, I'm a bit confused here. This is an in-kernel bit only (passing the
> > time through uinput events has no effect). So why do we need an ioctl here?
> > it's an in-kernel decision only anyway and the time in the events sent to
> > the evdev client should be dictated by what that client sets for the clock
> > type, right?
> 
> This is for input events queued by the uinput driver for the virtual
> input device.

oh, right. I thought this was in the path for uinput_write(). sorry about
that.

> This can be read through uinput_read() fops.
> I don't think anybody is doing a read on uinput nodes, so another
> option(Arnd and I considered this) could be not supporting reads on
> these nodes at all.
> 
> This is not related to evdev events in the kernel.
> Currently, this timestamp could be the same format as the evdev
> timestamps or not.

I can say I've never done the read from the uinput device, never even
occured to me. quick skim of the code looks like this only matters for
force_feedback stuff. can't really comment on that too much.

Cheers,
   Peter

Re: [PATCH v2 1/4] uinput: Add ioctl for using monotonic/ boot times

2016-10-27 Thread Peter Hutterer

On Thu, Oct 27, 2016 at 01:39:30PM -0700, Deepa Dinamani wrote:
> > hmm, I'm a bit confused here. This is an in-kernel bit only (passing the
> > time through uinput events has no effect). So why do we need an ioctl here?
> > it's an in-kernel decision only anyway and the time in the events sent to
> > the evdev client should be dictated by what that client sets for the clock
> > type, right?
> 
> This is for input events queued by the uinput driver for the virtual
> input device.

oh, right. I thought this was in the path for uinput_write(). sorry about
that.

> This can be read through uinput_read() fops.
> I don't think anybody is doing a read on uinput nodes, so another
> option(Arnd and I considered this) could be not supporting reads on
> these nodes at all.
> 
> This is not related to evdev events in the kernel.
> Currently, this timestamp could be the same format as the evdev
> timestamps or not.

I can say I've never done the read from the uinput device, never even
occured to me. quick skim of the code looks like this only matters for
force_feedback stuff. can't really comment on that too much.

Cheers,
   Peter

[v13, 4/8] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-10-27 Thread Yangbo Lu

Move mpc85xx.h to include/linux/fsl and rename it to svr.h as a common
header file.  This SVR numberspace is used on some ARM chips as well as
PPC, and even to check for a PPC SVR multi-arch drivers would otherwise
need to ifdef the header inclusion and all references to the SVR symbols.

Signed-off-by: Yangbo Lu 
Acked-by: Wolfram Sang 
Acked-by: Stephen Boyd 
Acked-by: Joerg Roedel 
[scottwood: update description]
Signed-off-by: Scott Wood 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- None
Changes for v5:
- Changed to Move mpc85xx.h to include/linux/fsl/
- Adjusted '#include ' position in file
Changes for v6:
- None
Changes for v7:
- Added 'Acked-by: Wolfram Sang' for I2C part
- Also applied to arch/powerpc/kernel/cpu_setup_fsl_booke.S
Changes for v8:
- Added 'Acked-by: Stephen Boyd' for clk part
- Added 'Acked-by: Scott Wood'
- Added 'Acked-by: Joerg Roedel' for iommu part
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Updated description by Scott
Changes for v12:
- None
Changes for v13:
- None
---
 arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +-
 arch/powerpc/sysdev/fsl_pci.c | 2 +-
 drivers/clk/clk-qoriq.c   | 3 +--
 drivers/i2c/busses/i2c-mpc.c  | 2 +-
 drivers/iommu/fsl_pamu.c  | 3 +--
 drivers/net/ethernet/freescale/gianfar.c  | 2 +-
 arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h | 4 ++--
 7 files changed, 8 insertions(+), 10 deletions(-)
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 462aed9..2b0284e 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -13,13 +13,13 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 
 _GLOBAL(__e500_icache_setup)
mfspr   r0, SPRN_L1CSR1
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index d3a5974..cb0efea 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c
index 20b1055..dc778e8 100644
--- a/drivers/clk/clk-qoriq.c
+++ b/drivers/clk/clk-qoriq.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1153,8 +1154,6 @@ static struct clk *clockgen_clk_get(struct 
of_phandle_args *clkspec, void *data)
 }
 
 #ifdef CONFIG_PPC
-#include 
-
 static const u32 a4510_svrs[] __initconst = {
(SVR_P2040 << 8) | 0x10,/* P2040 1.0 */
(SVR_P2040 << 8) | 0x11,/* P2040 1.1 */
diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c
index 565a49a..e791c51 100644
--- a/drivers/i2c/busses/i2c-mpc.c
+++ b/drivers/i2c/busses/i2c-mpc.c
@@ -27,9 +27,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 
 #define DRV_NAME "mpc-i2c"
diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
index a34355f..af8fb27 100644
--- a/drivers/iommu/fsl_pamu.c
+++ b/drivers/iommu/fsl_pamu.c
@@ -21,11 +21,10 @@
 #include "fsl_pamu.h"
 
 #include 
+#include 
 #include 
 #include 
 
-#include 
-
 /* define indexes for each operation mapping scenario */
 #define OMI_QMAN0x00
 #define OMI_FMAN0x01
diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index 4b4f5bc..55be5ce 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -86,11 +86,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #ifdef CONFIG_PPC
 #include 
-#include 
 #endif
 #include 
 #include 
diff --git a/arch/powerpc/include/asm/mpc85xx.h b/include/linux/fsl/svr.h
similarity index 97%
rename from arch/powerpc/include/asm/mpc85xx.h
rename to include/linux/fsl/svr.h
index 213f3a8..8d13836 100644
--- a/arch/powerpc/include/asm/mpc85xx.h
+++ b/include/linux/fsl/svr.h
@@ -9,8 +9,8 @@
  * (at your option) any later version.
  */
 
-#ifndef __ASM_PPC_MPC85XX_H
-#define __ASM_PPC_MPC85XX_H
+#ifndef FSL_SVR_H
+#define FSL_SVR_H
 
 #define SVR_REV(svr)   ((svr) & 0xFF)  /* SOC design resision */
 #define SVR_MAJ(svr)   (((svr) >>  4) & 0xF)   /* Major revision field*/
-- 
2.1.0.27.g96db324

[v13, 4/8] powerpc/fsl: move mpc85xx.h to include/linux/fsl

2016-10-27 Thread Yangbo Lu

Move mpc85xx.h to include/linux/fsl and rename it to svr.h as a common
header file.  This SVR numberspace is used on some ARM chips as well as
PPC, and even to check for a PPC SVR multi-arch drivers would otherwise
need to ifdef the header inclusion and all references to the SVR symbols.

Signed-off-by: Yangbo Lu 
Acked-by: Wolfram Sang 
Acked-by: Stephen Boyd 
Acked-by: Joerg Roedel 
[scottwood: update description]
Signed-off-by: Scott Wood 
---
Changes for v2:
- None
Changes for v3:
- None
Changes for v4:
- None
Changes for v5:
- Changed to Move mpc85xx.h to include/linux/fsl/
- Adjusted '#include ' position in file
Changes for v6:
- None
Changes for v7:
- Added 'Acked-by: Wolfram Sang' for I2C part
- Also applied to arch/powerpc/kernel/cpu_setup_fsl_booke.S
Changes for v8:
- Added 'Acked-by: Stephen Boyd' for clk part
- Added 'Acked-by: Scott Wood'
- Added 'Acked-by: Joerg Roedel' for iommu part
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Updated description by Scott
Changes for v12:
- None
Changes for v13:
- None
---
 arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +-
 arch/powerpc/sysdev/fsl_pci.c | 2 +-
 drivers/clk/clk-qoriq.c   | 3 +--
 drivers/i2c/busses/i2c-mpc.c  | 2 +-
 drivers/iommu/fsl_pamu.c  | 3 +--
 drivers/net/ethernet/freescale/gianfar.c  | 2 +-
 arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h | 4 ++--
 7 files changed, 8 insertions(+), 10 deletions(-)
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S 
b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
index 462aed9..2b0284e 100644
--- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S
+++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S
@@ -13,13 +13,13 @@
  *
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
-#include 
 
 _GLOBAL(__e500_icache_setup)
mfspr   r0, SPRN_L1CSR1
diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c
index d3a5974..cb0efea 100644
--- a/arch/powerpc/sysdev/fsl_pci.c
+++ b/arch/powerpc/sysdev/fsl_pci.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -37,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c
index 20b1055..dc778e8 100644
--- a/drivers/clk/clk-qoriq.c
+++ b/drivers/clk/clk-qoriq.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1153,8 +1154,6 @@ static struct clk *clockgen_clk_get(struct 
of_phandle_args *clkspec, void *data)
 }
 
 #ifdef CONFIG_PPC
-#include 
-
 static const u32 a4510_svrs[] __initconst = {
(SVR_P2040 << 8) | 0x10,/* P2040 1.0 */
(SVR_P2040 << 8) | 0x11,/* P2040 1.1 */
diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c
index 565a49a..e791c51 100644
--- a/drivers/i2c/busses/i2c-mpc.c
+++ b/drivers/i2c/busses/i2c-mpc.c
@@ -27,9 +27,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 
 #define DRV_NAME "mpc-i2c"
diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c
index a34355f..af8fb27 100644
--- a/drivers/iommu/fsl_pamu.c
+++ b/drivers/iommu/fsl_pamu.c
@@ -21,11 +21,10 @@
 #include "fsl_pamu.h"
 
 #include 
+#include 
 #include 
 #include 
 
-#include 
-
 /* define indexes for each operation mapping scenario */
 #define OMI_QMAN0x00
 #define OMI_FMAN0x01
diff --git a/drivers/net/ethernet/freescale/gianfar.c 
b/drivers/net/ethernet/freescale/gianfar.c
index 4b4f5bc..55be5ce 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -86,11 +86,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #ifdef CONFIG_PPC
 #include 
-#include 
 #endif
 #include 
 #include 
diff --git a/arch/powerpc/include/asm/mpc85xx.h b/include/linux/fsl/svr.h
similarity index 97%
rename from arch/powerpc/include/asm/mpc85xx.h
rename to include/linux/fsl/svr.h
index 213f3a8..8d13836 100644
--- a/arch/powerpc/include/asm/mpc85xx.h
+++ b/include/linux/fsl/svr.h
@@ -9,8 +9,8 @@
  * (at your option) any later version.
  */
 
-#ifndef __ASM_PPC_MPC85XX_H
-#define __ASM_PPC_MPC85XX_H
+#ifndef FSL_SVR_H
+#define FSL_SVR_H
 
 #define SVR_REV(svr)   ((svr) & 0xFF)  /* SOC design resision */
 #define SVR_MAJ(svr)   (((svr) >>  4) & 0xF)   /* Major revision field*/
-- 
2.1.0.27.g96db324

[PATCH v6 05/11] s390/spinlock: Provide vcpu_is_preempted

2016-10-27 Thread Pan Xinhui

From: Christian Borntraeger 

this implements the s390 backend for commit
"kernel/sched: introduce vcpu preempted check interface"
by reworking the existing smp_vcpu_scheduled into
arch_vcpu_is_preempted. We can then also get rid of the
local cpu_is_preempted function by moving the
CIF_ENABLED_WAIT test into arch_vcpu_is_preempted.

Signed-off-by: Christian Borntraeger 
Acked-by: Heiko Carstens 
---
 arch/s390/include/asm/spinlock.h |  8 
 arch/s390/kernel/smp.c   |  9 +++--
 arch/s390/lib/spinlock.c | 25 -
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, 
unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
 }
 
+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
 }
 
-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
 {
-   return pcpu_running(pcpu_devices + cpu);
+   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+   return false;
+   if (pcpu_running(pcpu_devices + cpu))
+   return false;
+   return true;
 }
+EXPORT_SYMBOL(arch_vcpu_is_preempted);
 
 void smp_yield_cpu(int cpu)
 {
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int 
*lock, unsigned int old)
asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock));
 }
 
-static inline int cpu_is_preempted(int cpu)
-{
-   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
-   return 0;
-   if (smp_vcpu_scheduled(cpu))
-   return 0;
-   return 1;
-}
-
 void arch_spin_lock_wait(arch_spinlock_t *lp)
 {
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned 
long flags)
continue;
}
/* Check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, 
unsigned long flags)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+   if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int 
prev)
owner = 0;
while (1) {
if (count-- <= 0) {
-

[PATCH v6 05/11] s390/spinlock: Provide vcpu_is_preempted

2016-10-27 Thread Pan Xinhui

From: Christian Borntraeger 

this implements the s390 backend for commit
"kernel/sched: introduce vcpu preempted check interface"
by reworking the existing smp_vcpu_scheduled into
arch_vcpu_is_preempted. We can then also get rid of the
local cpu_is_preempted function by moving the
CIF_ENABLED_WAIT test into arch_vcpu_is_preempted.

Signed-off-by: Christian Borntraeger 
Acked-by: Heiko Carstens 
---
 arch/s390/include/asm/spinlock.h |  8 
 arch/s390/kernel/smp.c   |  9 +++--
 arch/s390/lib/spinlock.c | 25 -
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, 
unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
 }
 
+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
 }
 
-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
 {
-   return pcpu_running(pcpu_devices + cpu);
+   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+   return false;
+   if (pcpu_running(pcpu_devices + cpu))
+   return false;
+   return true;
 }
+EXPORT_SYMBOL(arch_vcpu_is_preempted);
 
 void smp_yield_cpu(int cpu)
 {
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int 
*lock, unsigned int old)
asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock));
 }
 
-static inline int cpu_is_preempted(int cpu)
-{
-   if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
-   return 0;
-   if (smp_vcpu_scheduled(cpu))
-   return 0;
-   return 1;
-}
-
 void arch_spin_lock_wait(arch_spinlock_t *lp)
 {
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned 
long flags)
continue;
}
/* Check if the lock owner is running. */
-   if (first_diag && cpu_is_preempted(~owner)) {
+   if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, 
unsigned long flags)
 * yield the CPU unconditionally. For LPAR rely on the
 * sense running status.
 */
-   if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+   if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+   if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int 
prev)
owner = 0;
while (1) {
if (count-- <= 0) {
-   if (owner && cpu_is_preempted(~owner))
+

[PATCH v6 07/11] KVM: Introduce kvm_write_guest_offset_cached

2016-10-27 Thread Pan Xinhui

It allows us to update some status or field of one struct partially.

We can also save one kvm_read_guest_cached if we just update one filed
of the struct regardless of its current value.

Signed-off-by: Pan Xinhui 
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 20 ++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..6f00237 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void 
*data,
unsigned long len);
 int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
   void *data, unsigned long len);
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2907b7b..95308ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
 
-int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-  void *data, unsigned long len)
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len)
 {
struct kvm_memslots *slots = kvm_memslots(kvm);
int r;
+   gpa_t gpa = ghc->gpa + offset;
 
-   BUG_ON(len > ghc->len);
+   BUG_ON(len + offset > ghc->len);
 
if (slots->generation != ghc->generation)
kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len);
 
if (unlikely(!ghc->memslot))
-   return kvm_write_guest(kvm, ghc->gpa, data, len);
+   return kvm_write_guest(kvm, gpa, data, len);
 
if (kvm_is_error_hva(ghc->hva))
return -EFAULT;
 
-   r = __copy_to_user((void __user *)ghc->hva, data, len);
+   r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
if (r)
return -EFAULT;
-   mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+   mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
 
return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
+}
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-- 
2.4.11

[PATCH v6 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner

2016-10-27 Thread Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown] [H] 0xc00768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/mutex.c  | 15 +--
 kernel/locking/rwsem-xadd.c | 16 +---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..82108f5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner)
 */
barrier();
 
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex 
*lock)
 
rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
if (owner)
-   retval = owner->on_cpu;
+   retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..0897179 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
goto done;
}
 
-   ret = owner->on_cpu;
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
+   ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /* abort spinning when need_resched or owner is not running */
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* abort spinning when need_resched or owner is not running or
+* owner's cpu is preempted. vcpu_is_preempted is a macro
+* defined by false if arch does not support vcpu preempted
+* check
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
-- 
2.4.11

[PATCH v6 07/11] KVM: Introduce kvm_write_guest_offset_cached

2016-10-27 Thread Pan Xinhui

It allows us to update some status or field of one struct partially.

We can also save one kvm_read_guest_cached if we just update one filed
of the struct regardless of its current value.

Signed-off-by: Pan Xinhui 
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 20 ++--
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..6f00237 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void 
*data,
unsigned long len);
 int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
   void *data, unsigned long len);
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len);
 int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
  gpa_t gpa, unsigned long len);
 int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2907b7b..95308ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct 
gfn_to_hva_cache *ghc,
 }
 EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);
 
-int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-  void *data, unsigned long len)
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache 
*ghc,
+  void *data, int offset, unsigned long len)
 {
struct kvm_memslots *slots = kvm_memslots(kvm);
int r;
+   gpa_t gpa = ghc->gpa + offset;
 
-   BUG_ON(len > ghc->len);
+   BUG_ON(len + offset > ghc->len);
 
if (slots->generation != ghc->generation)
kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len);
 
if (unlikely(!ghc->memslot))
-   return kvm_write_guest(kvm, ghc->gpa, data, len);
+   return kvm_write_guest(kvm, gpa, data, len);
 
if (kvm_is_error_hva(ghc->hva))
return -EFAULT;
 
-   r = __copy_to_user((void __user *)ghc->hva, data, len);
+   r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
if (r)
return -EFAULT;
-   mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+   mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);
 
return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+  void *data, unsigned long len)
+{
+   return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
+}
 EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
 
 int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
-- 
2.4.11

[PATCH v6 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner

2016-10-27 Thread Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
the two spin_on_owner. This blames on the lock holder preemption issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

after patch:
 9.99%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 5.28%  sched-messaging  [unknown] [H] 0xc00768e0
 4.27%  sched-messaging  [kernel.vmlinux]  [k] __copy_tofrom_user_power7
 3.77%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 3.24%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.02%  sched-messaging  [kernel.vmlinux]  [k] system_call
 2.69%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task

Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/mutex.c  | 15 +--
 kernel/locking/rwsem-xadd.c | 16 +---
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..82108f5 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner)
 */
barrier();
 
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex 
*lock)
 
rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
if (owner)
-   retval = owner->on_cpu;
+   retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
 * if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..0897179 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
goto done;
}
 
-   ret = owner->on_cpu;
+   /*
+* As lock holder preemption issue, we both skip spinning if task is not
+* on cpu or its cpu is preempted
+*/
+   ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
 done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /* abort spinning when need_resched or owner is not running */
-   if (!owner->on_cpu || need_resched()) {
+   /*
+* abort spinning when need_resched or owner is not running or
+* owner's cpu is preempted. vcpu_is_preempted is a macro
+* defined by false if arch does not support vcpu preempted
+* check
+*/
+   if (!owner->on_cpu || need_resched() ||
+   vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
-- 
2.4.11

[PATCH v6 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

2016-10-27 Thread Pan Xinhui

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.
Then kernel can break the spin loops upon on the retval of
vcpu_is_preempted.

As kernel has used this interface, So lets support it.

To deal with kernel and kvm/xen, add vcpu_is_preempted into struct
pv_lock_ops.

Then kvm or xen could provide their own implementation to support
vcpu_is_preempted.

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/asm/paravirt_types.h | 2 ++
 arch/x86/include/asm/spinlock.h   | 8 
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {
 
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
 #include 
 
 /*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return 0;
+}
+
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+   .vcpu_is_preempted = native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-- 
2.4.11

[PATCH v6 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

2016-10-27 Thread Pan Xinhui

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.
Then kernel can break the spin loops upon on the retval of
vcpu_is_preempted.

As kernel has used this interface, So lets support it.

To deal with kernel and kvm/xen, add vcpu_is_preempted into struct
pv_lock_ops.

Then kvm or xen could provide their own implementation to support
vcpu_is_preempted.

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/asm/paravirt_types.h | 2 ++
 arch/x86/include/asm/spinlock.h   | 8 
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {
 
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
 #include 
 
 /*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
+static bool native_vcpu_is_preempted(int cpu)
+{
+   return 0;
+}
+
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+   .vcpu_is_preempted = native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-- 
2.4.11

[PATCH v6 10/11] x86, xen: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

From: Juergen Gross 

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross 
Signed-off-by: Pan Xinhui 
---
 arch/x86/xen/spinlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 /*
-- 
2.4.11

[PATCH v6 08/11] x86, kvm/x86.c: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time ::preempted to indicate that if
one vcpu is running or not.

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
 arch/x86/kvm/x86.c   | 16 
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e375235..f06e115 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>arch.st.steal, sizeof(struct kvm_steal_time
return;
 
+   vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */
 
@@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vcpu->arch.st.steal.preempted = 1;
+
+   kvm_write_guest_offset_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal.preempted,
+   offsetof(struct kvm_steal_time, preempted),
+   sizeof(vcpu->arch.st.steal.preempted));
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
-- 
2.4.11

[PATCH v6 04/11] powerpc/spinlock: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is powerNV are built into
same kernel image with pSeries. So we need return false if we are runnig
as powerNV. The another fact is that lppaca->yiled_count keeps zero on
powerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 arch/powerpc/include/asm/spinlock.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index fa37fe9..8c1b913 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
 #define SYNC_IO
 #endif
 
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
return lock.slock == 0;
-- 
2.4.11

[PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
not preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui 
Acked-by: Radim Krčmář 
---
 Documentation/virtual/kvm/msr.txt | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
}
 
whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
 
+   preempted: indicate the VCPU who owns this struct is running or
+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-- 
2.4.11

[PATCH v6 10/11] x86, xen: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

From: Juergen Gross 

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross 
Signed-off-by: Pan Xinhui 
---
 arch/x86/xen/spinlock.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 /*
-- 
2.4.11

[PATCH v6 08/11] x86, kvm/x86.c: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time ::preempted to indicate that if
one vcpu is running or not.

Signed-off-by: Pan Xinhui 
---
 arch/x86/include/uapi/asm/kvm_para.h |  4 +++-
 arch/x86/kvm/x86.c   | 16 
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
 };
 
 #define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e375235..f06e115 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
>arch.st.steal, sizeof(struct kvm_steal_time
return;
 
+   vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1;  /* first time write, random 
junk */
 
@@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
 }
 
+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+   if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+   return;
+
+   vcpu->arch.st.steal.preempted = 1;
+
+   kvm_write_guest_offset_cached(vcpu->kvm, >arch.st.stime,
+   >arch.st.steal.preempted,
+   offsetof(struct kvm_steal_time, preempted),
+   sizeof(vcpu->arch.st.steal.preempted));
+}
+
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
-- 
2.4.11

[PATCH v6 04/11] powerpc/spinlock: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon on the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is powerNV are built into
same kernel image with pSeries. So we need return false if we are runnig
as powerNV. The another fact is that lppaca->yiled_count keeps zero on
powerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng 
Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
---
 arch/powerpc/include/asm/spinlock.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index fa37fe9..8c1b913 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
 #define SYNC_IO
 #endif
 
+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
 static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
return lock.slock == 0;
-- 
2.4.11

[PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is
not preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui 
Acked-by: Radim Krčmář 
---
 Documentation/virtual/kvm/msr.txt | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt 
b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
-   __u32 pad[12];
+   __u8  preempted;
+   __u8  u8_pad[3];
+   __u32 pad[11];
}
 
whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.
 
+   preempted: indicate the VCPU who owns this struct is running or
+   not. Non-zero values mean the VCPU has been preempted. Zero
+   means the VCPU is not preempted. NOTE, it is always zero if the
+   the hypervisor doesn't support this field.
+
 MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled.  Bit 1 is reserved and must be zero.  When PV end of
-- 
2.4.11

[PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()

2016-10-27 Thread Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/osq_lock.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..39d1385 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+   return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;
 
cpu_relax_lowlatency();
-- 
2.4.11

[PATCH v6 09/11] x86, kernel/kvm.c: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

struct kvm_steal_time::preempted indicate that if one vcpu is running or
not after commit("x86, kvm/x86.c: support vcpu preempted check").

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
---
 arch/x86/kernel/kvm.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}
 
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-- 
2.4.11

[PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()

2016-10-27 Thread Pan Xinhui

An over-committed guest with more vCPUs than pCPUs has a heavy overload in
osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu
node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq
lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is
currently running or not. So break the spin loops on true condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

after patch:
20.68%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner
 8.45%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 4.12%  sched-messaging  [kernel.vmlinux]  [k] system_call
 3.01%  sched-messaging  [kernel.vmlinux]  [k] system_call_common
 2.83%  sched-messaging  [kernel.vmlinux]  [k] copypage_power7
 2.64%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 2.00%  sched-messaging  [kernel.vmlinux]  [k] osq_lock

Suggested-by: Boqun Feng 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 kernel/locking/osq_lock.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..39d1385 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
 }
 
+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+   return node->cpu - 1;
+}
+
 static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
 {
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
 * If we need to reschedule bail... so we can block.
+* Use vcpu_is_preempted to detech lock holder preemption issue
+* and break. vcpu_is_preempted is a macro defined by false if
+* arch does not support vcpu preempted check,
 */
-   if (need_resched())
+   if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;
 
cpu_relax_lowlatency();
-- 
2.4.11

[PATCH v6 09/11] x86, kernel/kvm.c: support vcpu preempted check

2016-10-27 Thread Pan Xinhui

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

struct kvm_steal_time::preempted indicate that if one vcpu is running or
not after commit("x86, kvm/x86.c: support vcpu preempted check").

unix benchmark result:
host:  kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps
Process Creation   |29881.2 lps  |28572.8 lps
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps

Signed-off-by: Pan Xinhui 
---
 arch/x86/kernel/kvm.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
 }
 
+static bool kvm_vcpu_is_preempted(int cpu)
+{
+   struct kvm_steal_time *src;
+
+   src = _cpu(steal_time, cpu);
+
+   return !!src->preempted;
+}
+
 #ifdef CONFIG_SMP
 static void __init kvm_smp_prepare_boot_cpu(void)
 {
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}
 
if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-- 
2.4.11

[PATCH v6 01/11] kernel/sched: introduce vcpu preempted check interface

2016-10-27 Thread Pan Xinhui

This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 include/linux/sched.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11

[PATCH v6 00/11] implement vcpu preempted check

2016-10-27 Thread Pan Xinhui

change from v5:
spilt x86/kvm patch into guest/host part.
introduce kvm_write_guest_offset_cached.
fix some typos.
rebase patch onto 4.9.2
change from v4:
spilt x86 kvm vcpu preempted check into two patches.
add documentation patch.
add x86 vcpu preempted check patch under xen
add s390 vcpu preempted check patch 
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner. 
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 

Christian Borntraeger (1):
  s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
  x86, xen: support vcpu preempted check

Pan Xinhui (9):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, paravirt: Add interface to support kvm/xen vcpu preempted check
  KVM: Introduce kvm_write_guest_offset_cached
  x86, kvm/x86.c: support vcpu preempted check
  x86, kernel/kvm.c: support vcpu preempted check
  Documentation: virtual: kvm: Support vcpu preempted check

 Documentation/virtual/kvm/msr.txt |  9 -
 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/s390/include/asm/spinlock.h  |  8 
 arch/s390/kernel/smp.c|  9 +++--
 arch/s390/lib/spinlock.c  | 25 -
 arch/x86/include/asm/paravirt_types.h |  2 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
 arch/x86/kernel/kvm.c | 12 
 arch/x86/kernel/paravirt-spinlocks.c  |  6 ++
 arch/x86/kvm/x86.c| 16 
 arch/x86/xen/spinlock.c   |  3 ++-
 include/linux/kvm_host.h  |  2 ++
 include/linux/sched.h | 12

[PATCH v6 01/11] kernel/sched: introduce vcpu preempted check interface

2016-10-27 Thread Pan Xinhui

This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if
one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu pteempted check.

Suggested-by: Peter Zijlstra (Intel) 
Signed-off-by: Pan Xinhui 
Acked-by: Christian Borntraeger 
Tested-by: Juergen Gross 
---
 include/linux/sched.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, 
unsigned int cpu)
 
 #endif /* CONFIG_SMP */
 
+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
 extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
 extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
 
-- 
2.4.11

[PATCH v6 00/11] implement vcpu preempted check

2016-10-27 Thread Pan Xinhui

change from v5:
spilt x86/kvm patch into guest/host part.
introduce kvm_write_guest_offset_cached.
fix some typos.
rebase patch onto 4.9.2
change from v4:
spilt x86 kvm vcpu preempted check into two patches.
add documentation patch.
add x86 vcpu preempted check patch under xen
add s390 vcpu preempted check patch 
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and 
mutex_spin_on_owner. 
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
 5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
 3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
 3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
 3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
 2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy -  3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks  2188223.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1804433.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1237257.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1032658.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   768000.0 KBps  (30.1 s, 
1 samples)

With patch: 

1 copy - File Write 4096 bufsize 8000 maxblocks  2209189.0 KBps  (30.0 s, 1 
samples)
2 copy - File Write 4096 bufsize 8000 maxblocks  1943816.0 KBps  (30.0 s, 1 
samples)
4 copy - File Write 4096 bufsize 8000 maxblocks  1405591.0 KBps  (30.0 s, 1 
samples)
8 copy - File Write 4096 bufsize 8000 maxblocks  1065080.0 KBps  (30.0 s, 1 
samples)
16 copy - File Write 4096 bufsize 8000 maxblocks   904762.0 KBps  (30.0 s, 
1 samples)

X86 test result:
test-case   after-patch   before-patch
Execl Throughput   |18307.9 lps  |11701.6 lps 
File Copy 1024 bufsize 2000 maxblocks  |  1352407.3 KBps |   790418.9 KBps
File Copy 256 bufsize 500 maxblocks|   367555.6 KBps |   222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks  |  3675649.7 KBps |  1780614.4 KBps
Pipe Throughput| 11872208.7 lps  | 11855628.9 lps 
Pipe-based Context Switching   |  1495126.5 lps  |  1490533.9 lps 
Process Creation   |29881.2 lps  |28572.8 lps 
Shell Scripts (1 concurrent)   |23224.3 lpm  |22607.4 lpm 
Shell Scripts (8 concurrent)   | 3531.4 lpm  | 3211.9 lpm 
System Call Overhead   | 10385653.0 lps  | 10419979.0 lps 

Christian Borntraeger (1):
  s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
  x86, xen: support vcpu preempted check

Pan Xinhui (9):
  kernel/sched: introduce vcpu preempted check interface
  locking/osq: Drop the overload of osq_lock()
  kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
  powerpc/spinlock: support vcpu preempted check
  x86, paravirt: Add interface to support kvm/xen vcpu preempted check
  KVM: Introduce kvm_write_guest_offset_cached
  x86, kvm/x86.c: support vcpu preempted check
  x86, kernel/kvm.c: support vcpu preempted check
  Documentation: virtual: kvm: Support vcpu preempted check

 Documentation/virtual/kvm/msr.txt |  9 -
 arch/powerpc/include/asm/spinlock.h   |  8 
 arch/s390/include/asm/spinlock.h  |  8 
 arch/s390/kernel/smp.c|  9 +++--
 arch/s390/lib/spinlock.c  | 25 -
 arch/x86/include/asm/paravirt_types.h |  2 ++
 arch/x86/include/asm/spinlock.h   |  8 
 arch/x86/include/uapi/asm/kvm_para.h  |  4 +++-
 arch/x86/kernel/kvm.c | 12 
 arch/x86/kernel/paravirt-spinlocks.c  |  6 ++
 arch/x86/kvm/x86.c| 16 
 arch/x86/xen/spinlock.c   |  3 ++-
 include/linux/kvm_host.h  |  2 ++
 include/linux/sched.h | 12

[v13, 7/8] base: soc: introduce soc_device_match() interface

2016-10-27 Thread Yangbo Lu

From: Arnd Bergmann 

We keep running into cases where device drivers want to know the exact
version of the a SoC they are currently running on. In the past, this has
usually been done through a vendor specific API that can be called by a
driver, or by directly accessing some kind of version register that is
not part of the device itself but that belongs to a global register area
of the chip.

Common reasons for doing this include:

- A machine is not using devicetree or similar for passing data about
  on-chip devices, but just announces their presence using boot-time
  platform devices, and the machine code itself does not care about the
  revision.

- There is existing firmware or boot loaders with existing DT binaries
  with generic compatible strings that do not identify the particular
  revision of each device, but the driver knows which SoC revisions
  include which part.

- A prerelease version of a chip has some quirks and we are using the same
  version of the bootloader and the DT blob on both the prerelease and the
  final version. An update of the DT binding seems inappropriate because
  that would involve maintaining multiple copies of the dts and/or
  bootloader.

This patch introduces the soc_device_match() interface that is meant to
work like of_match_node() but instead of identifying the version of a
device, it identifies the SoC itself using a vendor-agnostic interface.

Unlike of_match_node(), we do not do an exact string compare but instead
use glob_match() to allow wildcards in strings.

Signed-off-by: Arnd Bergmann 
Signed-off-by: Yangbo Lu 
Acked-by: Greg Kroah-Hartman 
---
Changes for v11:
- Added this patch for soc match
Changes for v12:
- Corrected the author
- Rewrited soc_device_match with while loop
Changes for v13:
- Added ack from Greg
---
 drivers/base/Kconfig|  1 +
 drivers/base/soc.c  | 66 +
 include/linux/sys_soc.h |  3 +++
 3 files changed, 70 insertions(+)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index fdf44ca..991b21e 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -235,6 +235,7 @@ config GENERIC_CPU_AUTOPROBE
 
 config SOC_BUS
bool
+   select GLOB
 
 source "drivers/base/regmap/Kconfig"
 
diff --git a/drivers/base/soc.c b/drivers/base/soc.c
index b63f23e..0c5cf87 100644
--- a/drivers/base/soc.c
+++ b/drivers/base/soc.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_IDA(soc_ida);
 
@@ -159,3 +160,68 @@ static int __init soc_bus_register(void)
return bus_register(_bus_type);
 }
 core_initcall(soc_bus_register);
+
+static int soc_device_match_one(struct device *dev, void *arg)
+{
+   struct soc_device *soc_dev = container_of(dev, struct soc_device, dev);
+   const struct soc_device_attribute *match = arg;
+
+   if (match->machine &&
+   !glob_match(match->machine, soc_dev->attr->machine))
+   return 0;
+
+   if (match->family &&
+   !glob_match(match->family, soc_dev->attr->family))
+   return 0;
+
+   if (match->revision &&
+   !glob_match(match->revision, soc_dev->attr->revision))
+   return 0;
+
+   if (match->soc_id &&
+   !glob_match(match->soc_id, soc_dev->attr->soc_id))
+   return 0;
+
+   return 1;
+}
+
+/*
+ * soc_device_match - identify the SoC in the machine
+ * @matches: zero-terminated array of possible matches
+ *
+ * returns the first matching entry of the argument array, or NULL
+ * if none of them match.
+ *
+ * This function is meant as a helper in place of of_match_node()
+ * in cases where either no device tree is available or the information
+ * in a device node is insufficient to identify a particular variant
+ * by its compatible strings or other properties. For new devices,
+ * the DT binding should always provide unique compatible strings
+ * that allow the use of of_match_node() instead.
+ *
+ * The calling function can use the .data entry of the
+ * soc_device_attribute to pass a structure or function pointer for
+ * each entry.
+ */
+const struct soc_device_attribute *soc_device_match(
+   const struct soc_device_attribute *matches)
+{
+   int ret = 0;
+
+   if (!matches)
+   return NULL;
+
+   while (!ret) {
+   if (!(matches->machine || matches->family ||
+ matches->revision || matches->soc_id))
+   break;
+   ret = bus_for_each_dev(_bus_type, NULL, (void *)matches,
+  soc_device_match_one);
+   if (!ret)
+   matches++;
+   else
+   return matches;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(soc_device_match);
diff --git a/include/linux/sys_soc.h b/include/linux/sys_soc.h
index

[v13, 7/8] base: soc: introduce soc_device_match() interface

2016-10-27 Thread Yangbo Lu

From: Arnd Bergmann 

We keep running into cases where device drivers want to know the exact
version of the a SoC they are currently running on. In the past, this has
usually been done through a vendor specific API that can be called by a
driver, or by directly accessing some kind of version register that is
not part of the device itself but that belongs to a global register area
of the chip.

Common reasons for doing this include:

- A machine is not using devicetree or similar for passing data about
  on-chip devices, but just announces their presence using boot-time
  platform devices, and the machine code itself does not care about the
  revision.

- There is existing firmware or boot loaders with existing DT binaries
  with generic compatible strings that do not identify the particular
  revision of each device, but the driver knows which SoC revisions
  include which part.

- A prerelease version of a chip has some quirks and we are using the same
  version of the bootloader and the DT blob on both the prerelease and the
  final version. An update of the DT binding seems inappropriate because
  that would involve maintaining multiple copies of the dts and/or
  bootloader.

This patch introduces the soc_device_match() interface that is meant to
work like of_match_node() but instead of identifying the version of a
device, it identifies the SoC itself using a vendor-agnostic interface.

Unlike of_match_node(), we do not do an exact string compare but instead
use glob_match() to allow wildcards in strings.

Signed-off-by: Arnd Bergmann 
Signed-off-by: Yangbo Lu 
Acked-by: Greg Kroah-Hartman 
---
Changes for v11:
- Added this patch for soc match
Changes for v12:
- Corrected the author
- Rewrited soc_device_match with while loop
Changes for v13:
- Added ack from Greg
---
 drivers/base/Kconfig|  1 +
 drivers/base/soc.c  | 66 +
 include/linux/sys_soc.h |  3 +++
 3 files changed, 70 insertions(+)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index fdf44ca..991b21e 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -235,6 +235,7 @@ config GENERIC_CPU_AUTOPROBE
 
 config SOC_BUS
bool
+   select GLOB
 
 source "drivers/base/regmap/Kconfig"
 
diff --git a/drivers/base/soc.c b/drivers/base/soc.c
index b63f23e..0c5cf87 100644
--- a/drivers/base/soc.c
+++ b/drivers/base/soc.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_IDA(soc_ida);
 
@@ -159,3 +160,68 @@ static int __init soc_bus_register(void)
return bus_register(_bus_type);
 }
 core_initcall(soc_bus_register);
+
+static int soc_device_match_one(struct device *dev, void *arg)
+{
+   struct soc_device *soc_dev = container_of(dev, struct soc_device, dev);
+   const struct soc_device_attribute *match = arg;
+
+   if (match->machine &&
+   !glob_match(match->machine, soc_dev->attr->machine))
+   return 0;
+
+   if (match->family &&
+   !glob_match(match->family, soc_dev->attr->family))
+   return 0;
+
+   if (match->revision &&
+   !glob_match(match->revision, soc_dev->attr->revision))
+   return 0;
+
+   if (match->soc_id &&
+   !glob_match(match->soc_id, soc_dev->attr->soc_id))
+   return 0;
+
+   return 1;
+}
+
+/*
+ * soc_device_match - identify the SoC in the machine
+ * @matches: zero-terminated array of possible matches
+ *
+ * returns the first matching entry of the argument array, or NULL
+ * if none of them match.
+ *
+ * This function is meant as a helper in place of of_match_node()
+ * in cases where either no device tree is available or the information
+ * in a device node is insufficient to identify a particular variant
+ * by its compatible strings or other properties. For new devices,
+ * the DT binding should always provide unique compatible strings
+ * that allow the use of of_match_node() instead.
+ *
+ * The calling function can use the .data entry of the
+ * soc_device_attribute to pass a structure or function pointer for
+ * each entry.
+ */
+const struct soc_device_attribute *soc_device_match(
+   const struct soc_device_attribute *matches)
+{
+   int ret = 0;
+
+   if (!matches)
+   return NULL;
+
+   while (!ret) {
+   if (!(matches->machine || matches->family ||
+ matches->revision || matches->soc_id))
+   break;
+   ret = bus_for_each_dev(_bus_type, NULL, (void *)matches,
+  soc_device_match_one);
+   if (!ret)
+   matches++;
+   else
+   return matches;
+   }
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(soc_device_match);
diff --git a/include/linux/sys_soc.h b/include/linux/sys_soc.h
index 2739ccb..9f5eb06 100644
--- a/include/linux/sys_soc.h
+++ b/include/linux/sys_soc.h
@@

[v13, 3/8] dt: bindings: move guts devicetree doc out of powerpc directory

2016-10-27 Thread Yangbo Lu

Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/
since it's used by not only PowerPC but also ARM. And add a specification
for 'little-endian' property.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Acked-by: Scott Wood 
---
Changes for v4:
- Added this patch
Changes for v5:
- Modified the description for little-endian property
Changes for v6:
- None
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
- Added 'Acked-by: Rob Herring'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++
 1 file changed, 3 insertions(+)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt 
b/Documentation/devicetree/bindings/soc/fsl/guts.txt
similarity index 91%
rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt
rename to Documentation/devicetree/bindings/soc/fsl/guts.txt
index b71b203..07adca9 100644
--- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt
@@ -25,6 +25,9 @@ Recommended properties:
  - fsl,liodn-bits : Indicates the number of defined bits in the LIODN
registers, for those SOCs that have a PAMU device.
 
+ - little-endian : Indicates that the global utilities block is little
+   endian. The default is big endian.
+
 Examples:
global-utilities@e {/* global utilities block */
compatible = "fsl,mpc8548-guts";
-- 
2.1.0.27.g96db324

[v13, 3/8] dt: bindings: move guts devicetree doc out of powerpc directory

2016-10-27 Thread Yangbo Lu

Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/
since it's used by not only PowerPC but also ARM. And add a specification
for 'little-endian' property.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Acked-by: Scott Wood 
---
Changes for v4:
- Added this patch
Changes for v5:
- Modified the description for little-endian property
Changes for v6:
- None
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
- Added 'Acked-by: Rob Herring'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++
 1 file changed, 3 insertions(+)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)

diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt 
b/Documentation/devicetree/bindings/soc/fsl/guts.txt
similarity index 91%
rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt
rename to Documentation/devicetree/bindings/soc/fsl/guts.txt
index b71b203..07adca9 100644
--- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt
+++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt
@@ -25,6 +25,9 @@ Recommended properties:
  - fsl,liodn-bits : Indicates the number of defined bits in the LIODN
registers, for those SOCs that have a PAMU device.
 
+ - little-endian : Indicates that the global utilities block is little
+   endian. The default is big endian.
+
 Examples:
global-utilities@e {/* global utilities block */
compatible = "fsl,mpc8548-guts";
-- 
2.1.0.27.g96db324

Re: [PATCH 1/4] printk/NMI: Handle continuous lines and missing newline

2016-10-27 Thread Sergey Senozhatsky

On (10/27/16 09:35), Joe Perches wrote:
[..]
> > -   printk_nmi_flush_line(buf, (end - start) + 1);
> > +   /* Handle continuous lines or missing new line. */
> > +   if ((c + 1 < end) && printk_get_level(c)) {
> > +   if (header) {
> > +   c += 2;
> 
> printk_skip_level

agree, printk_skip_level() probably would look better here.
other than that, looks good to me. nice that you found it, Petr!

Reviewed-by: Sergey Senozhatsky 

-ss

Re: [PATCH 1/4] printk/NMI: Handle continuous lines and missing newline

2016-10-27 Thread Sergey Senozhatsky

On (10/27/16 09:35), Joe Perches wrote:
[..]
> > -   printk_nmi_flush_line(buf, (end - start) + 1);
> > +   /* Handle continuous lines or missing new line. */
> > +   if ((c + 1 < end) && printk_get_level(c)) {
> > +   if (header) {
> > +   c += 2;
> 
> printk_skip_level

agree, printk_skip_level() probably would look better here.
other than that, looks good to me. nice that you found it, Petr!

Reviewed-by: Sergey Senozhatsky 

-ss

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Fengguang Wu


On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote:

On 28-10-16, 07:22, kbuild test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   e3300ffef0653774f1099cab153d25d24bd773ce
commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
dependent code in a separate file
date:   6 months ago


Why are we picking it up now ?


Sorry due to problems in the 0day infrastructure some few errors are
missed in May. Now we catch it when the commit goes mainline.

https://lists.01.org/pipermail/kbuild-all/

June 2016:  ... [ Gzip'd Text 853 KB ]
May 2016:   ... [ Gzip'd Text 294 KB ]
April 2016: ... [ Gzip'd Text 599 KB ]

As you can see, the report volumes are noticeably lower in "May 2016".

Thanks,
Fengguang

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Fengguang Wu


On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote:

On 28-10-16, 07:22, kbuild test robot wrote:

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   e3300ffef0653774f1099cab153d25d24bd773ce
commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
dependent code in a separate file
date:   6 months ago


Why are we picking it up now ?


Sorry due to problems in the 0day infrastructure some few errors are
missed in May. Now we catch it when the commit goes mainline.

https://lists.01.org/pipermail/kbuild-all/

June 2016:  ... [ Gzip'd Text 853 KB ]
May 2016:   ... [ Gzip'd Text 294 KB ]
April 2016: ... [ Gzip'd Text 599 KB ]

As you can see, the report volumes are noticeably lower in "May 2016".

Thanks,
Fengguang

Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls

2016-10-27 Thread Sergey Senozhatsky

Hello,

On (10/27/16 20:30), Linus Torvalds wrote:
> On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky
>  wrote:
> >
> > RFC
> >
> > This patch set extends a lock-less NMI per-cpu buffers idea to
> > handle recursive printk() calls. The basic mechanism is pretty much the
> > same -- at the beginning of a deadlock-prone section we switch to lock-less
> > printk callback, and return back to a default printk implementation at the
> > end; the messages are getting flushed to a logbuf buffer from a safer
> > context.
> 
> This looks very reasonable to me.
> 
> Does this also obviate the need for "printk_deferred()" that the
> scheduler and the clock code uses?  Because that would be a lovely
> thing to look at if it doesn't..

I wish I could say that we can retire printk_deferred(), but no, we still
need it. it's rather simple to fix printk recursion (that's what the patch
set is doing), but printk deadlocks are much harder to handle. anything that
starts somewhere else but somehow is related printk will deadlock (in the
worst case). I use this backtrace as an example:

 SyS_ioctl
  do_vfs_ioctl
   tty_ioctl
n_tty_ioctl
 tty_mode_ioctl
  set_termios
   tty_set_termios
uart_set_termios
 uart_change_speed
  FOO_serial_set_termios
   spin_lock_irqsave(>lock) // lock the output port

   !! WARN() or pr_err() or printk()
   vprintk_emit()
/* console_trylock() */
console_unlock()
 call_console_drivers()
  FOO_write()
   spin_lock_irqsave(>lock) // already locked

with the current printk we can't tell for sure how many locks will
be acquired -- printk() can succeed in locking the console_sem and
start invoking console drivers (if any) from console_unlock(), or
it can fail thus we will acquire only logbuf spin_lock and console_sem
spin_lock.

the things can change *a bit* once we switch to async_printk. because
instead of doing console_unlock()->call_console_drivers(), printk()
will just wake_up() the printk_kthread. but still, it won't be enough
to remove printk_deferred()   :(

   vprintk_emit()
wake_up()
 spin_lock rq lock
  printk

will be safe. but

  wake_up()
   spin_lock rq lock
printk
 vprintk_emit()
  wake_up()
   spin_lock rq lock

will deadlock.

we can't even tell for sure what locks are "important" to printk().
a small and reasonable code refactoring somewhere in clock code/etc.
can accidentally change the whole picture by introducing "unsafe"
WARN_ON() or adding yet another lock to the printing path.

need to think more.

p.s.
we are plannig to discuss printk related issues next week in Santa Fe.

-ss

Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls

2016-10-27 Thread Sergey Senozhatsky

Hello,

On (10/27/16 20:30), Linus Torvalds wrote:
> On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky
>  wrote:
> >
> > RFC
> >
> > This patch set extends a lock-less NMI per-cpu buffers idea to
> > handle recursive printk() calls. The basic mechanism is pretty much the
> > same -- at the beginning of a deadlock-prone section we switch to lock-less
> > printk callback, and return back to a default printk implementation at the
> > end; the messages are getting flushed to a logbuf buffer from a safer
> > context.
> 
> This looks very reasonable to me.
> 
> Does this also obviate the need for "printk_deferred()" that the
> scheduler and the clock code uses?  Because that would be a lovely
> thing to look at if it doesn't..

I wish I could say that we can retire printk_deferred(), but no, we still
need it. it's rather simple to fix printk recursion (that's what the patch
set is doing), but printk deadlocks are much harder to handle. anything that
starts somewhere else but somehow is related printk will deadlock (in the
worst case). I use this backtrace as an example:

 SyS_ioctl
  do_vfs_ioctl
   tty_ioctl
n_tty_ioctl
 tty_mode_ioctl
  set_termios
   tty_set_termios
uart_set_termios
 uart_change_speed
  FOO_serial_set_termios
   spin_lock_irqsave(>lock) // lock the output port

   !! WARN() or pr_err() or printk()
   vprintk_emit()
/* console_trylock() */
console_unlock()
 call_console_drivers()
  FOO_write()
   spin_lock_irqsave(>lock) // already locked

with the current printk we can't tell for sure how many locks will
be acquired -- printk() can succeed in locking the console_sem and
start invoking console drivers (if any) from console_unlock(), or
it can fail thus we will acquire only logbuf spin_lock and console_sem
spin_lock.

the things can change *a bit* once we switch to async_printk. because
instead of doing console_unlock()->call_console_drivers(), printk()
will just wake_up() the printk_kthread. but still, it won't be enough
to remove printk_deferred()   :(

   vprintk_emit()
wake_up()
 spin_lock rq lock
  printk

will be safe. but

  wake_up()
   spin_lock rq lock
printk
 vprintk_emit()
  wake_up()
   spin_lock rq lock

will deadlock.

we can't even tell for sure what locks are "important" to printk().
a small and reasonable code refactoring somewhere in clock code/etc.
can accidentally change the whole picture by introducing "unsafe"
WARN_ON() or adding yet another lock to the printing path.

need to think more.

p.s.
we are plannig to discuss printk related issues next week in Santa Fe.

-ss

Re: [PATCH 7/7] mfd: tps65217: Fix mismatched interrupt number

2016-10-27 Thread Milo Kim

On 10/26/2016 10:56 PM, Lee Jones wrote:

diff --git a/include/linux/mfd/tps65217.h b/include/linux/mfd/tps65217.h
> index 4ccda89..75a3a5f 100644
> --- a/include/linux/mfd/tps65217.h
> +++ b/include/linux/mfd/tps65217.h
> @@ -235,9 +235,9 @@ struct tps65217_bl_pdata {
>  };
>
>  enum tps65217_irq_type {
> -  TPS65217_IRQ_PB,
> -  TPS65217_IRQ_AC,
>TPS65217_IRQ_USB,
> +  TPS65217_IRQ_AC,
> +  TPS65217_IRQ_PB,
>TPS65217_NUM_IRQ
>  };

This is why using enum for these types of assignments is sometimes
dangerous.  It's probably best to be explicit.

I agree with you. Let me fix in v2 - use #define instead of enum type.

Best regards,
Milo

Re: [PATCH 7/7] mfd: tps65217: Fix mismatched interrupt number

2016-10-27 Thread Milo Kim

On 10/26/2016 10:56 PM, Lee Jones wrote:

diff --git a/include/linux/mfd/tps65217.h b/include/linux/mfd/tps65217.h
> index 4ccda89..75a3a5f 100644
> --- a/include/linux/mfd/tps65217.h
> +++ b/include/linux/mfd/tps65217.h
> @@ -235,9 +235,9 @@ struct tps65217_bl_pdata {
>  };
>
>  enum tps65217_irq_type {
> -  TPS65217_IRQ_PB,
> -  TPS65217_IRQ_AC,
>TPS65217_IRQ_USB,
> +  TPS65217_IRQ_AC,
> +  TPS65217_IRQ_PB,
>TPS65217_NUM_IRQ
>  };

This is why using enum for these types of assignments is sometimes
dangerous.  It's probably best to be explicit.

I agree with you. Let me fix in v2 - use #define instead of enum type.

Best regards,
Milo

[v13, 1/8] dt: bindings: update Freescale DCFG compatible

2016-10-27 Thread Yangbo Lu

Update Freescale DCFG compatible with 'fsl,-dcfg' instead
of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a,
ls1043a, and ls2080a.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Signed-off-by: Scott Wood 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added a list for the possible compatibles
Changes for v10:
- None
Changes for v11:
- Added 'Acked-by: Rob Herring'
- Updated commit message by Scott
Changes for v12:
- None
Changes for v13:
- None
---
 Documentation/devicetree/bindings/arm/fsl.txt | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
b/Documentation/devicetree/bindings/arm/fsl.txt
index dbbc095..713c1ae 100644
--- a/Documentation/devicetree/bindings/arm/fsl.txt
+++ b/Documentation/devicetree/bindings/arm/fsl.txt
@@ -119,7 +119,11 @@ Freescale DCFG
 configuration and status for the device. Such as setting the secondary
 core start address and release the secondary core from holdoff and startup.
   Required properties:
-  - compatible: should be "fsl,ls1021a-dcfg"
+  - compatible: should be "fsl,-dcfg"
+Possible compatibles:
+   "fsl,ls1021a-dcfg"
+   "fsl,ls1043a-dcfg"
+   "fsl,ls2080a-dcfg"
   - reg : should contain base address and length of DCFG memory-mapped 
registers
 
 Example:
-- 
2.1.0.27.g96db324

[v13, 1/8] dt: bindings: update Freescale DCFG compatible

2016-10-27 Thread Yangbo Lu

Update Freescale DCFG compatible with 'fsl,-dcfg' instead
of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a,
ls1043a, and ls2080a.

Signed-off-by: Yangbo Lu 
Acked-by: Rob Herring 
Signed-off-by: Scott Wood 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added a list for the possible compatibles
Changes for v10:
- None
Changes for v11:
- Added 'Acked-by: Rob Herring'
- Updated commit message by Scott
Changes for v12:
- None
Changes for v13:
- None
---
 Documentation/devicetree/bindings/arm/fsl.txt | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
b/Documentation/devicetree/bindings/arm/fsl.txt
index dbbc095..713c1ae 100644
--- a/Documentation/devicetree/bindings/arm/fsl.txt
+++ b/Documentation/devicetree/bindings/arm/fsl.txt
@@ -119,7 +119,11 @@ Freescale DCFG
 configuration and status for the device. Such as setting the secondary
 core start address and release the secondary core from holdoff and startup.
   Required properties:
-  - compatible: should be "fsl,ls1021a-dcfg"
+  - compatible: should be "fsl,-dcfg"
+Possible compatibles:
+   "fsl,ls1021a-dcfg"
+   "fsl,ls1043a-dcfg"
+   "fsl,ls2080a-dcfg"
   - reg : should contain base address and length of DCFG memory-mapped 
registers
 
 Example:
-- 
2.1.0.27.g96db324

Re: [PATCH 5/7] ARM: dts: am335x: Add the charger interrupt

2016-10-27 Thread Milo Kim


On 10/22/2016 05:47 AM, Robert Nelson wrote:

+#include 

^ this hasn't been posted nor pushed to mainline yet.. ;)



Oops! I've created this file but not captured not only in my git tree 
but also in my head! Thanks for your review.


Best regards,
Milo

[v13, 0/8] Fix eSDHC host version register bug

2016-10-27 Thread Yangbo Lu

This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0
eSDHC controller. To match the SoC version and revision, 10 previous version
patchsets had tried many methods but all of them were rejected by reviewers.
Such as
- dts compatible method
- syscon method
- ifdef PPC method
- GUTS driver getting SVR method
Anrd suggested a soc_device_match method in v10, and this is the only available
method left now. This v11 patchset introduces the soc_device_match interface in
soc driver.

The first six patches of Yangbo are to add the GUTS driver. This is used to
register a soc device which contain soc version and revision information.
The other two patches introduce the soc_device_match method in soc driver
and apply it on esdhc driver to fix this bug.

Arnd Bergmann (1):
  base: soc: introduce soc_device_match() interface

Yangbo Lu (7):
  dt: bindings: update Freescale DCFG compatible
  ARM64: dts: ls2080a: add device configuration node
  dt: bindings: move guts devicetree doc out of powerpc directory
  powerpc/fsl: move mpc85xx.h to include/linux/fsl
  soc: fsl: add GUTS driver for QorIQ platforms
  MAINTAINERS: add entry for Freescale SoC drivers
  mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

 Documentation/devicetree/bindings/arm/fsl.txt  |   6 +-
 .../bindings/{powerpc => soc}/fsl/guts.txt |   3 +
 MAINTAINERS|  11 +-
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi |   6 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 arch/powerpc/sysdev/fsl_pci.c  |   2 +-
 drivers/base/Kconfig   |   1 +
 drivers/base/soc.c |  66 ++
 drivers/clk/clk-qoriq.c|   3 +-
 drivers/i2c/busses/i2c-mpc.c   |   2 +-
 drivers/iommu/fsl_pamu.c   |   3 +-
 drivers/mmc/host/Kconfig   |   1 +
 drivers/mmc/host/sdhci-of-esdhc.c  |  20 ++
 drivers/net/ethernet/freescale/gianfar.c   |   2 +-
 drivers/soc/Kconfig|   3 +-
 drivers/soc/fsl/Kconfig|  18 ++
 drivers/soc/fsl/Makefile   |   1 +
 drivers/soc/fsl/guts.c | 236 +
 include/linux/fsl/guts.h   | 125 ++-
 .../asm/mpc85xx.h => include/linux/fsl/svr.h   |   4 +-
 include/linux/sys_soc.h|   3 +
 21 files changed, 456 insertions(+), 62 deletions(-)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

-- 
2.1.0.27.g96db324

[v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-10-27 Thread Yangbo Lu

The global utilities block controls power management, I/O device
enabling, power-onreset(POR) configuration monitoring, alternate
function selection for multiplexed signals,and clock control.

This patch adds a driver to manage and access global utilities block.
Initially only reading SVR and registering soc device are supported.
Other guts accesses, such as reading RCW, should eventually be moved
into this driver as well.

Signed-off-by: Yangbo Lu 
---
Changes for v4:
- Added this patch
Changes for v5:
- Modified copyright info
- Changed MODULE_LICENSE to GPL
- Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL
- Made FSL_GUTS user-invisible
- Added a complete compatible list for GUTS
- Stored guts info in file-scope variable
- Added mfspr() getting SVR
- Redefined GUTS APIs
- Called fsl_guts_init rather than using platform driver
- Removed useless parentheses
- Removed useless 'extern' key words
Changes for v6:
- Made guts thread safe in fsl_guts_init
Changes for v7:
- Removed 'ifdef' for function declaration in guts.h
Changes for v8:
- Fixes lines longer than 80 characters checkpatch issue
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Changed to platform driver
Changes for v12:
- Removed "signed-off-by: Scott"
- Defined fsl_soc_die_attr struct array instead of
  soc_device_attribute
- Re-designed soc_device_attribute for QorIQ SoC
- Other minor fixes
Changes for v13:
- Rebased
- Removed text after 'bool' in Kconfig
- Removed ARCH ifdefs
- Added more bits for ls1021a mask
- Used devm
---
 drivers/soc/Kconfig  |   3 +-
 drivers/soc/fsl/Kconfig  |  18 
 drivers/soc/fsl/Makefile |   1 +
 drivers/soc/fsl/guts.c   | 236 +++
 include/linux/fsl/guts.h | 125 +++--
 5 files changed, 333 insertions(+), 50 deletions(-)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c

diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig
index e6e90e8..f31bceb 100644
--- a/drivers/soc/Kconfig
+++ b/drivers/soc/Kconfig
@@ -1,8 +1,7 @@
 menu "SOC (System On Chip) specific Drivers"
 
 source "drivers/soc/bcm/Kconfig"
-source "drivers/soc/fsl/qbman/Kconfig"
-source "drivers/soc/fsl/qe/Kconfig"
+source "drivers/soc/fsl/Kconfig"
 source "drivers/soc/mediatek/Kconfig"
 source "drivers/soc/qcom/Kconfig"
 source "drivers/soc/rockchip/Kconfig"
diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
new file mode 100644
index 000..7a9fb9b
--- /dev/null
+++ b/drivers/soc/fsl/Kconfig
@@ -0,0 +1,18 @@
+#
+# Freescale SOC drivers
+#
+
+source "drivers/soc/fsl/qbman/Kconfig"
+source "drivers/soc/fsl/qe/Kconfig"
+
+config FSL_GUTS
+   bool
+   select SOC_BUS
+   help
+ The global utilities block controls power management, I/O device
+ enabling, power-onreset(POR) configuration monitoring, alternate
+ function selection for multiplexed signals,and clock control.
+ This driver is to manage and access global utilities block.
+ Initially only reading SVR and registering soc device are supported.
+ Other guts accesses, such as reading RCW, should eventually be moved
+ into this driver as well.
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 75e1f53..44b3beb 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -5,3 +5,4 @@
 obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_GUTS) += guts.o
diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
new file mode 100644
index 000..1f356ed
--- /dev/null
+++ b/drivers/soc/fsl/guts.c
@@ -0,0 +1,236 @@
+/*
+ * Freescale QorIQ Platforms GUTS Driver
+ *
+ * Copyright (C) 2016 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct guts {
+   struct ccsr_guts __iomem *regs;
+   bool little_endian;
+};
+
+struct fsl_soc_die_attr {
+   char*die;
+   u32 svr;
+   u32 mask;
+};
+
+static struct guts *guts;
+static struct soc_device_attribute soc_dev_attr;
+static struct soc_device *soc_dev;
+
+
+/* SoC die attribute definition for QorIQ platform */
+static const struct fsl_soc_die_attr fsl_soc_die[] = {
+   /*
+* Power Architecture-based SoCs T Series
+*/
+
+   /* Die:

Re: [PATCH 5/7] ARM: dts: am335x: Add the charger interrupt

2016-10-27 Thread Milo Kim


On 10/22/2016 05:47 AM, Robert Nelson wrote:

+#include 

^ this hasn't been posted nor pushed to mainline yet.. ;)



Oops! I've created this file but not captured not only in my git tree 
but also in my head! Thanks for your review.


Best regards,
Milo

[v13, 0/8] Fix eSDHC host version register bug

2016-10-27 Thread Yangbo Lu

This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0
eSDHC controller. To match the SoC version and revision, 10 previous version
patchsets had tried many methods but all of them were rejected by reviewers.
Such as
- dts compatible method
- syscon method
- ifdef PPC method
- GUTS driver getting SVR method
Anrd suggested a soc_device_match method in v10, and this is the only available
method left now. This v11 patchset introduces the soc_device_match interface in
soc driver.

The first six patches of Yangbo are to add the GUTS driver. This is used to
register a soc device which contain soc version and revision information.
The other two patches introduce the soc_device_match method in soc driver
and apply it on esdhc driver to fix this bug.

Arnd Bergmann (1):
  base: soc: introduce soc_device_match() interface

Yangbo Lu (7):
  dt: bindings: update Freescale DCFG compatible
  ARM64: dts: ls2080a: add device configuration node
  dt: bindings: move guts devicetree doc out of powerpc directory
  powerpc/fsl: move mpc85xx.h to include/linux/fsl
  soc: fsl: add GUTS driver for QorIQ platforms
  MAINTAINERS: add entry for Freescale SoC drivers
  mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0

 Documentation/devicetree/bindings/arm/fsl.txt  |   6 +-
 .../bindings/{powerpc => soc}/fsl/guts.txt |   3 +
 MAINTAINERS|  11 +-
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi |   6 +
 arch/powerpc/kernel/cpu_setup_fsl_booke.S  |   2 +-
 arch/powerpc/sysdev/fsl_pci.c  |   2 +-
 drivers/base/Kconfig   |   1 +
 drivers/base/soc.c |  66 ++
 drivers/clk/clk-qoriq.c|   3 +-
 drivers/i2c/busses/i2c-mpc.c   |   2 +-
 drivers/iommu/fsl_pamu.c   |   3 +-
 drivers/mmc/host/Kconfig   |   1 +
 drivers/mmc/host/sdhci-of-esdhc.c  |  20 ++
 drivers/net/ethernet/freescale/gianfar.c   |   2 +-
 drivers/soc/Kconfig|   3 +-
 drivers/soc/fsl/Kconfig|  18 ++
 drivers/soc/fsl/Makefile   |   1 +
 drivers/soc/fsl/guts.c | 236 +
 include/linux/fsl/guts.h   | 125 ++-
 .../asm/mpc85xx.h => include/linux/fsl/svr.h   |   4 +-
 include/linux/sys_soc.h|   3 +
 21 files changed, 456 insertions(+), 62 deletions(-)
 rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c
 rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%)

-- 
2.1.0.27.g96db324

[v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms

2016-10-27 Thread Yangbo Lu

The global utilities block controls power management, I/O device
enabling, power-onreset(POR) configuration monitoring, alternate
function selection for multiplexed signals,and clock control.

This patch adds a driver to manage and access global utilities block.
Initially only reading SVR and registering soc device are supported.
Other guts accesses, such as reading RCW, should eventually be moved
into this driver as well.

Signed-off-by: Yangbo Lu 
---
Changes for v4:
- Added this patch
Changes for v5:
- Modified copyright info
- Changed MODULE_LICENSE to GPL
- Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL
- Made FSL_GUTS user-invisible
- Added a complete compatible list for GUTS
- Stored guts info in file-scope variable
- Added mfspr() getting SVR
- Redefined GUTS APIs
- Called fsl_guts_init rather than using platform driver
- Removed useless parentheses
- Removed useless 'extern' key words
Changes for v6:
- Made guts thread safe in fsl_guts_init
Changes for v7:
- Removed 'ifdef' for function declaration in guts.h
Changes for v8:
- Fixes lines longer than 80 characters checkpatch issue
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- Changed to platform driver
Changes for v12:
- Removed "signed-off-by: Scott"
- Defined fsl_soc_die_attr struct array instead of
  soc_device_attribute
- Re-designed soc_device_attribute for QorIQ SoC
- Other minor fixes
Changes for v13:
- Rebased
- Removed text after 'bool' in Kconfig
- Removed ARCH ifdefs
- Added more bits for ls1021a mask
- Used devm
---
 drivers/soc/Kconfig  |   3 +-
 drivers/soc/fsl/Kconfig  |  18 
 drivers/soc/fsl/Makefile |   1 +
 drivers/soc/fsl/guts.c   | 236 +++
 include/linux/fsl/guts.h | 125 +++--
 5 files changed, 333 insertions(+), 50 deletions(-)
 create mode 100644 drivers/soc/fsl/Kconfig
 create mode 100644 drivers/soc/fsl/guts.c

diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig
index e6e90e8..f31bceb 100644
--- a/drivers/soc/Kconfig
+++ b/drivers/soc/Kconfig
@@ -1,8 +1,7 @@
 menu "SOC (System On Chip) specific Drivers"
 
 source "drivers/soc/bcm/Kconfig"
-source "drivers/soc/fsl/qbman/Kconfig"
-source "drivers/soc/fsl/qe/Kconfig"
+source "drivers/soc/fsl/Kconfig"
 source "drivers/soc/mediatek/Kconfig"
 source "drivers/soc/qcom/Kconfig"
 source "drivers/soc/rockchip/Kconfig"
diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
new file mode 100644
index 000..7a9fb9b
--- /dev/null
+++ b/drivers/soc/fsl/Kconfig
@@ -0,0 +1,18 @@
+#
+# Freescale SOC drivers
+#
+
+source "drivers/soc/fsl/qbman/Kconfig"
+source "drivers/soc/fsl/qe/Kconfig"
+
+config FSL_GUTS
+   bool
+   select SOC_BUS
+   help
+ The global utilities block controls power management, I/O device
+ enabling, power-onreset(POR) configuration monitoring, alternate
+ function selection for multiplexed signals,and clock control.
+ This driver is to manage and access global utilities block.
+ Initially only reading SVR and registering soc device are supported.
+ Other guts accesses, such as reading RCW, should eventually be moved
+ into this driver as well.
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index 75e1f53..44b3beb 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -5,3 +5,4 @@
 obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_GUTS) += guts.o
diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
new file mode 100644
index 000..1f356ed
--- /dev/null
+++ b/drivers/soc/fsl/guts.c
@@ -0,0 +1,236 @@
+/*
+ * Freescale QorIQ Platforms GUTS Driver
+ *
+ * Copyright (C) 2016 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct guts {
+   struct ccsr_guts __iomem *regs;
+   bool little_endian;
+};
+
+struct fsl_soc_die_attr {
+   char*die;
+   u32 svr;
+   u32 mask;
+};
+
+static struct guts *guts;
+static struct soc_device_attribute soc_dev_attr;
+static struct soc_device *soc_dev;
+
+
+/* SoC die attribute definition for QorIQ platform */
+static const struct fsl_soc_die_attr fsl_soc_die[] = {
+   /*
+* Power Architecture-based SoCs T Series
+*/
+
+   /* Die: T4240, SoC:

[v13, 2/8] ARM64: dts: ls2080a: add device configuration node

2016-10-27 Thread Yangbo Lu

Add the dts node for device configuration unit that provides
general purpose configuration and status for the device.

Signed-off-by: Yangbo Lu 
Acked-by: Scott Wood 
---
Changes for v5:
- Added this patch
Changes for v6:
- None
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
index 337da90..c03b099 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
@@ -215,6 +215,12 @@
clocks = <>;
};
 
+   dcfg: dcfg@1e0 {
+   compatible = "fsl,ls2080a-dcfg", "syscon";
+   reg = <0x0 0x1e0 0x0 0x1>;
+   little-endian;
+   };
+
serial0: serial@21c0500 {
compatible = "fsl,ns16550", "ns16550a";
reg = <0x0 0x21c0500 0x0 0x100>;
-- 
2.1.0.27.g96db324

[v13, 2/8] ARM64: dts: ls2080a: add device configuration node

2016-10-27 Thread Yangbo Lu

Add the dts node for device configuration unit that provides
general purpose configuration and status for the device.

Signed-off-by: Yangbo Lu 
Acked-by: Scott Wood 
---
Changes for v5:
- Added this patch
Changes for v6:
- None
Changes for v7:
- None
Changes for v8:
- Added 'Acked-by: Scott Wood'
Changes for v9:
- None
Changes for v10:
- None
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
index 337da90..c03b099 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi
@@ -215,6 +215,12 @@
clocks = <>;
};
 
+   dcfg: dcfg@1e0 {
+   compatible = "fsl,ls2080a-dcfg", "syscon";
+   reg = <0x0 0x1e0 0x0 0x1>;
+   little-endian;
+   };
+
serial0: serial@21c0500 {
compatible = "fsl,ns16550", "ns16550a";
reg = <0x0 0x21c0500 0x0 0x100>;
-- 
2.1.0.27.g96db324

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Viresh Kumar

On 28-10-16, 07:22, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   e3300ffef0653774f1099cab153d25d24bd773ce
> commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
> dependent code in a separate file
> date:   6 months ago

Why are we picking it up now ?

-- 
viresh

Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'

2016-10-27 Thread Viresh Kumar

On 28-10-16, 07:22, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   e3300ffef0653774f1099cab153d25d24bd773ce
> commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF 
> dependent code in a separate file
> date:   6 months ago

Why are we picking it up now ?

-- 
viresh

[v13, 6/8] MAINTAINERS: add entry for Freescale SoC drivers

2016-10-27 Thread Yangbo Lu

Add maintainer entry for Freescale SoC drivers including
the QE library and the GUTS driver now. Also add maintainer
for QE library.

Signed-off-by: Yangbo Lu 
Acked-by: Scott Wood 
Acked-by: Qiang Zhao 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added linux-arm mail list
- Removed GUTS driver entry
Changes for v10:
- Changed 'DRIVER' to 'DRIVERS'
- Added 'Acked-by' of Scott and Qiang
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 MAINTAINERS | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c72fa18..cf3aaee 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5037,9 +5037,18 @@ S:   Maintained
 F: drivers/net/ethernet/freescale/fman
 F: Documentation/devicetree/bindings/powerpc/fsl/fman.txt
 
+FREESCALE SOC DRIVERS
+M: Scott Wood 
+L: linuxppc-...@lists.ozlabs.org
+L: linux-arm-ker...@lists.infradead.org
+S: Maintained
+F: drivers/soc/fsl/
+F: include/linux/fsl/
+
 FREESCALE QUICC ENGINE LIBRARY
+M: Qiang Zhao 
 L: linuxppc-...@lists.ozlabs.org
-S: Orphan
+S: Maintained
 F: drivers/soc/fsl/qe/
 F: include/soc/fsl/*qe*.h
 F: include/soc/fsl/*ucc*.h
-- 
2.1.0.27.g96db324

[v13, 6/8] MAINTAINERS: add entry for Freescale SoC drivers

2016-10-27 Thread Yangbo Lu

Add maintainer entry for Freescale SoC drivers including
the QE library and the GUTS driver now. Also add maintainer
for QE library.

Signed-off-by: Yangbo Lu 
Acked-by: Scott Wood 
Acked-by: Qiang Zhao 
---
Changes for v8:
- Added this patch
Changes for v9:
- Added linux-arm mail list
- Removed GUTS driver entry
Changes for v10:
- Changed 'DRIVER' to 'DRIVERS'
- Added 'Acked-by' of Scott and Qiang
Changes for v11:
- None
Changes for v12:
- None
Changes for v13:
- None
---
 MAINTAINERS | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index c72fa18..cf3aaee 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5037,9 +5037,18 @@ S:   Maintained
 F: drivers/net/ethernet/freescale/fman
 F: Documentation/devicetree/bindings/powerpc/fsl/fman.txt
 
+FREESCALE SOC DRIVERS
+M: Scott Wood 
+L: linuxppc-...@lists.ozlabs.org
+L: linux-arm-ker...@lists.infradead.org
+S: Maintained
+F: drivers/soc/fsl/
+F: include/linux/fsl/
+
 FREESCALE QUICC ENGINE LIBRARY
+M: Qiang Zhao 
 L: linuxppc-...@lists.ozlabs.org
-S: Orphan
+S: Maintained
 F: drivers/soc/fsl/qe/
 F: include/soc/fsl/*qe*.h
 F: include/soc/fsl/*ucc*.h
-- 
2.1.0.27.g96db324

linux-next: Tree for Oct 28

2016-10-27 Thread Stephen Rothwell

Hi all,

There will probably be no linux-next releases next week while I attend
the Kernel Summit.

Changes since 20161027:

The akpm-current tree lost its build failures.

Non-merge commits (relative to Linus' tree): 3098
 3842 files changed, 227213 insertions(+), 59787 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 245 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e3300ffef065 Merge tag 'for-linus-4.9-rc2-ofs-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux)
Merging fixes/master (30066ce675d3 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging kbuild-current/rc-fixes (989cea5c14be kbuild: prevent lib-ksyms.o 
rebuilds)
Merging arc-current/for-curr (e2192b253de8 ARC: module: print pretty section 
names)
Merging arm-current/fixes (6127d124ee4e ARM: wire up new pkey syscalls)
Merging m68k-current/for-linus (6736e65effc3 m68k: Migrate exception table 
users off module.h and onto extable.h)
Merging metag-fixes/fixes (35d04077ad96 metag: Only define 
atomic_dec_if_positive conditionally)
Merging powerpc-fixes/fixes (fb479e44a9e2 powerpc/64s: relocation, register 
save fixes for system reset interrupt)
Merging sparc/master (a74ad5e660a9 sparc64: Handle extremely large kernel TLB 
range flushes more gracefully.)
Merging net/master (9ee7837449b3 net sched filters: fix notification of filter 
delete with proper handle)
CONFLICT (content): Merge conflict in drivers/net/ethernet/qlogic/Kconfig
Applying: qed*: merge fix for CONFIG_INFINIBAND_QEDR Kconfig move
Merging ipsec/master (7f92083eb58f vti6: flush x-netns xfrm cache when vti 
interface is removed)
Merging netfilter/master (7034b566a4e7 netfilter: fix nf_queue handling)
Merging ipvs/master (ea43f860d984 Merge branch 'ethoc-fixes')
Merging wireless-drivers/master (d3532ea6ce4e brcmfmac: avoid 
maybe-uninitialized warning in brcmf_cfg80211_start_ap)
Merging mac80211/master (b4f7f4ad425a mac80211: fix some sphinx warnings)
Merging sound-current/for-linus (bdc3478f90cd ALSA: usb-audio: Add quirk for 
Syntek STK1160)
Merging pci-current/for-linus (349d941e1ff1 PCI: qcom: Fix pp->dev usage before 
assignment)
Merging driver-core.current/driver-core-linus (248ff0216543 driver core: Make 
Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger)
Merging tty.current/tty-linus (009e39ae44f4 vt: clear selection before resizing)
Merging usb.current/usb-linus (c1aa67729a1d Merge tag 'usb-ci-v4.9-rc2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-linus)
Merging usb-gadget-fixes/fixes (a1aa8cf6471b Revert "Documentation: devicetree: 
dwc2: Deprecate g-tx-fifo-size")
Merging usb-serial-fixes/usb-linus (07d9a380680d Linux 4.9-rc2)
Merging usb-chipidea-fixes/ci-for-usb-stable (991d5add50a5 usb: chipidea: host: 
fix NULL ptr dereference during shutdown)
Merging phy/fixes (1001354ca341 Linux 4.9-rc1)
Merging staging.current/staging-linus (e866dd8aab76 greybus: fix a leak on 
error in gb_module_create())
Merging char-misc.current/char-misc-linus (cfcc1456e4a2 Merge tag 
'extcon-fixes-for-4.9-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into 
char-misc-linus)
Merging input-current/for-linus (324ae0958cab Input: psmouse - cleanup 
Focaltech code)
Merging crypto-current/master (6d4

linux-next: Tree for Oct 28

2016-10-27 Thread Stephen Rothwell

Hi all,

There will probably be no linux-next releases next week while I attend
the Kernel Summit.

Changes since 20161027:

The akpm-current tree lost its build failures.

Non-merge commits (relative to Linus' tree): 3098
 3842 files changed, 227213 insertions(+), 59787 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc and an allmodconfig (with
CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a
native build of tools/perf. After the final fixups (if any), I do an
x86_64 modules_install followed by builds for x86_64 allnoconfig,
powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig
(this fails its final link) and pseries_le_defconfig and i386, sparc
and sparc64 defconfig.

Below is a summary of the state of the merge.

I am currently merging 245 trees (counting Linus' and 35 trees of patches
pending for Linus' tree).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e3300ffef065 Merge tag 'for-linus-4.9-rc2-ofs-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux)
Merging fixes/master (30066ce675d3 Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6)
Merging kbuild-current/rc-fixes (989cea5c14be kbuild: prevent lib-ksyms.o 
rebuilds)
Merging arc-current/for-curr (e2192b253de8 ARC: module: print pretty section 
names)
Merging arm-current/fixes (6127d124ee4e ARM: wire up new pkey syscalls)
Merging m68k-current/for-linus (6736e65effc3 m68k: Migrate exception table 
users off module.h and onto extable.h)
Merging metag-fixes/fixes (35d04077ad96 metag: Only define 
atomic_dec_if_positive conditionally)
Merging powerpc-fixes/fixes (fb479e44a9e2 powerpc/64s: relocation, register 
save fixes for system reset interrupt)
Merging sparc/master (a74ad5e660a9 sparc64: Handle extremely large kernel TLB 
range flushes more gracefully.)
Merging net/master (9ee7837449b3 net sched filters: fix notification of filter 
delete with proper handle)
CONFLICT (content): Merge conflict in drivers/net/ethernet/qlogic/Kconfig
Applying: qed*: merge fix for CONFIG_INFINIBAND_QEDR Kconfig move
Merging ipsec/master (7f92083eb58f vti6: flush x-netns xfrm cache when vti 
interface is removed)
Merging netfilter/master (7034b566a4e7 netfilter: fix nf_queue handling)
Merging ipvs/master (ea43f860d984 Merge branch 'ethoc-fixes')
Merging wireless-drivers/master (d3532ea6ce4e brcmfmac: avoid 
maybe-uninitialized warning in brcmf_cfg80211_start_ap)
Merging mac80211/master (b4f7f4ad425a mac80211: fix some sphinx warnings)
Merging sound-current/for-linus (bdc3478f90cd ALSA: usb-audio: Add quirk for 
Syntek STK1160)
Merging pci-current/for-linus (349d941e1ff1 PCI: qcom: Fix pp->dev usage before 
assignment)
Merging driver-core.current/driver-core-linus (248ff0216543 driver core: Make 
Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger)
Merging tty.current/tty-linus (009e39ae44f4 vt: clear selection before resizing)
Merging usb.current/usb-linus (c1aa67729a1d Merge tag 'usb-ci-v4.9-rc2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-linus)
Merging usb-gadget-fixes/fixes (a1aa8cf6471b Revert "Documentation: devicetree: 
dwc2: Deprecate g-tx-fifo-size")
Merging usb-serial-fixes/usb-linus (07d9a380680d Linux 4.9-rc2)
Merging usb-chipidea-fixes/ci-for-usb-stable (991d5add50a5 usb: chipidea: host: 
fix NULL ptr dereference during shutdown)
Merging phy/fixes (1001354ca341 Linux 4.9-rc1)
Merging staging.current/staging-linus (e866dd8aab76 greybus: fix a leak on 
error in gb_module_create())
Merging char-misc.current/char-misc-linus (cfcc1456e4a2 Merge tag 
'extcon-fixes-for-4.9-rc3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into 
char-misc-linus)
Merging input-current/for-linus (324ae0958cab Input: psmouse - cleanup 
Focaltech code)
Merging crypto-current/master (6d4

Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls

2016-10-27 Thread Linus Torvalds

On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky
 wrote:
>
> RFC
>
> This patch set extends a lock-less NMI per-cpu buffers idea to
> handle recursive printk() calls. The basic mechanism is pretty much the
> same -- at the beginning of a deadlock-prone section we switch to lock-less
> printk callback, and return back to a default printk implementation at the
> end; the messages are getting flushed to a logbuf buffer from a safer
> context.

This looks very reasonable to me.

Does this also obviate the need for "printk_deferred()" that the
scheduler and the clock code uses?  Because that would be a lovely
thing to look at if it doesn't..

 LInus

Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls

2016-10-27 Thread Linus Torvalds

On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky
 wrote:
>
> RFC
>
> This patch set extends a lock-less NMI per-cpu buffers idea to
> handle recursive printk() calls. The basic mechanism is pretty much the
> same -- at the beginning of a deadlock-prone section we switch to lock-less
> printk callback, and return back to a default printk implementation at the
> end; the messages are getting flushed to a logbuf buffer from a safer
> context.

This looks very reasonable to me.

Does this also obviate the need for "printk_deferred()" that the
scheduler and the clock code uses?  Because that would be a lovely
thing to look at if it doesn't..

 LInus

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1708 matches

Mail list logo