[PATCH -v4 RESEND 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
From: Huang Ying__swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag for the huge page. This will free the specified swap cluster now. Because now this function will be called only in the error path to free the swap cluster just allocated. So the corresponding swap_map[i] == SWAP_HAS_CACHE, that is, the swap count is 0. This makes the implementation simpler than that of the ordinary swap entry. This will be used for delaying splitting THP (Transparent Huge Page) during swapping out. Where for one THP to swap out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. If anything fails after allocating the swap cluster and before splitting the THP successfully, the swapcache_free_trans_huge() will be used to free the swap space allocated. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 9 +++-- mm/swapfile.c| 33 +++-- 2 files changed, 38 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index cb8c1b0..b185e39 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -408,7 +408,7 @@ extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); -extern void swapcache_free(swp_entry_t); +extern void __swapcache_free(swp_entry_t, bool); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); @@ -480,7 +480,7 @@ static inline void swap_free(swp_entry_t swp) { } -static inline void swapcache_free(swp_entry_t swp) +static inline void __swapcache_free(swp_entry_t swp, bool huge) { } @@ -551,6 +551,11 @@ static inline swp_entry_t get_huge_swap_page(void) #endif /* CONFIG_SWAP */ +static inline void swapcache_free(swp_entry_t entry) +{ + __swapcache_free(entry, false); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 8224150..126c789 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -732,6 +732,27 @@ static void swap_free_huge_cluster(struct swap_info_struct *si, __swap_entry_free(si, offset, true); } +/* + * Caller should hold si->lock. + */ +static void swapcache_free_trans_huge(struct swap_info_struct *si, + swp_entry_t entry) +{ + unsigned long offset = swp_offset(entry); + unsigned long idx = offset / SWAPFILE_CLUSTER; + unsigned char *map; + unsigned int i; + + map = si->swap_map + offset; + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + VM_BUG_ON(map[i] != SWAP_HAS_CACHE); + map[i] &= ~SWAP_HAS_CACHE; + } + /* Cluster size is same as huge page size */ + mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR); + swap_free_huge_cluster(si, idx); +} + static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { unsigned long idx; @@ -758,6 +779,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { return 0; } + +static inline void swapcache_free_trans_huge(struct swap_info_struct *si, +swp_entry_t entry) +{ +} #endif swp_entry_t __get_swap_page(bool huge) @@ -949,13 +975,16 @@ void swap_free(swp_entry_t entry) /* * Called after dropping swapcache to decrease refcnt to swap entries. */ -void swapcache_free(swp_entry_t entry) +void __swapcache_free(swp_entry_t entry, bool huge) { struct swap_info_struct *p; p = swap_info_get(entry); if (p) { - swap_entry_free(p, entry, SWAP_HAS_CACHE); + if (unlikely(huge)) + swapcache_free_trans_huge(p, entry); + else + swap_entry_free(p, entry, SWAP_HAS_CACHE); spin_unlock(>lock); } } -- 2.9.3
[PATCH -v4 RESEND 5/9] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page
From: Huang Ying __swapcache_free() is added to support to clear the SWAP_HAS_CACHE flag for the huge page. This will free the specified swap cluster now. Because now this function will be called only in the error path to free the swap cluster just allocated. So the corresponding swap_map[i] == SWAP_HAS_CACHE, that is, the swap count is 0. This makes the implementation simpler than that of the ordinary swap entry. This will be used for delaying splitting THP (Transparent Huge Page) during swapping out. Where for one THP to swap out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. If anything fails after allocating the swap cluster and before splitting the THP successfully, the swapcache_free_trans_huge() will be used to free the swap space allocated. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 9 +++-- mm/swapfile.c| 33 +++-- 2 files changed, 38 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index cb8c1b0..b185e39 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -408,7 +408,7 @@ extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); -extern void swapcache_free(swp_entry_t); +extern void __swapcache_free(swp_entry_t, bool); extern int free_swap_and_cache(swp_entry_t); extern int swap_type_of(dev_t, sector_t, struct block_device **); extern unsigned int count_swap_pages(int, int); @@ -480,7 +480,7 @@ static inline void swap_free(swp_entry_t swp) { } -static inline void swapcache_free(swp_entry_t swp) +static inline void __swapcache_free(swp_entry_t swp, bool huge) { } @@ -551,6 +551,11 @@ static inline swp_entry_t get_huge_swap_page(void) #endif /* CONFIG_SWAP */ +static inline void swapcache_free(swp_entry_t entry) +{ + __swapcache_free(entry, false); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 8224150..126c789 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -732,6 +732,27 @@ static void swap_free_huge_cluster(struct swap_info_struct *si, __swap_entry_free(si, offset, true); } +/* + * Caller should hold si->lock. + */ +static void swapcache_free_trans_huge(struct swap_info_struct *si, + swp_entry_t entry) +{ + unsigned long offset = swp_offset(entry); + unsigned long idx = offset / SWAPFILE_CLUSTER; + unsigned char *map; + unsigned int i; + + map = si->swap_map + offset; + for (i = 0; i < SWAPFILE_CLUSTER; i++) { + VM_BUG_ON(map[i] != SWAP_HAS_CACHE); + map[i] &= ~SWAP_HAS_CACHE; + } + /* Cluster size is same as huge page size */ + mem_cgroup_uncharge_swap(entry, HPAGE_PMD_NR); + swap_free_huge_cluster(si, idx); +} + static unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { unsigned long idx; @@ -758,6 +779,11 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) { return 0; } + +static inline void swapcache_free_trans_huge(struct swap_info_struct *si, +swp_entry_t entry) +{ +} #endif swp_entry_t __get_swap_page(bool huge) @@ -949,13 +975,16 @@ void swap_free(swp_entry_t entry) /* * Called after dropping swapcache to decrease refcnt to swap entries. */ -void swapcache_free(swp_entry_t entry) +void __swapcache_free(swp_entry_t entry, bool huge) { struct swap_info_struct *p; p = swap_info_get(entry); if (p) { - swap_entry_free(p, entry, SWAP_HAS_CACHE); + if (unlikely(huge)) + swapcache_free_trans_huge(p, entry); + else + swap_entry_free(p, entry, SWAP_HAS_CACHE); spin_unlock(>lock); } } -- 2.9.3
[PATCH -v4 RESEND 3/9] mm, THP, swap: Add swap cluster allocate/free functions
From: Huang YingThe swap cluster allocation/free functions are added based on the existing swap cluster management mechanism for SSD. These functions don't work for the rotating hard disks because the existing swap cluster management mechanism doesn't work for them. The hard disks support may be added if someone really need it. But that needn't be included in this patchset. This will be used for the THP (Transparent Huge Page) swap support. Where one swap cluster will hold the contents of each THP swapped out. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- mm/swapfile.c | 203 +- 1 file changed, 146 insertions(+), 57 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f3fc83f..3643049 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, schedule_work(>discard_work); } +static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); + cluster_list_add_tail(>free_clusters, ci, idx); +} + /* * Doing discard actually. After a cluster discard is finished, the cluster * will be added to free cluster list. caller should hold si->lock. @@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) SWAPFILE_CLUSTER); spin_lock(>lock); - cluster_set_flag([idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(>free_clusters, info, idx); + __free_cluster(si, idx); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); } @@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work) spin_unlock(>lock); } +static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + VM_BUG_ON(cluster_list_first(>free_clusters) != idx); + cluster_list_del_first(>free_clusters, ci); + cluster_set_count_flag(ci + idx, 0, 0); +} + +static void free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info + idx; + + VM_BUG_ON(cluster_count(ci) != 0); + /* +* If the swap is discardable, prepare discard the cluster +* instead of free it immediately. The cluster will be freed +* after discard. +*/ + if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == + (SWP_WRITEOK | SWP_PAGE_DISCARD)) { + swap_cluster_schedule_discard(si, idx); + return; + } + + __free_cluster(si, idx); +} + /* * The cluster corresponding to page_nr will be used. The cluster will be * removed from free cluster list and its usage counter will be increased. @@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (!cluster_info) return; - if (cluster_is_free(_info[idx])) { - VM_BUG_ON(cluster_list_first(>free_clusters) != idx); - cluster_list_del_first(>free_clusters, cluster_info); - cluster_set_count_flag(_info[idx], 0, 0); - } + if (cluster_is_free(_info[idx])) + alloc_cluster(p, idx); VM_BUG_ON(cluster_count(_info[idx]) >= SWAPFILE_CLUSTER); cluster_set_count(_info[idx], @@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p, cluster_set_count(_info[idx], cluster_count(_info[idx]) - 1); - if (cluster_count(_info[idx]) == 0) { - /* -* If the swap is discardable, prepare discard the cluster -* instead of free it immediately. The cluster will be freed -* after discard. -*/ - if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == -(SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(p, idx); - return; - } - - cluster_set_flag(_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(>free_clusters, cluster_info, idx); - } + if (cluster_count(_info[idx]) == 0) + free_cluster(p, idx); } /* @@ -497,6 +516,69 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, *scan_base = tmp; } +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline unsigned int huge_cluster_nr_entries(bool
[v14, 0/8] Fix eSDHC host version register bug
This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and revision, 10 previous version patchsets had tried many methods but all of them were rejected by reviewers. Such as - dts compatible method - syscon method - ifdef PPC method - GUTS driver getting SVR method Anrd suggested a soc_device_match method in v10, and this is the only available method left now. This v11 patchset introduces the soc_device_match interface in soc driver. The first six patches of Yangbo are to add the GUTS driver. This is used to register a soc device which contain soc version and revision information. The other two patches introduce the soc_device_match method in soc driver and apply it on esdhc driver to fix this bug. Arnd Bergmann (1): base: soc: introduce soc_device_match() interface Yangbo Lu (7): dt: bindings: update Freescale DCFG compatible ARM64: dts: ls2080a: add device configuration node dt: bindings: move guts devicetree doc out of powerpc directory powerpc/fsl: move mpc85xx.h to include/linux/fsl soc: fsl: add GUTS driver for QorIQ platforms MAINTAINERS: add entry for Freescale SoC drivers mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 Documentation/devicetree/bindings/arm/fsl.txt | 6 +- .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + MAINTAINERS| 11 +- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/base/Kconfig | 1 + drivers/base/soc.c | 66 ++ drivers/clk/clk-qoriq.c| 3 +- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ drivers/net/ethernet/freescale/gianfar.c | 2 +- drivers/soc/Kconfig| 3 +- drivers/soc/fsl/Kconfig| 18 ++ drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 238 + include/linux/fsl/guts.h | 125 ++- .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- include/linux/sys_soc.h| 3 + 21 files changed, 458 insertions(+), 62 deletions(-) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) -- 2.1.0.27.g96db324
[PATCH -v4 RESEND 8/9] mm, THP, swap: Support to split THP in swap cache
From: Huang YingThis patch enhanced the split_huge_page_to_list() to work properly for the THP (Transparent Huge Page) in the swap cache during swapping out. This is used for delaying splitting the THP during swapping out. Where for a THP to be swapped out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. The page lock will be held during this process. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- mm/huge_memory.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 326b145..199eaba 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1831,7 +1831,7 @@ static void __split_huge_page_tail(struct page *head, int tail, * atomic_set() here would be safe on all archs (and not only on x86), * it's safer to use atomic_inc()/atomic_add(). */ - if (PageAnon(head)) { + if (PageAnon(head) && !PageSwapCache(head)) { page_ref_inc(page_tail); } else { /* Additional pin to radix tree */ @@ -1842,6 +1842,7 @@ static void __split_huge_page_tail(struct page *head, int tail, page_tail->flags |= (head->flags & ((1L << PG_referenced) | (1L << PG_swapbacked) | +(1L << PG_swapcache) | (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | @@ -1904,7 +1905,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head); /* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { - page_ref_inc(head); + /* Additional pin to radix tree of swap cache */ + if (PageSwapCache(head)) + page_ref_add(head, 2); + else + page_ref_inc(head); } else { /* Additional pin to radix tree */ page_ref_add(head, 2); @@ -2016,10 +2021,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) /* Racy check whether the huge page can be split */ bool can_split_huge_page(struct page *page) { - int extra_pins = 0; + int extra_pins; /* Additional pins from radix tree */ - if (!PageAnon(page)) + if (PageAnon(page)) + extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0; + else extra_pins = HPAGE_PMD_NR; return total_mapcount(page) == page_count(page) - extra_pins - 1; } @@ -2072,7 +2079,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) ret = -EBUSY; goto out; } - extra_pins = 0; + extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0; mapping = NULL; anon_vma_lock_write(anon_vma); } else { -- 2.9.3
[v14, 1/8] dt: bindings: update Freescale DCFG compatible
Update Freescale DCFG compatible with 'fsl,-dcfg' instead of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a, ls1043a, and ls2080a. Signed-off-by: Yangbo LuAcked-by: Rob Herring Signed-off-by: Scott Wood --- Changes for v8: - Added this patch Changes for v9: - Added a list for the possible compatibles Changes for v10: - None Changes for v11: - Added 'Acked-by: Rob Herring' - Updated commit message by Scott Changes for v12: - None Changes for v13: - None Changes for v14: - None --- Documentation/devicetree/bindings/arm/fsl.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/arm/fsl.txt b/Documentation/devicetree/bindings/arm/fsl.txt index dbbc095..713c1ae 100644 --- a/Documentation/devicetree/bindings/arm/fsl.txt +++ b/Documentation/devicetree/bindings/arm/fsl.txt @@ -119,7 +119,11 @@ Freescale DCFG configuration and status for the device. Such as setting the secondary core start address and release the secondary core from holdoff and startup. Required properties: - - compatible: should be "fsl,ls1021a-dcfg" + - compatible: should be "fsl,-dcfg" +Possible compatibles: + "fsl,ls1021a-dcfg" + "fsl,ls1043a-dcfg" + "fsl,ls2080a-dcfg" - reg : should contain base address and length of DCFG memory-mapped registers Example: -- 2.1.0.27.g96db324
[PATCH -v4 RESEND 3/9] mm, THP, swap: Add swap cluster allocate/free functions
From: Huang Ying The swap cluster allocation/free functions are added based on the existing swap cluster management mechanism for SSD. These functions don't work for the rotating hard disks because the existing swap cluster management mechanism doesn't work for them. The hard disks support may be added if someone really need it. But that needn't be included in this patchset. This will be used for the THP (Transparent Huge Page) swap support. Where one swap cluster will hold the contents of each THP swapped out. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- mm/swapfile.c | 203 +- 1 file changed, 146 insertions(+), 57 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index f3fc83f..3643049 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -326,6 +326,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, schedule_work(>discard_work); } +static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); + cluster_list_add_tail(>free_clusters, ci, idx); +} + /* * Doing discard actually. After a cluster discard is finished, the cluster * will be added to free cluster list. caller should hold si->lock. @@ -345,8 +353,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) SWAPFILE_CLUSTER); spin_lock(>lock); - cluster_set_flag([idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(>free_clusters, info, idx); + __free_cluster(si, idx); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); } @@ -363,6 +370,34 @@ static void swap_discard_work(struct work_struct *work) spin_unlock(>lock); } +static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info; + + VM_BUG_ON(cluster_list_first(>free_clusters) != idx); + cluster_list_del_first(>free_clusters, ci); + cluster_set_count_flag(ci + idx, 0, 0); +} + +static void free_cluster(struct swap_info_struct *si, unsigned long idx) +{ + struct swap_cluster_info *ci = si->cluster_info + idx; + + VM_BUG_ON(cluster_count(ci) != 0); + /* +* If the swap is discardable, prepare discard the cluster +* instead of free it immediately. The cluster will be freed +* after discard. +*/ + if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == + (SWP_WRITEOK | SWP_PAGE_DISCARD)) { + swap_cluster_schedule_discard(si, idx); + return; + } + + __free_cluster(si, idx); +} + /* * The cluster corresponding to page_nr will be used. The cluster will be * removed from free cluster list and its usage counter will be increased. @@ -374,11 +409,8 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (!cluster_info) return; - if (cluster_is_free(_info[idx])) { - VM_BUG_ON(cluster_list_first(>free_clusters) != idx); - cluster_list_del_first(>free_clusters, cluster_info); - cluster_set_count_flag(_info[idx], 0, 0); - } + if (cluster_is_free(_info[idx])) + alloc_cluster(p, idx); VM_BUG_ON(cluster_count(_info[idx]) >= SWAPFILE_CLUSTER); cluster_set_count(_info[idx], @@ -402,21 +434,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p, cluster_set_count(_info[idx], cluster_count(_info[idx]) - 1); - if (cluster_count(_info[idx]) == 0) { - /* -* If the swap is discardable, prepare discard the cluster -* instead of free it immediately. The cluster will be freed -* after discard. -*/ - if ((p->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == -(SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(p, idx); - return; - } - - cluster_set_flag(_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(>free_clusters, cluster_info, idx); - } + if (cluster_count(_info[idx]) == 0) + free_cluster(p, idx); } /* @@ -497,6 +516,69 @@ static void scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, *scan_base = tmp; } +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline unsigned int huge_cluster_nr_entries(bool huge) +{ + return huge ? SWAPFILE_CLUSTER : 1; +} +#else +#define huge_cluster_nr_entries(huge) 1 +#endif + +static void __swap_entry_alloc(struct
[v14, 0/8] Fix eSDHC host version register bug
This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and revision, 10 previous version patchsets had tried many methods but all of them were rejected by reviewers. Such as - dts compatible method - syscon method - ifdef PPC method - GUTS driver getting SVR method Anrd suggested a soc_device_match method in v10, and this is the only available method left now. This v11 patchset introduces the soc_device_match interface in soc driver. The first six patches of Yangbo are to add the GUTS driver. This is used to register a soc device which contain soc version and revision information. The other two patches introduce the soc_device_match method in soc driver and apply it on esdhc driver to fix this bug. Arnd Bergmann (1): base: soc: introduce soc_device_match() interface Yangbo Lu (7): dt: bindings: update Freescale DCFG compatible ARM64: dts: ls2080a: add device configuration node dt: bindings: move guts devicetree doc out of powerpc directory powerpc/fsl: move mpc85xx.h to include/linux/fsl soc: fsl: add GUTS driver for QorIQ platforms MAINTAINERS: add entry for Freescale SoC drivers mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 Documentation/devicetree/bindings/arm/fsl.txt | 6 +- .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + MAINTAINERS| 11 +- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/base/Kconfig | 1 + drivers/base/soc.c | 66 ++ drivers/clk/clk-qoriq.c| 3 +- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ drivers/net/ethernet/freescale/gianfar.c | 2 +- drivers/soc/Kconfig| 3 +- drivers/soc/fsl/Kconfig| 18 ++ drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 238 + include/linux/fsl/guts.h | 125 ++- .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- include/linux/sys_soc.h| 3 + 21 files changed, 458 insertions(+), 62 deletions(-) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) -- 2.1.0.27.g96db324
[PATCH -v4 RESEND 8/9] mm, THP, swap: Support to split THP in swap cache
From: Huang Ying This patch enhanced the split_huge_page_to_list() to work properly for the THP (Transparent Huge Page) in the swap cache during swapping out. This is used for delaying splitting the THP during swapping out. Where for a THP to be swapped out, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. The page lock will be held during this process. So in the code path other than swapping out, if the THP need to be split, the PageSwapCache(THP) will be always false. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- mm/huge_memory.c | 17 - 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 326b145..199eaba 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1831,7 +1831,7 @@ static void __split_huge_page_tail(struct page *head, int tail, * atomic_set() here would be safe on all archs (and not only on x86), * it's safer to use atomic_inc()/atomic_add(). */ - if (PageAnon(head)) { + if (PageAnon(head) && !PageSwapCache(head)) { page_ref_inc(page_tail); } else { /* Additional pin to radix tree */ @@ -1842,6 +1842,7 @@ static void __split_huge_page_tail(struct page *head, int tail, page_tail->flags |= (head->flags & ((1L << PG_referenced) | (1L << PG_swapbacked) | +(1L << PG_swapcache) | (1L << PG_mlocked) | (1L << PG_uptodate) | (1L << PG_active) | @@ -1904,7 +1905,11 @@ static void __split_huge_page(struct page *page, struct list_head *list, ClearPageCompound(head); /* See comment in __split_huge_page_tail() */ if (PageAnon(head)) { - page_ref_inc(head); + /* Additional pin to radix tree of swap cache */ + if (PageSwapCache(head)) + page_ref_add(head, 2); + else + page_ref_inc(head); } else { /* Additional pin to radix tree */ page_ref_add(head, 2); @@ -2016,10 +2021,12 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) /* Racy check whether the huge page can be split */ bool can_split_huge_page(struct page *page) { - int extra_pins = 0; + int extra_pins; /* Additional pins from radix tree */ - if (!PageAnon(page)) + if (PageAnon(page)) + extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0; + else extra_pins = HPAGE_PMD_NR; return total_mapcount(page) == page_count(page) - extra_pins - 1; } @@ -2072,7 +2079,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) ret = -EBUSY; goto out; } - extra_pins = 0; + extra_pins = PageSwapCache(head) ? HPAGE_PMD_NR : 0; mapping = NULL; anon_vma_lock_write(anon_vma); } else { -- 2.9.3
[v14, 1/8] dt: bindings: update Freescale DCFG compatible
Update Freescale DCFG compatible with 'fsl,-dcfg' instead of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a, ls1043a, and ls2080a. Signed-off-by: Yangbo Lu Acked-by: Rob Herring Signed-off-by: Scott Wood --- Changes for v8: - Added this patch Changes for v9: - Added a list for the possible compatibles Changes for v10: - None Changes for v11: - Added 'Acked-by: Rob Herring' - Updated commit message by Scott Changes for v12: - None Changes for v13: - None Changes for v14: - None --- Documentation/devicetree/bindings/arm/fsl.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/arm/fsl.txt b/Documentation/devicetree/bindings/arm/fsl.txt index dbbc095..713c1ae 100644 --- a/Documentation/devicetree/bindings/arm/fsl.txt +++ b/Documentation/devicetree/bindings/arm/fsl.txt @@ -119,7 +119,11 @@ Freescale DCFG configuration and status for the device. Such as setting the secondary core start address and release the secondary core from holdoff and startup. Required properties: - - compatible: should be "fsl,ls1021a-dcfg" + - compatible: should be "fsl,-dcfg" +Possible compatibles: + "fsl,ls1021a-dcfg" + "fsl,ls1043a-dcfg" + "fsl,ls2080a-dcfg" - reg : should contain base address and length of DCFG memory-mapped registers Example: -- 2.1.0.27.g96db324
[PATCH -v4 RESEND 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache
From: Huang YingWith this patch, a THP (Transparent Huge Page) can be added/deleted to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This will be used for the THP (Transparent Huge Page) swap support. Where one THP may be added/delted to/from the swap cache. This will batch the swap cache operations to reduce the lock acquire/release times for the THP swap too. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Signed-off-by: "Huang, Ying" --- include/linux/page-flags.h | 2 +- mm/swap_state.c| 58 -- 2 files changed, 41 insertions(+), 19 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..f5bcbea 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem) #endif #ifdef CONFIG_SWAP -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else PAGEFLAG_FALSE(SwapCache) #endif diff --git a/mm/swap_state.c b/mm/swap_state.c index d3f047b..3115762 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = { }; #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) +#define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) static struct { unsigned long add_total; @@ -80,25 +81,33 @@ void show_swap_cache_info(void) */ int __add_to_swap_cache(struct page *page, swp_entry_t entry) { - int error; + int error, i, nr = hpage_nr_pages(page); struct address_space *address_space; + struct page *cur_page; + swp_entry_t cur_entry; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapCache(page), page); VM_BUG_ON_PAGE(!PageSwapBacked(page), page); - get_page(page); + page_ref_add(page, nr); SetPageSwapCache(page); - set_page_private(page, entry.val); address_space = swap_address_space(entry); + cur_page = page; + cur_entry.val = entry.val; spin_lock_irq(_space->tree_lock); - error = radix_tree_insert(_space->page_tree, - swp_offset(entry), page); + for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) { + set_page_private(cur_page, cur_entry.val); + error = radix_tree_insert(_space->page_tree, + swp_offset(cur_entry), cur_page); + if (unlikely(error)) + break; + } if (likely(!error)) { - address_space->nrpages++; - __inc_node_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(add_total); + address_space->nrpages += nr; + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); + ADD_CACHE_INFO(add_total, nr); } spin_unlock_irq(_space->tree_lock); @@ -109,9 +118,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry) * So add_to_swap_cache() doesn't returns -EEXIST. */ VM_BUG_ON(error == -EEXIST); - set_page_private(page, 0UL); ClearPageSwapCache(page); - put_page(page); + set_page_private(cur_page, 0UL); + while (i--) { + cur_page--; + cur_entry.val--; + set_page_private(cur_page, 0UL); + radix_tree_delete(_space->page_tree, + swp_offset(cur_entry)); + } + page_ref_sub(page, nr); } return error; @@ -122,7 +138,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { int error; - error = radix_tree_maybe_preload(gfp_mask); + error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page)); if (!error) { error = __add_to_swap_cache(page, entry); radix_tree_preload_end(); @@ -138,6 +154,7 @@ void __delete_from_swap_cache(struct page *page) { swp_entry_t entry; struct address_space *address_space; + int i, nr = hpage_nr_pages(page); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageSwapCache(page), page); @@ -145,12 +162,17 @@ void __delete_from_swap_cache(struct page *page) entry.val = page_private(page); address_space = swap_address_space(entry); - radix_tree_delete(_space->page_tree, swp_offset(entry)); - set_page_private(page, 0); ClearPageSwapCache(page); - address_space->nrpages--; -
[PATCH -v4 RESEND 6/9] mm, THP, swap: Support to add/delete THP to/from swap cache
From: Huang Ying With this patch, a THP (Transparent Huge Page) can be added/deleted to/from the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This will be used for the THP (Transparent Huge Page) swap support. Where one THP may be added/delted to/from the swap cache. This will batch the swap cache operations to reduce the lock acquire/release times for the THP swap too. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Signed-off-by: "Huang, Ying" --- include/linux/page-flags.h | 2 +- mm/swap_state.c| 58 -- 2 files changed, 41 insertions(+), 19 deletions(-) diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 74e4dda..f5bcbea 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -314,7 +314,7 @@ PAGEFLAG_FALSE(HighMem) #endif #ifdef CONFIG_SWAP -PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND) +PAGEFLAG(SwapCache, swapcache, PF_NO_TAIL) #else PAGEFLAG_FALSE(SwapCache) #endif diff --git a/mm/swap_state.c b/mm/swap_state.c index d3f047b..3115762 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -43,6 +43,7 @@ struct address_space swapper_spaces[MAX_SWAPFILES] = { }; #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) +#define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) static struct { unsigned long add_total; @@ -80,25 +81,33 @@ void show_swap_cache_info(void) */ int __add_to_swap_cache(struct page *page, swp_entry_t entry) { - int error; + int error, i, nr = hpage_nr_pages(page); struct address_space *address_space; + struct page *cur_page; + swp_entry_t cur_entry; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapCache(page), page); VM_BUG_ON_PAGE(!PageSwapBacked(page), page); - get_page(page); + page_ref_add(page, nr); SetPageSwapCache(page); - set_page_private(page, entry.val); address_space = swap_address_space(entry); + cur_page = page; + cur_entry.val = entry.val; spin_lock_irq(_space->tree_lock); - error = radix_tree_insert(_space->page_tree, - swp_offset(entry), page); + for (i = 0; i < nr; i++, cur_page++, cur_entry.val++) { + set_page_private(cur_page, cur_entry.val); + error = radix_tree_insert(_space->page_tree, + swp_offset(cur_entry), cur_page); + if (unlikely(error)) + break; + } if (likely(!error)) { - address_space->nrpages++; - __inc_node_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(add_total); + address_space->nrpages += nr; + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); + ADD_CACHE_INFO(add_total, nr); } spin_unlock_irq(_space->tree_lock); @@ -109,9 +118,16 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry) * So add_to_swap_cache() doesn't returns -EEXIST. */ VM_BUG_ON(error == -EEXIST); - set_page_private(page, 0UL); ClearPageSwapCache(page); - put_page(page); + set_page_private(cur_page, 0UL); + while (i--) { + cur_page--; + cur_entry.val--; + set_page_private(cur_page, 0UL); + radix_tree_delete(_space->page_tree, + swp_offset(cur_entry)); + } + page_ref_sub(page, nr); } return error; @@ -122,7 +138,7 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { int error; - error = radix_tree_maybe_preload(gfp_mask); + error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page)); if (!error) { error = __add_to_swap_cache(page, entry); radix_tree_preload_end(); @@ -138,6 +154,7 @@ void __delete_from_swap_cache(struct page *page) { swp_entry_t entry; struct address_space *address_space; + int i, nr = hpage_nr_pages(page); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(!PageSwapCache(page), page); @@ -145,12 +162,17 @@ void __delete_from_swap_cache(struct page *page) entry.val = page_private(page); address_space = swap_address_space(entry); - radix_tree_delete(_space->page_tree, swp_offset(entry)); - set_page_private(page, 0); ClearPageSwapCache(page); - address_space->nrpages--; - __dec_node_page_state(page, NR_FILE_PAGES); - INC_CACHE_INFO(del_total); + for (i = 0; i < nr; i++, entry.val++) { + struct page *cur_page
[PATCH -v4 RESEND 9/9] mm, THP, swap: Delay splitting THP during swap out
From: Huang YingIn this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. This is the first step for the THP swap support. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help to improve the THP swap performance. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which usually are 4k random IO. This will help to improve the THP swap performance too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after the THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead to the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The RAM simulated PMEM (persistent memory) device is used as the swap device. To test sequential swapping out, the test case uses 16 processes sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list Signed-off-by: "Huang, Ying" --- mm/swap_state.c | 65 ++--- 1 file changed, 62 insertions(+), 3 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 3115762..b338523 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -17,6 +17,7 @@ #include #include #include +#include #include @@ -175,12 +176,53 @@ void __delete_from_swap_cache(struct page *page) ADD_CACHE_INFO(del_total, nr); } +#ifdef CONFIG_THP_SWAP_CLUSTER +int add_to_swap_trans_huge(struct page *page, struct list_head *list) +{ + swp_entry_t entry; + int ret = 0; + + /* cannot split, which may be needed during swap in, skip it */ + if (!can_split_huge_page(page)) + return -EBUSY; + /* fallback to split huge page firstly if no PMD map */ + if (!compound_mapcount(page)) + return 0; + entry = get_huge_swap_page(); + if (!entry.val) + return 0; + if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) { + __swapcache_free(entry, true); + return -EOVERFLOW; + } + ret = add_to_swap_cache(page, entry, + __GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN); + /* -ENOMEM radix-tree allocation failure */ + if (ret) { + __swapcache_free(entry, true); + return 0; + } + ret = split_huge_page_to_list(page, list); + if (ret) { + delete_from_swap_cache(page); + return -EBUSY; + } + return 1; +} +#else +static inline int add_to_swap_trans_huge(struct page *page, +struct list_head *list) +{ + return 0; +} +#endif + /** * add_to_swap - allocate swap space for a page * @page: page
[PATCH -v4 RESEND 4/9] mm, THP, swap: Add get_huge_swap_page()
From: Huang YingA variation of get_swap_page(), get_huge_swap_page(), is added to allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap cluster allocation function. A fair simple algorithm is used, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. This will be used for the THP (Transparent Huge Page) swap support. Where get_huge_swap_page() will be used to allocate one swap cluster for each THP swapped out. Because of the algorithm adopted, if the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 24 +++- mm/swapfile.c| 18 -- 2 files changed, 35 insertions(+), 7 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 001b506..cb8c1b0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -401,7 +401,7 @@ static inline long get_nr_swap_pages(void) } extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); +extern swp_entry_t __get_swap_page(bool huge); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); @@ -421,6 +421,23 @@ extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); struct backing_dev_info; +static inline swp_entry_t get_swap_page(void) +{ + return __get_swap_page(false); +} + +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline swp_entry_t get_huge_swap_page(void) +{ + return __get_swap_page(true); +} +#else +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} +#endif + #else /* CONFIG_SWAP */ #define swap_address_space(entry) (NULL) @@ -527,6 +544,11 @@ static inline swp_entry_t get_swap_page(void) return entry; } +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} + #endif /* CONFIG_SWAP */ #ifdef CONFIG_MEMCG diff --git a/mm/swapfile.c b/mm/swapfile.c index 3643049..8224150 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -760,14 +760,15 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) } #endif -swp_entry_t get_swap_page(void) +swp_entry_t __get_swap_page(bool huge) { struct swap_info_struct *si, *next; pgoff_t offset; + int nr_pages = huge_cluster_nr_entries(huge); - if (atomic_long_read(_swap_pages) <= 0) + if (atomic_long_read(_swap_pages) < nr_pages) goto noswap; - atomic_long_dec(_swap_pages); + atomic_long_sub(nr_pages, _swap_pages); spin_lock(_avail_lock); @@ -795,10 +796,15 @@ swp_entry_t get_swap_page(void) } /* This is called for allocating swap entry for cache */ - offset = scan_swap_map(si, SWAP_HAS_CACHE); + if (likely(nr_pages == 1)) + offset = scan_swap_map(si, SWAP_HAS_CACHE); + else + offset = swap_alloc_huge_cluster(si); spin_unlock(>lock); if (offset) return swp_entry(si->type, offset); + else if (unlikely(nr_pages != 1)) + goto fail_alloc; pr_debug("scan_swap_map of si %d failed to find offset\n", si->type); spin_lock(_avail_lock); @@ -818,8 +824,8 @@ swp_entry_t get_swap_page(void) } spin_unlock(_avail_lock); - - atomic_long_inc(_swap_pages); +fail_alloc: + atomic_long_add(nr_pages, _swap_pages); noswap: return (swp_entry_t) {0}; } -- 2.9.3
[PATCH -v4 RESEND 7/9] mm, THP: Add can_split_huge_page()
From: Huang YingSeparates checking whether we can split the huge page from split_huge_page_to_list() into a function. This will help to check that before splitting the THP (Transparent Huge Page) really. This will be used for delaying splitting THP during swapping out. Where for a THP, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. To avoid the unnecessary operations for the un-splittable THP, we will check that firstly. There is no functionality change in this patch. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- include/linux/huge_mm.h | 7 +++ mm/huge_memory.c| 13 - 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9b9f65d..14ffa3f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp, extern void prep_transhuge_page(struct page *page); extern void free_transhuge_page(struct page *page); +bool can_split_huge_page(struct page *page); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { @@ -176,6 +177,12 @@ static inline void prep_transhuge_page(struct page *page) {} #define thp_get_unmapped_area NULL +static inline bool +can_split_huge_page(struct page *page) +{ + BUILD_BUG(); + return false; +} static inline int split_huge_page_to_list(struct page *page, struct list_head *list) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cdcd25c..326b145 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2013,6 +2013,17 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) return ret; } +/* Racy check whether the huge page can be split */ +bool can_split_huge_page(struct page *page) +{ + int extra_pins = 0; + + /* Additional pins from radix tree */ + if (!PageAnon(page)) + extra_pins = HPAGE_PMD_NR; + return total_mapcount(page) == page_count(page) - extra_pins - 1; +} + /* * This function splits huge page into normal pages. @page can point to any * subpage of huge page to split. Split doesn't change the position of @page. @@ -2083,7 +2094,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * Racy check if we can split the page, before freeze_page() will * split PMDs */ - if (total_mapcount(head) != page_count(head) - extra_pins - 1) { + if (!can_split_huge_page(head)) { ret = -EBUSY; goto out_unlock; } -- 2.9.3
[PATCH -v4 RESEND 9/9] mm, THP, swap: Delay splitting THP during swap out
From: Huang Ying In this patch, splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP (Transparent Huge Page) and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. This is the first step for the THP swap support. The plan is to delay splitting the THP step by step and avoid splitting the THP finally. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help to improve the THP swap performance. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which usually are 4k random IO. This will help to improve the THP swap performance too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after the THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead to the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. With the patchset, the swap out throughput improved 12.1% (from 1.12GB/s to 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The RAM simulated PMEM (persistent memory) device is used as the swap device. To test sequential swapping out, the test case uses 16 processes sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list Signed-off-by: "Huang, Ying" --- mm/swap_state.c | 65 ++--- 1 file changed, 62 insertions(+), 3 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index 3115762..b338523 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -17,6 +17,7 @@ #include #include #include +#include #include @@ -175,12 +176,53 @@ void __delete_from_swap_cache(struct page *page) ADD_CACHE_INFO(del_total, nr); } +#ifdef CONFIG_THP_SWAP_CLUSTER +int add_to_swap_trans_huge(struct page *page, struct list_head *list) +{ + swp_entry_t entry; + int ret = 0; + + /* cannot split, which may be needed during swap in, skip it */ + if (!can_split_huge_page(page)) + return -EBUSY; + /* fallback to split huge page firstly if no PMD map */ + if (!compound_mapcount(page)) + return 0; + entry = get_huge_swap_page(); + if (!entry.val) + return 0; + if (mem_cgroup_try_charge_swap(page, entry, HPAGE_PMD_NR)) { + __swapcache_free(entry, true); + return -EOVERFLOW; + } + ret = add_to_swap_cache(page, entry, + __GFP_HIGH | __GFP_NOMEMALLOC|__GFP_NOWARN); + /* -ENOMEM radix-tree allocation failure */ + if (ret) { + __swapcache_free(entry, true); + return 0; + } + ret = split_huge_page_to_list(page, list); + if (ret) { + delete_from_swap_cache(page); + return -EBUSY; + } + return 1; +} +#else +static inline int add_to_swap_trans_huge(struct page *page, +struct list_head *list) +{ + return 0; +} +#endif + /** * add_to_swap - allocate swap space for a page * @page: page we want to move to swap * * Allocate
[PATCH -v4 RESEND 4/9] mm, THP, swap: Add get_huge_swap_page()
From: Huang Ying A variation of get_swap_page(), get_huge_swap_page(), is added to allocate a swap cluster (HPAGE_PMD_NR swap slots) based on the swap cluster allocation function. A fair simple algorithm is used, that is, only the first swap device in priority list will be tried to allocate the swap cluster. The function will fail if the trying is not successful, and the caller will fallback to allocate a single swap slot instead. This works good enough for normal cases. This will be used for the THP (Transparent Huge Page) swap support. Where get_huge_swap_page() will be used to allocate one swap cluster for each THP swapped out. Because of the algorithm adopted, if the difference of the number of the free swap clusters among multiple swap devices is significant, it is possible that some THPs are split earlier than necessary. For example, this could be caused by big size difference among multiple swap devices. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Signed-off-by: "Huang, Ying" --- include/linux/swap.h | 24 +++- mm/swapfile.c| 18 -- 2 files changed, 35 insertions(+), 7 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 001b506..cb8c1b0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -401,7 +401,7 @@ static inline long get_nr_swap_pages(void) } extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); +extern swp_entry_t __get_swap_page(bool huge); extern swp_entry_t get_swap_page_of_type(int); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); @@ -421,6 +421,23 @@ extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); struct backing_dev_info; +static inline swp_entry_t get_swap_page(void) +{ + return __get_swap_page(false); +} + +#ifdef CONFIG_THP_SWAP_CLUSTER +static inline swp_entry_t get_huge_swap_page(void) +{ + return __get_swap_page(true); +} +#else +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} +#endif + #else /* CONFIG_SWAP */ #define swap_address_space(entry) (NULL) @@ -527,6 +544,11 @@ static inline swp_entry_t get_swap_page(void) return entry; } +static inline swp_entry_t get_huge_swap_page(void) +{ + return (swp_entry_t) {0}; +} + #endif /* CONFIG_SWAP */ #ifdef CONFIG_MEMCG diff --git a/mm/swapfile.c b/mm/swapfile.c index 3643049..8224150 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -760,14 +760,15 @@ static inline unsigned long swap_alloc_huge_cluster(struct swap_info_struct *si) } #endif -swp_entry_t get_swap_page(void) +swp_entry_t __get_swap_page(bool huge) { struct swap_info_struct *si, *next; pgoff_t offset; + int nr_pages = huge_cluster_nr_entries(huge); - if (atomic_long_read(_swap_pages) <= 0) + if (atomic_long_read(_swap_pages) < nr_pages) goto noswap; - atomic_long_dec(_swap_pages); + atomic_long_sub(nr_pages, _swap_pages); spin_lock(_avail_lock); @@ -795,10 +796,15 @@ swp_entry_t get_swap_page(void) } /* This is called for allocating swap entry for cache */ - offset = scan_swap_map(si, SWAP_HAS_CACHE); + if (likely(nr_pages == 1)) + offset = scan_swap_map(si, SWAP_HAS_CACHE); + else + offset = swap_alloc_huge_cluster(si); spin_unlock(>lock); if (offset) return swp_entry(si->type, offset); + else if (unlikely(nr_pages != 1)) + goto fail_alloc; pr_debug("scan_swap_map of si %d failed to find offset\n", si->type); spin_lock(_avail_lock); @@ -818,8 +824,8 @@ swp_entry_t get_swap_page(void) } spin_unlock(_avail_lock); - - atomic_long_inc(_swap_pages); +fail_alloc: + atomic_long_add(nr_pages, _swap_pages); noswap: return (swp_entry_t) {0}; } -- 2.9.3
[PATCH -v4 RESEND 7/9] mm, THP: Add can_split_huge_page()
From: Huang Ying Separates checking whether we can split the huge page from split_huge_page_to_list() into a function. This will help to check that before splitting the THP (Transparent Huge Page) really. This will be used for delaying splitting THP during swapping out. Where for a THP, we will allocate a swap cluster, add the THP into the swap cache, then split the THP. To avoid the unnecessary operations for the un-splittable THP, we will check that firstly. There is no functionality change in this patch. Cc: Andrea Arcangeli Cc: Kirill A. Shutemov Cc: Ebru Akagunduz Signed-off-by: "Huang, Ying" --- include/linux/huge_mm.h | 7 +++ mm/huge_memory.c| 13 - 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 9b9f65d..14ffa3f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -94,6 +94,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp, extern void prep_transhuge_page(struct page *page); extern void free_transhuge_page(struct page *page); +bool can_split_huge_page(struct page *page); int split_huge_page_to_list(struct page *page, struct list_head *list); static inline int split_huge_page(struct page *page) { @@ -176,6 +177,12 @@ static inline void prep_transhuge_page(struct page *page) {} #define thp_get_unmapped_area NULL +static inline bool +can_split_huge_page(struct page *page) +{ + BUILD_BUG(); + return false; +} static inline int split_huge_page_to_list(struct page *page, struct list_head *list) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cdcd25c..326b145 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2013,6 +2013,17 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount) return ret; } +/* Racy check whether the huge page can be split */ +bool can_split_huge_page(struct page *page) +{ + int extra_pins = 0; + + /* Additional pins from radix tree */ + if (!PageAnon(page)) + extra_pins = HPAGE_PMD_NR; + return total_mapcount(page) == page_count(page) - extra_pins - 1; +} + /* * This function splits huge page into normal pages. @page can point to any * subpage of huge page to split. Split doesn't change the position of @page. @@ -2083,7 +2094,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * Racy check if we can split the page, before freeze_page() will * split PMDs */ - if (total_mapcount(head) != page_count(head) - extra_pins - 1) { + if (!can_split_huge_page(head)) { ret = -EBUSY; goto out_unlock; } -- 2.9.3
[PATCH -v4 RESEND 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
From: Huang YingIn this patch, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). This is for the THP swap support on x86_64. Where one swap cluster will be used to hold the contents of each THP swapped out. And some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. For other architectures which want THP swap support, ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Suggested-by: Andrew Morton Signed-off-by: "Huang, Ying" --- arch/x86/Kconfig | 1 + mm/Kconfig | 13 + mm/swapfile.c| 4 3 files changed, 18 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index bada636..a8446bc 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -165,6 +165,7 @@ config X86 select HAVE_STACK_VALIDATIONif X86_64 select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS + select ARCH_USES_THP_SWAP_CLUSTER if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/mm/Kconfig b/mm/Kconfig index be0ee11..2da8128 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -503,6 +503,19 @@ config FRONTSWAP If unsure, say Y to enable frontswap. +config ARCH_USES_THP_SWAP_CLUSTER + bool + default n + +config THP_SWAP_CLUSTER + bool + depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER + default y + help + Use one swap cluster to hold the contents of the THP + (Transparent Huge Page) swapped out. The size of the swap + cluster will be same as that of THP. + config CMA bool "Contiguous Memory Allocator" depends on HAVE_MEMBLOCK && MMU diff --git a/mm/swapfile.c b/mm/swapfile.c index 2210de2..18e247b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, } } +#ifdef CONFIG_THP_SWAP_CLUSTER +#define SWAPFILE_CLUSTER HPAGE_PMD_NR +#else #define SWAPFILE_CLUSTER 256 +#endif #define LATENCY_LIMIT 256 static inline void cluster_set_flag(struct swap_cluster_info *info, -- 2.9.3
[PATCH -v4 RESEND 1/9] mm, swap: Make swap cluster size same of THP size on x86_64
From: Huang Ying In this patch, the size of the swap cluster is changed to that of the THP (Transparent Huge Page) on x86_64 architecture (512). This is for the THP swap support on x86_64. Where one swap cluster will be used to hold the contents of each THP swapped out. And some information of the swapped out THP (such as compound map count) will be recorded in the swap_cluster_info data structure. For other architectures which want THP swap support, ARCH_USES_THP_SWAP_CLUSTER need to be selected in the Kconfig file for the architecture. In effect, this will enlarge swap cluster size by 2 times on x86_64. Which may make it harder to find a free cluster when the swap space becomes fragmented. So that, this may reduce the continuous swap space allocation and sequential write in theory. The performance test in 0day shows no regressions caused by this. Cc: Hugh Dickins Cc: Shaohua Li Cc: Minchan Kim Cc: Rik van Riel Suggested-by: Andrew Morton Signed-off-by: "Huang, Ying" --- arch/x86/Kconfig | 1 + mm/Kconfig | 13 + mm/swapfile.c| 4 3 files changed, 18 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index bada636..a8446bc 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -165,6 +165,7 @@ config X86 select HAVE_STACK_VALIDATIONif X86_64 select ARCH_USES_HIGH_VMA_FLAGS if X86_INTEL_MEMORY_PROTECTION_KEYS select ARCH_HAS_PKEYS if X86_INTEL_MEMORY_PROTECTION_KEYS + select ARCH_USES_THP_SWAP_CLUSTER if X86_64 config INSTRUCTION_DECODER def_bool y diff --git a/mm/Kconfig b/mm/Kconfig index be0ee11..2da8128 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -503,6 +503,19 @@ config FRONTSWAP If unsure, say Y to enable frontswap. +config ARCH_USES_THP_SWAP_CLUSTER + bool + default n + +config THP_SWAP_CLUSTER + bool + depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER + default y + help + Use one swap cluster to hold the contents of the THP + (Transparent Huge Page) swapped out. The size of the swap + cluster will be same as that of THP. + config CMA bool "Contiguous Memory Allocator" depends on HAVE_MEMBLOCK && MMU diff --git a/mm/swapfile.c b/mm/swapfile.c index 2210de2..18e247b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, } } +#ifdef CONFIG_THP_SWAP_CLUSTER +#define SWAPFILE_CLUSTER HPAGE_PMD_NR +#else #define SWAPFILE_CLUSTER 256 +#endif #define LATENCY_LIMIT 256 static inline void cluster_set_flag(struct swap_cluster_info *info, -- 2.9.3
[PATCH -v4 RESEND 0/9] THP swap: Delay splitting THP during swapping out
From: Huang YingThis patchset is to optimize the performance of Transparent Huge Page (THP) swap. Hi, Andrew, could you help me to check whether the overall design is reasonable? Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the swap part of the patchset? Especially [1/9], [3/9], [4/9], [5/9], [6/9], [9/9]. Hi, Andrea and Kirill, could you help me to review the THP part of the patchset? Especially [2/9], [7/9] and [8/9]. Hi, Johannes, Michal and Vladimir, I am not very confident about the memory cgroup part, especially [2/9]. Could you help me to review it? And for all, Any comment is welcome! Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead to the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is based on 10/11 head of mmotm/master. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 12.1% (from about 1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case uses 16 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list >From the swap out throughput number, we can find, even tested on a RAM simulated PMEM (Persistent Memory) device, the swap out throughput can reach only about 1.1GB/s. While, in the file IO test, the sequential write throughput of an Intel P3700 SSD can reach about 1.8GB/s steadily. And according the following URL, https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html The sequential write throughput of Intel P3608 SSD can reach about 3.0GB/s, while the random read IOPS can reach about 850k. It is clear that the bottleneck has moved from the disk to the
[PATCH -v4 RESEND 0/9] THP swap: Delay splitting THP during swapping out
From: Huang Ying This patchset is to optimize the performance of Transparent Huge Page (THP) swap. Hi, Andrew, could you help me to check whether the overall design is reasonable? Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the swap part of the patchset? Especially [1/9], [3/9], [4/9], [5/9], [6/9], [9/9]. Hi, Andrea and Kirill, could you help me to review the THP part of the patchset? Especially [2/9], [7/9] and [8/9]. Hi, Johannes, Michal and Vladimir, I am not very confident about the memory cgroup part, especially [2/9]. Could you help me to review it? And for all, Any comment is welcome! Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swap out even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages of the THP swap support include: - Batch the swap operations for the THP to reduce lock acquiring/releasing, including allocating/freeing the swap space, adding/deleting to/from the swap cache, and writing/reading the swap space, etc. This will help improve the performance of the THP swap. - The THP swap space read/write will be 2M sequential IO. It is particularly helpful for the swap read, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The 2M continuous pages will be free up after THP swapping out. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapping out, it will take quite long time for the normal pages to collapse back into the THP after being swapped in. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swap in, mainly because possible enlarged read/write IO size (for swap in/out) may put more overhead to the storage device. To deal with that, the THP swap in should be turned on only when necessary. For example, it can be selected via "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. This patchset is based on 10/11 head of mmotm/master. This patchset is the first step for the THP swap support. The plan is to delay splitting THP step by step, finally avoid splitting THP during the THP swapping out and swap out/in the THP as a whole. As the first step, in this patchset, the splitting huge page is delayed from almost the first step of swapping out to after allocating the swap space for the THP and adding the THP into the swap cache. This will reduce lock acquiring/releasing for the locks used for the swap cache management. With the patchset, the swap out throughput improves 12.1% (from about 1.12GB/s to about 1.25GB/s) in the vm-scalability swap-w-seq test case with 16 processes. The test is done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case uses 16 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used up. The detailed compare result is as follow, base base+patchset -- %stddev %change %stddev \ |\ 1118821 ± 0% +12.1%1254241 ± 1% vmstat.swap.so 2460636 ± 1% +10.6%2720983 ± 1% vm-scalability.throughput 308.79 ± 1% -7.9% 284.53 ± 1% vm-scalability.time.elapsed_time 1639 ± 4%+232.3% 5446 ± 1% meminfo.SwapCached 0.70 ± 3% +8.7% 0.77 ± 5% perf-stat.ipc 9.82 ± 8% -31.6% 6.72 ± 2% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list >From the swap out throughput number, we can find, even tested on a RAM simulated PMEM (Persistent Memory) device, the swap out throughput can reach only about 1.1GB/s. While, in the file IO test, the sequential write throughput of an Intel P3700 SSD can reach about 1.8GB/s steadily. And according the following URL, https://www-ssl.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html The sequential write throughput of Intel P3608 SSD can reach about 3.0GB/s, while the random read IOPS can reach about 850k. It is clear that the bottleneck has moved from the disk to the kernel swap component
Re: [RFC 0/8] Define coherent device memory node
On 10/27/2016 08:35 PM, Jerome Glisse wrote: > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: >> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>> Jerome Glissewrites: On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: >> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > In my patchset there is no policy, it is all under device driver control which decide what range of memory is migrated and when. I think only device driver as proper knowledge to make such decision. By coalescing data from GPU counters and request from application made through the uppler level programming API like Cuda. >>> >>> Right, I understand that. But what I pointed out here is that there are >>> problems >>> now migrating user mapped pages back and forth between LRU system RAM >>> memory and >>> non LRU device memory which is yet to be solved. Because you are proposing >>> a non >>> LRU based design with ZONE_DEVICE, how we are solving/working around these >>> problems for bi-directional migration ? >> >> Let me elaborate on this bit more. Before non LRU migration support patch >> series >> from Minchan, it was not possible to migrate non LRU pages which are >> generally >> driver managed through migrate_pages interface. This was affecting the >> ability >> to do compaction on platforms which has a large share of non LRU pages. That >> series >> actually solved the migration problem and allowed compaction. But it still >> did not >> solve the migration problem for non LRU *user mapped* pages. So if the non >> LRU pages >> are mapped into a process's page table and being accessed from user space, >> it can >> not be moved using migrate_pages interface. >> >> Minchan had a draft solution for that problem which is still hosted here. On >> his >> suggestion I had tried this solution but still faced some other problems >> during >> mapped pages migration. (NOTE: IIRC this was not posted in the community) >> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the >> following >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) >> >> As I had mentioned earlier, we intend to support all possible migrations >> between >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. >> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > I achieve this 2 objective in HMM, i sent you the additional patches for file > back page migration. I am not done working on them but they are small. Sure, will go through them. Thanks ! > > >> This is not happening now with non LRU pages. Here are some of reasons but >> before >> that some notes. >> >> * Driver initiates all the migrations >> * Driver does the isolation of pages >> * Driver puts the isolated pages in a linked list >> * Driver passes the linked list to migrate_pages interface for migration >> * IIRC isolation of non LRU pages happens through >> page->as->aops->isolate_page call >> * If migration fails, call page->as->aops->putback_page to give the page >> back to the >> device driver >> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to >> be fixed >> >> 2. After a successful migration from non LRU device memory to LRU system >> RAM, the non >>LRU will be freed back. Right now migrate_pages releases these pages to >> buddy, but >>in this situation we need the pages to be given back to the driver >> instead. Hence >>migrate_pages needs to be changed to accommodate this. >> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does >> the new >>page (which came from device memory) will be part of core MM LRU either >> for Anon >>or File mapping ? >> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" and >> "Anon VMA >>Chain" reverse mapping information both on the page->mapping element ? >> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" of >> the device >>driver and radix tree based reverse mapping information for the existing >> file >>mapping both on the same page->mapping element ? >> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops >> which will >>defined inside the device driver) and the reverse mapping information
Re: [RFC 0/8] Define coherent device memory node
On 10/27/2016 08:35 PM, Jerome Glisse wrote: > On Thu, Oct 27, 2016 at 12:33:05PM +0530, Anshuman Khandual wrote: >> On 10/27/2016 10:08 AM, Anshuman Khandual wrote: >>> On 10/26/2016 09:32 PM, Jerome Glisse wrote: On Wed, Oct 26, 2016 at 04:43:10PM +0530, Anshuman Khandual wrote: > On 10/26/2016 12:22 AM, Jerome Glisse wrote: >> On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote: >>> Jerome Glisse writes: On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote: > Jerome Glisse writes: >> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote: > > [...] > In my patchset there is no policy, it is all under device driver control which decide what range of memory is migrated and when. I think only device driver as proper knowledge to make such decision. By coalescing data from GPU counters and request from application made through the uppler level programming API like Cuda. >>> >>> Right, I understand that. But what I pointed out here is that there are >>> problems >>> now migrating user mapped pages back and forth between LRU system RAM >>> memory and >>> non LRU device memory which is yet to be solved. Because you are proposing >>> a non >>> LRU based design with ZONE_DEVICE, how we are solving/working around these >>> problems for bi-directional migration ? >> >> Let me elaborate on this bit more. Before non LRU migration support patch >> series >> from Minchan, it was not possible to migrate non LRU pages which are >> generally >> driver managed through migrate_pages interface. This was affecting the >> ability >> to do compaction on platforms which has a large share of non LRU pages. That >> series >> actually solved the migration problem and allowed compaction. But it still >> did not >> solve the migration problem for non LRU *user mapped* pages. So if the non >> LRU pages >> are mapped into a process's page table and being accessed from user space, >> it can >> not be moved using migrate_pages interface. >> >> Minchan had a draft solution for that problem which is still hosted here. On >> his >> suggestion I had tried this solution but still faced some other problems >> during >> mapped pages migration. (NOTE: IIRC this was not posted in the community) >> >> git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git with the >> following >> branch (non-lru-mapped-v1r2-v4.7-rc4-mmotm-2016-06-24-15-53) >> >> As I had mentioned earlier, we intend to support all possible migrations >> between >> system RAM (LRU) and device memory (Non LRU) for user space mapped pages. >> >> (1) System RAM (Anon mapping) --> Device memory, back and forth many times >> (2) System RAM (File mapping) --> Device memory, back and forth many times > > I achieve this 2 objective in HMM, i sent you the additional patches for file > back page migration. I am not done working on them but they are small. Sure, will go through them. Thanks ! > > >> This is not happening now with non LRU pages. Here are some of reasons but >> before >> that some notes. >> >> * Driver initiates all the migrations >> * Driver does the isolation of pages >> * Driver puts the isolated pages in a linked list >> * Driver passes the linked list to migrate_pages interface for migration >> * IIRC isolation of non LRU pages happens through >> page->as->aops->isolate_page call >> * If migration fails, call page->as->aops->putback_page to give the page >> back to the >> device driver >> >> 1. queue_pages_range() currently does not work with non LRU pages, needs to >> be fixed >> >> 2. After a successful migration from non LRU device memory to LRU system >> RAM, the non >>LRU will be freed back. Right now migrate_pages releases these pages to >> buddy, but >>in this situation we need the pages to be given back to the driver >> instead. Hence >>migrate_pages needs to be changed to accommodate this. >> >> 3. After LRU system RAM to non LRU device migration for a mapped page, does >> the new >>page (which came from device memory) will be part of core MM LRU either >> for Anon >>or File mapping ? >> >> 4. After LRU (Anon mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" and >> "Anon VMA >>Chain" reverse mapping information both on the page->mapping element ? >> >> 5. After LRU (File mapped) system RAM to non LRU device migration for a >> mapped page, >>how we are going to store "address_space->address_space_operations" of >> the device >>driver and radix tree based reverse mapping information for the existing >> file >>mapping both on the same page->mapping element ? >> >> 6. IIRC, it was not possible to retain the non LRU identify (page->as->aops >> which will >>defined inside the device driver) and the reverse mapping information >> (either anon >>or file mapping)
Re: [PATCH] arm64: defconfig: Enable DRM DU and V4L2 FCP + VSP modules
On Thu, Oct 27, 2016 at 04:37:53PM +0900, Magnus Damm wrote: > Hi Simon, > > On Thu, Oct 27, 2016 at 4:15 PM, Simon Hormanwrote: > > On Thu, Oct 27, 2016 at 09:08:01AM +0200, Simon Horman wrote: > >> On Wed, Oct 26, 2016 at 02:24:22PM +0900, Magnus Damm wrote: > >> > From: Magnus Damm > >> > > >> > Extend the ARM64 defconfig to enable the DU DRM device as module > >> > together with required dependencies of V4L2 FCP and VSP modules. > >> > > >> > This enables VGA output on the r8a7795 Salvator-X board. > >> > > >> > Signed-off-by: Magnus Damm > >> > >> Thanks, I have queued this up. > > > > Given discussion elsewhere on enabling DU I am holding off on this for a > > little; it is not queued up for now. > > Sure, thanks for holding off the DT integration patches for r8a7796. > Please note that as of mainline v4.9-rc2 the r8a7795 Salvator-X board > has thanks to DU, FCP and VSP a working VGA port. So enabling those > devices in the defconfig from now on makes sense to me. Understood, I have queued this up.
Re: [PATCH] arm64: defconfig: Enable DRM DU and V4L2 FCP + VSP modules
On Thu, Oct 27, 2016 at 04:37:53PM +0900, Magnus Damm wrote: > Hi Simon, > > On Thu, Oct 27, 2016 at 4:15 PM, Simon Horman wrote: > > On Thu, Oct 27, 2016 at 09:08:01AM +0200, Simon Horman wrote: > >> On Wed, Oct 26, 2016 at 02:24:22PM +0900, Magnus Damm wrote: > >> > From: Magnus Damm > >> > > >> > Extend the ARM64 defconfig to enable the DU DRM device as module > >> > together with required dependencies of V4L2 FCP and VSP modules. > >> > > >> > This enables VGA output on the r8a7795 Salvator-X board. > >> > > >> > Signed-off-by: Magnus Damm > >> > >> Thanks, I have queued this up. > > > > Given discussion elsewhere on enabling DU I am holding off on this for a > > little; it is not queued up for now. > > Sure, thanks for holding off the DT integration patches for r8a7796. > Please note that as of mainline v4.9-rc2 the r8a7795 Salvator-X board > has thanks to DU, FCP and VSP a working VGA port. So enabling those > devices in the defconfig from now on makes sense to me. Understood, I have queued this up.
Re: [PATCH v4 0/3] nvme power saving
On Thu, Oct 27, 2016 at 05:06:16PM -0700, Andy Lutomirski wrote: > It looks like there is at least one NVMe disk in existence (a > different Samsung device) that sporadically dies when APST is on. > This device appears to also sporadically die when APST is off, but it > lasts considerably longer before dying with APST off. Judy, can you help Andy to find someone in Samsung to report this to? > So here's what I'm tempted to do: > > - For devices that report NVMe version 1.2 support, APST is on by > default. I hope this is safe. It should be safe. That being said NVMe is being driven more and more into consumer markets so eventually we will find some device we need to work around inevitably, but that's life. > - For devices that don't report NVMe 1.2 or higher but do report > APSTA (which implies NVMe 1.1), then we can have a blacklist or a > whitelist. A blacklist is nicer, but a whitelist is safer. We just had a discussion about advertising features before claiming conformance where they appear in in the NVMe technical working group. The general concensus was that it should be safe. I'm thus tempted to start out with the blacklist. > - A sysfs and/or module control allows overriding this. > > - Implement dev_pm_qos latency control. The chosen latency (if APST > is enabled) will be the lesser of the dev_pm_qos setting and a module > parameter. > > How does that sound? Great!
Re: [PATCH v4 0/3] nvme power saving
On Thu, Oct 27, 2016 at 05:06:16PM -0700, Andy Lutomirski wrote: > It looks like there is at least one NVMe disk in existence (a > different Samsung device) that sporadically dies when APST is on. > This device appears to also sporadically die when APST is off, but it > lasts considerably longer before dying with APST off. Judy, can you help Andy to find someone in Samsung to report this to? > So here's what I'm tempted to do: > > - For devices that report NVMe version 1.2 support, APST is on by > default. I hope this is safe. It should be safe. That being said NVMe is being driven more and more into consumer markets so eventually we will find some device we need to work around inevitably, but that's life. > - For devices that don't report NVMe 1.2 or higher but do report > APSTA (which implies NVMe 1.1), then we can have a blacklist or a > whitelist. A blacklist is nicer, but a whitelist is safer. We just had a discussion about advertising features before claiming conformance where they appear in in the NVMe technical working group. The general concensus was that it should be safe. I'm thus tempted to start out with the blacklist. > - A sysfs and/or module control allows overriding this. > > - Implement dev_pm_qos latency control. The chosen latency (if APST > is enabled) will be the lesser of the dev_pm_qos setting and a module > parameter. > > How does that sound? Great!
Re: [RFC PATCH] usb: core: correct usb_get_dev() documentation
On Thu, Oct 27, 2016 at 04:49:18PM -0700, Dmitry Torokhov wrote: > On Thu, Oct 27, 2016 at 03:02:30PM -0700, Brian Norris wrote: > > In reading through a USB interface driver, I noticed that it called > > usb_{get,put}_dev() in its probe() and disconnect() methods. This seemed > > unnecessary, but a look at the comments here matched the usage. > > > > USB interface devices seem to be well covered by the parent/child > > relationship of the device model, and so it should be unnecessary for a > > child device to grab a refcount on its parent device. > > > > Signed-off-by: Brian Norris> > Yes, usb_device is parent of usb_interface and device core does "parent > = get_device(dev->parent);" as part of device_add() when registering new > interfaces. > > Reviewed-by: Dmitry Torokhov > Yes, current code seems a little messy for get{put}_device. Eg, for USB device, it tries to call get_device again at usb_set_configuration when create its child device (interface device). For USB interface device, it handles get{put}_device at message.c for common interface, it seems to be not necessary to call usb_get{put}_dev again at individual interface driver. Peter > > --- > > This reflects my understanding (and testing), as well as the majority of > > usage > > -- there are *very* few interface drivers that actually call usb_get_dev(). > > If > > I'm wrong, please feel free to tell me so! But I thought patching the > > documentation would be the best way to solicit a response :) > > > > drivers/usb/core/usb.c | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c > > index 592151461017..0ba7e070f04e 100644 > > --- a/drivers/usb/core/usb.c > > +++ b/drivers/usb/core/usb.c > > @@ -539,9 +539,9 @@ EXPORT_SYMBOL_GPL(usb_alloc_dev); > > * > > * Each live reference to a device should be refcounted. > > * > > - * Drivers for USB interfaces should normally record such references in > > - * their probe() methods, when they bind to an interface, and release > > - * them by calling usb_put_dev(), in their disconnect() methods. > > + * The device driver core automatically handles this refcounting for USB > > + * interface drivers, but this API can be used for non-parent/child > > + * relationships. > > * > > * Return: A pointer to the device with the incremented reference counter. > > */ > > -- > > 2.8.0.rc3.226.g39d4020 > > > > -- > Dmitry > -- > To unsubscribe from this list: send the line "unsubscribe linux-usb" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Peter Chen
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On 28-10-16, 12:07, Fengguang Wu wrote: > On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote: > >On 28-10-16, 07:22, kbuild test robot wrote: > >>tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > >>master > >>head: e3300ffef0653774f1099cab153d25d24bd773ce > >>commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF > >>dependent code in a separate file > >>date: 6 months ago > > > >Why are we picking it up now ? > > Sorry due to problems in the 0day infrastructure some few errors are > missed in May. Now we catch it when the commit goes mainline. > > https://lists.01.org/pipermail/kbuild-all/ > > June 2016:... [ Gzip'd Text 853 KB ] > May 2016: ... [ Gzip'd Text 294 KB ] > April 2016: ... [ Gzip'd Text 599 KB ] > > As you can see, the report volumes are noticeably lower in "May 2016". No issues :) So I will just ignore this email now as things are probably stable right now. -- viresh
Re: [RFC PATCH] usb: core: correct usb_get_dev() documentation
On Thu, Oct 27, 2016 at 04:49:18PM -0700, Dmitry Torokhov wrote: > On Thu, Oct 27, 2016 at 03:02:30PM -0700, Brian Norris wrote: > > In reading through a USB interface driver, I noticed that it called > > usb_{get,put}_dev() in its probe() and disconnect() methods. This seemed > > unnecessary, but a look at the comments here matched the usage. > > > > USB interface devices seem to be well covered by the parent/child > > relationship of the device model, and so it should be unnecessary for a > > child device to grab a refcount on its parent device. > > > > Signed-off-by: Brian Norris > > Yes, usb_device is parent of usb_interface and device core does "parent > = get_device(dev->parent);" as part of device_add() when registering new > interfaces. > > Reviewed-by: Dmitry Torokhov > Yes, current code seems a little messy for get{put}_device. Eg, for USB device, it tries to call get_device again at usb_set_configuration when create its child device (interface device). For USB interface device, it handles get{put}_device at message.c for common interface, it seems to be not necessary to call usb_get{put}_dev again at individual interface driver. Peter > > --- > > This reflects my understanding (and testing), as well as the majority of > > usage > > -- there are *very* few interface drivers that actually call usb_get_dev(). > > If > > I'm wrong, please feel free to tell me so! But I thought patching the > > documentation would be the best way to solicit a response :) > > > > drivers/usb/core/usb.c | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/usb/core/usb.c b/drivers/usb/core/usb.c > > index 592151461017..0ba7e070f04e 100644 > > --- a/drivers/usb/core/usb.c > > +++ b/drivers/usb/core/usb.c > > @@ -539,9 +539,9 @@ EXPORT_SYMBOL_GPL(usb_alloc_dev); > > * > > * Each live reference to a device should be refcounted. > > * > > - * Drivers for USB interfaces should normally record such references in > > - * their probe() methods, when they bind to an interface, and release > > - * them by calling usb_put_dev(), in their disconnect() methods. > > + * The device driver core automatically handles this refcounting for USB > > + * interface drivers, but this API can be used for non-parent/child > > + * relationships. > > * > > * Return: A pointer to the device with the incremented reference counter. > > */ > > -- > > 2.8.0.rc3.226.g39d4020 > > > > -- > Dmitry > -- > To unsubscribe from this list: send the line "unsubscribe linux-usb" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best Regards, Peter Chen
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On 28-10-16, 12:07, Fengguang Wu wrote: > On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote: > >On 28-10-16, 07:22, kbuild test robot wrote: > >>tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > >>master > >>head: e3300ffef0653774f1099cab153d25d24bd773ce > >>commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF > >>dependent code in a separate file > >>date: 6 months ago > > > >Why are we picking it up now ? > > Sorry due to problems in the 0day infrastructure some few errors are > missed in May. Now we catch it when the commit goes mainline. > > https://lists.01.org/pipermail/kbuild-all/ > > June 2016:... [ Gzip'd Text 853 KB ] > May 2016: ... [ Gzip'd Text 294 KB ] > April 2016: ... [ Gzip'd Text 599 KB ] > > As you can see, the report volumes are noticeably lower in "May 2016". No issues :) So I will just ignore this email now as things are probably stable right now. -- viresh
Re: [REVIEW][PATCH v2] mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
ebied...@xmission.com (Eric W. Biederman) writes: > Cyrill Gorcunov <gorcu...@gmail.com> writes: > >> On Fri, Oct 28, 2016 at 12:39:18AM +0300, Cyrill Gorcunov wrote: >>> On Thu, Oct 27, 2016 at 10:54:34AM -0500, Eric W. Biederman wrote: >>> > >>> > >>> > I can't imagine either of these changes making a practical difference >>> > to anyone but I am calling them out in case someone can. >>> > >>> > include/linux/mm_types.h | 1 + >>> > kernel/fork.c| 9 ++--- >>> > kernel/ptrace.c | 26 +++--- >>> > mm/init-mm.c | 2 ++ >>> > 4 files changed, 20 insertions(+), 18 deletions(-) >>> >>> Thanks a huge, Eric! And really sorry for delay in response, >>> I managed to miss this quite important mail for me in mail >>> storm. Gonna test it and will write you the results. Overall looks >>> great, but better be sure and run the tests. >>> >>> Reviewed-by: Cyrill Gorcunov <gorcu...@openvz.org> >> >> Eric, on which kernel the patch is on top of? >> It doesn't apply on linux-next for some reason. >> >> | Date: Thu Oct 27 14:21:59 2016 +1100 >> | >> | Add linux-next specific files for 20161027 >> | >> | Signed-off-by: Stephen Rothwell <s...@canb.auug.org.au> >> >> I applied it on Linus' master and tests passed fine >> (but they were passing fine even without the patch, >> only linux-next failed). > > Odd. I don't think I have taken the old version out of > linux-next yet. So you can probably revert the old version out of > linux-next and apply this one. All of my development at this point is > against v4.9-rc1. > > I suspect you will find my last version on top of against v4.9-rc1 will > pass. Since my tree is only one deep and I don't think anyone except > linux-next is based on it, I plan to drop and readd this patch. > Especially since it is candidate for backporting. Mind if I add your tested-by? To see Linus's tree fail with my patch you can apply the patch below. That is the essence of what I changed to fix things. Just ignoring dumpable when an mm exists. Eric diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 44a25a1e6e83..b53983ee3f03 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -272,7 +272,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) ok: rcu_read_unlock(); mm = task->mm; - if (mm && + if (!mm || ((get_dumpable(mm) != SUID_DUMP_USER) && !ptrace_has_cap(mm->user_ns, mode))) return -EPERM;
Re: [REVIEW][PATCH v2] mm: Add a user_ns owner to mm_struct and fix ptrace permission checks
ebied...@xmission.com (Eric W. Biederman) writes: > Cyrill Gorcunov writes: > >> On Fri, Oct 28, 2016 at 12:39:18AM +0300, Cyrill Gorcunov wrote: >>> On Thu, Oct 27, 2016 at 10:54:34AM -0500, Eric W. Biederman wrote: >>> > >>> > >>> > I can't imagine either of these changes making a practical difference >>> > to anyone but I am calling them out in case someone can. >>> > >>> > include/linux/mm_types.h | 1 + >>> > kernel/fork.c| 9 ++--- >>> > kernel/ptrace.c | 26 +++--- >>> > mm/init-mm.c | 2 ++ >>> > 4 files changed, 20 insertions(+), 18 deletions(-) >>> >>> Thanks a huge, Eric! And really sorry for delay in response, >>> I managed to miss this quite important mail for me in mail >>> storm. Gonna test it and will write you the results. Overall looks >>> great, but better be sure and run the tests. >>> >>> Reviewed-by: Cyrill Gorcunov >> >> Eric, on which kernel the patch is on top of? >> It doesn't apply on linux-next for some reason. >> >> | Date: Thu Oct 27 14:21:59 2016 +1100 >> | >> | Add linux-next specific files for 20161027 >> | >> | Signed-off-by: Stephen Rothwell >> >> I applied it on Linus' master and tests passed fine >> (but they were passing fine even without the patch, >> only linux-next failed). > > Odd. I don't think I have taken the old version out of > linux-next yet. So you can probably revert the old version out of > linux-next and apply this one. All of my development at this point is > against v4.9-rc1. > > I suspect you will find my last version on top of against v4.9-rc1 will > pass. Since my tree is only one deep and I don't think anyone except > linux-next is based on it, I plan to drop and readd this patch. > Especially since it is candidate for backporting. Mind if I add your tested-by? To see Linus's tree fail with my patch you can apply the patch below. That is the essence of what I changed to fix things. Just ignoring dumpable when an mm exists. Eric diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 44a25a1e6e83..b53983ee3f03 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -272,7 +272,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) ok: rcu_read_unlock(); mm = task->mm; - if (mm && + if (!mm || ((get_dumpable(mm) != SUID_DUMP_USER) && !ptrace_has_cap(mm->user_ns, mode))) return -EPERM;
Re: [PATCH v2 3/4] input: Deprecate real timestamps beyond year 2106
On Thu, Oct 27, 2016 at 03:24:55PM -0700, Deepa Dinamani wrote: > On Wed, Oct 26, 2016 at 7:56 PM, Peter Hutterer >wrote: > > On Mon, Oct 17, 2016 at 08:27:32PM -0700, Deepa Dinamani wrote: > >> struct timeval is not y2038 safe. > >> All usage of timeval in the kernel will be replaced by > >> y2038 safe structures. > >> > >> struct input_event maintains time for each input event. > >> Real time timestamps are not ideal for input as this > >> time can go backwards as noted in the patch a80b83b7b8 > >> by John Stultz. Hence, having the input_event.time fields > >> only big enough for monotonic and boot times are > >> sufficient. > >> > >> Leave the original input_event as is. This is to maintain > >> backward compatibility with existing userspace interfaces > >> that use input_event. > >> Introduce a new replacement struct raw_input_event. > > > > general comment here - please don't name it "raw_input_event". > > First, when you grep for input_event you want the new ones to show up too, > > so a struct input_event_raw would be better here. That also has better > > namespacing in general. Second though: the event isn't any more "raw" than > > the previous we had. > > > > I can't think of anything better than struct input_event_v2 though. > > The general idea was to leave the original struct input_event as a > common interface for userspace (as it cannot be deleted). > So reading raw data unformatted by the userspace will have the new > struct raw_input_event format. > This was the reason for the "raw" in the name. > > struct input_event_v2 is fine too, if this is more preferred. > > >> This replaces timeval with struct input_timeval. This structure > >> maintains time in __kernel_ulong_t or compat_ulong_t to allow > >> for architectures to override types as in the case of x32. > >> > >> The change requires any userspace utilities reading or writing > >> from event nodes to update their reading format to match > >> raw_input_event. The changes to the popular libraries will be > >> posted along with the kernel changes. > >> The driver version is also updated to reflect the change in > >> event format. > > > > Doesn't this break *all* of userspace then? I don't see anything to > > negotiate the type of input event the kernel gives me. And nothing right now > > checks for EVDEV_VERSION, so they all just assume it's a struct > > input_event. Best case, if the available events aren't a multiple of > > sizeof(struct input_event) userspace will bomb out, but unless that happens, > > everyone will just happily read old-style events. > > > > So we need some negotiation what is acceptable. Which also needs to address > > the race conditions we're going to get when events start coming in before > > the client has announced that it supports the new-style events. > > No, this does not break any userspace right now. > Both struct input_event and struct raw_input_event are exactly the same today. oh, right, the ABI is the same. I see that now, thanks. > This will be the case until a 2038-safe glibc is used with a 64 bit time_t > flag. > > So these are the scenarios: > 1. old kernel driver + new userspace > -- should still be ok until 2038. Version checks could help discover these > 2. new kernel driver + old userspace (without recompiled with new 2038 gblic) > -- works because the format is really the same. > > The patch I posted to libevdev checks this driver version. btw, where did you post the libevdev patch? I haven't seen it anywhere I'm subscribed to. > And, hence any library that results in a call to libevdev_set_fd() > will fail if it is not this updated driver. without having seen the libevdev patch - that sounds like a bad idea . there are plenty of usecases where libevdev_set_fd() is called but timestamps in events just don't matter. So we may need need some more negotiation between the library user, libevdev and the kernel. Cheers, Peter > We could just do a similar check in every library also. > I think the latter would be better. > > So, the kernel patches can go in as a no-op right now and then I can > add version checks to respective user space libraries.
Re: [PATCH v2 3/4] input: Deprecate real timestamps beyond year 2106
On Thu, Oct 27, 2016 at 03:24:55PM -0700, Deepa Dinamani wrote: > On Wed, Oct 26, 2016 at 7:56 PM, Peter Hutterer > wrote: > > On Mon, Oct 17, 2016 at 08:27:32PM -0700, Deepa Dinamani wrote: > >> struct timeval is not y2038 safe. > >> All usage of timeval in the kernel will be replaced by > >> y2038 safe structures. > >> > >> struct input_event maintains time for each input event. > >> Real time timestamps are not ideal for input as this > >> time can go backwards as noted in the patch a80b83b7b8 > >> by John Stultz. Hence, having the input_event.time fields > >> only big enough for monotonic and boot times are > >> sufficient. > >> > >> Leave the original input_event as is. This is to maintain > >> backward compatibility with existing userspace interfaces > >> that use input_event. > >> Introduce a new replacement struct raw_input_event. > > > > general comment here - please don't name it "raw_input_event". > > First, when you grep for input_event you want the new ones to show up too, > > so a struct input_event_raw would be better here. That also has better > > namespacing in general. Second though: the event isn't any more "raw" than > > the previous we had. > > > > I can't think of anything better than struct input_event_v2 though. > > The general idea was to leave the original struct input_event as a > common interface for userspace (as it cannot be deleted). > So reading raw data unformatted by the userspace will have the new > struct raw_input_event format. > This was the reason for the "raw" in the name. > > struct input_event_v2 is fine too, if this is more preferred. > > >> This replaces timeval with struct input_timeval. This structure > >> maintains time in __kernel_ulong_t or compat_ulong_t to allow > >> for architectures to override types as in the case of x32. > >> > >> The change requires any userspace utilities reading or writing > >> from event nodes to update their reading format to match > >> raw_input_event. The changes to the popular libraries will be > >> posted along with the kernel changes. > >> The driver version is also updated to reflect the change in > >> event format. > > > > Doesn't this break *all* of userspace then? I don't see anything to > > negotiate the type of input event the kernel gives me. And nothing right now > > checks for EVDEV_VERSION, so they all just assume it's a struct > > input_event. Best case, if the available events aren't a multiple of > > sizeof(struct input_event) userspace will bomb out, but unless that happens, > > everyone will just happily read old-style events. > > > > So we need some negotiation what is acceptable. Which also needs to address > > the race conditions we're going to get when events start coming in before > > the client has announced that it supports the new-style events. > > No, this does not break any userspace right now. > Both struct input_event and struct raw_input_event are exactly the same today. oh, right, the ABI is the same. I see that now, thanks. > This will be the case until a 2038-safe glibc is used with a 64 bit time_t > flag. > > So these are the scenarios: > 1. old kernel driver + new userspace > -- should still be ok until 2038. Version checks could help discover these > 2. new kernel driver + old userspace (without recompiled with new 2038 gblic) > -- works because the format is really the same. > > The patch I posted to libevdev checks this driver version. btw, where did you post the libevdev patch? I haven't seen it anywhere I'm subscribed to. > And, hence any library that results in a call to libevdev_set_fd() > will fail if it is not this updated driver. without having seen the libevdev patch - that sounds like a bad idea . there are plenty of usecases where libevdev_set_fd() is called but timestamps in events just don't matter. So we may need need some more negotiation between the library user, libevdev and the kernel. Cheers, Peter > We could just do a similar check in every library also. > I think the latter would be better. > > So, the kernel patches can go in as a no-op right now and then I can > add version checks to respective user space libraries.
Re: [v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
On Fri, 2016-10-28 at 11:32 +0800, Yangbo Lu wrote: > + guts->regs = of_iomap(np, 0); > + if (!guts->regs) > + return -ENOMEM; > + > + /* Register soc device */ > + machine = of_flat_dt_get_machine_name(); > + if (machine) > + soc_dev_attr.machine = devm_kstrdup(dev, machine, > GFP_KERNEL); > + > + svr = fsl_guts_get_svr(); > + soc_die = fsl_soc_die_match(svr, fsl_soc_die); > + if (soc_die) { > + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL, > + "QorIQ %s", soc_die- > >die); > + } else { > + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL, > "QorIQ"); > + } > + soc_dev_attr.soc_id = devm_kasprintf(dev, GFP_KERNEL, > + "svr:0x%08x", svr); > + soc_dev_attr.revision = devm_kasprintf(dev, GFP_KERNEL, "%d.%d", > + SVR_MAJ(svr), SVR_MIN(svr)); > + > + soc_dev = soc_device_register(_dev_attr); > + if (IS_ERR(soc_dev)) > + return PTR_ERR(soc_dev); ioremap leaks on this error path. Use devm_ioremap_resource(). -Scott
Re: [v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
On Fri, 2016-10-28 at 11:32 +0800, Yangbo Lu wrote: > + guts->regs = of_iomap(np, 0); > + if (!guts->regs) > + return -ENOMEM; > + > + /* Register soc device */ > + machine = of_flat_dt_get_machine_name(); > + if (machine) > + soc_dev_attr.machine = devm_kstrdup(dev, machine, > GFP_KERNEL); > + > + svr = fsl_guts_get_svr(); > + soc_die = fsl_soc_die_match(svr, fsl_soc_die); > + if (soc_die) { > + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL, > + "QorIQ %s", soc_die- > >die); > + } else { > + soc_dev_attr.family = devm_kasprintf(dev, GFP_KERNEL, > "QorIQ"); > + } > + soc_dev_attr.soc_id = devm_kasprintf(dev, GFP_KERNEL, > + "svr:0x%08x", svr); > + soc_dev_attr.revision = devm_kasprintf(dev, GFP_KERNEL, "%d.%d", > + SVR_MAJ(svr), SVR_MIN(svr)); > + > + soc_dev = soc_device_register(_dev_attr); > + if (IS_ERR(soc_dev)) > + return PTR_ERR(soc_dev); ioremap leaks on this error path. Use devm_ioremap_resource(). -Scott
[PATCH RESEND] mpt3sas: Fix for block device of raid exists even after deleting raid disk
While merging mpt3sas & mpt2sas code, we posted below patch for WarpDrive support, mpt3sas: Ported WarpDrive product SSS6200 support commit id is 7786ab6aff In this patch and in the below hunk, we have added is_warpdrive check condition on the wrong line --- scsih_target_alloc(struct scsi_target *starget) sas_target_priv_data->handle = raid_device->handle; sas_target_priv_data->sas_address = raid_device->wwid; sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME; - raid_device->starget = starget; + sas_target_priv_data->raid_device = raid_device; + if (ioc->is_warpdrive) + raid_device->starget = starget; } spin_unlock_irqrestore(>raid_device_lock, flags); return 0; -- Actually that check should be for below line sas_target_priv_data->raid_device = raid_device; Due to above hunk, we are not initializing raid_device's starget for raid volumes, and so during raid disk deletion driver is not calling scsi_remove_target() API as driver observes starget field of raid_device's structure as NULL. Signed-off-by: Sreekanth ReddyCc: --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 981be7b..618c9df8 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -1279,9 +1279,9 @@ scsih_target_alloc(struct scsi_target *starget) sas_target_priv_data->handle = raid_device->handle; sas_target_priv_data->sas_address = raid_device->wwid; sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME; - sas_target_priv_data->raid_device = raid_device; if (ioc->is_warpdrive) - raid_device->starget = starget; + sas_target_priv_data->raid_device = raid_device; + raid_device->starget = starget; } spin_unlock_irqrestore(>raid_device_lock, flags); return 0; -- 2.4.3
[PATCH RESEND] mpt3sas: Fix for block device of raid exists even after deleting raid disk
While merging mpt3sas & mpt2sas code, we posted below patch for WarpDrive support, mpt3sas: Ported WarpDrive product SSS6200 support commit id is 7786ab6aff In this patch and in the below hunk, we have added is_warpdrive check condition on the wrong line --- scsih_target_alloc(struct scsi_target *starget) sas_target_priv_data->handle = raid_device->handle; sas_target_priv_data->sas_address = raid_device->wwid; sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME; - raid_device->starget = starget; + sas_target_priv_data->raid_device = raid_device; + if (ioc->is_warpdrive) + raid_device->starget = starget; } spin_unlock_irqrestore(>raid_device_lock, flags); return 0; -- Actually that check should be for below line sas_target_priv_data->raid_device = raid_device; Due to above hunk, we are not initializing raid_device's starget for raid volumes, and so during raid disk deletion driver is not calling scsi_remove_target() API as driver observes starget field of raid_device's structure as NULL. Signed-off-by: Sreekanth Reddy Cc: --- drivers/scsi/mpt3sas/mpt3sas_scsih.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 981be7b..618c9df8 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -1279,9 +1279,9 @@ scsih_target_alloc(struct scsi_target *starget) sas_target_priv_data->handle = raid_device->handle; sas_target_priv_data->sas_address = raid_device->wwid; sas_target_priv_data->flags |= MPT_TARGET_FLAGS_VOLUME; - sas_target_priv_data->raid_device = raid_device; if (ioc->is_warpdrive) - raid_device->starget = starget; + sas_target_priv_data->raid_device = raid_device; + raid_device->starget = starget; } spin_unlock_irqrestore(>raid_device_lock, flags); return 0; -- 2.4.3
[v13, 8/8] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0
The eSDHC of T4240-R1.0-R2.0 has incorrect vender version and spec version. Acturally the right version numbers should be VVN=0x13 and SVN = 0x1. This patch adds the GUTS driver support for eSDHC driver to match SoC. And fix host version to avoid that incorrect version numbers break down the ADMA data transfer. Signed-off-by: Yangbo LuAcked-by: Ulf Hansson Acked-by: Scott Wood --- Changes for v2: - Got SVR through iomap instead of dts Changes for v3: - Managed GUTS through syscon instead of iomap in eSDHC driver Changes for v4: - Got SVR by GUTS driver instead of SYSCON Changes for v5: - Changed to get SVR through API fsl_guts_get_svr() - Combined patch 4, patch 5 and patch 6 into one Changes for v6: - Added 'Acked-by: Ulf Hansson' Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - Changed to use soc_device_match Changes for v12: - Matched soc through .family field instead of .soc_id Changes for v13: - None --- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 2 files changed, 21 insertions(+) diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig index 5274f50..a1135a9 100644 --- a/drivers/mmc/host/Kconfig +++ b/drivers/mmc/host/Kconfig @@ -144,6 +144,7 @@ config MMC_SDHCI_OF_ESDHC depends on MMC_SDHCI_PLTFM depends on PPC || ARCH_MXC || ARCH_LAYERSCAPE select MMC_SDHCI_IO_ACCESSORS + select FSL_GUTS help This selects the Freescale eSDHC controller support. diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c index fb71c86..57bdb9e 100644 --- a/drivers/mmc/host/sdhci-of-esdhc.c +++ b/drivers/mmc/host/sdhci-of-esdhc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include "sdhci-pltfm.h" #include "sdhci-esdhc.h" @@ -28,6 +29,7 @@ struct sdhci_esdhc { u8 vendor_ver; u8 spec_ver; + bool quirk_incorrect_hostver; }; /** @@ -73,6 +75,8 @@ static u32 esdhc_readl_fixup(struct sdhci_host *host, static u16 esdhc_readw_fixup(struct sdhci_host *host, int spec_reg, u32 value) { + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); + struct sdhci_esdhc *esdhc = sdhci_pltfm_priv(pltfm_host); u16 ret; int shift = (spec_reg & 0x2) * 8; @@ -80,6 +84,12 @@ static u16 esdhc_readw_fixup(struct sdhci_host *host, ret = value & 0x; else ret = (value >> shift) & 0x; + /* Workaround for T4240-R1.0-R2.0 eSDHC which has incorrect +* vendor version and spec version information. +*/ + if ((spec_reg == SDHCI_HOST_VERSION) && + (esdhc->quirk_incorrect_hostver)) + ret = (VENDOR_V_23 << SDHCI_VENDOR_VER_SHIFT) | SDHCI_SPEC_200; return ret; } @@ -558,6 +568,12 @@ static const struct sdhci_pltfm_data sdhci_esdhc_le_pdata = { .ops = _esdhc_le_ops, }; +static struct soc_device_attribute soc_incorrect_hostver[] = { + { .family = "QorIQ T4240", .revision = "1.0", }, + { .family = "QorIQ T4240", .revision = "2.0", }, + { }, +}; + static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host) { struct sdhci_pltfm_host *pltfm_host; @@ -571,6 +587,10 @@ static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host) esdhc->vendor_ver = (host_ver & SDHCI_VENDOR_VER_MASK) >> SDHCI_VENDOR_VER_SHIFT; esdhc->spec_ver = host_ver & SDHCI_SPEC_VER_MASK; + if (soc_device_match(soc_incorrect_hostver)) + esdhc->quirk_incorrect_hostver = true; + else + esdhc->quirk_incorrect_hostver = false; } static int sdhci_esdhc_probe(struct platform_device *pdev) -- 2.1.0.27.g96db324
[v13, 8/8] mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0
The eSDHC of T4240-R1.0-R2.0 has incorrect vender version and spec version. Acturally the right version numbers should be VVN=0x13 and SVN = 0x1. This patch adds the GUTS driver support for eSDHC driver to match SoC. And fix host version to avoid that incorrect version numbers break down the ADMA data transfer. Signed-off-by: Yangbo Lu Acked-by: Ulf Hansson Acked-by: Scott Wood --- Changes for v2: - Got SVR through iomap instead of dts Changes for v3: - Managed GUTS through syscon instead of iomap in eSDHC driver Changes for v4: - Got SVR by GUTS driver instead of SYSCON Changes for v5: - Changed to get SVR through API fsl_guts_get_svr() - Combined patch 4, patch 5 and patch 6 into one Changes for v6: - Added 'Acked-by: Ulf Hansson' Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - Changed to use soc_device_match Changes for v12: - Matched soc through .family field instead of .soc_id Changes for v13: - None --- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 2 files changed, 21 insertions(+) diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig index 5274f50..a1135a9 100644 --- a/drivers/mmc/host/Kconfig +++ b/drivers/mmc/host/Kconfig @@ -144,6 +144,7 @@ config MMC_SDHCI_OF_ESDHC depends on MMC_SDHCI_PLTFM depends on PPC || ARCH_MXC || ARCH_LAYERSCAPE select MMC_SDHCI_IO_ACCESSORS + select FSL_GUTS help This selects the Freescale eSDHC controller support. diff --git a/drivers/mmc/host/sdhci-of-esdhc.c b/drivers/mmc/host/sdhci-of-esdhc.c index fb71c86..57bdb9e 100644 --- a/drivers/mmc/host/sdhci-of-esdhc.c +++ b/drivers/mmc/host/sdhci-of-esdhc.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include "sdhci-pltfm.h" #include "sdhci-esdhc.h" @@ -28,6 +29,7 @@ struct sdhci_esdhc { u8 vendor_ver; u8 spec_ver; + bool quirk_incorrect_hostver; }; /** @@ -73,6 +75,8 @@ static u32 esdhc_readl_fixup(struct sdhci_host *host, static u16 esdhc_readw_fixup(struct sdhci_host *host, int spec_reg, u32 value) { + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); + struct sdhci_esdhc *esdhc = sdhci_pltfm_priv(pltfm_host); u16 ret; int shift = (spec_reg & 0x2) * 8; @@ -80,6 +84,12 @@ static u16 esdhc_readw_fixup(struct sdhci_host *host, ret = value & 0x; else ret = (value >> shift) & 0x; + /* Workaround for T4240-R1.0-R2.0 eSDHC which has incorrect +* vendor version and spec version information. +*/ + if ((spec_reg == SDHCI_HOST_VERSION) && + (esdhc->quirk_incorrect_hostver)) + ret = (VENDOR_V_23 << SDHCI_VENDOR_VER_SHIFT) | SDHCI_SPEC_200; return ret; } @@ -558,6 +568,12 @@ static const struct sdhci_pltfm_data sdhci_esdhc_le_pdata = { .ops = _esdhc_le_ops, }; +static struct soc_device_attribute soc_incorrect_hostver[] = { + { .family = "QorIQ T4240", .revision = "1.0", }, + { .family = "QorIQ T4240", .revision = "2.0", }, + { }, +}; + static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host) { struct sdhci_pltfm_host *pltfm_host; @@ -571,6 +587,10 @@ static void esdhc_init(struct platform_device *pdev, struct sdhci_host *host) esdhc->vendor_ver = (host_ver & SDHCI_VENDOR_VER_MASK) >> SDHCI_VENDOR_VER_SHIFT; esdhc->spec_ver = host_ver & SDHCI_SPEC_VER_MASK; + if (soc_device_match(soc_incorrect_hostver)) + esdhc->quirk_incorrect_hostver = true; + else + esdhc->quirk_incorrect_hostver = false; } static int sdhci_esdhc_probe(struct platform_device *pdev) -- 2.1.0.27.g96db324
Re: [PATCH v2 1/4] uinput: Add ioctl for using monotonic/ boot times
On Thu, Oct 27, 2016 at 01:39:30PM -0700, Deepa Dinamani wrote: > > hmm, I'm a bit confused here. This is an in-kernel bit only (passing the > > time through uinput events has no effect). So why do we need an ioctl here? > > it's an in-kernel decision only anyway and the time in the events sent to > > the evdev client should be dictated by what that client sets for the clock > > type, right? > > This is for input events queued by the uinput driver for the virtual > input device. oh, right. I thought this was in the path for uinput_write(). sorry about that. > This can be read through uinput_read() fops. > I don't think anybody is doing a read on uinput nodes, so another > option(Arnd and I considered this) could be not supporting reads on > these nodes at all. > > This is not related to evdev events in the kernel. > Currently, this timestamp could be the same format as the evdev > timestamps or not. I can say I've never done the read from the uinput device, never even occured to me. quick skim of the code looks like this only matters for force_feedback stuff. can't really comment on that too much. Cheers, Peter
Re: [PATCH v2 1/4] uinput: Add ioctl for using monotonic/ boot times
On Thu, Oct 27, 2016 at 01:39:30PM -0700, Deepa Dinamani wrote: > > hmm, I'm a bit confused here. This is an in-kernel bit only (passing the > > time through uinput events has no effect). So why do we need an ioctl here? > > it's an in-kernel decision only anyway and the time in the events sent to > > the evdev client should be dictated by what that client sets for the clock > > type, right? > > This is for input events queued by the uinput driver for the virtual > input device. oh, right. I thought this was in the path for uinput_write(). sorry about that. > This can be read through uinput_read() fops. > I don't think anybody is doing a read on uinput nodes, so another > option(Arnd and I considered this) could be not supporting reads on > these nodes at all. > > This is not related to evdev events in the kernel. > Currently, this timestamp could be the same format as the evdev > timestamps or not. I can say I've never done the read from the uinput device, never even occured to me. quick skim of the code looks like this only matters for force_feedback stuff. can't really comment on that too much. Cheers, Peter
[v13, 4/8] powerpc/fsl: move mpc85xx.h to include/linux/fsl
Move mpc85xx.h to include/linux/fsl and rename it to svr.h as a common header file. This SVR numberspace is used on some ARM chips as well as PPC, and even to check for a PPC SVR multi-arch drivers would otherwise need to ifdef the header inclusion and all references to the SVR symbols. Signed-off-by: Yangbo LuAcked-by: Wolfram Sang Acked-by: Stephen Boyd Acked-by: Joerg Roedel [scottwood: update description] Signed-off-by: Scott Wood --- Changes for v2: - None Changes for v3: - None Changes for v4: - None Changes for v5: - Changed to Move mpc85xx.h to include/linux/fsl/ - Adjusted '#include ' position in file Changes for v6: - None Changes for v7: - Added 'Acked-by: Wolfram Sang' for I2C part - Also applied to arch/powerpc/kernel/cpu_setup_fsl_booke.S Changes for v8: - Added 'Acked-by: Stephen Boyd' for clk part - Added 'Acked-by: Scott Wood' - Added 'Acked-by: Joerg Roedel' for iommu part Changes for v9: - None Changes for v10: - None Changes for v11: - Updated description by Scott Changes for v12: - None Changes for v13: - None --- arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/clk/clk-qoriq.c | 3 +-- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +-- drivers/net/ethernet/freescale/gianfar.c | 2 +- arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h | 4 ++-- 7 files changed, 8 insertions(+), 10 deletions(-) rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S index 462aed9..2b0284e 100644 --- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S +++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S @@ -13,13 +13,13 @@ * */ +#include #include #include #include #include #include #include -#include _GLOBAL(__e500_icache_setup) mfspr r0, SPRN_L1CSR1 diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c index d3a5974..cb0efea 100644 --- a/arch/powerpc/sysdev/fsl_pci.c +++ b/arch/powerpc/sysdev/fsl_pci.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -37,7 +38,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c index 20b1055..dc778e8 100644 --- a/drivers/clk/clk-qoriq.c +++ b/drivers/clk/clk-qoriq.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -1153,8 +1154,6 @@ static struct clk *clockgen_clk_get(struct of_phandle_args *clkspec, void *data) } #ifdef CONFIG_PPC -#include - static const u32 a4510_svrs[] __initconst = { (SVR_P2040 << 8) | 0x10,/* P2040 1.0 */ (SVR_P2040 << 8) | 0x11,/* P2040 1.1 */ diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c index 565a49a..e791c51 100644 --- a/drivers/i2c/busses/i2c-mpc.c +++ b/drivers/i2c/busses/i2c-mpc.c @@ -27,9 +27,9 @@ #include #include #include +#include #include -#include #include #define DRV_NAME "mpc-i2c" diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c index a34355f..af8fb27 100644 --- a/drivers/iommu/fsl_pamu.c +++ b/drivers/iommu/fsl_pamu.c @@ -21,11 +21,10 @@ #include "fsl_pamu.h" #include +#include #include #include -#include - /* define indexes for each operation mapping scenario */ #define OMI_QMAN0x00 #define OMI_FMAN0x01 diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index 4b4f5bc..55be5ce 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -86,11 +86,11 @@ #include #include #include +#include #include #ifdef CONFIG_PPC #include -#include #endif #include #include diff --git a/arch/powerpc/include/asm/mpc85xx.h b/include/linux/fsl/svr.h similarity index 97% rename from arch/powerpc/include/asm/mpc85xx.h rename to include/linux/fsl/svr.h index 213f3a8..8d13836 100644 --- a/arch/powerpc/include/asm/mpc85xx.h +++ b/include/linux/fsl/svr.h @@ -9,8 +9,8 @@ * (at your option) any later version. */ -#ifndef __ASM_PPC_MPC85XX_H -#define __ASM_PPC_MPC85XX_H +#ifndef FSL_SVR_H +#define FSL_SVR_H #define SVR_REV(svr) ((svr) & 0xFF) /* SOC design resision */ #define SVR_MAJ(svr) (((svr) >> 4) & 0xF) /* Major revision field*/ -- 2.1.0.27.g96db324
[v13, 4/8] powerpc/fsl: move mpc85xx.h to include/linux/fsl
Move mpc85xx.h to include/linux/fsl and rename it to svr.h as a common header file. This SVR numberspace is used on some ARM chips as well as PPC, and even to check for a PPC SVR multi-arch drivers would otherwise need to ifdef the header inclusion and all references to the SVR symbols. Signed-off-by: Yangbo Lu Acked-by: Wolfram Sang Acked-by: Stephen Boyd Acked-by: Joerg Roedel [scottwood: update description] Signed-off-by: Scott Wood --- Changes for v2: - None Changes for v3: - None Changes for v4: - None Changes for v5: - Changed to Move mpc85xx.h to include/linux/fsl/ - Adjusted '#include ' position in file Changes for v6: - None Changes for v7: - Added 'Acked-by: Wolfram Sang' for I2C part - Also applied to arch/powerpc/kernel/cpu_setup_fsl_booke.S Changes for v8: - Added 'Acked-by: Stephen Boyd' for clk part - Added 'Acked-by: Scott Wood' - Added 'Acked-by: Joerg Roedel' for iommu part Changes for v9: - None Changes for v10: - None Changes for v11: - Updated description by Scott Changes for v12: - None Changes for v13: - None --- arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/clk/clk-qoriq.c | 3 +-- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +-- drivers/net/ethernet/freescale/gianfar.c | 2 +- arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h | 4 ++-- 7 files changed, 8 insertions(+), 10 deletions(-) rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) diff --git a/arch/powerpc/kernel/cpu_setup_fsl_booke.S b/arch/powerpc/kernel/cpu_setup_fsl_booke.S index 462aed9..2b0284e 100644 --- a/arch/powerpc/kernel/cpu_setup_fsl_booke.S +++ b/arch/powerpc/kernel/cpu_setup_fsl_booke.S @@ -13,13 +13,13 @@ * */ +#include #include #include #include #include #include #include -#include _GLOBAL(__e500_icache_setup) mfspr r0, SPRN_L1CSR1 diff --git a/arch/powerpc/sysdev/fsl_pci.c b/arch/powerpc/sysdev/fsl_pci.c index d3a5974..cb0efea 100644 --- a/arch/powerpc/sysdev/fsl_pci.c +++ b/arch/powerpc/sysdev/fsl_pci.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -37,7 +38,6 @@ #include #include #include -#include #include #include #include diff --git a/drivers/clk/clk-qoriq.c b/drivers/clk/clk-qoriq.c index 20b1055..dc778e8 100644 --- a/drivers/clk/clk-qoriq.c +++ b/drivers/clk/clk-qoriq.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -1153,8 +1154,6 @@ static struct clk *clockgen_clk_get(struct of_phandle_args *clkspec, void *data) } #ifdef CONFIG_PPC -#include - static const u32 a4510_svrs[] __initconst = { (SVR_P2040 << 8) | 0x10,/* P2040 1.0 */ (SVR_P2040 << 8) | 0x11,/* P2040 1.1 */ diff --git a/drivers/i2c/busses/i2c-mpc.c b/drivers/i2c/busses/i2c-mpc.c index 565a49a..e791c51 100644 --- a/drivers/i2c/busses/i2c-mpc.c +++ b/drivers/i2c/busses/i2c-mpc.c @@ -27,9 +27,9 @@ #include #include #include +#include #include -#include #include #define DRV_NAME "mpc-i2c" diff --git a/drivers/iommu/fsl_pamu.c b/drivers/iommu/fsl_pamu.c index a34355f..af8fb27 100644 --- a/drivers/iommu/fsl_pamu.c +++ b/drivers/iommu/fsl_pamu.c @@ -21,11 +21,10 @@ #include "fsl_pamu.h" #include +#include #include #include -#include - /* define indexes for each operation mapping scenario */ #define OMI_QMAN0x00 #define OMI_FMAN0x01 diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c index 4b4f5bc..55be5ce 100644 --- a/drivers/net/ethernet/freescale/gianfar.c +++ b/drivers/net/ethernet/freescale/gianfar.c @@ -86,11 +86,11 @@ #include #include #include +#include #include #ifdef CONFIG_PPC #include -#include #endif #include #include diff --git a/arch/powerpc/include/asm/mpc85xx.h b/include/linux/fsl/svr.h similarity index 97% rename from arch/powerpc/include/asm/mpc85xx.h rename to include/linux/fsl/svr.h index 213f3a8..8d13836 100644 --- a/arch/powerpc/include/asm/mpc85xx.h +++ b/include/linux/fsl/svr.h @@ -9,8 +9,8 @@ * (at your option) any later version. */ -#ifndef __ASM_PPC_MPC85XX_H -#define __ASM_PPC_MPC85XX_H +#ifndef FSL_SVR_H +#define FSL_SVR_H #define SVR_REV(svr) ((svr) & 0xFF) /* SOC design resision */ #define SVR_MAJ(svr) (((svr) >> 4) & 0xF) /* Major revision field*/ -- 2.1.0.27.g96db324
[PATCH v6 05/11] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraegerthis implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { -
[PATCH v6 05/11] s390/spinlock: Provide vcpu_is_preempted
From: Christian Borntraeger this implements the s390 backend for commit "kernel/sched: introduce vcpu preempted check interface" by reworking the existing smp_vcpu_scheduled into arch_vcpu_is_preempted. We can then also get rid of the local cpu_is_preempted function by moving the CIF_ENABLED_WAIT test into arch_vcpu_is_preempted. Signed-off-by: Christian Borntraeger Acked-by: Heiko Carstens --- arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c | 9 +++-- arch/s390/lib/spinlock.c | 25 - 3 files changed, 23 insertions(+), 19 deletions(-) diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h index 7e9e09f..7ecd890 100644 --- a/arch/s390/include/asm/spinlock.h +++ b/arch/s390/include/asm/spinlock.h @@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new) return __sync_bool_compare_and_swap(lock, old, new); } +#ifndef CONFIG_SMP +static inline bool arch_vcpu_is_preempted(int cpu) { return false; } +#else +bool arch_vcpu_is_preempted(int cpu); +#endif + +#define vcpu_is_preempted arch_vcpu_is_preempted + /* * Simple spin lock operations. There are two variants, one clears IRQ's * on the local processor, one does not. diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 35531fe..b988ed1 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address) return -1; } -int smp_vcpu_scheduled(int cpu) +bool arch_vcpu_is_preempted(int cpu) { - return pcpu_running(pcpu_devices + cpu); + if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) + return false; + if (pcpu_running(pcpu_devices + cpu)) + return false; + return true; } +EXPORT_SYMBOL(arch_vcpu_is_preempted); void smp_yield_cpu(int cpu) { diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c index e5f50a7..e48a48e 100644 --- a/arch/s390/lib/spinlock.c +++ b/arch/s390/lib/spinlock.c @@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old) asm(".insn rsy,0xeb22,%0,0,%1" : : "d" (old), "Q" (*lock)); } -static inline int cpu_is_preempted(int cpu) -{ - if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu)) - return 0; - if (smp_vcpu_scheduled(cpu)) - return 0; - return 1; -} - void arch_spin_lock_wait(arch_spinlock_t *lp) { unsigned int cpu = SPINLOCK_LOCKVAL; @@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) continue; } /* First iteration: check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) continue; } /* Check if the lock owner is running. */ - if (first_diag && cpu_is_preempted(~owner)) { + if (first_diag && arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; continue; @@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags) * yield the CPU unconditionally. For LPAR rely on the * sense running status. */ - if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) { + if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) { smp_yield_cpu(~owner); first_diag = 0; } @@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) + if (owner && arch_vcpu_is_preempted(~owner)) smp_yield_cpu(~owner); count = spin_retry; } @@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev) owner = 0; while (1) { if (count-- <= 0) { - if (owner && cpu_is_preempted(~owner)) +
[PATCH v6 07/11] KVM: Introduce kvm_write_guest_offset_cached
It allows us to update some status or field of one struct partially. We can also save one kvm_read_guest_cached if we just update one filed of the struct regardless of its current value. Signed-off-by: Pan Xinhui--- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 20 ++-- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 01c0b9c..6f00237 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, unsigned long len); int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, unsigned long len); +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2907b7b..95308ee 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init); -int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, - void *data, unsigned long len) +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); int r; + gpa_t gpa = ghc->gpa + offset; - BUG_ON(len > ghc->len); + BUG_ON(len + offset > ghc->len); if (slots->generation != ghc->generation) kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len); if (unlikely(!ghc->memslot)) - return kvm_write_guest(kvm, ghc->gpa, data, len); + return kvm_write_guest(kvm, gpa, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; - r = __copy_to_user((void __user *)ghc->hva, data, len); + r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT); return 0; } +EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached); + +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len); +} EXPORT_SYMBOL_GPL(kvm_write_guest_cached); int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, -- 2.4.11
[PATCH v6 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan XinhuiAcked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..82108f5 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..0897179 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v6 07/11] KVM: Introduce kvm_write_guest_offset_cached
It allows us to update some status or field of one struct partially. We can also save one kvm_read_guest_cached if we just update one filed of the struct regardless of its current value. Signed-off-by: Pan Xinhui --- include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 20 ++-- 2 files changed, 16 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 01c0b9c..6f00237 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data, unsigned long len); int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, unsigned long len); +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2907b7b..95308ee 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init); -int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, - void *data, unsigned long len) +int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, int offset, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); int r; + gpa_t gpa = ghc->gpa + offset; - BUG_ON(len > ghc->len); + BUG_ON(len + offset > ghc->len); if (slots->generation != ghc->generation) kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len); if (unlikely(!ghc->memslot)) - return kvm_write_guest(kvm, ghc->gpa, data, len); + return kvm_write_guest(kvm, gpa, data, len); if (kvm_is_error_hva(ghc->hva)) return -EFAULT; - r = __copy_to_user((void __user *)ghc->hva, data, len); + r = __copy_to_user((void __user *)ghc->hva + offset, data, len); if (r) return -EFAULT; - mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT); return 0; } +EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached); + +int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, + void *data, unsigned long len) +{ + return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len); +} EXPORT_SYMBOL_GPL(kvm_write_guest_cached); int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, -- 2.4.11
[PATCH v6 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
An over-committed guest with more vCPUs than pCPUs has a heavy overload in the two spin_on_owner. This blames on the lock holder preemption issue. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock after patch: 9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock 5.28% sched-messaging [unknown] [H] 0xc00768e0 4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7 3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7 3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.02% sched-messaging [kernel.vmlinux] [k] system_call 2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/mutex.c | 15 +-- kernel/locking/rwsem-xadd.c | 16 +--- 2 files changed, 26 insertions(+), 5 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..82108f5 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -236,7 +236,13 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner) */ barrier(); - if (!owner->on_cpu || need_resched()) { + /* +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { ret = false; break; } @@ -261,8 +267,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock) rcu_read_lock(); owner = READ_ONCE(lock->owner); + + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ if (owner) - retval = owner->on_cpu; + retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); rcu_read_unlock(); /* * if lock->owner is not set, the mutex owner may have just acquired diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2337b4b..0897179 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) goto done; } - ret = owner->on_cpu; + /* +* As lock holder preemption issue, we both skip spinning if task is not +* on cpu or its cpu is preempted +*/ + ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner)); done: rcu_read_unlock(); return ret; @@ -362,8 +366,14 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); - /* abort spinning when need_resched or owner is not running */ - if (!owner->on_cpu || need_resched()) { + /* +* abort spinning when need_resched or owner is not running or +* owner's cpu is preempted. vcpu_is_preempted is a macro +* defined by false if arch does not support vcpu preempted +* check +*/ + if (!owner->on_cpu || need_resched() || + vcpu_is_preempted(task_cpu(owner))) { rcu_read_unlock(); return false; } -- 2.4.11
[PATCH v6 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui--- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v6 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. To deal with kernel and kvm/xen, add vcpu_is_preempted into struct pv_lock_ops. Then kvm or xen could provide their own implementation to support vcpu_is_preempted. Signed-off-by: Pan Xinhui --- arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0f400c0..38c3bb7 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -310,6 +310,8 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); + + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..0526f59 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -26,6 +26,14 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return pv_lock_ops.vcpu_is_preempted(cpu); +} +#endif + #include /* diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 2c55a00..2f204dd 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } +static bool native_vcpu_is_preempted(int cpu) +{ + return 0; +} + struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .queued_spin_lock_slowpath = native_queued_spin_lock_slowpath, .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, + .vcpu_is_preempted = native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); -- 2.4.11
[PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Juergen GrossSupport the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v6 08/11] x86, kvm/x86.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time ::preempted to indicate that if one vcpu is running or not. Signed-off-by: Pan Xinhui--- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e375235..f06e115 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) >arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v6 04/11] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun FengSuggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index fa37fe9..8c1b913 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; -- 2.4.11
[PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is not preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan XinhuiAcked-by: Radim Krčmář --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..ab2ab76 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v6 10/11] x86, xen: support vcpu preempted check
From: Juergen Gross Support the vcpu_is_preempted() functionality under Xen. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. A quick test (4 vcpus on 1 physical cpu doing a parallel build job with "make -j 8") reduced system time by about 5% with this patch. Signed-off-by: Juergen Gross Signed-off-by: Pan Xinhui --- arch/x86/xen/spinlock.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d6e006..74756bb 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; + + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } /* -- 2.4.11
[PATCH v6 08/11] x86, kvm/x86.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. Use one field of struct kvm_steal_time ::preempted to indicate that if one vcpu is running or not. Signed-off-by: Pan Xinhui --- arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kvm/x86.c | 16 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 94dc8ca..1421a65 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -45,7 +45,9 @@ struct kvm_steal_time { __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e375235..f06e115 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu) >arch.st.steal, sizeof(struct kvm_steal_time return; + vcpu->arch.st.steal.preempted = 0; + if (vcpu->arch.st.steal.version & 1) vcpu->arch.st.steal.version += 1; /* first time write, random junk */ @@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); } +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu) +{ + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED)) + return; + + vcpu->arch.st.steal.preempted = 1; + + kvm_write_guest_offset_cached(vcpu->kvm, >arch.st.stime, + >arch.st.steal.preempted, + offsetof(struct kvm_steal_time, preempted), + sizeof(vcpu->arch.st.steal.preempted)); +} + void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) { + kvm_steal_time_set_preempted(vcpu); kvm_x86_ops->vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu->arch.last_host_tsc = rdtsc(); -- 2.4.11
[PATCH v6 04/11] powerpc/spinlock: support vcpu preempted check
This is to fix some lock holder preemption issues. Some other locks implementation do a spin loop before acquiring the lock itself. Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It takes the cpu as parameter and return true if the cpu is preempted. Then kernel can break the spin loops upon on the retval of vcpu_is_preempted. As kernel has used this interface, So lets support it. Only pSeries need support it. And the fact is powerNV are built into same kernel image with pSeries. So we need return false if we are runnig as powerNV. The another fact is that lppaca->yiled_count keeps zero on powerNV. So we can just skip the machine type check. Suggested-by: Boqun Feng Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui --- arch/powerpc/include/asm/spinlock.h | 8 1 file changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index fa37fe9..8c1b913 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -52,6 +52,14 @@ #define SYNC_IO #endif +#ifdef CONFIG_PPC_PSERIES +#define vcpu_is_preempted vcpu_is_preempted +static inline bool vcpu_is_preempted(int cpu) +{ + return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); +} +#endif + static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) { return lock.slock == 0; -- 2.4.11
[PATCH v6 11/11] Documentation: virtual: kvm: Support vcpu preempted check
Commit ("x86, kvm: support vcpu preempted check") add one field "__u8 preempted" into struct kvm_steal_time. This field tells if one vcpu is running or not. It is zero if 1) some old KVM deos not support this filed. 2) the vcpu is not preempted. Other values means the vcpu has been preempted. Signed-off-by: Pan Xinhui Acked-by: Radim Krčmář --- Documentation/virtual/kvm/msr.txt | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt index 2a71c8f..ab2ab76 100644 --- a/Documentation/virtual/kvm/msr.txt +++ b/Documentation/virtual/kvm/msr.txt @@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 __u64 steal; __u32 version; __u32 flags; - __u32 pad[12]; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; } whose data will be filled in by the hypervisor periodically. Only one @@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03 nanoseconds. Time during which the vcpu is idle, will not be reported as steal time. + preempted: indicate the VCPU who owns this struct is running or + not. Non-zero values mean the VCPU has been preempted. Zero + means the VCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + MSR_KVM_EOI_EN: 0x4b564d04 data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 when disabled. Bit 1 is reserved and must be zero. When PV end of -- 2.4.11
[PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun FengSigned-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v6 09/11] x86, kernel/kvm.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicate that if one vcpu is running or not after commit("x86, kvm/x86.c: support vcpu preempted check"). unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui--- arch/x86/kernel/kvm.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = _cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) -- 2.4.11
[PATCH v6 02/11] locking/osq: Drop the overload of osq_lock()
An over-committed guest with more vCPUs than pCPUs has a heavy overload in osq_lock(). This is because vCPU A hold the osq lock and yield out, vCPU B wait per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and unlock the osq lock. Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU is currently running or not. So break the spin loops on true condition. test case: perf record -a perf bench sched messaging -g 400 -p && perf report before patch: 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call after patch: 20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner 8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock 4.12% sched-messaging [kernel.vmlinux] [k] system_call 3.01% sched-messaging [kernel.vmlinux] [k] system_call_common 2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7 2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 2.00% sched-messaging [kernel.vmlinux] [k] osq_lock Suggested-by: Boqun Feng Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- kernel/locking/osq_lock.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 05a3785..39d1385 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr) return cpu_nr + 1; } +static inline int node_cpu(struct optimistic_spin_node *node) +{ + return node->cpu - 1; +} + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -118,8 +123,11 @@ bool osq_lock(struct optimistic_spin_queue *lock) while (!READ_ONCE(node->locked)) { /* * If we need to reschedule bail... so we can block. +* Use vcpu_is_preempted to detech lock holder preemption issue +* and break. vcpu_is_preempted is a macro defined by false if +* arch does not support vcpu preempted check, */ - if (need_resched()) + if (need_resched() || vcpu_is_preempted(node_cpu(node->prev))) goto unqueue; cpu_relax_lowlatency(); -- 2.4.11
[PATCH v6 09/11] x86, kernel/kvm.c: support vcpu preempted check
Support the vcpu_is_preempted() functionality under KVM. This will enhance lock performance on overcommitted hosts (more runnable vcpus than physical cpus in the system) as doing busy waits for preempted vcpus will hurt system performance far worse than early yielding. struct kvm_steal_time::preempted indicate that if one vcpu is running or not after commit("x86, kvm/x86.c: support vcpu preempted check"). unix benchmark result: host: kernel 4.8.1, i5-4570, 4 cpus guest: kernel 4.8.1, 8 vcpus test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Signed-off-by: Pan Xinhui --- arch/x86/kernel/kvm.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index edbbfc8..0b48dd2 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -415,6 +415,15 @@ void kvm_disable_steal_time(void) wrmsr(MSR_KVM_STEAL_TIME, 0, 0); } +static bool kvm_vcpu_is_preempted(int cpu) +{ + struct kvm_steal_time *src; + + src = _cpu(steal_time, cpu); + + return !!src->preempted; +} + #ifdef CONFIG_SMP static void __init kvm_smp_prepare_boot_cpu(void) { @@ -471,6 +480,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { has_steal_clock = 1; pv_time_ops.steal_clock = kvm_steal_clock; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted; +#endif } if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) -- 2.4.11
[PATCH v6 01/11] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel)Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v6 00/11] implement vcpu preempted check
change from v5: spilt x86/kvm patch into guest/host part. introduce kvm_write_guest_offset_cached. fix some typos. rebase patch onto 4.9.2 change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (9): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check KVM: Introduce kvm_write_guest_offset_cached x86, kvm/x86.c: support vcpu preempted check x86, kernel/kvm.c: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 9 - arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 16 arch/x86/xen/spinlock.c | 3 ++- include/linux/kvm_host.h | 2 ++ include/linux/sched.h | 12
[PATCH v6 01/11] kernel/sched: introduce vcpu preempted check interface
This patch support to fix lock holder preemption issue. For kernel users, we could use bool vcpu_is_preempted(int cpu) to detech if one vcpu is preempted or not. The default implementation is a macro defined by false. So compiler can wrap it out if arch dose not support such vcpu pteempted check. Suggested-by: Peter Zijlstra (Intel) Signed-off-by: Pan Xinhui Acked-by: Christian Borntraeger Tested-by: Juergen Gross --- include/linux/sched.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 348f51b..44c1ce7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) #endif /* CONFIG_SMP */ +/* + * In order to deal with a various lock holder preemption issues provide an + * interface to see if a vCPU is currently running or not. + * + * This allows us to terminate optimistic spin loops and block, analogous to + * the native optimistic spin heuristic of testing if the lock owner task is + * running or not. + */ +#ifndef vcpu_is_preempted +#define vcpu_is_preempted(cpu) false +#endif + extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask); extern long sched_getaffinity(pid_t pid, struct cpumask *mask); -- 2.4.11
[PATCH v6 00/11] implement vcpu preempted check
change from v5: spilt x86/kvm patch into guest/host part. introduce kvm_write_guest_offset_cached. fix some typos. rebase patch onto 4.9.2 change from v4: spilt x86 kvm vcpu preempted check into two patches. add documentation patch. add x86 vcpu preempted check patch under xen add s390 vcpu preempted check patch change from v3: add x86 vcpu preempted check patch change from v2: no code change, fix typos, update some comments change from v1: a simplier definition of default vcpu_is_preempted skip mahcine type check on ppc, and add config. remove dedicated macro. add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner. add more comments thanks boqun and Peter's suggestion. This patch set aims to fix lock holder preemption issues. test-case: perf record -a perf bench sched messaging -g 400 -p && perf report 18.09% sched-messaging [kernel.vmlinux] [k] osq_lock 12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner 5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock 3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task 3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq 3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is 2.49% sched-messaging [kernel.vmlinux] [k] system_call We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner. These spin_on_onwer variant also cause rcu stall before we apply this patch set We also have observed some performace improvements in uninx benchmark tests. PPC test result: 1 copy - 0.94% 2 copy - 7.17% 4 copy - 11.9% 8 copy - 3.04% 16 copy - 15.11% details below: Without patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples) With patch: 1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples) 2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples) 4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples) 8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples) 16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples) X86 test result: test-case after-patch before-patch Execl Throughput |18307.9 lps |11701.6 lps File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps File Copy 256 bufsize 500 maxblocks| 367555.6 KBps | 222867.7 KBps File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps Pipe Throughput| 11872208.7 lps | 11855628.9 lps Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps Process Creation |29881.2 lps |28572.8 lps Shell Scripts (1 concurrent) |23224.3 lpm |22607.4 lpm Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm System Call Overhead | 10385653.0 lps | 10419979.0 lps Christian Borntraeger (1): s390/spinlock: Provide vcpu_is_preempted Juergen Gross (1): x86, xen: support vcpu preempted check Pan Xinhui (9): kernel/sched: introduce vcpu preempted check interface locking/osq: Drop the overload of osq_lock() kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner powerpc/spinlock: support vcpu preempted check x86, paravirt: Add interface to support kvm/xen vcpu preempted check KVM: Introduce kvm_write_guest_offset_cached x86, kvm/x86.c: support vcpu preempted check x86, kernel/kvm.c: support vcpu preempted check Documentation: virtual: kvm: Support vcpu preempted check Documentation/virtual/kvm/msr.txt | 9 - arch/powerpc/include/asm/spinlock.h | 8 arch/s390/include/asm/spinlock.h | 8 arch/s390/kernel/smp.c| 9 +++-- arch/s390/lib/spinlock.c | 25 - arch/x86/include/asm/paravirt_types.h | 2 ++ arch/x86/include/asm/spinlock.h | 8 arch/x86/include/uapi/asm/kvm_para.h | 4 +++- arch/x86/kernel/kvm.c | 12 arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/kvm/x86.c| 16 arch/x86/xen/spinlock.c | 3 ++- include/linux/kvm_host.h | 2 ++ include/linux/sched.h | 12
[v13, 7/8] base: soc: introduce soc_device_match() interface
From: Arnd BergmannWe keep running into cases where device drivers want to know the exact version of the a SoC they are currently running on. In the past, this has usually been done through a vendor specific API that can be called by a driver, or by directly accessing some kind of version register that is not part of the device itself but that belongs to a global register area of the chip. Common reasons for doing this include: - A machine is not using devicetree or similar for passing data about on-chip devices, but just announces their presence using boot-time platform devices, and the machine code itself does not care about the revision. - There is existing firmware or boot loaders with existing DT binaries with generic compatible strings that do not identify the particular revision of each device, but the driver knows which SoC revisions include which part. - A prerelease version of a chip has some quirks and we are using the same version of the bootloader and the DT blob on both the prerelease and the final version. An update of the DT binding seems inappropriate because that would involve maintaining multiple copies of the dts and/or bootloader. This patch introduces the soc_device_match() interface that is meant to work like of_match_node() but instead of identifying the version of a device, it identifies the SoC itself using a vendor-agnostic interface. Unlike of_match_node(), we do not do an exact string compare but instead use glob_match() to allow wildcards in strings. Signed-off-by: Arnd Bergmann Signed-off-by: Yangbo Lu Acked-by: Greg Kroah-Hartman --- Changes for v11: - Added this patch for soc match Changes for v12: - Corrected the author - Rewrited soc_device_match with while loop Changes for v13: - Added ack from Greg --- drivers/base/Kconfig| 1 + drivers/base/soc.c | 66 + include/linux/sys_soc.h | 3 +++ 3 files changed, 70 insertions(+) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index fdf44ca..991b21e 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -235,6 +235,7 @@ config GENERIC_CPU_AUTOPROBE config SOC_BUS bool + select GLOB source "drivers/base/regmap/Kconfig" diff --git a/drivers/base/soc.c b/drivers/base/soc.c index b63f23e..0c5cf87 100644 --- a/drivers/base/soc.c +++ b/drivers/base/soc.c @@ -13,6 +13,7 @@ #include #include #include +#include static DEFINE_IDA(soc_ida); @@ -159,3 +160,68 @@ static int __init soc_bus_register(void) return bus_register(_bus_type); } core_initcall(soc_bus_register); + +static int soc_device_match_one(struct device *dev, void *arg) +{ + struct soc_device *soc_dev = container_of(dev, struct soc_device, dev); + const struct soc_device_attribute *match = arg; + + if (match->machine && + !glob_match(match->machine, soc_dev->attr->machine)) + return 0; + + if (match->family && + !glob_match(match->family, soc_dev->attr->family)) + return 0; + + if (match->revision && + !glob_match(match->revision, soc_dev->attr->revision)) + return 0; + + if (match->soc_id && + !glob_match(match->soc_id, soc_dev->attr->soc_id)) + return 0; + + return 1; +} + +/* + * soc_device_match - identify the SoC in the machine + * @matches: zero-terminated array of possible matches + * + * returns the first matching entry of the argument array, or NULL + * if none of them match. + * + * This function is meant as a helper in place of of_match_node() + * in cases where either no device tree is available or the information + * in a device node is insufficient to identify a particular variant + * by its compatible strings or other properties. For new devices, + * the DT binding should always provide unique compatible strings + * that allow the use of of_match_node() instead. + * + * The calling function can use the .data entry of the + * soc_device_attribute to pass a structure or function pointer for + * each entry. + */ +const struct soc_device_attribute *soc_device_match( + const struct soc_device_attribute *matches) +{ + int ret = 0; + + if (!matches) + return NULL; + + while (!ret) { + if (!(matches->machine || matches->family || + matches->revision || matches->soc_id)) + break; + ret = bus_for_each_dev(_bus_type, NULL, (void *)matches, + soc_device_match_one); + if (!ret) + matches++; + else + return matches; + } + return NULL; +} +EXPORT_SYMBOL_GPL(soc_device_match); diff --git a/include/linux/sys_soc.h b/include/linux/sys_soc.h index
[v13, 7/8] base: soc: introduce soc_device_match() interface
From: Arnd Bergmann We keep running into cases where device drivers want to know the exact version of the a SoC they are currently running on. In the past, this has usually been done through a vendor specific API that can be called by a driver, or by directly accessing some kind of version register that is not part of the device itself but that belongs to a global register area of the chip. Common reasons for doing this include: - A machine is not using devicetree or similar for passing data about on-chip devices, but just announces their presence using boot-time platform devices, and the machine code itself does not care about the revision. - There is existing firmware or boot loaders with existing DT binaries with generic compatible strings that do not identify the particular revision of each device, but the driver knows which SoC revisions include which part. - A prerelease version of a chip has some quirks and we are using the same version of the bootloader and the DT blob on both the prerelease and the final version. An update of the DT binding seems inappropriate because that would involve maintaining multiple copies of the dts and/or bootloader. This patch introduces the soc_device_match() interface that is meant to work like of_match_node() but instead of identifying the version of a device, it identifies the SoC itself using a vendor-agnostic interface. Unlike of_match_node(), we do not do an exact string compare but instead use glob_match() to allow wildcards in strings. Signed-off-by: Arnd Bergmann Signed-off-by: Yangbo Lu Acked-by: Greg Kroah-Hartman --- Changes for v11: - Added this patch for soc match Changes for v12: - Corrected the author - Rewrited soc_device_match with while loop Changes for v13: - Added ack from Greg --- drivers/base/Kconfig| 1 + drivers/base/soc.c | 66 + include/linux/sys_soc.h | 3 +++ 3 files changed, 70 insertions(+) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index fdf44ca..991b21e 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -235,6 +235,7 @@ config GENERIC_CPU_AUTOPROBE config SOC_BUS bool + select GLOB source "drivers/base/regmap/Kconfig" diff --git a/drivers/base/soc.c b/drivers/base/soc.c index b63f23e..0c5cf87 100644 --- a/drivers/base/soc.c +++ b/drivers/base/soc.c @@ -13,6 +13,7 @@ #include #include #include +#include static DEFINE_IDA(soc_ida); @@ -159,3 +160,68 @@ static int __init soc_bus_register(void) return bus_register(_bus_type); } core_initcall(soc_bus_register); + +static int soc_device_match_one(struct device *dev, void *arg) +{ + struct soc_device *soc_dev = container_of(dev, struct soc_device, dev); + const struct soc_device_attribute *match = arg; + + if (match->machine && + !glob_match(match->machine, soc_dev->attr->machine)) + return 0; + + if (match->family && + !glob_match(match->family, soc_dev->attr->family)) + return 0; + + if (match->revision && + !glob_match(match->revision, soc_dev->attr->revision)) + return 0; + + if (match->soc_id && + !glob_match(match->soc_id, soc_dev->attr->soc_id)) + return 0; + + return 1; +} + +/* + * soc_device_match - identify the SoC in the machine + * @matches: zero-terminated array of possible matches + * + * returns the first matching entry of the argument array, or NULL + * if none of them match. + * + * This function is meant as a helper in place of of_match_node() + * in cases where either no device tree is available or the information + * in a device node is insufficient to identify a particular variant + * by its compatible strings or other properties. For new devices, + * the DT binding should always provide unique compatible strings + * that allow the use of of_match_node() instead. + * + * The calling function can use the .data entry of the + * soc_device_attribute to pass a structure or function pointer for + * each entry. + */ +const struct soc_device_attribute *soc_device_match( + const struct soc_device_attribute *matches) +{ + int ret = 0; + + if (!matches) + return NULL; + + while (!ret) { + if (!(matches->machine || matches->family || + matches->revision || matches->soc_id)) + break; + ret = bus_for_each_dev(_bus_type, NULL, (void *)matches, + soc_device_match_one); + if (!ret) + matches++; + else + return matches; + } + return NULL; +} +EXPORT_SYMBOL_GPL(soc_device_match); diff --git a/include/linux/sys_soc.h b/include/linux/sys_soc.h index 2739ccb..9f5eb06 100644 --- a/include/linux/sys_soc.h +++ b/include/linux/sys_soc.h @@
[v13, 3/8] dt: bindings: move guts devicetree doc out of powerpc directory
Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/ since it's used by not only PowerPC but also ARM. And add a specification for 'little-endian' property. Signed-off-by: Yangbo LuAcked-by: Rob Herring Acked-by: Scott Wood --- Changes for v4: - Added this patch Changes for v5: - Modified the description for little-endian property Changes for v6: - None Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' - Added 'Acked-by: Rob Herring' Changes for v9: - None Changes for v10: - None Changes for v11: - None Changes for v12: - None Changes for v13: - None --- Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++ 1 file changed, 3 insertions(+) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt b/Documentation/devicetree/bindings/soc/fsl/guts.txt similarity index 91% rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt rename to Documentation/devicetree/bindings/soc/fsl/guts.txt index b71b203..07adca9 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt +++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt @@ -25,6 +25,9 @@ Recommended properties: - fsl,liodn-bits : Indicates the number of defined bits in the LIODN registers, for those SOCs that have a PAMU device. + - little-endian : Indicates that the global utilities block is little + endian. The default is big endian. + Examples: global-utilities@e {/* global utilities block */ compatible = "fsl,mpc8548-guts"; -- 2.1.0.27.g96db324
[v13, 3/8] dt: bindings: move guts devicetree doc out of powerpc directory
Move guts devicetree doc to Documentation/devicetree/bindings/soc/fsl/ since it's used by not only PowerPC but also ARM. And add a specification for 'little-endian' property. Signed-off-by: Yangbo Lu Acked-by: Rob Herring Acked-by: Scott Wood --- Changes for v4: - Added this patch Changes for v5: - Modified the description for little-endian property Changes for v6: - None Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' - Added 'Acked-by: Rob Herring' Changes for v9: - None Changes for v10: - None Changes for v11: - None Changes for v12: - None Changes for v13: - None --- Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt | 3 +++ 1 file changed, 3 insertions(+) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) diff --git a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt b/Documentation/devicetree/bindings/soc/fsl/guts.txt similarity index 91% rename from Documentation/devicetree/bindings/powerpc/fsl/guts.txt rename to Documentation/devicetree/bindings/soc/fsl/guts.txt index b71b203..07adca9 100644 --- a/Documentation/devicetree/bindings/powerpc/fsl/guts.txt +++ b/Documentation/devicetree/bindings/soc/fsl/guts.txt @@ -25,6 +25,9 @@ Recommended properties: - fsl,liodn-bits : Indicates the number of defined bits in the LIODN registers, for those SOCs that have a PAMU device. + - little-endian : Indicates that the global utilities block is little + endian. The default is big endian. + Examples: global-utilities@e {/* global utilities block */ compatible = "fsl,mpc8548-guts"; -- 2.1.0.27.g96db324
Re: [PATCH 1/4] printk/NMI: Handle continuous lines and missing newline
On (10/27/16 09:35), Joe Perches wrote: [..] > > - printk_nmi_flush_line(buf, (end - start) + 1); > > + /* Handle continuous lines or missing new line. */ > > + if ((c + 1 < end) && printk_get_level(c)) { > > + if (header) { > > + c += 2; > > printk_skip_level agree, printk_skip_level() probably would look better here. other than that, looks good to me. nice that you found it, Petr! Reviewed-by: Sergey Senozhatsky-ss
Re: [PATCH 1/4] printk/NMI: Handle continuous lines and missing newline
On (10/27/16 09:35), Joe Perches wrote: [..] > > - printk_nmi_flush_line(buf, (end - start) + 1); > > + /* Handle continuous lines or missing new line. */ > > + if ((c + 1 < end) && printk_get_level(c)) { > > + if (header) { > > + c += 2; > > printk_skip_level agree, printk_skip_level() probably would look better here. other than that, looks good to me. nice that you found it, Petr! Reviewed-by: Sergey Senozhatsky -ss
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote: On 28-10-16, 07:22, kbuild test robot wrote: tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: e3300ffef0653774f1099cab153d25d24bd773ce commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF dependent code in a separate file date: 6 months ago Why are we picking it up now ? Sorry due to problems in the 0day infrastructure some few errors are missed in May. Now we catch it when the commit goes mainline. https://lists.01.org/pipermail/kbuild-all/ June 2016: ... [ Gzip'd Text 853 KB ] May 2016: ... [ Gzip'd Text 294 KB ] April 2016: ... [ Gzip'd Text 599 KB ] As you can see, the report volumes are noticeably lower in "May 2016". Thanks, Fengguang
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On Fri, Oct 28, 2016 at 09:27:53AM +0530, Viresh Kumar wrote: On 28-10-16, 07:22, kbuild test robot wrote: tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: e3300ffef0653774f1099cab153d25d24bd773ce commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF dependent code in a separate file date: 6 months ago Why are we picking it up now ? Sorry due to problems in the 0day infrastructure some few errors are missed in May. Now we catch it when the commit goes mainline. https://lists.01.org/pipermail/kbuild-all/ June 2016: ... [ Gzip'd Text 853 KB ] May 2016: ... [ Gzip'd Text 294 KB ] April 2016: ... [ Gzip'd Text 599 KB ] As you can see, the report volumes are noticeably lower in "May 2016". Thanks, Fengguang
Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls
Hello, On (10/27/16 20:30), Linus Torvalds wrote: > On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky >wrote: > > > > RFC > > > > This patch set extends a lock-less NMI per-cpu buffers idea to > > handle recursive printk() calls. The basic mechanism is pretty much the > > same -- at the beginning of a deadlock-prone section we switch to lock-less > > printk callback, and return back to a default printk implementation at the > > end; the messages are getting flushed to a logbuf buffer from a safer > > context. > > This looks very reasonable to me. > > Does this also obviate the need for "printk_deferred()" that the > scheduler and the clock code uses? Because that would be a lovely > thing to look at if it doesn't.. I wish I could say that we can retire printk_deferred(), but no, we still need it. it's rather simple to fix printk recursion (that's what the patch set is doing), but printk deadlocks are much harder to handle. anything that starts somewhere else but somehow is related printk will deadlock (in the worst case). I use this backtrace as an example: SyS_ioctl do_vfs_ioctl tty_ioctl n_tty_ioctl tty_mode_ioctl set_termios tty_set_termios uart_set_termios uart_change_speed FOO_serial_set_termios spin_lock_irqsave(>lock) // lock the output port !! WARN() or pr_err() or printk() vprintk_emit() /* console_trylock() */ console_unlock() call_console_drivers() FOO_write() spin_lock_irqsave(>lock) // already locked with the current printk we can't tell for sure how many locks will be acquired -- printk() can succeed in locking the console_sem and start invoking console drivers (if any) from console_unlock(), or it can fail thus we will acquire only logbuf spin_lock and console_sem spin_lock. the things can change *a bit* once we switch to async_printk. because instead of doing console_unlock()->call_console_drivers(), printk() will just wake_up() the printk_kthread. but still, it won't be enough to remove printk_deferred() :( vprintk_emit() wake_up() spin_lock rq lock printk will be safe. but wake_up() spin_lock rq lock printk vprintk_emit() wake_up() spin_lock rq lock will deadlock. we can't even tell for sure what locks are "important" to printk(). a small and reasonable code refactoring somewhere in clock code/etc. can accidentally change the whole picture by introducing "unsafe" WARN_ON() or adding yet another lock to the printing path. need to think more. p.s. we are plannig to discuss printk related issues next week in Santa Fe. -ss
Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls
Hello, On (10/27/16 20:30), Linus Torvalds wrote: > On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky > wrote: > > > > RFC > > > > This patch set extends a lock-less NMI per-cpu buffers idea to > > handle recursive printk() calls. The basic mechanism is pretty much the > > same -- at the beginning of a deadlock-prone section we switch to lock-less > > printk callback, and return back to a default printk implementation at the > > end; the messages are getting flushed to a logbuf buffer from a safer > > context. > > This looks very reasonable to me. > > Does this also obviate the need for "printk_deferred()" that the > scheduler and the clock code uses? Because that would be a lovely > thing to look at if it doesn't.. I wish I could say that we can retire printk_deferred(), but no, we still need it. it's rather simple to fix printk recursion (that's what the patch set is doing), but printk deadlocks are much harder to handle. anything that starts somewhere else but somehow is related printk will deadlock (in the worst case). I use this backtrace as an example: SyS_ioctl do_vfs_ioctl tty_ioctl n_tty_ioctl tty_mode_ioctl set_termios tty_set_termios uart_set_termios uart_change_speed FOO_serial_set_termios spin_lock_irqsave(>lock) // lock the output port !! WARN() or pr_err() or printk() vprintk_emit() /* console_trylock() */ console_unlock() call_console_drivers() FOO_write() spin_lock_irqsave(>lock) // already locked with the current printk we can't tell for sure how many locks will be acquired -- printk() can succeed in locking the console_sem and start invoking console drivers (if any) from console_unlock(), or it can fail thus we will acquire only logbuf spin_lock and console_sem spin_lock. the things can change *a bit* once we switch to async_printk. because instead of doing console_unlock()->call_console_drivers(), printk() will just wake_up() the printk_kthread. but still, it won't be enough to remove printk_deferred() :( vprintk_emit() wake_up() spin_lock rq lock printk will be safe. but wake_up() spin_lock rq lock printk vprintk_emit() wake_up() spin_lock rq lock will deadlock. we can't even tell for sure what locks are "important" to printk(). a small and reasonable code refactoring somewhere in clock code/etc. can accidentally change the whole picture by introducing "unsafe" WARN_ON() or adding yet another lock to the printing path. need to think more. p.s. we are plannig to discuss printk related issues next week in Santa Fe. -ss
Re: [PATCH 7/7] mfd: tps65217: Fix mismatched interrupt number
On 10/26/2016 10:56 PM, Lee Jones wrote: diff --git a/include/linux/mfd/tps65217.h b/include/linux/mfd/tps65217.h > index 4ccda89..75a3a5f 100644 > --- a/include/linux/mfd/tps65217.h > +++ b/include/linux/mfd/tps65217.h > @@ -235,9 +235,9 @@ struct tps65217_bl_pdata { > }; > > enum tps65217_irq_type { > - TPS65217_IRQ_PB, > - TPS65217_IRQ_AC, >TPS65217_IRQ_USB, > + TPS65217_IRQ_AC, > + TPS65217_IRQ_PB, >TPS65217_NUM_IRQ > }; This is why using enum for these types of assignments is sometimes dangerous. It's probably best to be explicit. I agree with you. Let me fix in v2 - use #define instead of enum type. Best regards, Milo
Re: [PATCH 7/7] mfd: tps65217: Fix mismatched interrupt number
On 10/26/2016 10:56 PM, Lee Jones wrote: diff --git a/include/linux/mfd/tps65217.h b/include/linux/mfd/tps65217.h > index 4ccda89..75a3a5f 100644 > --- a/include/linux/mfd/tps65217.h > +++ b/include/linux/mfd/tps65217.h > @@ -235,9 +235,9 @@ struct tps65217_bl_pdata { > }; > > enum tps65217_irq_type { > - TPS65217_IRQ_PB, > - TPS65217_IRQ_AC, >TPS65217_IRQ_USB, > + TPS65217_IRQ_AC, > + TPS65217_IRQ_PB, >TPS65217_NUM_IRQ > }; This is why using enum for these types of assignments is sometimes dangerous. It's probably best to be explicit. I agree with you. Let me fix in v2 - use #define instead of enum type. Best regards, Milo
[v13, 1/8] dt: bindings: update Freescale DCFG compatible
Update Freescale DCFG compatible with 'fsl,-dcfg' instead of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a, ls1043a, and ls2080a. Signed-off-by: Yangbo LuAcked-by: Rob Herring Signed-off-by: Scott Wood --- Changes for v8: - Added this patch Changes for v9: - Added a list for the possible compatibles Changes for v10: - None Changes for v11: - Added 'Acked-by: Rob Herring' - Updated commit message by Scott Changes for v12: - None Changes for v13: - None --- Documentation/devicetree/bindings/arm/fsl.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/arm/fsl.txt b/Documentation/devicetree/bindings/arm/fsl.txt index dbbc095..713c1ae 100644 --- a/Documentation/devicetree/bindings/arm/fsl.txt +++ b/Documentation/devicetree/bindings/arm/fsl.txt @@ -119,7 +119,11 @@ Freescale DCFG configuration and status for the device. Such as setting the secondary core start address and release the secondary core from holdoff and startup. Required properties: - - compatible: should be "fsl,ls1021a-dcfg" + - compatible: should be "fsl,-dcfg" +Possible compatibles: + "fsl,ls1021a-dcfg" + "fsl,ls1043a-dcfg" + "fsl,ls2080a-dcfg" - reg : should contain base address and length of DCFG memory-mapped registers Example: -- 2.1.0.27.g96db324
[v13, 1/8] dt: bindings: update Freescale DCFG compatible
Update Freescale DCFG compatible with 'fsl,-dcfg' instead of 'fsl,ls1021a-dcfg' to include more chips such as ls1021a, ls1043a, and ls2080a. Signed-off-by: Yangbo Lu Acked-by: Rob Herring Signed-off-by: Scott Wood --- Changes for v8: - Added this patch Changes for v9: - Added a list for the possible compatibles Changes for v10: - None Changes for v11: - Added 'Acked-by: Rob Herring' - Updated commit message by Scott Changes for v12: - None Changes for v13: - None --- Documentation/devicetree/bindings/arm/fsl.txt | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/devicetree/bindings/arm/fsl.txt b/Documentation/devicetree/bindings/arm/fsl.txt index dbbc095..713c1ae 100644 --- a/Documentation/devicetree/bindings/arm/fsl.txt +++ b/Documentation/devicetree/bindings/arm/fsl.txt @@ -119,7 +119,11 @@ Freescale DCFG configuration and status for the device. Such as setting the secondary core start address and release the secondary core from holdoff and startup. Required properties: - - compatible: should be "fsl,ls1021a-dcfg" + - compatible: should be "fsl,-dcfg" +Possible compatibles: + "fsl,ls1021a-dcfg" + "fsl,ls1043a-dcfg" + "fsl,ls2080a-dcfg" - reg : should contain base address and length of DCFG memory-mapped registers Example: -- 2.1.0.27.g96db324
Re: [PATCH 5/7] ARM: dts: am335x: Add the charger interrupt
On 10/22/2016 05:47 AM, Robert Nelson wrote: +#include ^ this hasn't been posted nor pushed to mainline yet.. ;) Oops! I've created this file but not captured not only in my git tree but also in my head! Thanks for your review. Best regards, Milo
[v13, 0/8] Fix eSDHC host version register bug
This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and revision, 10 previous version patchsets had tried many methods but all of them were rejected by reviewers. Such as - dts compatible method - syscon method - ifdef PPC method - GUTS driver getting SVR method Anrd suggested a soc_device_match method in v10, and this is the only available method left now. This v11 patchset introduces the soc_device_match interface in soc driver. The first six patches of Yangbo are to add the GUTS driver. This is used to register a soc device which contain soc version and revision information. The other two patches introduce the soc_device_match method in soc driver and apply it on esdhc driver to fix this bug. Arnd Bergmann (1): base: soc: introduce soc_device_match() interface Yangbo Lu (7): dt: bindings: update Freescale DCFG compatible ARM64: dts: ls2080a: add device configuration node dt: bindings: move guts devicetree doc out of powerpc directory powerpc/fsl: move mpc85xx.h to include/linux/fsl soc: fsl: add GUTS driver for QorIQ platforms MAINTAINERS: add entry for Freescale SoC drivers mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 Documentation/devicetree/bindings/arm/fsl.txt | 6 +- .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + MAINTAINERS| 11 +- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/base/Kconfig | 1 + drivers/base/soc.c | 66 ++ drivers/clk/clk-qoriq.c| 3 +- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ drivers/net/ethernet/freescale/gianfar.c | 2 +- drivers/soc/Kconfig| 3 +- drivers/soc/fsl/Kconfig| 18 ++ drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 236 + include/linux/fsl/guts.h | 125 ++- .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- include/linux/sys_soc.h| 3 + 21 files changed, 456 insertions(+), 62 deletions(-) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) -- 2.1.0.27.g96db324
[v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
The global utilities block controls power management, I/O device enabling, power-onreset(POR) configuration monitoring, alternate function selection for multiplexed signals,and clock control. This patch adds a driver to manage and access global utilities block. Initially only reading SVR and registering soc device are supported. Other guts accesses, such as reading RCW, should eventually be moved into this driver as well. Signed-off-by: Yangbo Lu--- Changes for v4: - Added this patch Changes for v5: - Modified copyright info - Changed MODULE_LICENSE to GPL - Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL - Made FSL_GUTS user-invisible - Added a complete compatible list for GUTS - Stored guts info in file-scope variable - Added mfspr() getting SVR - Redefined GUTS APIs - Called fsl_guts_init rather than using platform driver - Removed useless parentheses - Removed useless 'extern' key words Changes for v6: - Made guts thread safe in fsl_guts_init Changes for v7: - Removed 'ifdef' for function declaration in guts.h Changes for v8: - Fixes lines longer than 80 characters checkpatch issue - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - Changed to platform driver Changes for v12: - Removed "signed-off-by: Scott" - Defined fsl_soc_die_attr struct array instead of soc_device_attribute - Re-designed soc_device_attribute for QorIQ SoC - Other minor fixes Changes for v13: - Rebased - Removed text after 'bool' in Kconfig - Removed ARCH ifdefs - Added more bits for ls1021a mask - Used devm --- drivers/soc/Kconfig | 3 +- drivers/soc/fsl/Kconfig | 18 drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 236 +++ include/linux/fsl/guts.h | 125 +++-- 5 files changed, 333 insertions(+), 50 deletions(-) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig index e6e90e8..f31bceb 100644 --- a/drivers/soc/Kconfig +++ b/drivers/soc/Kconfig @@ -1,8 +1,7 @@ menu "SOC (System On Chip) specific Drivers" source "drivers/soc/bcm/Kconfig" -source "drivers/soc/fsl/qbman/Kconfig" -source "drivers/soc/fsl/qe/Kconfig" +source "drivers/soc/fsl/Kconfig" source "drivers/soc/mediatek/Kconfig" source "drivers/soc/qcom/Kconfig" source "drivers/soc/rockchip/Kconfig" diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig new file mode 100644 index 000..7a9fb9b --- /dev/null +++ b/drivers/soc/fsl/Kconfig @@ -0,0 +1,18 @@ +# +# Freescale SOC drivers +# + +source "drivers/soc/fsl/qbman/Kconfig" +source "drivers/soc/fsl/qe/Kconfig" + +config FSL_GUTS + bool + select SOC_BUS + help + The global utilities block controls power management, I/O device + enabling, power-onreset(POR) configuration monitoring, alternate + function selection for multiplexed signals,and clock control. + This driver is to manage and access global utilities block. + Initially only reading SVR and registering soc device are supported. + Other guts accesses, such as reading RCW, should eventually be moved + into this driver as well. diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile index 75e1f53..44b3beb 100644 --- a/drivers/soc/fsl/Makefile +++ b/drivers/soc/fsl/Makefile @@ -5,3 +5,4 @@ obj-$(CONFIG_FSL_DPAA) += qbman/ obj-$(CONFIG_QUICC_ENGINE) += qe/ obj-$(CONFIG_CPM) += qe/ +obj-$(CONFIG_FSL_GUTS) += guts.o diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c new file mode 100644 index 000..1f356ed --- /dev/null +++ b/drivers/soc/fsl/guts.c @@ -0,0 +1,236 @@ +/* + * Freescale QorIQ Platforms GUTS Driver + * + * Copyright (C) 2016 Freescale Semiconductor, Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct guts { + struct ccsr_guts __iomem *regs; + bool little_endian; +}; + +struct fsl_soc_die_attr { + char*die; + u32 svr; + u32 mask; +}; + +static struct guts *guts; +static struct soc_device_attribute soc_dev_attr; +static struct soc_device *soc_dev; + + +/* SoC die attribute definition for QorIQ platform */ +static const struct fsl_soc_die_attr fsl_soc_die[] = { + /* +* Power Architecture-based SoCs T Series +*/ + + /* Die:
Re: [PATCH 5/7] ARM: dts: am335x: Add the charger interrupt
On 10/22/2016 05:47 AM, Robert Nelson wrote: +#include ^ this hasn't been posted nor pushed to mainline yet.. ;) Oops! I've created this file but not captured not only in my git tree but also in my head! Thanks for your review. Best regards, Milo
[v13, 0/8] Fix eSDHC host version register bug
This patchset is used to fix a host version register bug in the T4240-R1.0-R2.0 eSDHC controller. To match the SoC version and revision, 10 previous version patchsets had tried many methods but all of them were rejected by reviewers. Such as - dts compatible method - syscon method - ifdef PPC method - GUTS driver getting SVR method Anrd suggested a soc_device_match method in v10, and this is the only available method left now. This v11 patchset introduces the soc_device_match interface in soc driver. The first six patches of Yangbo are to add the GUTS driver. This is used to register a soc device which contain soc version and revision information. The other two patches introduce the soc_device_match method in soc driver and apply it on esdhc driver to fix this bug. Arnd Bergmann (1): base: soc: introduce soc_device_match() interface Yangbo Lu (7): dt: bindings: update Freescale DCFG compatible ARM64: dts: ls2080a: add device configuration node dt: bindings: move guts devicetree doc out of powerpc directory powerpc/fsl: move mpc85xx.h to include/linux/fsl soc: fsl: add GUTS driver for QorIQ platforms MAINTAINERS: add entry for Freescale SoC drivers mmc: sdhci-of-esdhc: fix host version for T4240-R1.0-R2.0 Documentation/devicetree/bindings/arm/fsl.txt | 6 +- .../bindings/{powerpc => soc}/fsl/guts.txt | 3 + MAINTAINERS| 11 +- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 + arch/powerpc/kernel/cpu_setup_fsl_booke.S | 2 +- arch/powerpc/sysdev/fsl_pci.c | 2 +- drivers/base/Kconfig | 1 + drivers/base/soc.c | 66 ++ drivers/clk/clk-qoriq.c| 3 +- drivers/i2c/busses/i2c-mpc.c | 2 +- drivers/iommu/fsl_pamu.c | 3 +- drivers/mmc/host/Kconfig | 1 + drivers/mmc/host/sdhci-of-esdhc.c | 20 ++ drivers/net/ethernet/freescale/gianfar.c | 2 +- drivers/soc/Kconfig| 3 +- drivers/soc/fsl/Kconfig| 18 ++ drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 236 + include/linux/fsl/guts.h | 125 ++- .../asm/mpc85xx.h => include/linux/fsl/svr.h | 4 +- include/linux/sys_soc.h| 3 + 21 files changed, 456 insertions(+), 62 deletions(-) rename Documentation/devicetree/bindings/{powerpc => soc}/fsl/guts.txt (91%) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c rename arch/powerpc/include/asm/mpc85xx.h => include/linux/fsl/svr.h (97%) -- 2.1.0.27.g96db324
[v13, 5/8] soc: fsl: add GUTS driver for QorIQ platforms
The global utilities block controls power management, I/O device enabling, power-onreset(POR) configuration monitoring, alternate function selection for multiplexed signals,and clock control. This patch adds a driver to manage and access global utilities block. Initially only reading SVR and registering soc device are supported. Other guts accesses, such as reading RCW, should eventually be moved into this driver as well. Signed-off-by: Yangbo Lu --- Changes for v4: - Added this patch Changes for v5: - Modified copyright info - Changed MODULE_LICENSE to GPL - Changed EXPORT_SYMBOL_GPL to EXPORT_SYMBOL - Made FSL_GUTS user-invisible - Added a complete compatible list for GUTS - Stored guts info in file-scope variable - Added mfspr() getting SVR - Redefined GUTS APIs - Called fsl_guts_init rather than using platform driver - Removed useless parentheses - Removed useless 'extern' key words Changes for v6: - Made guts thread safe in fsl_guts_init Changes for v7: - Removed 'ifdef' for function declaration in guts.h Changes for v8: - Fixes lines longer than 80 characters checkpatch issue - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - Changed to platform driver Changes for v12: - Removed "signed-off-by: Scott" - Defined fsl_soc_die_attr struct array instead of soc_device_attribute - Re-designed soc_device_attribute for QorIQ SoC - Other minor fixes Changes for v13: - Rebased - Removed text after 'bool' in Kconfig - Removed ARCH ifdefs - Added more bits for ls1021a mask - Used devm --- drivers/soc/Kconfig | 3 +- drivers/soc/fsl/Kconfig | 18 drivers/soc/fsl/Makefile | 1 + drivers/soc/fsl/guts.c | 236 +++ include/linux/fsl/guts.h | 125 +++-- 5 files changed, 333 insertions(+), 50 deletions(-) create mode 100644 drivers/soc/fsl/Kconfig create mode 100644 drivers/soc/fsl/guts.c diff --git a/drivers/soc/Kconfig b/drivers/soc/Kconfig index e6e90e8..f31bceb 100644 --- a/drivers/soc/Kconfig +++ b/drivers/soc/Kconfig @@ -1,8 +1,7 @@ menu "SOC (System On Chip) specific Drivers" source "drivers/soc/bcm/Kconfig" -source "drivers/soc/fsl/qbman/Kconfig" -source "drivers/soc/fsl/qe/Kconfig" +source "drivers/soc/fsl/Kconfig" source "drivers/soc/mediatek/Kconfig" source "drivers/soc/qcom/Kconfig" source "drivers/soc/rockchip/Kconfig" diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig new file mode 100644 index 000..7a9fb9b --- /dev/null +++ b/drivers/soc/fsl/Kconfig @@ -0,0 +1,18 @@ +# +# Freescale SOC drivers +# + +source "drivers/soc/fsl/qbman/Kconfig" +source "drivers/soc/fsl/qe/Kconfig" + +config FSL_GUTS + bool + select SOC_BUS + help + The global utilities block controls power management, I/O device + enabling, power-onreset(POR) configuration monitoring, alternate + function selection for multiplexed signals,and clock control. + This driver is to manage and access global utilities block. + Initially only reading SVR and registering soc device are supported. + Other guts accesses, such as reading RCW, should eventually be moved + into this driver as well. diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile index 75e1f53..44b3beb 100644 --- a/drivers/soc/fsl/Makefile +++ b/drivers/soc/fsl/Makefile @@ -5,3 +5,4 @@ obj-$(CONFIG_FSL_DPAA) += qbman/ obj-$(CONFIG_QUICC_ENGINE) += qe/ obj-$(CONFIG_CPM) += qe/ +obj-$(CONFIG_FSL_GUTS) += guts.o diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c new file mode 100644 index 000..1f356ed --- /dev/null +++ b/drivers/soc/fsl/guts.c @@ -0,0 +1,236 @@ +/* + * Freescale QorIQ Platforms GUTS Driver + * + * Copyright (C) 2016 Freescale Semiconductor, Inc. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct guts { + struct ccsr_guts __iomem *regs; + bool little_endian; +}; + +struct fsl_soc_die_attr { + char*die; + u32 svr; + u32 mask; +}; + +static struct guts *guts; +static struct soc_device_attribute soc_dev_attr; +static struct soc_device *soc_dev; + + +/* SoC die attribute definition for QorIQ platform */ +static const struct fsl_soc_die_attr fsl_soc_die[] = { + /* +* Power Architecture-based SoCs T Series +*/ + + /* Die: T4240, SoC:
[v13, 2/8] ARM64: dts: ls2080a: add device configuration node
Add the dts node for device configuration unit that provides general purpose configuration and status for the device. Signed-off-by: Yangbo LuAcked-by: Scott Wood --- Changes for v5: - Added this patch Changes for v6: - None Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - None Changes for v12: - None Changes for v13: - None --- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi index 337da90..c03b099 100644 --- a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi +++ b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi @@ -215,6 +215,12 @@ clocks = <>; }; + dcfg: dcfg@1e0 { + compatible = "fsl,ls2080a-dcfg", "syscon"; + reg = <0x0 0x1e0 0x0 0x1>; + little-endian; + }; + serial0: serial@21c0500 { compatible = "fsl,ns16550", "ns16550a"; reg = <0x0 0x21c0500 0x0 0x100>; -- 2.1.0.27.g96db324
[v13, 2/8] ARM64: dts: ls2080a: add device configuration node
Add the dts node for device configuration unit that provides general purpose configuration and status for the device. Signed-off-by: Yangbo Lu Acked-by: Scott Wood --- Changes for v5: - Added this patch Changes for v6: - None Changes for v7: - None Changes for v8: - Added 'Acked-by: Scott Wood' Changes for v9: - None Changes for v10: - None Changes for v11: - None Changes for v12: - None Changes for v13: - None --- arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi index 337da90..c03b099 100644 --- a/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi +++ b/arch/arm64/boot/dts/freescale/fsl-ls2080a.dtsi @@ -215,6 +215,12 @@ clocks = <>; }; + dcfg: dcfg@1e0 { + compatible = "fsl,ls2080a-dcfg", "syscon"; + reg = <0x0 0x1e0 0x0 0x1>; + little-endian; + }; + serial0: serial@21c0500 { compatible = "fsl,ns16550", "ns16550a"; reg = <0x0 0x21c0500 0x0 0x100>; -- 2.1.0.27.g96db324
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On 28-10-16, 07:22, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: e3300ffef0653774f1099cab153d25d24bd773ce > commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF > dependent code in a separate file > date: 6 months ago Why are we picking it up now ? -- viresh
Re: drivers/base/power/opp/of.c:181:6: error: redefinition of 'dev_pm_opp_of_remove_table'
On 28-10-16, 07:22, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: e3300ffef0653774f1099cab153d25d24bd773ce > commit: f47b72a15a9679dd4dc1af681d4d2f1ca2815552 PM / OPP: Move CONFIG_OF > dependent code in a separate file > date: 6 months ago Why are we picking it up now ? -- viresh
[v13, 6/8] MAINTAINERS: add entry for Freescale SoC drivers
Add maintainer entry for Freescale SoC drivers including the QE library and the GUTS driver now. Also add maintainer for QE library. Signed-off-by: Yangbo LuAcked-by: Scott Wood Acked-by: Qiang Zhao --- Changes for v8: - Added this patch Changes for v9: - Added linux-arm mail list - Removed GUTS driver entry Changes for v10: - Changed 'DRIVER' to 'DRIVERS' - Added 'Acked-by' of Scott and Qiang Changes for v11: - None Changes for v12: - None Changes for v13: - None --- MAINTAINERS | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index c72fa18..cf3aaee 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5037,9 +5037,18 @@ S: Maintained F: drivers/net/ethernet/freescale/fman F: Documentation/devicetree/bindings/powerpc/fsl/fman.txt +FREESCALE SOC DRIVERS +M: Scott Wood +L: linuxppc-...@lists.ozlabs.org +L: linux-arm-ker...@lists.infradead.org +S: Maintained +F: drivers/soc/fsl/ +F: include/linux/fsl/ + FREESCALE QUICC ENGINE LIBRARY +M: Qiang Zhao L: linuxppc-...@lists.ozlabs.org -S: Orphan +S: Maintained F: drivers/soc/fsl/qe/ F: include/soc/fsl/*qe*.h F: include/soc/fsl/*ucc*.h -- 2.1.0.27.g96db324
[v13, 6/8] MAINTAINERS: add entry for Freescale SoC drivers
Add maintainer entry for Freescale SoC drivers including the QE library and the GUTS driver now. Also add maintainer for QE library. Signed-off-by: Yangbo Lu Acked-by: Scott Wood Acked-by: Qiang Zhao --- Changes for v8: - Added this patch Changes for v9: - Added linux-arm mail list - Removed GUTS driver entry Changes for v10: - Changed 'DRIVER' to 'DRIVERS' - Added 'Acked-by' of Scott and Qiang Changes for v11: - None Changes for v12: - None Changes for v13: - None --- MAINTAINERS | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index c72fa18..cf3aaee 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5037,9 +5037,18 @@ S: Maintained F: drivers/net/ethernet/freescale/fman F: Documentation/devicetree/bindings/powerpc/fsl/fman.txt +FREESCALE SOC DRIVERS +M: Scott Wood +L: linuxppc-...@lists.ozlabs.org +L: linux-arm-ker...@lists.infradead.org +S: Maintained +F: drivers/soc/fsl/ +F: include/linux/fsl/ + FREESCALE QUICC ENGINE LIBRARY +M: Qiang Zhao L: linuxppc-...@lists.ozlabs.org -S: Orphan +S: Maintained F: drivers/soc/fsl/qe/ F: include/soc/fsl/*qe*.h F: include/soc/fsl/*ucc*.h -- 2.1.0.27.g96db324
linux-next: Tree for Oct 28
Hi all, There will probably be no linux-next releases next week while I attend the Kernel Summit. Changes since 20161027: The akpm-current tree lost its build failures. Non-merge commits (relative to Linus' tree): 3098 3842 files changed, 227213 insertions(+), 59787 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (this fails its final link) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 245 trees (counting Linus' and 35 trees of patches pending for Linus' tree). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e3300ffef065 Merge tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux) Merging fixes/master (30066ce675d3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging kbuild-current/rc-fixes (989cea5c14be kbuild: prevent lib-ksyms.o rebuilds) Merging arc-current/for-curr (e2192b253de8 ARC: module: print pretty section names) Merging arm-current/fixes (6127d124ee4e ARM: wire up new pkey syscalls) Merging m68k-current/for-linus (6736e65effc3 m68k: Migrate exception table users off module.h and onto extable.h) Merging metag-fixes/fixes (35d04077ad96 metag: Only define atomic_dec_if_positive conditionally) Merging powerpc-fixes/fixes (fb479e44a9e2 powerpc/64s: relocation, register save fixes for system reset interrupt) Merging sparc/master (a74ad5e660a9 sparc64: Handle extremely large kernel TLB range flushes more gracefully.) Merging net/master (9ee7837449b3 net sched filters: fix notification of filter delete with proper handle) CONFLICT (content): Merge conflict in drivers/net/ethernet/qlogic/Kconfig Applying: qed*: merge fix for CONFIG_INFINIBAND_QEDR Kconfig move Merging ipsec/master (7f92083eb58f vti6: flush x-netns xfrm cache when vti interface is removed) Merging netfilter/master (7034b566a4e7 netfilter: fix nf_queue handling) Merging ipvs/master (ea43f860d984 Merge branch 'ethoc-fixes') Merging wireless-drivers/master (d3532ea6ce4e brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap) Merging mac80211/master (b4f7f4ad425a mac80211: fix some sphinx warnings) Merging sound-current/for-linus (bdc3478f90cd ALSA: usb-audio: Add quirk for Syntek STK1160) Merging pci-current/for-linus (349d941e1ff1 PCI: qcom: Fix pp->dev usage before assignment) Merging driver-core.current/driver-core-linus (248ff0216543 driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger) Merging tty.current/tty-linus (009e39ae44f4 vt: clear selection before resizing) Merging usb.current/usb-linus (c1aa67729a1d Merge tag 'usb-ci-v4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-linus) Merging usb-gadget-fixes/fixes (a1aa8cf6471b Revert "Documentation: devicetree: dwc2: Deprecate g-tx-fifo-size") Merging usb-serial-fixes/usb-linus (07d9a380680d Linux 4.9-rc2) Merging usb-chipidea-fixes/ci-for-usb-stable (991d5add50a5 usb: chipidea: host: fix NULL ptr dereference during shutdown) Merging phy/fixes (1001354ca341 Linux 4.9-rc1) Merging staging.current/staging-linus (e866dd8aab76 greybus: fix a leak on error in gb_module_create()) Merging char-misc.current/char-misc-linus (cfcc1456e4a2 Merge tag 'extcon-fixes-for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-linus) Merging input-current/for-linus (324ae0958cab Input: psmouse - cleanup Focaltech code) Merging crypto-current/master (6d4
linux-next: Tree for Oct 28
Hi all, There will probably be no linux-next releases next week while I attend the Kernel Summit. Changes since 20161027: The akpm-current tree lost its build failures. Non-merge commits (relative to Linus' tree): 3098 3842 files changed, 227213 insertions(+), 59787 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig (this fails its final link) and pseries_le_defconfig and i386, sparc and sparc64 defconfig. Below is a summary of the state of the merge. I am currently merging 245 trees (counting Linus' and 35 trees of patches pending for Linus' tree). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e3300ffef065 Merge tag 'for-linus-4.9-rc2-ofs-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux) Merging fixes/master (30066ce675d3 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6) Merging kbuild-current/rc-fixes (989cea5c14be kbuild: prevent lib-ksyms.o rebuilds) Merging arc-current/for-curr (e2192b253de8 ARC: module: print pretty section names) Merging arm-current/fixes (6127d124ee4e ARM: wire up new pkey syscalls) Merging m68k-current/for-linus (6736e65effc3 m68k: Migrate exception table users off module.h and onto extable.h) Merging metag-fixes/fixes (35d04077ad96 metag: Only define atomic_dec_if_positive conditionally) Merging powerpc-fixes/fixes (fb479e44a9e2 powerpc/64s: relocation, register save fixes for system reset interrupt) Merging sparc/master (a74ad5e660a9 sparc64: Handle extremely large kernel TLB range flushes more gracefully.) Merging net/master (9ee7837449b3 net sched filters: fix notification of filter delete with proper handle) CONFLICT (content): Merge conflict in drivers/net/ethernet/qlogic/Kconfig Applying: qed*: merge fix for CONFIG_INFINIBAND_QEDR Kconfig move Merging ipsec/master (7f92083eb58f vti6: flush x-netns xfrm cache when vti interface is removed) Merging netfilter/master (7034b566a4e7 netfilter: fix nf_queue handling) Merging ipvs/master (ea43f860d984 Merge branch 'ethoc-fixes') Merging wireless-drivers/master (d3532ea6ce4e brcmfmac: avoid maybe-uninitialized warning in brcmf_cfg80211_start_ap) Merging mac80211/master (b4f7f4ad425a mac80211: fix some sphinx warnings) Merging sound-current/for-linus (bdc3478f90cd ALSA: usb-audio: Add quirk for Syntek STK1160) Merging pci-current/for-linus (349d941e1ff1 PCI: qcom: Fix pp->dev usage before assignment) Merging driver-core.current/driver-core-linus (248ff0216543 driver core: Make Kconfig text for DEBUG_TEST_DRIVER_REMOVE stronger) Merging tty.current/tty-linus (009e39ae44f4 vt: clear selection before resizing) Merging usb.current/usb-linus (c1aa67729a1d Merge tag 'usb-ci-v4.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/peter.chen/usb into usb-linus) Merging usb-gadget-fixes/fixes (a1aa8cf6471b Revert "Documentation: devicetree: dwc2: Deprecate g-tx-fifo-size") Merging usb-serial-fixes/usb-linus (07d9a380680d Linux 4.9-rc2) Merging usb-chipidea-fixes/ci-for-usb-stable (991d5add50a5 usb: chipidea: host: fix NULL ptr dereference during shutdown) Merging phy/fixes (1001354ca341 Linux 4.9-rc1) Merging staging.current/staging-linus (e866dd8aab76 greybus: fix a leak on error in gb_module_create()) Merging char-misc.current/char-misc-linus (cfcc1456e4a2 Merge tag 'extcon-fixes-for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/extcon into char-misc-linus) Merging input-current/for-linus (324ae0958cab Input: psmouse - cleanup Focaltech code) Merging crypto-current/master (6d4
Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls
On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatskywrote: > > RFC > > This patch set extends a lock-less NMI per-cpu buffers idea to > handle recursive printk() calls. The basic mechanism is pretty much the > same -- at the beginning of a deadlock-prone section we switch to lock-less > printk callback, and return back to a default printk implementation at the > end; the messages are getting flushed to a logbuf buffer from a safer > context. This looks very reasonable to me. Does this also obviate the need for "printk_deferred()" that the scheduler and the clock code uses? Because that would be a lovely thing to look at if it doesn't.. LInus
Re: [RFC][PATCHv4 0/6] printk: use printk_safe to handle printk() recursive calls
On Thu, Oct 27, 2016 at 8:49 AM, Sergey Senozhatsky wrote: > > RFC > > This patch set extends a lock-less NMI per-cpu buffers idea to > handle recursive printk() calls. The basic mechanism is pretty much the > same -- at the beginning of a deadlock-prone section we switch to lock-less > printk callback, and return back to a default printk implementation at the > end; the messages are getting flushed to a logbuf buffer from a safer > context. This looks very reasonable to me. Does this also obviate the need for "printk_deferred()" that the scheduler and the clock code uses? Because that would be a lovely thing to look at if it doesn't.. LInus