[RFC PATCH 5/5] mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list
On system with heterogeneous memory, reasonable fall back lists woul be: a. No fall back, stick to current running node. b. Fall back to other nodes of the same type or different type e.g. DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3 c. Fall back to other nodes of the same type only. e.g. DRAM node 0 -> DRAM node 1 a. is already in place, previous patch implement b. providing way to satisfy memory request as best effort by default. And this patch of writing build c. to fallback to the same node type when user specify GFP_SAME_NODE_TYPE only. Signed-off-by: Fan Du --- include/linux/gfp.h| 7 +++ include/linux/mmzone.h | 1 + mm/page_alloc.c| 15 +++ 3 files changed, 23 insertions(+) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index fdab7de..ca5fdfc 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -44,6 +44,8 @@ #else #define ___GFP_NOLOCKDEP 0 #endif +#define ___GFP_SAME_NODE_TYPE 0x100u + /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -215,6 +217,7 @@ /* Disable lockdep for GFP context tracking */ #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) +#define __GFP_SAME_NODE_TYPE ((__force gfp_t)___GFP_SAME_NODE_TYPE) /* Room for N __GFP_FOO bits */ #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) @@ -301,6 +304,8 @@ __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM) #define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) +#define GFP_SAME_NODE_TYPE (__GFP_SAME_NODE_TYPE) + /* Convert GFP flags to their corresponding migrate type */ #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE) #define GFP_MOVABLE_SHIFT 3 @@ -438,6 +443,8 @@ static inline int gfp_zonelist(gfp_t flags) #ifdef CONFIG_NUMA if (unlikely(flags & __GFP_THISNODE)) return ZONELIST_NOFALLBACK; + if (unlikely(flags & __GFP_SAME_NODE_TYPE)) + return ZONELIST_FALLBACK_SAME_TYPE; #endif return ZONELIST_FALLBACK; } diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 8c37e1c..2f8603e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -583,6 +583,7 @@ static inline bool zone_intersects(struct zone *zone, enum { ZONELIST_FALLBACK, /* zonelist with fallback */ + ZONELIST_FALLBACK_SAME_TYPE,/* zonelist with fallback to the same type node */ #ifdef CONFIG_NUMA /* * The NUMA zonelists are doubled because we need zonelists that diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a408a91..de797921 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5448,6 +5448,21 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order, } zonerefs->zone = NULL; zonerefs->zone_idx = 0; + + zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK_SAME_TYPE]._zonerefs; + + for (i = 0; i < nr_nodes; i++) { + int nr_zones; + + pg_data_t *node = NODE_DATA(node_order[i]); + + if (!is_node_same_type(node->node_id, pgdat->node_id)) + continue; + nr_zones = build_zonerefs_node(node, zonerefs); + zonerefs += nr_zones; + } + zonerefs->zone = NULL; + zonerefs->zone_idx = 0; } /* -- 1.8.3.1
[RFC PATCH 0/5] New fallback workflow for heterogeneous memory system
This is another approach of building zonelist based on patch #10 of patchset[1]. For systems with heterogeneous DRAM and PMEM (persistent memory), 1) change ZONELIST_FALLBACK to first fallback to same type nodes, then the other types 2) add ZONELIST_FALLBACK_SAME_TYPE to fallback only same type nodes. To be explicitly selected by __GFP_SAME_NODE_TYPE. For example, a 2S DRAM+PMEM system may have NUMA distances: node 0 1 2 3 0: 10 21 17 28 1: 21 10 28 17 2: 17 28 10 28 3: 28 17 28 10 Node 0,1 are DRAM nodes, node 2, 3 are PMEM nodes. ZONELIST_FALLBACK = Current zoned fallback lists are based on numa distance only, which means page allocation request from node 0 will iterate zone order like: DRAM node 0 -> PMEM node 2 -> DRAM node 1 -> PMEM node 3. However PMEM has different characteristics from DRAM, the more reasonable or desirable fallback style would be: DRAM node 0 -> DRAM node 1 -> PMEM node 2 -> PMEM node 3. When DRAM is exhausted, try PMEM then. ZONELIST_FALLBACK_SAME_TYPE === Some cases are more suitable to fit PMEM characteristics, like page is read more frequently than written. Other cases may prefer DRAM only. It doesn't matter page is from local node, or remote. Create __GFP_SAME_NODE_TYPE to request page of same node type, either we got DRAM(from node 0, 1) or PMEM (from node 2, 3), it's kind of extension to the nofallback list, but with the same node type. This patchset is self-contained, and based on Linux 5.1-rc6. [1]: https://lkml.org/lkml/2018/12/26/138 Fan Du (5): acpi/numa: memorize NUMA node type from SRAT table mmzone: new pgdat flags for DRAM and PMEM x86,numa: update numa node type mm, page alloc: build fallback list on per node type basis mm, page_alloc: Introduce ZONELIST_FALLBACK_SAME_TYPE fallback list arch/x86/include/asm/numa.h | 2 ++ arch/x86/mm/numa.c | 3 +++ drivers/acpi/numa.c | 5 include/linux/gfp.h | 7 ++ include/linux/mmzone.h | 35 mm/page_alloc.c | 57 - 6 files changed, 93 insertions(+), 16 deletions(-) -- 1.8.3.1
[RFC PATCH 4/5] mm, page alloc: build fallback list on per node type basis
On box with both DRAM and PMEM managed by mm system, Usually node 0, 1 are DRAM nodes, nodes 2, 3 are PMEM nodes. nofallback list are same as before, fallback list are not redesigned to be arranged by node type basis, iow, allocation request of DRAM page start from node 0 will go through node0->node1->node2->node3 zonelists. Signed-off-by: Fan Du --- include/linux/mmzone.h | 8 mm/page_alloc.c| 42 ++ 2 files changed, 34 insertions(+), 16 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d3ee9f9..8c37e1c 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -939,6 +939,14 @@ static inline int is_node_dram(int nid) return test_bit(PGDAT_DRAM, &pgdat->flags); } +static inline int is_node_same_type(int nida, int nidb) +{ + if (node_isset(nida, numa_nodes_pmem)) + return node_isset(nidb, numa_nodes_pmem); + else + return node_isset(nidb, numa_nodes_dram); +} + static inline void set_node_type(int nid) { pg_data_t *pgdat = NODE_DATA(nid); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c6ce20a..a408a91 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5372,7 +5372,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write, * * Return: node id of the found node or %NUMA_NO_NODE if no node is found. */ -static int find_next_best_node(int node, nodemask_t *used_node_mask) +static int find_next_best_node(int node, nodemask_t *used_node_mask, int need_same_type) { int n, val; int min_val = INT_MAX; @@ -5380,7 +5380,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) const struct cpumask *tmp = cpumask_of_node(0); /* Use the local node if we haven't already */ - if (!node_isset(node, *used_node_mask)) { + if (need_same_type && !node_isset(node, *used_node_mask)) { node_set(node, *used_node_mask); return node; } @@ -5391,6 +5391,12 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) if (node_isset(n, *used_node_mask)) continue; + if (need_same_type && !is_node_same_type(node, n)) + continue; + + if (!need_same_type && is_node_same_type(node, n)) + continue; + /* Use the distance array to find the distance */ val = node_distance(node, n); @@ -5472,31 +5478,35 @@ static void build_zonelists(pg_data_t *pgdat) int node, load, nr_nodes = 0; nodemask_t used_mask; int local_node, prev_node; + int need_same_type; /* NUMA-aware ordering of nodes */ local_node = pgdat->node_id; load = nr_online_nodes; prev_node = local_node; - nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { - /* -* We don't want to pressure a particular node. -* So adding penalty to the first node in same -* distance group to make it round-robin. -*/ - if (node_distance(local_node, node) != - node_distance(local_node, prev_node)) - node_load[node] = load; + for (need_same_type = 1; need_same_type >= 0; need_same_type--) { + nodes_clear(used_mask); + while ((node = find_next_best_node(local_node, &used_mask, + need_same_type)) >= 0) { + /* +* We don't want to pressure a particular node. +* So adding penalty to the first node in same +* distance group to make it round-robin. +*/ + if (node_distance(local_node, node) != + node_distance(local_node, prev_node)) + node_load[node] = load; - node_order[nr_nodes++] = node; - prev_node = node; - load--; + node_order[nr_nodes++] = node; + prev_node = node; + load--; + } } - build_zonelists_in_node_order(pgdat, node_order, nr_nodes); build_thisnode_zonelists(pgdat); + } #ifdef CONFIG_HAVE_MEMORYLESS_NODES -- 1.8.3.1
[RFC PATCH 1/5] acpi/numa: memorize NUMA node type from SRAT table
Mark NUMA node as DRAM or PMEM. This could happen in boot up state (see the e820 pmem type override patch), or on fly when bind devdax device with kmem driver. It depends on BIOS supplying PMEM NUMA proximity in SRAT table, that's current production BIOS does. Signed-off-by: Fan Du --- arch/x86/include/asm/numa.h | 2 ++ arch/x86/mm/numa.c | 2 ++ drivers/acpi/numa.c | 5 + 3 files changed, 9 insertions(+) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index bbfde3d..5191198 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -30,6 +30,8 @@ */ extern s16 __apicid_to_node[MAX_LOCAL_APIC]; extern nodemask_t numa_nodes_parsed __initdata; +extern nodemask_t numa_nodes_pmem; +extern nodemask_t numa_nodes_dram; extern int __init numa_add_memblk(int nodeid, u64 start, u64 end); extern void __init numa_set_distance(int from, int to, int distance); diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index dfb6c4d..3c3a1f5 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -20,6 +20,8 @@ int numa_off; nodemask_t numa_nodes_parsed __initdata; +nodemask_t numa_nodes_pmem; +nodemask_t numa_nodes_dram; struct pglist_data *node_data[MAX_NUMNODES] __read_mostly; EXPORT_SYMBOL(node_data); diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 867f6e3..ec4b7a7e 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -298,6 +298,11 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit) node_set(node, numa_nodes_parsed); + if (ma->flags & ACPI_SRAT_MEM_NON_VOLATILE) + node_set(node, numa_nodes_pmem); + else + node_set(node, numa_nodes_dram); + pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s%s\n", node, pxm, (unsigned long long) start, (unsigned long long) end - 1, -- 1.8.3.1
[RFC PATCH 2/5] mmzone: new pgdat flags for DRAM and PMEM
One system with DRAM and PMEM, we need new flag to tag pgdat is made of DRAM or peristent memory. This patch serves as preparetion one for follow up patch. Signed-off-by: Fan Du --- include/linux/mmzone.h | 26 ++ 1 file changed, 26 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fba7741..d3ee9f9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -520,6 +520,8 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_DRAM, /* Volatile DRAM memory node */ + PGDAT_PMEM, /* Persistent memory node */ }; enum zone_flags { @@ -923,6 +925,30 @@ extern int numa_zonelist_order_handler(struct ctl_table *, int, #endif /* !CONFIG_NEED_MULTIPLE_NODES */ +static inline int is_node_pmem(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + return test_bit(PGDAT_PMEM, &pgdat->flags); +} + +static inline int is_node_dram(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + return test_bit(PGDAT_DRAM, &pgdat->flags); +} + +static inline void set_node_type(int nid) +{ + pg_data_t *pgdat = NODE_DATA(nid); + + if (node_isset(nid, numa_nodes_pmem)) + set_bit(PGDAT_PMEM, &pgdat->flags); + else + set_bit(PGDAT_DRAM, &pgdat->flags); +} + extern struct pglist_data *first_online_pgdat(void); extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat); extern struct zone *next_zone(struct zone *zone); -- 1.8.3.1
[RFC PATCH 3/5] x86,numa: update numa node type
Give the newly created node a type per SRAT attribution. Signed-off-by: Fan Du --- arch/x86/mm/numa.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 3c3a1f5..ff8ad63 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -590,6 +590,7 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) continue; alloc_node_data(nid); + set_node_type(nid); } /* Dump memblock with node info and return. */ -- 1.8.3.1
[PATCH] memory hotplug: fix comments when add section
Here, pfn_to_node should be page_to_nid. Signed-off-by: Fan Du --- mm/memory_hotplug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d4b5f29..c3c1c57 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -265,7 +265,7 @@ static int __meminit __add_section(int nid, unsigned long phys_start_pfn, /* * Make all the pages reserved so that nobody will stumble over half * initialized state. -* FIXME: We also have to associate it with a node because pfn_to_node +* FIXME: We also have to associate it with a node because page_to_nid * relies on having page with the proper node. */ for (i = 0; i < PAGES_PER_SECTION; i++) { -- 1.8.3.1
[PATCH v4] Add /proc/PID/smaps support for DAX
Memory behind device DAX is not attached into normal memory management system, when user mmap /dev/dax, smaps part is currently missing, so no idea for user to check how much device DAX memory are actually used in practice. Whether vma is backed up by normal page,huge page, or both at the same time, this makes no difference for device DAX user so far. Using existing smaps structure is enough to do the job, so this patch tries to use existing RSS/PSS stuff for statistics. An example reading is like this: 7f30fe20-7f310220 rw-s 00:06 19567 /dev/dax12.0 Size: 65536 kB KernelPageSize:4 kB MMUPageSize: 4 kB Rss: 65536 kB Pss: 65536 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 65536 kB Referenced:65536 kB Anonymous: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped:0 kB Shared_Hugetlb:0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked:65536 kB ProtectionKey: 0 VmFlags: rd wr sh mr mw me ms mm hg Signed-off-by: Fan Du --- v4: * Merge device DAX readings into existing smap counters for simplicity. v3: * Elaborate more about the usage suggested by Michal Hocko v2: * Using pte_devmap to check valid pfn page structure, Pointed out by Dan. thx! fs/proc/task_mmu.c | 74 -- 1 file changed, 72 insertions(+), 2 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..9b2d3e6 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -507,6 +507,55 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, } } +/* page structure behind DAX mappings is NOT compound page + * when it's a huge page mappings, so introduce new API to + * account for both PMD and PUD mapping. + */ +static void smaps_account_dax_huge(struct mem_size_stats *mss, + struct page *page, unsigned long size, bool young, bool dirty) +{ + int mapcount = page_mapcount(page); + + if (PageAnon(page)) { + mss->anonymous += size; + if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) + mss->lazyfree += size; + } + + mss->resident += size; + /* Accumulate the size in pages that have been accessed. */ + if (young || page_is_young(page) || PageReferenced(page)) + mss->referenced += size; + + /* +* page_count(page) == 1 guarantees the page is mapped exactly once. +* If any subpage of the compound page mapped with PTE it would elevate +* page_count(). +*/ + if (page_count(page) == 1) { + if (dirty || PageDirty(page)) + mss->private_dirty += size; + else + mss->private_clean += size; + mss->pss += (u64)size << PSS_SHIFT; + return; + } + + if (mapcount >= 2) { + if (dirty || PageDirty(page)) + mss->shared_dirty += size; + else + mss->shared_clean += size; + mss->pss += (size << PSS_SHIFT) / mapcount; + } else { + if (dirty || PageDirty(page)) + mss->private_dirty += size; + else + mss->private_clean += size; + mss->pss += size << PSS_SHIFT; + } +} + #ifdef CONFIG_SHMEM static int smaps_pte_hole(unsigned long addr, unsigned long end, struct mm_walk *walk) @@ -528,7 +577,16 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL; if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + if (!vma_is_dax(vma)) + page = vm_normal_page(vma, addr, *pte); + else if (pte_devmap(*pte)) { + struct dev_pagemap *pgmap; + + pgmap = get_dev_pagemap(pte_pfn(*pte), NULL); + if (!pgmap) + return; + page = pte_page(*pte); + } } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -579,7 +637,19 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, struct page *page; /* FOLL_DUMP will return -EFAULT on huge zero page */ - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + if (!vma_is_dax(vma)) + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + else if (pmd_devmap(*pmd)) { + struct
[PATCH 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX
So user could check those interface for more detailed information about how much DAX mappings are currently created. Here we use vma_is_dax method to find specific page struture with DAX {huge, normal}page mappings, vm_normal_page routine works as before without any impact on the existing logical where _vm_normal_page are called. With this patch, user could check Device DAX usage at Ptes@ readings, fox example, statistics from a Device DAX algined by 2MB: 7f6c1780-7f6c17e0 rw-s 00:06 20559 /dev/dax12.0 Size: 6144 kB . . . Ptes@2MB: 6144 kB Signed-off-by: Fan Du --- v3: * Elaborate more about the usage suggested by Michal Hocko v2: * Using pte_devmap to check valid pfn page structure, Pointed out by Dan. thx! --- fs/proc/task_mmu.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 30dbf37..5c4535c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL; if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + if (!vma_is_dax(vma)) + page = vm_normal_page(vma, addr, *pte); + else if (pte_devmap(*pte)) + page = pte_page(*pte); mss->rss_pte += PAGE_SIZE; } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, struct page *page; /* FOLL_DUMP will return -EFAULT on huge zero page */ - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + if (!vma_is_dax(vma)) + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + else if (pmd_devmap(*pmd)) + page = pmd_page(*pmd); + if (IS_ERR_OR_NULL(page)) return; if (PageAnon(page)) @@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + struct page *page; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - struct page *page; - - page = can_gather_numa_stats_pmd(*pmd, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats_pmd(*pmd, vma, addr); + else if (pmd_devmap(*pmd)) + page = pmd_page(*pmd); if (page) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, #endif orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { - struct page *page = can_gather_numa_stats(*pte, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats(*pte, vma, addr); + else if (pte_devmap(*pte)) + page = pte_page(*pte); if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); -- 1.8.3.1
[PATCHv3 1/2] proc: mm: export PTE sizes directly in smaps
From: Dave Hansen /proc/$pid/smaps has a number of fields that are intended to imply the kinds of PTEs used to map memory. "AnonHugePages" obviously tells you how many PMDs are being used. "MMUPageSize" along with the "Hugetlb" fields tells you how many PTEs you have for a huge page. The current mechanisms work fine when we have one or two page sizes. But, they start to get a bit muddled when we mix page sizes inside one VMA. For instance, the DAX folks were proposing adding a set of fields like: DevicePages: DeviceHugePages: DeviceGiganticPages: DeviceGinormousPages: to unmuddle things when page sizes get mixed. That's fine, but it does require userspace know the mapping from our various arbitrary names to hardware page sizes on each architecture and kernel configuration. That seems rather suboptimal. What folks really want is to know how much memory is mapped with each page size. How about we just do *that* instead? Patch attached. Seems harmless enough. Seems to compile on a bunch of random architectures. Makes smaps look like this: Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Locked:0 kB Ptes@4kB: 32 kB Ptes@2MB: 2048 kB The format I used here should be unlikely to break smaps parsers unless they're looking for "kB" and now match the 'Ptes@4kB' instead of the one at the end of the line. Note: hugetlbfs PTEs are unusual. We can have more than one "pte_t" for each hugetlbfs "page". arm64 has this configuration, and probably others. The code should now handle when an hstate's size is not equal to one of the page table entry sizes. For instance, it assumes that hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs and prints them as such. I've tested this on x86 with normal 4k ptes, anonymous huge pages, 1G hugetlbfs and 2M hugetlbfs pages. 1. I'd like to thank Dan Williams for showing me a mirror as I complained about the bozo that introduced 'AnonHugePages'. [Fan] Rebase the original patch from Dave Hansen by fixing a couple of compile issues. Signed-off-by: Fan Du Signed-off-by: Dave Hansen --- Documentation/filesystems/proc.txt | 6 +++ fs/proc/task_mmu.c | 106 - mm/hugetlb.c | 11 3 files changed, 121 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index adba21b..a11ab80 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -423,6 +423,9 @@ SwapPss: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Locked:0 kB +Ptes@4kB: 4 kB +Ptes@2MB: 8192 kB + VmFlags: rd ex mr mw me dw the first of these lines shows the same information as is displayed for the @@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. +"Ptes@..." lines show how many page table entries are currently in place and +pointing to memory. There is an entry for each size present in the hardware +page tables for this mapping. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..30dbf37 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -451,6 +451,9 @@ struct mem_size_stats { unsigned long shared_hugetlb; unsigned long private_hugetlb; unsigned long first_vma_start; + unsigned long rss_pte; + unsigned long rss_pmd; + unsigned long rss_pud; u64 pss; u64 pss_locked; u64 swap_pss; @@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (pte_present(*pte)) { page = vm_normal_page(vma, addr, *pte); + mss->rss_pte += PAGE_SIZE; } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, /* pass */; else VM_BUG_ON_PAGE(1, page); + mss->rss_pmd += PMD_SIZE; smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd)); } #else @@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) } #ifdef CONFIG_HUGETLB_PAGE +/* + * Most archi
[PATCHv2 2/2] Add /proc/PID/{smaps, numa_maps} support for DAX
So user could check those interface for more detailed information about how much DAX mappings are currently created. Here we use vma_is_dax method to find specific page struture with DAX {huge, normal}page mappings, vm_normal_page routine works as before without any impact on the existing logical where _vm_normal_page are called. Signed-off-by: Fan Du --- v2: * Using pte_devmap to check valid pfn page structure, Pointed out by Dan. thx! --- fs/proc/task_mmu.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 30dbf37..5c4535c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -531,7 +531,10 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL; if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + if (!vma_is_dax(vma)) + page = vm_normal_page(vma, addr, *pte); + else if (pte_devmap(*pte)) + page = pte_page(*pte); mss->rss_pte += PAGE_SIZE; } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -583,7 +586,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, struct page *page; /* FOLL_DUMP will return -EFAULT on huge zero page */ - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + if (!vma_is_dax(vma)) + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + else if (pmd_devmap(*pmd)) + page = pmd_page(*pmd); + if (IS_ERR_OR_NULL(page)) return; if (PageAnon(page)) @@ -1770,13 +1777,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + struct page *page; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - struct page *page; - - page = can_gather_numa_stats_pmd(*pmd, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats_pmd(*pmd, vma, addr); + else if (pmd_devmap(*pmd)) + page = pmd_page(*pmd); if (page) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1789,7 +1798,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, #endif orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { - struct page *page = can_gather_numa_stats(*pte, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats(*pte, vma, addr); + else if (pte_devmap(*pte)) + page = pte_page(*pte); if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); -- 1.8.3.1
[PATCHv2 1/2] proc: mm: export PTE sizes directly in smaps
/proc/$pid/smaps has a number of fields that are intended to imply the kinds of PTEs used to map memory. "AnonHugePages" obviously tells you how many PMDs are being used. "MMUPageSize" along with the "Hugetlb" fields tells you how many PTEs you have for a huge page. The current mechanisms work fine when we have one or two page sizes. But, they start to get a bit muddled when we mix page sizes inside one VMA. For instance, the DAX folks were proposing adding a set of fields like: DevicePages: DeviceHugePages: DeviceGiganticPages: DeviceGinormousPages: to unmuddle things when page sizes get mixed. That's fine, but it does require userspace know the mapping from our various arbitrary names to hardware page sizes on each architecture and kernel configuration. That seems rather suboptimal. What folks really want is to know how much memory is mapped with each page size. How about we just do *that* instead? Patch attached. Seems harmless enough. Seems to compile on a bunch of random architectures. Makes smaps look like this: Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Locked:0 kB Ptes@4kB: 32 kB Ptes@2MB: 2048 kB The format I used here should be unlikely to break smaps parsers unless they're looking for "kB" and now match the 'Ptes@4kB' instead of the one at the end of the line. Note: hugetlbfs PTEs are unusual. We can have more than one "pte_t" for each hugetlbfs "page". arm64 has this configuration, and probably others. The code should now handle when an hstate's size is not equal to one of the page table entry sizes. For instance, it assumes that hstates between PMD_SIZE and PUD_SIZE are made up of multiple PMDs and prints them as such. I've tested this on x86 with normal 4k ptes, anonymous huge pages, 1G hugetlbfs and 2M hugetlbfs pages. 1. I'd like to thank Dan Williams for showing me a mirror as I complained about the bozo that introduced 'AnonHugePages'. [Fan] Rebase the original patch from Dave Hansen by fixing a couple of compile issues. Signed-off-by: Fan Du Signed-off-by: Dave Hansen --- Documentation/filesystems/proc.txt | 6 +++ fs/proc/task_mmu.c | 106 - mm/hugetlb.c | 11 3 files changed, 121 insertions(+), 2 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index adba21b..a11ab80 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -423,6 +423,9 @@ SwapPss: 0 kB KernelPageSize:4 kB MMUPageSize: 4 kB Locked:0 kB +Ptes@4kB: 4 kB +Ptes@2MB: 8192 kB + VmFlags: rd ex mr mw me dw the first of these lines shows the same information as is displayed for the @@ -460,6 +463,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. "SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. +"Ptes@..." lines show how many page table entries are currently in place and +pointing to memory. There is an entry for each size present in the hardware +page tables for this mapping. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter encoded diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..30dbf37 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -451,6 +451,9 @@ struct mem_size_stats { unsigned long shared_hugetlb; unsigned long private_hugetlb; unsigned long first_vma_start; + unsigned long rss_pte; + unsigned long rss_pmd; + unsigned long rss_pud; u64 pss; u64 pss_locked; u64 swap_pss; @@ -529,6 +532,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, if (pte_present(*pte)) { page = vm_normal_page(vma, addr, *pte); + mss->rss_pte += PAGE_SIZE; } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -590,6 +594,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, /* pass */; else VM_BUG_ON_PAGE(1, page); + mss->rss_pmd += PMD_SIZE; smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd)); } #else @@ -699,6 +704,30 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) } #ifdef CONFIG_HUGETLB_PAGE +/* + * Most architectures h
[PATCH] Add /proc/PID/{smaps, numa_maps} support for DAX
So user could check those interface for more detailed information about how much DAX mappings are currently created. Here we use vma_is_dax method to find specific page struture with DAX {huge, normal}page mappings, vm_normal_page routine works as before without any impact on the existing logical where _vm_normal_page are called. Signed-off-by: Fan Du --- fs/proc/task_mmu.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5589b4b..ba2e58c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -528,7 +528,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL; if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + if (!vma_is_dax(vma)) + page = vm_normal_page(vma, addr, *pte); + else + page = pte_page(*pte); + } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte); @@ -579,7 +583,11 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr, struct page *page; /* FOLL_DUMP will return -EFAULT on huge zero page */ - page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + if (!vma_is_dax(vma)) + page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP); + else + page = pmd_page(*pmd); + if (IS_ERR_OR_NULL(page)) return; if (PageAnon(page)) @@ -1668,13 +1676,15 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, spinlock_t *ptl; pte_t *orig_pte; pte_t *pte; + struct page *page; #ifdef CONFIG_TRANSPARENT_HUGEPAGE ptl = pmd_trans_huge_lock(pmd, vma); if (ptl) { - struct page *page; - - page = can_gather_numa_stats_pmd(*pmd, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats_pmd(*pmd, vma, addr); + else + page = pmd_page(*pmd); if (page) gather_stats(page, md, pmd_dirty(*pmd), HPAGE_PMD_SIZE/PAGE_SIZE); @@ -1687,7 +1697,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr, #endif orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl); do { - struct page *page = can_gather_numa_stats(*pte, vma, addr); + if (!vma_is_dax(vma)) + page = can_gather_numa_stats(*pte, vma, addr); + else + page = pte_page(*pte); if (!page) continue; gather_stats(page, md, pte_dirty(*pte), 1); -- 1.8.3.1
Re: [PATCH v3 net-next] xfrm: Add check to prevent un-complete key manager
On 2013年11月11日 14:39, baker.ker...@gmail.com wrote: From: Baker Zhang "acquire" and "compile_policy" callbacks are necessary for a key manager. Signed-off-by: Baker Zhang --- Thanks for all reply. V1: For current kernel source, there is no problem. In our vpn product, we need a xfrm_km in kernel module to monitor the xfrm state change. thus, the 'acquire' and 'compile_policy' may be NULL. So I think we should do the check before use it. V2: Align the continuation line according the networking coding style. V3: Add check to prevent un-complete key manager at register time. net/xfrm/xfrm_state.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index b9c3f9e..178283e 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1806,6 +1806,9 @@ static DEFINE_SPINLOCK(xfrm_km_lock); int xfrm_register_km(struct xfrm_mgr *km) { + if (km->acquire == NULL || km->compile_policy == NULL) "acquire" is a MUST, "compile_policy" is not a necessity. From the fist commit log, you probably add functionality providing SA state changes in your private key manager, which current implementation does not. Maybe it's worthwhile to elaborate the missing functionality than add those checking, because both key manage (pfkeyv2/netlink) in use has "acquire" and "compile_policy" at the same time. -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
On 2013年09月16日 17:01, Thomas Gleixner wrote: On Mon, 16 Sep 2013, Fan Du wrote: On 2013年09月13日 22:32, Thomas Gleixner wrote: On Fri, 13 Sep 2013, Fan Du wrote: (2) What I have been bugging you around here for this long time is really the second problem, I'm sorry I didn't make it clearly to you and others, which is below: Why using wall clock time to calculate soft/hard IPsec events when xfrm_state timer out happens in its timeout handler? Because even if xfrm_state using CLOCK_BOOTTIME, system wall clock time changing will surely disturb soft/hard IPsec events, which you raised your concern about in (*a*). No CLOCK_BOOTTIME is not affected by wall clock time changes. It's basically CLOCK_MONOTONIC, it just keeps counting the suspend time as well. So without a suspend/resume cycle CLOCK_MONOTONIC and CLOCK_BOOTTIME are the same. After a suspend/resume cycle CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is the number of seconds spent in suspend. Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting suspend/resume time as well as not affected by wall clock time change. Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not affected by wall clock time change, as well as keep the timer valid when suspend/resume. A classic way of using timer is: a- Arm a timer with specified interval b- Wait for the timer to timeout c- After the timeout, notify the event to other place in the timer handler. IPsec key lifetime timer does its this way: a- Arm a timer with specified interval b- Wait for the timer to timeout c- After timeout, in the timer handler, using wall clock time to calculate which kind of event to notify user(soft or hard for both key use time, and key created time). So the problem arises at this stage if wall clock time changed. And why on earth must it use wall clock time for selecting the event type? /*shivering... sorry to bother you again.*/ Yep, it's a bit twisted. This parts of codes come a long way from v2.5.67 Beyond this fact, I barely know its original design goal by doing so. The first version of this patch is to remove the wall clock time things by myself, it's a pity that the feedback is not very welcome at the end. So later on adding notifier at clock_was_set is suggested by David. Thanks, tglx -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
Hi, Thomas On 2013年09月13日 22:32, Thomas Gleixner wrote: On Fri, 13 Sep 2013, Fan Du wrote: (2) What I have been bugging you around here for this long time is really the second problem, I'm sorry I didn't make it clearly to you and others, which is below: Why using wall clock time to calculate soft/hard IPsec events when xfrm_state timer out happens in its timeout handler? Because even if xfrm_state using CLOCK_BOOTTIME, system wall clock time changing will surely disturb soft/hard IPsec events, which you raised your concern about in (*a*). No CLOCK_BOOTTIME is not affected by wall clock time changes. It's basically CLOCK_MONOTONIC, it just keeps counting the suspend time as well. So without a suspend/resume cycle CLOCK_MONOTONIC and CLOCK_BOOTTIME are the same. After a suspend/resume cycle CLOCK_BOOTTIME will be N seconds ahead of CLOCK_MONOTONIC, where N is the number of seconds spent in suspend. Sorry for the ambiguity. Yes, CLOCK_BOOTTIME is monotonic *and* counting suspend/resume time as well as not affected by wall clock time change. Using CLOCK_BOOTTIME guarantees b- will timeout in a right manner, i.e., not affected by wall clock time change, as well as keep the timer valid when suspend/resume. A classic way of using timer is: a- Arm a timer with specified interval b- Wait for the timer to timeout c- After the timeout, notify the event to other place in the timer handler. IPsec key lifetime timer does its this way: a- Arm a timer with specified interval b- Wait for the timer to timeout c- After timeout, in the timer handler, using wall clock time to calculate which kind of event to notify user(soft or hard for both key use time, and key created time). So the problem arises at this stage if wall clock time changed. Thanks Thanks, tglx -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
Hi Dave/Thomas On 2013年09月13日 01:32, David Miller wrote: From: Thomas Gleixner Date: Thu, 12 Sep 2013 16:43:37 +0200 (CEST) So what about going back to timer_list timers and simply utilize register_pm_notifier(), which will tell you that the system resumed? The thing to understand is that there are two timeouts for an IPSEC rule, a soft and a hard timeout. There is a gap between these two exactly so that we can negotiate a new encapsulation with the IPSEC gateway before communication ceases to be possible over the IPSEC protected path. So the idea is that the soft timeout triggers the re-negotiation, and after a hard timeout the IPSEC path is no longer usable and all communication will fail. Simply triggering a re-negoation after every suspend/resume makes no sense at all. Spurious re-negotiations are undesirable. ^^ (*a*) What's the differences between this with re-negotiation after every system wall clock changing by using clock_was_set notifier? > On 2013年08月02日 06:35, David Miller wrote: > > I suspect the thing to do is to have system time changes generate a > notifier when clock_was_set() is called. > > The XFRM code would walk the rules and pretend that we hit the soft > timeout for every rule that we haven't hit the soft timeout yet > already. > > If a rule hit the soft timeout, force a hard timeout. > > When forcing a soft timeout, adjust the hard timeout to be > (hard_timeout - soft_timeout) into the future. What we want are real timers. We want that rather than a "we suspended so just assume all timers expired" event which is not very useful for this kind of application. Here we are facing two problems:) (1) what kind timer should xfrm_state should employ, Two requirements here: First one, KEY lifetime should include suspend/resume time. Second one, system wall clock time changing(backward/forward) should *not* impact *timer* timeout event(not the soft/hard IPsec events fired to user space!) net-next commit 99565a6c471cbb66caa68347c195133017559943 ("xfrm: Make xfrm_state timer monotonic") by utilizing *CLOCK_BOOTTIME* has solved this problem. (2) What I have been bugging you around here for this long time is really the second problem, I'm sorry I didn't make it clearly to you and others, which is below: Why using wall clock time to calculate soft/hard IPsec events when xfrm_state timer out happens in its timeout handler? Because even if xfrm_state using CLOCK_BOOTTIME, system wall clock time changing will surely disturb soft/hard IPsec events, which you raised your concern about in (*a*). The initial approach( http://marc.info/?l=linux-netdev&m=137534280429187&w=2) has tried to solve this second problem by eliminating depending system wall clock in xfrm_state timer timeout handler. I think this time, I have made this situation crystal clear. -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
On 2013年08月18日 17:05, Thomas Gleixner wrote: On Wed, 14 Aug 2013, Fan Du wrote: From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001 From: Fan Du Date: Wed, 14 Aug 2013 16:39:23 +0800 Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called When clock_was_set is called in case of system wall time change or host resume from suspend state, use this notifier for places where interested in this action, e.g Ipsec SA lifetime management. Sigh. These notifiers have been proposed in the past and we always rejected them. Why do you need this? There should be nothing except the core timekeeping code which needs to know about clock_was_set. Can you please explain what kind of users you have in mind and WHY they need to know about it. Hi, Thomas Thanks for your patience. Please let me take a few seconds try to explain this. Current xfrm layers has *one* hrtimer to guard Ipsec keys timeout, The timeout could be measured in either of below two ways: (1) The timer is started once the keys is created, but this key is not necessary actually used right now. In detail, record the get_seconds() when this key is created. (2) Starting the timer when this key is actually used, e.g when an IP packet need to be encrypted. In details, recored the get_seconds() when this key is first used. So in the hrtimer handler, the code get current get_seconds, and check against with what saved in (1)or(2), and notify the timeout up to user land. So the pitfall is using one hrtimer for two timeout events, most importantly using get_seconds to check timeout, once system clock is changed by user intentionally, the key timeout could misbehave wildly. A refractor has been proposed to get rid of depending on system wall clock by cleaning up the hrtimer handler. Unfortunately David frowned on it in (3), and suggest once system clock is changed, adjust the timeout of the key. (3): http://www.spinics.net/lists/netdev/msg245169.html Thanks, tglx -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called
From e3929d4fdfad5b40fd8cad0e217597670d1aef54 Mon Sep 17 00:00:00 2001 From: Fan Du Date: Wed, 14 Aug 2013 16:39:23 +0800 Subject: [PATCHv3 linux-next] hrtimer: Add notifier when clock_was_set was called When clock_was_set is called in case of system wall time change or host resume from suspend state, use this notifier for places where interested in this action, e.g Ipsec SA lifetime management. Signed-off-by: Fan Du v3: -Beautify notifier with register/unregister API exported for other subsystem. --- include/linux/hrtimer.h |3 +++ kernel/hrtimer.c| 19 +++ 2 files changed, 22 insertions(+) diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h index d19a5c2..f0404e4 100644 --- a/include/linux/hrtimer.h +++ b/include/linux/hrtimer.h @@ -461,4 +461,7 @@ extern u64 ktime_divns(const ktime_t kt, s64 div); /* Show pending timers: */ extern void sysrq_timer_list_show(void); +extern int register_clock_change_notifier(struct notifier_block *nb); +extern int unregister_clock_change_notifier(struct notifier_block *nb); + #endif diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c index 383319b..c6e6405 100644 --- a/kernel/hrtimer.c +++ b/kernel/hrtimer.c @@ -755,6 +755,24 @@ static inline void retrigger_next_event(void *arg) { } #endif /* CONFIG_HIGH_RES_TIMERS */ +static ATOMIC_NOTIFIER_HEAD(clock_change_notifier_list); +static int call_clock_change_notifiers(void) +{ + return atomic_notifier_call_chain(&clock_change_notifier_list, 0, 0); +} + +int register_clock_change_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_register(&clock_change_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(register_clock_change_notifier); + +int unregister_clock_change_notifier(struct notifier_block *nb) +{ + return atomic_notifier_chain_unregister(&clock_change_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(unregister_clock_change_notifier); + /* * Clock realtime was set * @@ -773,6 +791,7 @@ void clock_was_set(void) on_each_cpu(retrigger_next_event, NULL, 1); #endif timerfd_clock_was_set(); + call_clock_change_notifiers(); } /* -- 1.7.9.5 -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG] ipv6, rawv6_close(): BUG: unable to handle kernel paging request
Hallo Srivatsa On 2013年07月22日 02:28, Srivatsa S. Bhat wrote: Hi, I'm seeing this on every boot. Version: Latest mainline (commit ea45ea70b) I tested with this commit using your updated IPv6 config, this incident didn't show up after several times of reboot. Could you please elaborate your testing details if possible? A wild guess, it dereference mrt->mroute6_sk, indicating mrt is invalid. Regards, Srivatsa S. Bhat --- BUG: unable to handle kernel paging request at 882018552020 IP: [] ip6mr_sk_done+0x32/0xb0 [ipv6] PGD 290a067 PUD 207ffe0067 PMD 207ff1d067 PTE 802018552060 Oops: [#1] SMP DEBUG_PAGEALLOC Modules linked in: ebtable_nat ebtables nfs fscache nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables nfsd lockd nfs_acl exportfs auth_rpcgss autofs4 sunrpc 8021q garp bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vfat fat vhost_net macvtap macvlan vhost tun kvm_intel kvm uinput iTCO_wdt iTCO_vendor_support cdc_ether usbnet mii microcode i2c_i801 i2c_core lpc_ich mfd_core shpchp ioatdma dca mlx4_core be2net wmi acpi_cpufreq mperf ext4 jbd2 mbcache dm_mirror dm_region_hash dm_log dm_mod CPU: 0 PID: 7 Comm: kworker/u33:0 Not tainted 3.11.0-rc1-ea45e-a #4 Hardware name: IBM -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012 Workqueue: netns cleanup_net task: 8810393641c0 ti: 881039366000 task.ti: 881039366000 RIP: 0010:[] [] ip6mr_sk_done+0x32/0xb0 [ipv6] RSP: 0018:881039367bd8 EFLAGS: 00010286 RAX: 881039367fd8 RBX: 882018552000 RCX: dead00200200 RDX: RSI: 881039367b68 RDI: 881039367b68 RBP: 881039367bf8 R08: 881039367b68 R09: R10: R11: R12: 882015a7a040 R13: 882014eb89c0 R14: 8820289e2800 R15: FS: () GS:88103fc0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 882018552020 CR3: 01c0b000 CR4: 000407f0 Stack: 881039367c18 882014eb89c0 882015e28c00 881039367c18 a034d9d1 8820289e2800 882014eb89c0 881039367c58 815bdecb 815bddf2 882014eb89c0 Call Trace: [] rawv6_close+0x21/0x40 [ipv6] [] inet_release+0xfb/0x220 [] ? inet_release+0x22/0x220 [] inet6_release+0x3f/0x50 [ipv6] [] sock_release+0x29/0xa0 [] sk_release_kernel+0x30/0x70 [] icmpv6_sk_exit+0x3b/0x80 [ipv6] [] ops_exit_list+0x39/0x60 [] cleanup_net+0xfb/0x1a0 [] process_one_work+0x1da/0x610 [] ? process_one_work+0x169/0x610 [] worker_thread+0x120/0x3a0 [] ? process_one_work+0x610/0x610 [] kthread+0xee/0x100 [] ? __init_kthread_worker+0x70/0x70 [] ret_from_fork+0x7c/0xb0 [] ? __init_kthread_worker+0x70/0x70 Code: 20 48 89 5d e8 4c 89 65 f0 4c 89 6d f8 66 66 66 66 90 4c 8b 67 30 49 89 fd e8 db 3c 1e e1 49 8b 9c 24 90 08 00 00 48 85 db 74 06<4c> 39 6b 20 74 20 bb f3 ff ff ff e8 8e 3c 1e e1 89 d8 4c 8b 65 RIP [] ip6mr_sk_done+0x32/0xb0 [ipv6] RSP CR2: 882018552020 ---[ end trace e8367f5addd58b5f ]--- BUG: sleeping function called from invalid context at kernel/rwsem.c:20 in_atomic(): 0, irqs_disabled(): 1, pid: 7, name: kworker/u33:0 INFO: lockdep is turned off. irq event stamp: 7804 hardirqs last enabled at (7803): [] _raw_spin_unlock_irq+0x30/0x50 hardirqs last disabled at (7804): [] _raw_spin_lock_irq+0x17/0x60 softirqs last enabled at (7122): [] __do_softirq+0x1e6/0x400 softirqs last disabled at (7113): [] irq_exit+0xed/0x100 CPU: 0 PID: 7 Comm: kworker/u33:0 Tainted: G D 3.11.0-rc1-ea45e-a #4 Hardware name: IBM -[8737R2A]-/00Y2738, BIOS -[B2E120RUS-1.20]- 11/30/2012 Workqueue: netns cleanup_net 819f4a61 881039367828 8161ab9c 881039367828 8810393641c0 881039367858 8108cbee 881039367898 881039357ec8 0009 0009 881039367888 Call Trace: [] dump_stack+0x59/0x7d [] __might_sleep+0x17e/0x230 [] down_read+0x24/0x70 [] exit_signals+0x24/0x140 [] ? blocking_notifier_call_chain+0x16/0x20 [] do_exit+0xb2/0x4c0 [] oops_end+0xa9/0xf0 [] no_context+0x11e/0x1f0 [] __bad_area_nosemaphore+0x12d/0x230 [] bad_area_nosemaphore+0x13/0x20 [] __do_page_fault+0x133/0x4e0 [] ? __change_page_attr+0x6b/0x2b0 [] ? __change_page_attr_set_clr+0x4d/0xb0 [] do_page_fault+0x37/0x70 [] ? restore_args+0x30/0x30 [] page_fault+0x22/0x30 [] ? ip6mr_sk_done+0x32/0xb0 [ipv6] [] ? ip6mr_sk_done+0x25/0xb0 [ipv6] [] rawv6_close+0x21/0x40 [ipv6] [] inet_release+0xfb/0x220 [] ? inet_release+0x22/0x220 [] inet6_release+0x3f/0x50 [ipv6] [] sock_release+0x29/0xa0 [] sk_release_kernel+0x30/0x70 [] icmpv6_sk_exit+0x3b/0x80 [ipv6] [] ops_exit_list+0x39/0x60 [] cleanup_net+0xfb/0x1a0 [] process_one_work+0x
[PATCH] percpu_counter: __this_cpu_write doesn't need to be protected by spinlock
__this_cpu_write doesn't need to be protected by spinlock, AS we are doing per cpu write with preempt disabled. And another reason to remove __this_cpu_write outside of spinlock: __percpu_counter_sum is not a accurate counter. Signed-off-by: Fan Du --- lib/percpu_counter.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index ba6085d..1fc23a3 100644 --- a/lib/percpu_counter.c +++ b/lib/percpu_counter.c @@ -80,8 +80,8 @@ void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch) if (count >= batch || count <= -batch) { raw_spin_lock(&fbc->lock); fbc->count += count; - __this_cpu_write(*fbc->counters, 0); raw_spin_unlock(&fbc->lock); + __this_cpu_write(*fbc->counters, 0); } else { __this_cpu_write(*fbc->counters, count); } -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PANIC at net/xfrm/xfrm_output.c:125 (3.9.4)
Hello Chris/Jean This issue might have already been fixed by this: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit/net/xfrm/xfrm_output.c?id=497574c72c9922cf20c12aed15313c389f722fa0 Hope it helps. On 2013年06月06日 09:04, Jean Sacren wrote: From: Chris Boot Date: Wed, 05 Jun 2013 22:47:48 +0100 Hi folks, I have a re-purposed Watchguard Firebox running Debian GNU/Linux with a self-built vanilla 3.9.4 kernel. I have an IPsec tunnel up to a remote router through which I was passing a fair bit of traffic when I hit the following panic: [486832.949560] BUG: unable to handle kernel NULL pointer dereference at 0010 [486832.953431] IP: [] xfrm_output_resume+0x61/0x29f [486832.953431] *pde = [486832.953431] Oops: [#1] [486832.953431] Modules linked in: xt_realm xt_nat authenc esp4 xfrm4_mode_tunnel tun ip6table_nat nf_nat_ipv6 sch_fq_codel xt_statistic xt_CT xt_LOG xt_connlimit xt_recent xt_time xt_TCPMSS xt_sctp ip6t_REJECT pppoe deflate zlib_deflate pppox ctr twofish_generic twofish_i586 twofish_common camellia_generic serpent_sse2_i586 xts serpent_generic lrw gf128mul glue_helper ablk_helper cryptd blowfish_generic blowfish_common cast5_generic cast_common des_generic cbc xcbc rmd160 sha512_generic sha256_generic sha1_generic hmac crypto_null af_key xfrm_algo xt_comment xt_addrtype xt_policy ip_set_hash_ip ipt_ULOG ipt_REJECT ipt_MASQUERADE ipt_ECN ipt_CLUSTERIP ipt_ah act_police cls_basic cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq xt_set ip_set nf_nat_tftp nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core xt_tcpmss xt_pkttype xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_connmark xt_CLASSIFY xt_AUDIT xt_state nfnetlink bridge 8021q garp stp mrp llc ppp_generic slhc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_raw ip6table_filter ip6_tables xt_tcpudp xt_conntrack iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_filter ip_tables x_tables w83627hf hwmon_vid loop iTCO_wdt iTCO_vendor_support evdev snd_pcm snd_page_alloc snd_timer snd soundcore acpi_cpufreq mperf processor pcspkr serio_raw drm_kms_helper lpc_ich i2c_i801 of_i2c drm rng_core thermal_sys i2c_algo_bit ehci_pci i2c_core ext4 crc16 jbd2 mbcache dm_mod sg sd_mod crc_t10dif ata_generic ata_piix uhci_hcd ehci_hcd libata microcode scsi_mod skge sky2 usbcore usb_common [486832.953431] Pid: 0, comm: swapper Not tainted 3.9.4-1-bootc #1 [486832.953431] EIP: 0060:[] EFLAGS: 00210246 CPU: 0 [486832.953431] EIP is at xfrm_output_resume+0x61/0x29f [486832.953431] EAX: EBX: f3fbc100 ECX: f77f1288 EDX: f6130200 [486832.953431] ESI: 0016 EDI: EBP: f70b3c00 ESP: c1407c44 [486832.953431] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 [486832.953431] CR0: 8005003b CR2: 0010 CR3: 37247000 CR4: 07d0 [486832.953431] DR0: DR1: DR2: DR3: [486832.953431] DR6: 0ff0 DR7: 0400 [486832.953431] Process swapper (pid: 0, ti=c1406000 task=c1413490 task.ti=c1406000) [486832.953431] Stack: [486832.953431] c129d44f 8000 0002 c1457254 f3fbc100 c129d44f 0008 [486832.953431] c129d49e f4524000 c129d44f 8000 f3fbc100 c1268b49 [486832.953431] f3fbc100 f127604e c12678c3 f7123000 c1267685 c1457f8c c1456b80 [486832.953431] Call Trace: [486832.953431] [] ? xfrm4_extract_output+0x94/0x94 [486832.953431] [] ? xfrm4_extract_output+0x94/0x94 [486832.953431] [] ? xfrm4_output+0x2c/0x6a [486832.953431] [] ? xfrm4_extract_output+0x94/0x94 [486832.953431] [] ? ip_forward_finish+0x59/0x5c [486832.953431] [] ? ip_rcv_finish+0x23e/0x274 [486832.953431] [] ? pskb_may_pull+0x2d/0x2d [486832.953431] [] ? __netif_receive_skb_core+0x39d/0x406 [486832.953431] [] ? br_handle_frame_finish+0x22c/0x264 [bridge] [486832.953431] [] ? process_backlog+0xd0/0xd0 [486832.953431] [] ? br_handle_local_finish+0x4d/0x4d [bridge] [486832.953431] [] ? NF_HOOK_THRESH+0x1d/0x4c [bridge] [486832.953431] [] ? br_handle_local_finish+0x4d/0x4d [bridge] [486832.953431] [] ? br_nf_pre_routing_finish+0x1c8/0x1d2 [bridge] [486832.953431] [] ? br_handle_local_finish+0x4d/0x4d [bridge] [486832.953431] [] ? nf_hook_slow+0x52/0xed [486832.953431] [] ? nf_bridge_alloc.isra.18+0x32/0x32 [bridge] [486832.953431] [] ? nf_bridge_alloc.isra.18+0x32/0x32 [bridge] [486832.953431] [] ? NF_HOOK_THRESH+0x1d/0x4c [bridge] [486832.953431] [] ? nf_bridge_alloc
Re: [PATCH] fs: Disable preempt when acquire i_size_seqcount write lock
On 2013年01月11日 06:38, Andrew Morton wrote: On Wed, 9 Jan 2013 11:34:19 +0800 Fan Du wrote: Two rt tasks bind to one CPU core. The higher priority rt task A preempts a lower priority rt task B which has already taken the write seq lock, and then the higher priority rt task A try to acquire read seq lock, it's doomed to lockup. rt task A with lower priority: call write i_size_writert task B with higher priority: call sync, and preempt task A write_seqcount_begin(&inode->i_size_seqcount);i_size_read inode->i_size = i_size; read_seqcount_begin<-- lockup here... Ouch. And even if the preemping task is preemptible, it will spend an entire timeslice pointlessly spinning, which isn't very good. So disable preempt when acquiring every i_size_seqcount *write* lock will cure the problem. ... --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode) static inline void i_size_write(struct inode *inode, loff_t i_size) { #if BITS_PER_LONG==32&& defined(CONFIG_SMP) + preempt_disable(); write_seqcount_begin(&inode->i_size_seqcount); inode->i_size = i_size; write_seqcount_end(&inode->i_size_seqcount); + preempt_enable(); #elif BITS_PER_LONG==32&& defined(CONFIG_PREEMPT) preempt_disable(); inode->i_size = i_size; afacit all write_seqcount_begin()/read_seqretry() sites are vulnerable to this problem. Would it not be better to do the preempt_disable() in write_seqcount_begin()? IMHO, write_seqcount_begin/write_seqcount_end are often wrapped by mutex, this gives higher priority task a chance to sleep, and then lower priority task get cpu to unlock, so avoid the problematic scenario this patch describing. But in i_size_write case, I could only find disable preempt a good choice before someone else has better idea :) Possible problems: - mm/filemap_xip.c does disk I/O under write_seqcount_begin(). - dev_change_name() does GFP_KERNEL allocations under write_seqcount_begin() - I didn't review u64_stats_update_begin() callers. But I think calling schedule() under preempt_disable() is OK anyway? -- 浮沉随浪只记今朝笑 --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] fs: Disable preempt when acquire i_size_seqcount write lock
Two rt tasks bind to one CPU core. The higher priority rt task A preempts a lower priority rt task B which has already taken the write seq lock, and then the higher priority rt task A try to acquire read seq lock, it's doomed to lockup. rt task A with lower priority: call write i_size_writert task B with higher priority: call sync, and preempt task A write_seqcount_begin(&inode->i_size_seqcount);i_size_read inode->i_size = i_size; read_seqcount_begin <-- lockup here... So disable preempt when acquiring every i_size_seqcount *write* lock will cure the problem. Signed-off-by: Fan Du --- include/linux/fs.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index db84f77..1b69e87 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode) static inline void i_size_write(struct inode *inode, loff_t i_size) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) + preempt_disable(); write_seqcount_begin(&inode->i_size_seqcount); inode->i_size = i_size; write_seqcount_end(&inode->i_size_seqcount); + preempt_enable(); #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT) preempt_disable(); inode->i_size = i_size; -- 1.7.0.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Disable preempt when acquire i_size_seqcount write lock
Two rt tasks bind to one CPU core. The higher priority rt task A preempts a lower priority rt task B which has already taken the write seq lock, and then the higher priority rt task A try to acquire read seq lock, it's doomed to lockup. rt task A with lower priority: call write i_size_writert task B with higher priority: call sync, and preempt task A write_seqcount_begin(&inode->i_size_seqcount);i_size_read inode->i_size = i_size; read_seqcount_begin <-- lockup here... So disable preempt when acquiring every i_size_seqcount *write* lock will cure the problem. Signed-off-by: Fan Du --- include/linux/fs.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index db84f77..1b69e87 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -758,9 +758,11 @@ static inline loff_t i_size_read(const struct inode *inode) static inline void i_size_write(struct inode *inode, loff_t i_size) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) + preempt_disable(); write_seqcount_begin(&inode->i_size_seqcount); inode->i_size = i_size; write_seqcount_end(&inode->i_size_seqcount); + preempt_enable(); #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT) preempt_disable(); inode->i_size = i_size; -- 1.7.0.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: suspicious RCU usage in xfrm_net_init()
On 2012年08月17日 09:34, Paul Gortmaker wrote: On Thu, Aug 16, 2012 at 9:07 PM, Fan Du wrote: On 2012年08月16日 23:19, Fengguang Wu wrote: Hi Fan, On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote: Hi, Fengguang Could you please try the below patch, see if spewing still there? thanks Yes, it worked, thank you very much! Hi, Dave Could you please pick up this patch? Please do not make extra work for maintainers by sending attachments, or requests for status/merge etc. Your 1st patch had to be manually set to an RFC, and now you add another patch less than 24h later. Please see: http://patchwork.ozlabs.org/patch/177934/ http://patchwork.ozlabs.org/patch/178132/ Also, a patch should describe the problem it solves (i.e. the symptom the end user sees), and how the problem originated, and why the fix in the patch is the _right_ fix. The worst description a commit log can have is one that just describes the C change in words, since most people can read C on their own. Here you add "_bh" in the code and then repeat exactly that in the commit log. Your commit log does not tell me when it broke, or why it broke, or who had their use case broken. Can you see why this is not acceptable? Please take the time to look at the traffic in netdev, and read the feedback given by maintainers on other patches, so that the common errors are understood by you, and not repeated. It will be time well spent! Rick Jones has already well informed me the etiquettes off the list, which I just broken. Anyway, thanks for your time writing those suggestions. Thanks, Paul. --- thanks btw, your email client wraps long lines.. Oh, I will definitely fix this. thanks feng guang for the testing :) Thanks, Fengguang From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001 From: Fan Du Date: Thu, 16 Aug 2012 17:31:25 +0800 Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh Signed-off-by: Fan Du --- net/xfrm/xfrm_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5ad4d2c..75a9d6a 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2501,7 +2501,7 @@ static void __net_init xfrm_dst_ops_init(struct net *net) struct xfrm_policy_afinfo *afinfo; rcu_read_lock_bh(); - afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); + afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]); if (afinfo) net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops; #if IS_ENABLED(CONFIG_IPV6) -- 1.7.1 On 2012年08月16日 15:37, Fengguang Wu wrote: Hi Priyanka, The below warning shows up, probably related to this commit: 418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo with rcu [0.921216] [0.921645] === [0.922766] [ INFO: suspicious RCU usage. ] [0.923887] 3.5.0-01540-g1669891 #64 Not tainted [0.925123] --- [0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious rcu_dereference_check() usage! [0.935361] [0.935361] other info that might help us debug this: [0.935361] [0.937472] [0.937472] rcu_scheduler_active = 1, debug_locks = 0 [0.939182] 2 locks held by swapper/1: [0.940171] #0: (net_mutex){+.+.+.}, at: [] register_pernet_subsys+0x21/0x57 [0.942705] #1: (rcu_read_lock_bh){..}, at: [] xfrm_net_init+0x1e4/0x437 [0.951507] [0.951507] stack backtrace: [0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64 [0.954364] Call Trace: [0.955074] [] lockdep_rcu_suspicious+0x174/0x187 [0.956736] [] xfrm_net_init+0x30e/0x437 [0.958205] [] ? xfrm_net_init+0x1e4/0x437 [0.959712] [] ops_init+0x1bb/0x1ff [0.961067] [] ? trace_hardirqs_on+0x1b/0x24 [0.962644] [] register_pernet_operations.isra.5+0x9d/0xfe [0.971376] [] register_pernet_subsys+0x30/0x57 [0.972992] [] xfrm_init+0x17/0x2c [0.974316] [] ip_rt_init+0x82/0xe7 [0.975668] [] ip_init+0x10/0x25 [0.976952] [] inet_init+0x235/0x360 [0.978352] [] ? devinet_init+0xf2/0xf2 [0.979808] [] do_one_initcall+0xb4/0x203 [0.981313] [] kernel_init+0x1a9/0x29a [0.982732] [] ? loglevel+0x46/0x46 [0.990889] [] kernel_thread_helper+0x4/0x10 [0.992472] [] ? retint_restore_args+0x13/0x13 [0.994076] [] ? do_one_initcall+0x203/0x203 [0.995636] [] ? gs_change+0x13/0x13 [0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes) [1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes) Thanks, Fengguang -- Love each day! --fan -- Love each day! --fan -- Love each day! --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org Mo
Re: suspicious RCU usage in xfrm_net_init()
On 2012年08月16日 23:19, Fengguang Wu wrote: Hi Fan, On Thu, Aug 16, 2012 at 05:36:35PM +0800, Fan Du wrote: Hi, Fengguang Could you please try the below patch, see if spewing still there? thanks Yes, it worked, thank you very much! Hi, Dave Could you please pick up this patch? thanks btw, your email client wraps long lines.. Oh, I will definitely fix this. thanks feng guang for the testing :) Thanks, Fengguang From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001 From: Fan Du Date: Thu, 16 Aug 2012 17:31:25 +0800 Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh Signed-off-by: Fan Du --- net/xfrm/xfrm_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5ad4d2c..75a9d6a 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2501,7 +2501,7 @@ static void __net_init xfrm_dst_ops_init(struct net *net) struct xfrm_policy_afinfo *afinfo; rcu_read_lock_bh(); - afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); + afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]); if (afinfo) net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops; #if IS_ENABLED(CONFIG_IPV6) -- 1.7.1 On 2012年08月16日 15:37, Fengguang Wu wrote: Hi Priyanka, The below warning shows up, probably related to this commit: 418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo with rcu [0.921216] [0.921645] === [0.922766] [ INFO: suspicious RCU usage. ] [0.923887] 3.5.0-01540-g1669891 #64 Not tainted [0.925123] --- [0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious rcu_dereference_check() usage! [0.935361] [0.935361] other info that might help us debug this: [0.935361] [0.937472] [0.937472] rcu_scheduler_active = 1, debug_locks = 0 [0.939182] 2 locks held by swapper/1: [0.940171] #0: (net_mutex){+.+.+.}, at: [] register_pernet_subsys+0x21/0x57 [0.942705] #1: (rcu_read_lock_bh){..}, at: [] xfrm_net_init+0x1e4/0x437 [0.951507] [0.951507] stack backtrace: [0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64 [0.954364] Call Trace: [0.955074] [] lockdep_rcu_suspicious+0x174/0x187 [0.956736] [] xfrm_net_init+0x30e/0x437 [0.958205] [] ? xfrm_net_init+0x1e4/0x437 [0.959712] [] ops_init+0x1bb/0x1ff [0.961067] [] ? trace_hardirqs_on+0x1b/0x24 [0.962644] [] register_pernet_operations.isra.5+0x9d/0xfe [0.971376] [] register_pernet_subsys+0x30/0x57 [0.972992] [] xfrm_init+0x17/0x2c [0.974316] [] ip_rt_init+0x82/0xe7 [0.975668] [] ip_init+0x10/0x25 [0.976952] [] inet_init+0x235/0x360 [0.978352] [] ? devinet_init+0xf2/0xf2 [0.979808] [] do_one_initcall+0xb4/0x203 [0.981313] [] kernel_init+0x1a9/0x29a [0.982732] [] ? loglevel+0x46/0x46 [0.990889] [] kernel_thread_helper+0x4/0x10 [0.992472] [] ? retint_restore_args+0x13/0x13 [0.994076] [] ? do_one_initcall+0x203/0x203 [0.995636] [] ? gs_change+0x13/0x13 [0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes) [1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes) Thanks, Fengguang -- Love each day! --fan -- Love each day! --fan >From f3b4d84f7ca2ed235c8c8ae5186f45e20b9b80db Mon Sep 17 00:00:00 2001 From: Fan Du Date: Thu, 16 Aug 2012 17:51:25 +0800 Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh Signed-off-by: Fan Du --- net/xfrm/xfrm_policy.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5ad4d2c..6405764 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2501,11 +2501,11 @@ static void __net_init xfrm_dst_ops_init(struct net *net) struct xfrm_policy_afinfo *afinfo; rcu_read_lock_bh(); - afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); + afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]); if (afinfo) net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops; #if IS_ENABLED(CONFIG_IPV6) - afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET6]); + afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET6]); if (afinfo) net->xfrm.xfrm6_dst_ops = *afinfo->dst_ops; #endif -- 1.7.1
Re: suspicious RCU usage in xfrm_net_init()
Hi, Fengguang Could you please try the below patch, see if spewing still there? thanks From a3f86ecc3ee16ff81d49416bbf791780422988b3 Mon Sep 17 00:00:00 2001 From: Fan Du Date: Thu, 16 Aug 2012 17:31:25 +0800 Subject: [PATCH] Use rcu_dereference_bh to deference pointer protected by rcu_read_lock_bh Signed-off-by: Fan Du --- net/xfrm/xfrm_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 5ad4d2c..75a9d6a 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2501,7 +2501,7 @@ static void __net_init xfrm_dst_ops_init(struct net *net) struct xfrm_policy_afinfo *afinfo; rcu_read_lock_bh(); - afinfo = rcu_dereference(xfrm_policy_afinfo[AF_INET]); + afinfo = rcu_dereference_bh(xfrm_policy_afinfo[AF_INET]); if (afinfo) net->xfrm.xfrm4_dst_ops = *afinfo->dst_ops; #if IS_ENABLED(CONFIG_IPV6) -- 1.7.1 On 2012年08月16日 15:37, Fengguang Wu wrote: Hi Priyanka, The below warning shows up, probably related to this commit: 418a99ac6ad487dc9c42e6b0e85f941af56330f2 Replace rwlock on xfrm_policy_afinfo with rcu [0.921216] [0.921645] === [0.922766] [ INFO: suspicious RCU usage. ] [0.923887] 3.5.0-01540-g1669891 #64 Not tainted [0.925123] --- [0.932860] /c/kernel-tests/src/tip/net/xfrm/xfrm_policy.c:2504 suspicious rcu_dereference_check() usage! [0.935361] [0.935361] other info that might help us debug this: [0.935361] [0.937472] [0.937472] rcu_scheduler_active = 1, debug_locks = 0 [0.939182] 2 locks held by swapper/1: [0.940171] #0: (net_mutex){+.+.+.}, at: [] register_pernet_subsys+0x21/0x57 [0.942705] #1: (rcu_read_lock_bh){..}, at: [] xfrm_net_init+0x1e4/0x437 [0.951507] [0.951507] stack backtrace: [0.952660] Pid: 1, comm: swapper Not tainted 3.5.0-01540-g1669891 #64 [0.954364] Call Trace: [0.955074] [] lockdep_rcu_suspicious+0x174/0x187 [0.956736] [] xfrm_net_init+0x30e/0x437 [0.958205] [] ? xfrm_net_init+0x1e4/0x437 [0.959712] [] ops_init+0x1bb/0x1ff [0.961067] [] ? trace_hardirqs_on+0x1b/0x24 [0.962644] [] register_pernet_operations.isra.5+0x9d/0xfe [0.971376] [] register_pernet_subsys+0x30/0x57 [0.972992] [] xfrm_init+0x17/0x2c [0.974316] [] ip_rt_init+0x82/0xe7 [0.975668] [] ip_init+0x10/0x25 [0.976952] [] inet_init+0x235/0x360 [0.978352] [] ? devinet_init+0xf2/0xf2 [0.979808] [] do_one_initcall+0xb4/0x203 [0.981313] [] kernel_init+0x1a9/0x29a [0.982732] [] ? loglevel+0x46/0x46 [0.990889] [] kernel_thread_helper+0x4/0x10 [0.992472] [] ? retint_restore_args+0x13/0x13 [0.994076] [] ? do_one_initcall+0x203/0x203 [0.995636] [] ? gs_change+0x13/0x13 [0.997197] TCP established hash table entries: 8192 (order: 5, 131072 bytes) [1.74] TCP bind hash table entries: 8192 (order: 7, 655360 bytes) Thanks, Fengguang -- Love each day! --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] MIPS: oops when show backtrace of all active cpu
On 2012年08月02日 12:18, Kevin Cernekee wrote: On Wed, Aug 1, 2012 at 8:31 PM, Fan Du wrote: show_backtrace must have an valid task when calling unwind_stack, so fix it by checking first. [...] --- a/arch/mips/kernel/traps.c +++ b/arch/mips/kernel/traps.c @@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const struct pt_regs *regs) show_raw_backtrace(sp); return; } + + if (task == NULL) + task = current; + printk("Call Trace:\n"); do { print_ip_sym(pc); FYI, a slightly different version of this change was accepted: https://patchwork.linux-mips.org/patch/3524/ Oh, Looks like I'm late :) thanks anyway. -- Love each day! --fan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] MIPS: oops when show backtrace of all active cpu
show_backtrace must have an valid task when calling unwind_stack, so fix it by checking first. root@romashell:/root> echo l > /proc/sysrq-trigger SysRq : Show backtrace of all active CPUs Stack : 81bf 81ceroot@octeon:/root> 81bf 0036 81bf 0003 a8041b60b850 811dfa8c 4000 a8041fc0c060 a8041b60b870 8124c42c 0038 811b8efc a80003abbc00 81be0980 81be4200 0038 81ce 81b92c40 a8041b60b8c0 81b92c40 8157b0d8 4000 4000 81be0990 a8041b60b810 a8041b60b910 8157b0e4 a8041b60b958 a80003b9b358 0001 81d7 811a1ac8 a8041b60b990 8157b0d8 ... Call Trace: [] show_stack+0xd8/0x100 CPU 14 Unable to handle kernel paging request at virtual address 0008, epc == 8119c6a4, ra == 811a0664 Oops[#1]: Cpu 14 $ 0 : 811a0664 002e 00020003 $ 4 : a8041b60b770 811a1ac8 a8041b60b778 $ 8 : 3fff 282d 0001 $12 : 811a19a8 101f a8041fe68000 $16 : 81b5ae70 a8041b60b770 811a1ac8 $20 : a8041b60b810 81b5ae98 00200200 $24 : 2b6e8c94 $28 : a8041b608000 a8041b60b700 a8041b60b700 811a0664 Hi: 0002 Lo: bf7889a5a000 epc : 8119c6a4 unwind_stack+0x2c/0x180 Not tainted ra: 811a0664 show_backtrace+0x15c/0x178 Status: 10008ce2KX SX UX KERNEL EXL Cause : 40808008 BadVA : 0008 PrId : 000d030b (Cavium Octeon+) Modules linked in: Process klogd (pid: 1163, threadinfo=a8041b608000, task=a8041c271600, tls=2aed3f70) Stack : 81b5ae70 a8041b60b778 a8041b60b770 811a0650 a8041b60b740 811a1ac8 811a1ac8 81b5ae70 a8041b60b778 a8041b60b770 811a0664 a8041b60b810 8157b0e4 a8041b60b950 0028 81b5abe0 a8041b60b810 a8041b60b7c0 811a075c a8041b60b7d0 4000 4000 81be0990 a8041b60b810 811a1a30 81bf 81ce 81bf 0036 81bf 0003 ... Call Trace: [] unwind_stack+0x2c/0x180 [] show_backtrace+0x15c/0x178 [] show_stacktrace+0xdc/0x138 [] show_stack+0x40/0x100 [] showacpu+0x84/0x98 [] generic_smp_call_function_interrupt+0x174/0x250 [] smp_call_function_interrupt+0x34/0x50 [] mailbox_interrupt+0x78/0x80 [] handle_IRQ_event+0x9c/0x368 [] handle_percpu_irq+0x70/0xd0 [] do_IRQ+0x40/0x60 [] native_plat_irq_dispatch+0xf8/0x188 [] plat_irq_dispatch+0x24/0x38 [] ret_from_irq+0x0/0x4 [] _raw_spin_unlock_irq+0x38/0x70 [] do_syslog+0x408/0x590 [] kmsg_read+0x48/0x98 [] proc_reg_read+0xa0/0xf0 [] vfs_read+0xcc/0x160 [] SyS_read+0x6c/0x168 [] handle_sysn32+0x44/0x84 Code: 00a0902d 122d 00c0882d 3c02811a 64423e60 10c20033 Disabling lock debugging due to kernel taint Kernel panic - not syncing: Fatal exception in interrupt Signed-off-by: Fan Du --- arch/mips/kernel/traps.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c index b25b921..99c6611 100644 --- a/arch/mips/kernel/traps.c +++ b/arch/mips/kernel/traps.c @@ -151,6 +151,10 @@ static void show_backtrace(struct task_struct *task, const struct pt_regs *regs) show_raw_backtrace(sp); return; } + + if (task == NULL) + task = current; + printk("Call Trace:\n"); do { print_ip_sym(pc); -- 1.7.0.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/