Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver
On 1 July 2016 at 17:00, Mark Brownwrote: > On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote: >> On 1 July 2016 at 10:25, Mark Brown wrote: > >> > It's been repeatedly suggested to you that the tooling for this stuff >> > could use some work. Please go and put some effort into that rather >> > than continuing this thread which is accomplishing nothing. > >> You completely miss the point. No tooling will make people reconfigure >> the kernel when the configuration in fact stays the same. > >> Sure the tooling does need work. And it would help getting the cases >> when the tooling is NOT needed out of the way. > > I understand the problem perfectly, no amount of repeating yourself is > going to change the problems that the bodge you are trying to force in > creates for other users and the maintainability of the system. Can you, please, specify what problems this patch creates for other users and maintainability? Without stating the problems of this solution clearly or proposing alternate workable solution there is not much that can be done. Thanks Michal
[PATCH 01/31] mm, vmstat: add infrastructure for per-node vmstats
Series: "Move LRU page reclaim from zones to nodes v7" This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.6-rc3 plus the page allocator optimisation series. Conceptually, this is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Min total-odr0-1 485.00 ( 0.00%) 462.00 ( 4.74%) Min total-odr0-2 354.00 ( 0.00%) 341.00 ( 3.67%) Min total-odr0-4 285.00 ( 0.00%) 277.00 ( 2.81%) Min total-odr0-8 249.00 ( 0.00%) 240.00 ( 3.61%) Min total-odr0-16 230.00 ( 0.00%) 224.00 ( 2.61%) Min total-odr0-32 222.00 ( 0.00%) 215.00 ( 3.15%) Min total-odr0-64 216.00 ( 0.00%) 210.00 ( 2.78%) Min total-odr0-128 214.00 ( 0.00%) 208.00 ( 2.80%) Min total-odr0-256 248.00 ( 0.00%) 233.00 ( 6.05%) Min total-odr0-512 277.00 ( 0.00%) 270.00 ( 2.53%) Min total-odr0-1024294.00 ( 0.00%) 284.00 ( 3.40%) Min total-odr0-2048308.00 ( 0.00%) 298.00 ( 3.25%) Min total-odr0-4096318.00 ( 0.00%) 307.00 ( 3.46%) Min total-odr0-8192322.00 ( 0.00%) 308.00 ( 4.35%) Min total-odr0-16384 324.00 ( 0.00%) 309.00 ( 4.63%) Min total-odr1-1 729.00 ( 0.00%) 686.00 ( 5.90%) Min total-odr1-2 533.00 ( 0.00%) 520.00 ( 2.44%) Min total-odr1-4 434.00 ( 0.00%) 415.00 ( 4.38%) Min total-odr1-8 390.00 ( 0.00%) 364.00 ( 6.67%) Min total-odr1-16 359.00 ( 0.00%) 335.00 ( 6.69%) Min total-odr1-32 356.00 ( 0.00%) 327.00 ( 8.15%) Min total-odr1-64 356.00 ( 0.00%) 321.00 ( 9.83%) Min total-odr1-128 356.00 ( 0.00%) 333.00 ( 6.46%) Min total-odr1-256 354.00 ( 0.00%) 337.00 ( 4.80%) Min total-odr1-512 366.00 ( 0.00%)
[PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis
This patch makes reclaim decisions on a per-node basis. A reclaimer knows what zone is required by the allocation request and skips pages from higher zones. In many cases this will be ok because it's a GFP_HIGHMEM request of some description. On 64-bit, ZONE_DMA32 requests will cause some problems but 32-bit devices on 64-bit platforms are increasingly rare. Historically it would have been a major problem on 32-bit with big Highmem:Lowmem ratios but such configurations are also now rare and even where they exist, they are not encouraged. If it really becomes a problem, it'll manifest as very low reclaim efficiencies. Link: http://lkml.kernel.org/r/1466518566-30034-5-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 79 ++--- 1 file changed, 55 insertions(+), 24 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 86a523a761c9..766b36bec829 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -84,6 +84,9 @@ struct scan_control { /* Scan (total_size >> priority) pages at once */ int priority; + /* The highest zone to isolate pages for reclaim from */ + enum zone_type reclaim_idx; + unsigned int may_writepage:1; /* Can mapped pages be reclaimed? */ @@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_taken = 0; unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 }; unsigned long scan, nr_pages; + LIST_HEAD(pages_skipped); for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan && !list_empty(src); scan++) { @@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, VM_BUG_ON_PAGE(!PageLRU(page), page); + if (page_zonenum(page) > sc->reclaim_idx) { + list_move(>lru, _skipped); + continue; + } + switch (__isolate_lru_page(page, mode)) { case 0: nr_pages = hpage_nr_pages(page); @@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, } } + /* +* Splice any skipped pages to the start of the LRU list. Note that +* this disrupts the LRU order when reclaiming for lower zones but +* we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX +* scanning would soon rescan the same pages to skip and put the +* system at risk of premature OOM. +*/ + if (!list_empty(_skipped)) + list_splice(_skipped, src); *nr_scanned = scan; trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan, nr_taken, mode, is_file_lru(lru)); @@ -1589,7 +1607,7 @@ static int current_may_throttle(void) } /* - * shrink_inactive_list() is a helper for shrink_zone(). It returns the number + * shrink_inactive_list() is a helper for shrink_node(). It returns the number * of reclaimed pages */ static noinline_for_stack unsigned long @@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone *zone, } } -static bool shrink_zone(struct zone *zone, struct scan_control *sc, - bool is_classzone) +static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, + enum zone_type classzone_idx) { struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; + struct zone *zone = >node_zones[classzone_idx]; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, shrink_zone_memcg(zone, memcg, sc, _pages); zone_lru_pages += lru_pages; - if (memcg && is_classzone) + if (!global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), memcg, sc->nr_scanned - scanned, lru_pages); @@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc, * Shrink the slab caches in the same proportion that * the eligible LRU pages were scanned. */ - if (global_reclaim(sc) && is_classzone) + if (global_reclaim(sc)) shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL, sc->nr_scanned - nr_scanned, zone_lru_pages); @@ -2553,7 +2572,7 @@ static void
Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver
On 1 July 2016 at 17:00, Mark Brown wrote: > On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote: >> On 1 July 2016 at 10:25, Mark Brown wrote: > >> > It's been repeatedly suggested to you that the tooling for this stuff >> > could use some work. Please go and put some effort into that rather >> > than continuing this thread which is accomplishing nothing. > >> You completely miss the point. No tooling will make people reconfigure >> the kernel when the configuration in fact stays the same. > >> Sure the tooling does need work. And it would help getting the cases >> when the tooling is NOT needed out of the way. > > I understand the problem perfectly, no amount of repeating yourself is > going to change the problems that the bodge you are trying to force in > creates for other users and the maintainability of the system. Can you, please, specify what problems this patch creates for other users and maintainability? Without stating the problems of this solution clearly or proposing alternate workable solution there is not much that can be done. Thanks Michal
[PATCH 11/31] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
kswapd scans from highest to lowest for a zone that requires balancing. This was necessary when reclaim was per-zone to fairly age pages on lower zones. Now that we are reclaiming on a per-node basis, any eligible zone can be used and pages will still be aged fairly. This patch avoids reclaiming excessively unless buffer_heads are over the limit and it's necessary to reclaim from a higher zone than requested by the waker of kswapd to relieve low memory pressure. [hillf...@alibaba-inc.com: Force kswapd reclaim no more than needed] Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanSigned-off-by: Hillf Danton Acked-by: Vlastimil Babka --- mm/vmscan.c | 56 1 file changed, 24 insertions(+), 32 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 911142d25de2..2f898ba2ee2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3141,31 +3141,36 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.nr_reclaimed = 0; - /* Scan from the highest requested zone to dma */ - for (i = classzone_idx; i >= 0; i--) { - zone = pgdat->node_zones + i; - if (!populated_zone(zone)) - continue; - - /* -* If the number of buffer_heads in the machine -* exceeds the maximum allowed level and this node -* has a highmem zone, force kswapd to reclaim from -* it to relieve lowmem pressure. -*/ - if (buffer_heads_over_limit && is_highmem_idx(i)) { - classzone_idx = i; - break; - } + /* +* If the number of buffer_heads in the machine exceeds the +* maximum allowed level then reclaim from all zones. This is +* not specific to highmem as highmem may not exist but it is +* it is expected that buffer_heads are stripped in writeback. +*/ + if (buffer_heads_over_limit) { + for (i = MAX_NR_ZONES - 1; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; - if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; } } - if (i < 0) - goto out; + /* +* Only reclaim if there are no eligible zones. Check from +* high to low zone to avoid prematurely clearing pgdat +* congested state. +*/ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, sc.order, classzone_idx)) + goto out; + } /* * Do some background aging of the anon list, to give @@ -3211,19 +3216,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; /* -* Stop reclaiming if any eligible zone is balanced and clear -* node writeback or congested. -*/ - for (i = 0; i <= classzone_idx; i++) { - zone = pgdat->node_zones + i; - if (!populated_zone(zone)) - continue; - - if (zone_balanced(zone, sc.order, classzone_idx)) - goto out; - } - - /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ -- 2.6.4
[PATCH 06/31] mm, vmscan: make kswapd reclaim in terms of nodes
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started thinking of reclaim in terms of nodes but kswapd is still zone-centric. This patch gets rid of many of the node-based versus zone-based decisions. o A node is considered balanced when any eligible lower zone is balanced. This eliminates one class of age-inversion problem because we avoid reclaiming a newer page just because it's in the wrong zone o pgdat_balanced disappears because we now only care about one zone being balanced. o Some anomalies related to writeback and congestion tracking being based on zones disappear. o kswapd no longer has to take care to reclaim zones in the reverse order that the page allocator uses. o Most importantly of all, reclaim from node 0 with multiple zones will have similar aging and reclaiming characteristics as every other node. Link: http://lkml.kernel.org/r/1466518566-30034-7-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 292 +--- 1 file changed, 101 insertions(+), 191 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c6e61dae382b..7b382b90b145 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control *sc) +static void age_active_anon(struct pglist_data *pgdat, + struct zone *zone, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, bool highorder, +static bool zone_balanced(struct zone *zone, int order, unsigned long balance_gap, int classzone_idx) { unsigned long mark = high_wmark_pages(zone) + balance_gap; - /* -* When checking from pgdat_balanced(), kswapd should stop and sleep -* when it reaches the high order-0 watermark and let kcompactd take -* over. Other callers such as wakeup_kswapd() want to determine the -* true high-order watermark. -*/ - if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { - mark += (1UL << order); - order = 0; - } - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } /* - * pgdat_balanced() is used when checking if a node is balanced. - * - * For order-0, all zones must be balanced! - * - * For high-order allocations only zones that meet watermarks and are in a - * zone allowed by the callers classzone_idx are added to balanced_pages. The - * total of balanced pages must be at least 25% of the zones allowed by - * classzone_idx for the node to be considered balanced. Forcing all zones to - * be balanced for high orders can cause excessive reclaim when there are - * imbalanced zones. - * The choice of 25% is due to - * o a 16M DMA zone that is balanced will not balance a zone on any - * reasonable sized machine - * o On all other machines, the top zone must be at least a reasonable - * percentage of the middle zones. For example, on 32-bit x86, highmem - * would need to be at least 256M for it to be balance a whole node. - * Similarly, on x86-64 the Normal zone would need to be at least 1G - * to balance a node on its own. These seemed like reasonable ratios. - */ -static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) -{ - unsigned long managed_pages = 0; - unsigned long balanced_pages = 0; - int i; - - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - managed_pages += zone->managed_pages; - - /* -* A special case here: -* -* balance_pgdat() skips over all_unreclaimable after -* DEF_PRIORITY. Effectively, it considers them balanced so -* they must be considered balanced here as well! -*/ - if (!pgdat_reclaimable(zone->zone_pgdat)) { - balanced_pages += zone->managed_pages; - continue; - } - - if (zone_balanced(zone, order, false, 0, i)) - balanced_pages += zone->managed_pages; - else if (!order) - return false; - } - - if (order) - return balanced_pages >= (managed_pages >> 2); - else -
[PATCH 11/31] mm: vmscan: do not reclaim from kswapd if there is any eligible zone
kswapd scans from highest to lowest for a zone that requires balancing. This was necessary when reclaim was per-zone to fairly age pages on lower zones. Now that we are reclaiming on a per-node basis, any eligible zone can be used and pages will still be aged fairly. This patch avoids reclaiming excessively unless buffer_heads are over the limit and it's necessary to reclaim from a higher zone than requested by the waker of kswapd to relieve low memory pressure. [hillf...@alibaba-inc.com: Force kswapd reclaim no more than needed] Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Signed-off-by: Hillf Danton Acked-by: Vlastimil Babka --- mm/vmscan.c | 56 1 file changed, 24 insertions(+), 32 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 911142d25de2..2f898ba2ee2e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3141,31 +3141,36 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.nr_reclaimed = 0; - /* Scan from the highest requested zone to dma */ - for (i = classzone_idx; i >= 0; i--) { - zone = pgdat->node_zones + i; - if (!populated_zone(zone)) - continue; - - /* -* If the number of buffer_heads in the machine -* exceeds the maximum allowed level and this node -* has a highmem zone, force kswapd to reclaim from -* it to relieve lowmem pressure. -*/ - if (buffer_heads_over_limit && is_highmem_idx(i)) { - classzone_idx = i; - break; - } + /* +* If the number of buffer_heads in the machine exceeds the +* maximum allowed level then reclaim from all zones. This is +* not specific to highmem as highmem may not exist but it is +* it is expected that buffer_heads are stripped in writeback. +*/ + if (buffer_heads_over_limit) { + for (i = MAX_NR_ZONES - 1; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; - if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; } } - if (i < 0) - goto out; + /* +* Only reclaim if there are no eligible zones. Check from +* high to low zone to avoid prematurely clearing pgdat +* congested state. +*/ + for (i = classzone_idx; i >= 0; i--) { + zone = pgdat->node_zones + i; + if (!populated_zone(zone)) + continue; + + if (zone_balanced(zone, sc.order, classzone_idx)) + goto out; + } /* * Do some background aging of the anon list, to give @@ -3211,19 +3216,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; /* -* Stop reclaiming if any eligible zone is balanced and clear -* node writeback or congested. -*/ - for (i = 0; i <= classzone_idx; i++) { - zone = pgdat->node_zones + i; - if (!populated_zone(zone)) - continue; - - if (zone_balanced(zone, sc.order, classzone_idx)) - goto out; - } - - /* * Raise priority if scanning rate is too low or there was no * progress in reclaiming pages */ -- 2.6.4
[PATCH 06/31] mm, vmscan: make kswapd reclaim in terms of nodes
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started thinking of reclaim in terms of nodes but kswapd is still zone-centric. This patch gets rid of many of the node-based versus zone-based decisions. o A node is considered balanced when any eligible lower zone is balanced. This eliminates one class of age-inversion problem because we avoid reclaiming a newer page just because it's in the wrong zone o pgdat_balanced disappears because we now only care about one zone being balanced. o Some anomalies related to writeback and congestion tracking being based on zones disappear. o kswapd no longer has to take care to reclaim zones in the reverse order that the page allocator uses. o Most importantly of all, reclaim from node 0 with multiple zones will have similar aging and reclaiming characteristics as every other node. Link: http://lkml.kernel.org/r/1466518566-30034-7-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 292 +--- 1 file changed, 101 insertions(+), 191 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index c6e61dae382b..7b382b90b145 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, } #endif -static void age_active_anon(struct zone *zone, struct scan_control *sc) +static void age_active_anon(struct pglist_data *pgdat, + struct zone *zone, struct scan_control *sc) { struct mem_cgroup *memcg; @@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, bool highorder, +static bool zone_balanced(struct zone *zone, int order, unsigned long balance_gap, int classzone_idx) { unsigned long mark = high_wmark_pages(zone) + balance_gap; - /* -* When checking from pgdat_balanced(), kswapd should stop and sleep -* when it reaches the high order-0 watermark and let kcompactd take -* over. Other callers such as wakeup_kswapd() want to determine the -* true high-order watermark. -*/ - if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { - mark += (1UL << order); - order = 0; - } - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } /* - * pgdat_balanced() is used when checking if a node is balanced. - * - * For order-0, all zones must be balanced! - * - * For high-order allocations only zones that meet watermarks and are in a - * zone allowed by the callers classzone_idx are added to balanced_pages. The - * total of balanced pages must be at least 25% of the zones allowed by - * classzone_idx for the node to be considered balanced. Forcing all zones to - * be balanced for high orders can cause excessive reclaim when there are - * imbalanced zones. - * The choice of 25% is due to - * o a 16M DMA zone that is balanced will not balance a zone on any - * reasonable sized machine - * o On all other machines, the top zone must be at least a reasonable - * percentage of the middle zones. For example, on 32-bit x86, highmem - * would need to be at least 256M for it to be balance a whole node. - * Similarly, on x86-64 the Normal zone would need to be at least 1G - * to balance a node on its own. These seemed like reasonable ratios. - */ -static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) -{ - unsigned long managed_pages = 0; - unsigned long balanced_pages = 0; - int i; - - /* Check the watermark levels */ - for (i = 0; i <= classzone_idx; i++) { - struct zone *zone = pgdat->node_zones + i; - - if (!populated_zone(zone)) - continue; - - managed_pages += zone->managed_pages; - - /* -* A special case here: -* -* balance_pgdat() skips over all_unreclaimable after -* DEF_PRIORITY. Effectively, it considers them balanced so -* they must be considered balanced here as well! -*/ - if (!pgdat_reclaimable(zone->zone_pgdat)) { - balanced_pages += zone->managed_pages; - continue; - } - - if (zone_balanced(zone, order, false, 0, i)) - balanced_pages += zone->managed_pages; - else if (!order) - return false; - } - - if (order) - return balanced_pages >= (managed_pages >> 2); - else - return true; -} - -/* * Prepare kswapd for sleeping. This verifies that there are no processes
[PATCH 12/31] mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data that is node-centric for reclaiming but shrink_node() itself still uses too much zone information. This patch removes unnecessary zone-based information with the most important decision being whether to continue reclaim or not. Some memcg APIs are adjusted as a result even though memcg itself still uses some zone information. Signed-off-by: Mel GormanAcked-by: Michal Hocko Acked-by: Vlastimil Babka --- include/linux/memcontrol.h | 19 include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 2 +- mm/memcontrol.c| 4 ++-- mm/page_alloc.c| 2 +- mm/vmscan.c| 57 ++ mm/workingset.c| 6 ++--- 7 files changed, 51 insertions(+), 43 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1927dcb6921e..48b43c709ed7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) } /** - * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg + * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone + * @node: node of the wanted lruvec * @zone: zone of the wanted lruvec * @memcg: memcg of the wanted lruvec * - * Returns the lru list vector holding pages for the given @zone and - * @mem. This can be the global zone lruvec, if the memory controller + * Returns the lru list vector holding pages for a given @node or a given + * @memcg and @zone. This can be the node lruvec, if the memory controller * is disabled. */ -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { struct mem_cgroup_per_zone *mz; struct lruvec *lruvec; if (mem_cgroup_disabled()) { - lruvec = zone_lruvec(zone); + lruvec = node_lruvec(pgdat); goto out; } @@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new) { } -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { - return zone_lruvec(zone); + return node_lruvec(pgdat); } static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index eb74e63df5cf..f88cbbb476c8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone) return >zone_pgdat->lru_lock; } -static inline struct lruvec *zone_lruvec(struct zone *zone) +static inline struct lruvec *node_lruvec(struct pglist_data *pgdat) { - return >zone_pgdat->lruvec; + return >lruvec; } static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) diff --git a/include/linux/swap.h b/include/linux/swap.h index 916e2eddecd6..0ad616d7c381 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, bool may_swap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, +extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, unsigned long *nr_scanned); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 50c86ad121bc..c9ebec98e92a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, } continue; } - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, -zone, _scanned); + total += mem_cgroup_shrink_node(victim, gfp_mask, false, + zone, _scanned); *total_scanned += nr_scanned; if (!soft_limit_excess(root_memcg)) break; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index
[PATCH 07/31] mm, vmscan: remove balance gap
The balance gap was introduced to apply equal pressure to all zones when reclaiming for a higher zone. With node-based LRU, the need for the balance gap is removed and the code is dead so remove it. [vba...@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO] Link: http://lkml.kernel.org/r/1466518566-30034-8-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/swap.h | 9 - mm/vmscan.c | 19 --- 2 files changed, 8 insertions(+), 20 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c82f916008b7..916e2eddecd6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -157,15 +157,6 @@ enum { #define SWAP_CLUSTER_MAX 32UL #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX -/* - * Ratio between zone->managed_pages and the "gap" that above the per-zone - * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that - * do not meet the (high_wmark + gap) watermark, even which already met the - * high_wmark, in order to provide better per-zone lru behavior. We are ok to - * spend not more than 1% of the memory for this zone balancing "gap". - */ -#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 - #define SWAP_MAP_MAX 0x3e/* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f/* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40/* Flag page is cached, in first swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b382b90b145..a52167eabc96 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, */ static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx) { - unsigned long balance_gap, watermark; + unsigned long watermark; bool watermark_ok; /* @@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_ * there is a buffer of free pages available to give compaction * a reasonable chance of completing and allocating the page */ - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); - watermark = high_wmark_pages(zone) + balance_gap + (2UL << order); + watermark = high_wmark_pages(zone) + (2UL << order); watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx); /* @@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, - unsigned long balance_gap, int classzone_idx) +static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { - unsigned long mark = high_wmark_pages(zone) + balance_gap; + unsigned long mark = high_wmark_pages(zone); return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } @@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, if (!populated_zone(zone)) continue; - if (zone_balanced(zone, order, 0, classzone_idx)) + if (zone_balanced(zone, order, classzone_idx)) return true; } @@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; } - if (!zone_balanced(zone, order, 0, 0)) { + if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; } else { @@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!populated_zone(zone)) continue; - if (zone_balanced(zone, sc.order, 0, classzone_idx)) { + if (zone_balanced(zone, sc.order, classzone_idx)) { clear_bit(PGDAT_CONGESTED, >flags); clear_bit(PGDAT_DIRTY, >flags); goto out; @@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) } if (!waitqueue_active(>kswapd_wait)) return; - if (zone_balanced(zone, order, 0, 0)) + if (zone_balanced(zone, order, 0)) return; trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); -- 2.6.4
[PATCH 09/31] mm, vmscan: by default have direct reclaim only shrink once per node
Direct reclaim iterates over all zones in the zonelist and shrinking them but this is in conflict with node-based reclaim. In the default case, only shrink once per node. Link: http://lkml.kernel.org/r/1466518566-30034-10-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b524d3b72527..34656173a670 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_ * try to reclaim pages from zones which will satisfy the caller's allocation * request. * - * We reclaim from a zone even if that zone is over high_wmark_pages(zone). - * Because: - * a) The caller may be trying to free *extra* pages to satisfy a higher-order - *allocation or - * b) The target zone may be at high_wmark_pages(zone) but the lower zones - *must go *over* high_wmark_pages(zone) to satisfy the `incremental min' - *zone defense algorithm. - * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ @@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; enum zone_type classzone_idx; + pg_data_t *last_pgdat = NULL; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2600,6 +2593,16 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) classzone_idx--; /* +* Shrink each node in the zonelist once. If the zonelist is +* ordered by zone (not the default) then a node may be +* shrunk multiple times but in that case the user prefers +* lower zones being preserved +*/ + if (zone->zone_pgdat == last_pgdat) + continue; + last_pgdat = zone->zone_pgdat; + + /* * Take care memory controller reclaiming has small influence * to global LRU. */ -- 2.6.4
[PATCH 12/31] mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data that is node-centric for reclaiming but shrink_node() itself still uses too much zone information. This patch removes unnecessary zone-based information with the most important decision being whether to continue reclaim or not. Some memcg APIs are adjusted as a result even though memcg itself still uses some zone information. Signed-off-by: Mel Gorman Acked-by: Michal Hocko Acked-by: Vlastimil Babka --- include/linux/memcontrol.h | 19 include/linux/mmzone.h | 4 ++-- include/linux/swap.h | 2 +- mm/memcontrol.c| 4 ++-- mm/page_alloc.c| 2 +- mm/vmscan.c| 57 ++ mm/workingset.c| 6 ++--- 7 files changed, 51 insertions(+), 43 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1927dcb6921e..48b43c709ed7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone) } /** - * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg + * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone + * @node: node of the wanted lruvec * @zone: zone of the wanted lruvec * @memcg: memcg of the wanted lruvec * - * Returns the lru list vector holding pages for the given @zone and - * @mem. This can be the global zone lruvec, if the memory controller + * Returns the lru list vector holding pages for a given @node or a given + * @memcg and @zone. This can be the node lruvec, if the memory controller * is disabled. */ -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { struct mem_cgroup_per_zone *mz; struct lruvec *lruvec; if (mem_cgroup_disabled()) { - lruvec = zone_lruvec(zone); + lruvec = node_lruvec(pgdat); goto out; } @@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, struct page *new) { } -static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone, - struct mem_cgroup *memcg) +static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat, + struct zone *zone, struct mem_cgroup *memcg) { - return zone_lruvec(zone); + return node_lruvec(pgdat); } static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index eb74e63df5cf..f88cbbb476c8 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone) return >zone_pgdat->lru_lock; } -static inline struct lruvec *zone_lruvec(struct zone *zone) +static inline struct lruvec *node_lruvec(struct pglist_data *pgdat) { - return >zone_pgdat->lruvec; + return >lruvec; } static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat) diff --git a/include/linux/swap.h b/include/linux/swap.h index 916e2eddecd6..0ad616d7c381 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, bool may_swap); -extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem, +extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, struct zone *zone, unsigned long *nr_scanned); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 50c86ad121bc..c9ebec98e92a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg, } continue; } - total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false, -zone, _scanned); + total += mem_cgroup_shrink_node(victim, gfp_mask, false, + zone, _scanned); *total_scanned += nr_scanned; if (!soft_limit_excess(root_memcg)) break; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f58548139bf2..b76ea2527c09 100644 --- a/mm/page_alloc.c +++
[PATCH 07/31] mm, vmscan: remove balance gap
The balance gap was introduced to apply equal pressure to all zones when reclaiming for a higher zone. With node-based LRU, the need for the balance gap is removed and the code is dead so remove it. [vba...@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO] Link: http://lkml.kernel.org/r/1466518566-30034-8-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/swap.h | 9 - mm/vmscan.c | 19 --- 2 files changed, 8 insertions(+), 20 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index c82f916008b7..916e2eddecd6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -157,15 +157,6 @@ enum { #define SWAP_CLUSTER_MAX 32UL #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX -/* - * Ratio between zone->managed_pages and the "gap" that above the per-zone - * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that - * do not meet the (high_wmark + gap) watermark, even which already met the - * high_wmark, in order to provide better per-zone lru behavior. We are ok to - * spend not more than 1% of the memory for this zone balancing "gap". - */ -#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100 - #define SWAP_MAP_MAX 0x3e/* Max duplication count, in first swap_map */ #define SWAP_MAP_BAD 0x3f/* Note pageblock is bad, in first swap_map */ #define SWAP_HAS_CACHE 0x40/* Flag page is cached, in first swap_map */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b382b90b145..a52167eabc96 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc, */ static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx) { - unsigned long balance_gap, watermark; + unsigned long watermark; bool watermark_ok; /* @@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_ * there is a buffer of free pages available to give compaction * a reasonable chance of completing and allocating the page */ - balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP( - zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO)); - watermark = high_wmark_pages(zone) + balance_gap + (2UL << order); + watermark = high_wmark_pages(zone) + (2UL << order); watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx); /* @@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat, } while (memcg); } -static bool zone_balanced(struct zone *zone, int order, - unsigned long balance_gap, int classzone_idx) +static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { - unsigned long mark = high_wmark_pages(zone) + balance_gap; + unsigned long mark = high_wmark_pages(zone); return zone_watermark_ok_safe(zone, order, mark, classzone_idx); } @@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, if (!populated_zone(zone)) continue; - if (zone_balanced(zone, order, 0, classzone_idx)) + if (zone_balanced(zone, order, classzone_idx)) return true; } @@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) break; } - if (!zone_balanced(zone, order, 0, 0)) { + if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; } else { @@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!populated_zone(zone)) continue; - if (zone_balanced(zone, sc.order, 0, classzone_idx)) { + if (zone_balanced(zone, sc.order, classzone_idx)) { clear_bit(PGDAT_CONGESTED, >flags); clear_bit(PGDAT_DIRTY, >flags); goto out; @@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) } if (!waitqueue_active(>kswapd_wait)) return; - if (zone_balanced(zone, order, 0, 0)) + if (zone_balanced(zone, order, 0)) return; trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); -- 2.6.4
[PATCH 09/31] mm, vmscan: by default have direct reclaim only shrink once per node
Direct reclaim iterates over all zones in the zonelist and shrinking them but this is in conflict with node-based reclaim. In the default case, only shrink once per node. Link: http://lkml.kernel.org/r/1466518566-30034-10-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b524d3b72527..34656173a670 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_ * try to reclaim pages from zones which will satisfy the caller's allocation * request. * - * We reclaim from a zone even if that zone is over high_wmark_pages(zone). - * Because: - * a) The caller may be trying to free *extra* pages to satisfy a higher-order - *allocation or - * b) The target zone may be at high_wmark_pages(zone) but the lower zones - *must go *over* high_wmark_pages(zone) to satisfy the `incremental min' - *zone defense algorithm. - * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ @@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) unsigned long nr_soft_scanned; gfp_t orig_mask; enum zone_type classzone_idx; + pg_data_t *last_pgdat = NULL; /* * If the number of buffer_heads in the machine exceeds the maximum @@ -2600,6 +2593,16 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) classzone_idx--; /* +* Shrink each node in the zonelist once. If the zonelist is +* ordered by zone (not the default) then a node may be +* shrunk multiple times but in that case the user prefers +* lower zones being preserved +*/ + if (zone->zone_pgdat == last_pgdat) + continue; + last_pgdat = zone->zone_pgdat; + + /* * Take care memory controller reclaiming has small influence * to global LRU. */ -- 2.6.4
[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps
kswapd goes through some complex steps trying to figure out if it should stay awake based on the classzone_idx and the requested order. It is unnecessarily complex and passes in an invalid classzone_idx to balance_pgdat(). What matters most of all is whether a larger order has been requsted and whether kswapd successfully reclaimed at the previous order. This patch irons out the logic to check just that and the end result is less headache inducing. Link: http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 5 ++- mm/memory_hotplug.c| 5 ++- mm/page_alloc.c| 2 +- mm/vmscan.c| 102 ++--- 4 files changed, 62 insertions(+), 52 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 258c20758e80..eb74e63df5cf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -667,8 +667,9 @@ typedef struct pglist_data { wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ - int kswapd_max_order; - enum zone_type classzone_idx; + int kswapd_order; + enum zone_type kswapd_classzone_idx; + #ifdef CONFIG_COMPACTION int kcompactd_max_order; enum zone_type kcompactd_classzone_idx; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c5278360ca66..065140ecd081 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) arch_refresh_nodedata(nid, pgdat); } else { - /* Reset the nr_zones and classzone_idx to 0 before reuse */ + /* Reset the nr_zones, order and classzone_idx before reuse */ pgdat->nr_zones = 0; - pgdat->classzone_idx = 0; + pgdat->kswapd_order = 0; + pgdat->kswapd_classzone_idx = 0; } /* we can use NODE_DATA(nid) from here */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 59e4463e5dce..f58548139bf2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, unsigned long end_pfn = 0; /* pg_data_t should be reset to zero when it's allocated */ - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx); + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx); reset_deferred_meminit(pgdat); pgdat->node_id = nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index a52167eabc96..b524d3b72527 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) /* kswapd must be awake if processes are being throttled */ if (!wmark_ok && waitqueue_active(>kswapd_wait)) { - pgdat->classzone_idx = min(pgdat->classzone_idx, + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx, (enum zone_type)ZONE_NORMAL); wake_up_interruptible(>kswapd_wait); } @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) return sc.order; } -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, - int classzone_idx, int balanced_classzone_idx) +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, + int classzone_idx) { long remaining = 0; DEFINE_WAIT(wait); @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE); + /* +* If kswapd has not been woken recently, then kswapd goes fully +* to sleep. kcompactd may still need to wake if the original +* request was high-order. +*/ + if (classzone_idx == -1) { + wakeup_kcompactd(pgdat, alloc_order, classzone_idx); + classzone_idx = MAX_NR_ZONES - 1; + goto full_sleep; + } + /* Try to sleep for a short interval */ - if (prepare_kswapd_sleep(pgdat, order, remaining, - balanced_classzone_idx)) { + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) { /* * Compaction records what page blocks it recently failed to * isolate pages from and skips them in the future scanning. @@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t
[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps
kswapd goes through some complex steps trying to figure out if it should stay awake based on the classzone_idx and the requested order. It is unnecessarily complex and passes in an invalid classzone_idx to balance_pgdat(). What matters most of all is whether a larger order has been requsted and whether kswapd successfully reclaimed at the previous order. This patch irons out the logic to check just that and the end result is less headache inducing. Link: http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 5 ++- mm/memory_hotplug.c| 5 ++- mm/page_alloc.c| 2 +- mm/vmscan.c| 102 ++--- 4 files changed, 62 insertions(+), 52 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 258c20758e80..eb74e63df5cf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -667,8 +667,9 @@ typedef struct pglist_data { wait_queue_head_t pfmemalloc_wait; struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ - int kswapd_max_order; - enum zone_type classzone_idx; + int kswapd_order; + enum zone_type kswapd_classzone_idx; + #ifdef CONFIG_COMPACTION int kcompactd_max_order; enum zone_type kcompactd_classzone_idx; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c5278360ca66..065140ecd081 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start) arch_refresh_nodedata(nid, pgdat); } else { - /* Reset the nr_zones and classzone_idx to 0 before reuse */ + /* Reset the nr_zones, order and classzone_idx before reuse */ pgdat->nr_zones = 0; - pgdat->classzone_idx = 0; + pgdat->kswapd_order = 0; + pgdat->kswapd_classzone_idx = 0; } /* we can use NODE_DATA(nid) from here */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 59e4463e5dce..f58548139bf2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size, unsigned long end_pfn = 0; /* pg_data_t should be reset to zero when it's allocated */ - WARN_ON(pgdat->nr_zones || pgdat->classzone_idx); + WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx); reset_deferred_meminit(pgdat); pgdat->node_id = nid; diff --git a/mm/vmscan.c b/mm/vmscan.c index a52167eabc96..b524d3b72527 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) /* kswapd must be awake if processes are being throttled */ if (!wmark_ok && waitqueue_active(>kswapd_wait)) { - pgdat->classzone_idx = min(pgdat->classzone_idx, + pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx, (enum zone_type)ZONE_NORMAL); wake_up_interruptible(>kswapd_wait); } @@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) return sc.order; } -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, - int classzone_idx, int balanced_classzone_idx) +static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int reclaim_order, + int classzone_idx) { long remaining = 0; DEFINE_WAIT(wait); @@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE); + /* +* If kswapd has not been woken recently, then kswapd goes fully +* to sleep. kcompactd may still need to wake if the original +* request was high-order. +*/ + if (classzone_idx == -1) { + wakeup_kcompactd(pgdat, alloc_order, classzone_idx); + classzone_idx = MAX_NR_ZONES - 1; + goto full_sleep; + } + /* Try to sleep for a short interval */ - if (prepare_kswapd_sleep(pgdat, order, remaining, - balanced_classzone_idx)) { + if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, classzone_idx)) { /* * Compaction records what page blocks it recently failed to * isolate pages from and skips them in the future scanning. @@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, * We have freed the memory, now we should compact it to make
[PATCH 05/31] mm, vmscan: have kswapd only scan based on the highest requested zone
kswapd checks all eligible zones to see if they need balancing even if it was woken for a lower zone. This made sense when we reclaimed on a per-zone basis because we wanted to shrink zones fairly so avoid age-inversion problems. Ideally this is completely unnecessary when reclaiming on a per-node basis. In theory, there may still be anomalies when all requests are for lower zones and very old pages are preserved in higher zones but this should be the exceptional case. Link: http://lkml.kernel.org/r/1466518566-30034-6-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 766b36bec829..c6e61dae382b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.nr_reclaimed = 0; - /* -* Scan in the highmem->dma direction for the highest -* zone which needs scanning -*/ - for (i = pgdat->nr_zones - 1; i >= 0; i--) { + /* Scan from the highest requested zone to dma */ + for (i = classzone_idx; i >= 0; i--) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) -- 2.6.4
[PATCH 02/31] mm, vmscan: move lru_lock to the node
Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1466518566-30034-3-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanAcked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- Documentation/cgroup-v1/memcg_test.txt | 4 +-- Documentation/cgroup-v1/memory.txt | 4 +-- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 10 +-- mm/compaction.c| 10 +++ mm/filemap.c | 4 +-- mm/huge_memory.c | 6 ++--- mm/memcontrol.c| 6 ++--- mm/mlock.c | 10 +++ mm/page_alloc.c| 4 +-- mm/page_idle.c | 4 +-- mm/rmap.c | 2 +- mm/swap.c | 30 ++--- mm/vmscan.c| 48 +- 14 files changed, 75 insertions(+), 69 deletions(-) diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt index 8870b0212150..78a8c2963b38 100644 --- a/Documentation/cgroup-v1/memcg_test.txt +++ b/Documentation/cgroup-v1/memcg_test.txt @@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 8. LRU Each memcg has its own private LRU. Now, its handling is under global - VM's control (means that it's handled under global zone->lru_lock). + VM's control (means that it's handled under global zone_lru_lock). Almost all routines around memcg's LRU is called by global LRU's - list management functions under zone->lru_lock(). + list management functions under zone_lru_lock(). A special function is mem_cgroup_isolate_pages(). This scans memcg's private LRU and call __isolate_lru_page() to extract a page diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index b14abf217239..946e69103cdd 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered. Other lock order is following: PG_locked. mm->page_table_lock - zone->lru_lock + zone_lru_lock lock_page_cgroup. In many cases, just lock_page_cgroup() is called. per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - zone->lru_lock, it has no lock of its own. + zone_lru_lock, it has no lock of its own. 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e093e1d3285b..ca2ed9a6c8d8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -118,7 +118,7 @@ struct page { */ union { struct list_head lru; /* Pageout list, eg. active_list -* protected by zone->lru_lock ! +* protected by zone_lru_lock ! * Can be used as a generic list * by the page owner. */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 078ecb81e209..2d5087e3c034 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -93,7 +93,7 @@ struct free_area { struct pglist_data; /* - * zone->lock and zone->lru_lock are two of the hottest locks in the kernel. + * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. * So add a wild amount of padding here to ensure that they fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. @@ -496,7 +496,6 @@ struct zone { /* Write-intensive fields used by page reclaim */ /* Fields commonly accessed by the page reclaim scanner */ - spinlock_t lru_lock; struct lruvec lruvec; /* @@ -690,6 +689,9 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + /* Write-intensive fields used from the page allocator */ + ZONE_PADDING(_pad1_) + spinlock_t lru_lock; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* @@ -721,6 +723,10 @@ typedef struct pglist_data {
[PATCH 10/31] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
Reclaim may stall if there is too much dirty or congested data on a node. This was previously based on zone flags and the logic for clearing the flags is in two places. As congestion/dirty tracking is now tracked on a per-node basis, we can remove some duplicate logic. Link: http://lkml.kernel.org/r/1466518566-30034-11-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel GormanCc: Johannes Weiner Cc: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 34656173a670..911142d25de2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3005,7 +3005,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { unsigned long mark = high_wmark_pages(zone); - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); + if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) + return false; + + /* +* If any eligible zone is balanced then the node is not considered +* to be congested or dirty +*/ + clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); + clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); + + return true; } /* @@ -3151,13 +3161,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; - } else { - /* -* If any eligible zone is balanced then the -* node is not considered congested or dirty. -*/ - clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); - clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); } } @@ -3216,11 +3219,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!populated_zone(zone)) continue; - if (zone_balanced(zone, sc.order, classzone_idx)) { - clear_bit(PGDAT_CONGESTED, >flags); - clear_bit(PGDAT_DIRTY, >flags); + if (zone_balanced(zone, sc.order, classzone_idx)) goto out; - } } /* -- 2.6.4
[PATCH 05/31] mm, vmscan: have kswapd only scan based on the highest requested zone
kswapd checks all eligible zones to see if they need balancing even if it was woken for a lower zone. This made sense when we reclaimed on a per-zone basis because we wanted to shrink zones fairly so avoid age-inversion problems. Ideally this is completely unnecessary when reclaiming on a per-node basis. In theory, there may still be anomalies when all requests are for lower zones and very old pages are preserved in higher zones but this should be the exceptional case. Link: http://lkml.kernel.org/r/1466518566-30034-6-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Johannes Weiner Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 766b36bec829..c6e61dae382b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) sc.nr_reclaimed = 0; - /* -* Scan in the highmem->dma direction for the highest -* zone which needs scanning -*/ - for (i = pgdat->nr_zones - 1; i >= 0; i--) { + /* Scan from the highest requested zone to dma */ + for (i = classzone_idx; i >= 0; i--) { struct zone *zone = pgdat->node_zones + i; if (!populated_zone(zone)) -- 2.6.4
[PATCH 02/31] mm, vmscan: move lru_lock to the node
Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1466518566-30034-3-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Acked-by: Johannes Weiner Acked-by: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- Documentation/cgroup-v1/memcg_test.txt | 4 +-- Documentation/cgroup-v1/memory.txt | 4 +-- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 10 +-- mm/compaction.c| 10 +++ mm/filemap.c | 4 +-- mm/huge_memory.c | 6 ++--- mm/memcontrol.c| 6 ++--- mm/mlock.c | 10 +++ mm/page_alloc.c| 4 +-- mm/page_idle.c | 4 +-- mm/rmap.c | 2 +- mm/swap.c | 30 ++--- mm/vmscan.c| 48 +- 14 files changed, 75 insertions(+), 69 deletions(-) diff --git a/Documentation/cgroup-v1/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt index 8870b0212150..78a8c2963b38 100644 --- a/Documentation/cgroup-v1/memcg_test.txt +++ b/Documentation/cgroup-v1/memcg_test.txt @@ -107,9 +107,9 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. 8. LRU Each memcg has its own private LRU. Now, its handling is under global - VM's control (means that it's handled under global zone->lru_lock). + VM's control (means that it's handled under global zone_lru_lock). Almost all routines around memcg's LRU is called by global LRU's - list management functions under zone->lru_lock(). + list management functions under zone_lru_lock(). A special function is mem_cgroup_isolate_pages(). This scans memcg's private LRU and call __isolate_lru_page() to extract a page diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt index b14abf217239..946e69103cdd 100644 --- a/Documentation/cgroup-v1/memory.txt +++ b/Documentation/cgroup-v1/memory.txt @@ -267,11 +267,11 @@ When oom event notifier is registered, event will be delivered. Other lock order is following: PG_locked. mm->page_table_lock - zone->lru_lock + zone_lru_lock lock_page_cgroup. In many cases, just lock_page_cgroup() is called. per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - zone->lru_lock, it has no lock of its own. + zone_lru_lock, it has no lock of its own. 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index e093e1d3285b..ca2ed9a6c8d8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -118,7 +118,7 @@ struct page { */ union { struct list_head lru; /* Pageout list, eg. active_list -* protected by zone->lru_lock ! +* protected by zone_lru_lock ! * Can be used as a generic list * by the page owner. */ diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 078ecb81e209..2d5087e3c034 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -93,7 +93,7 @@ struct free_area { struct pglist_data; /* - * zone->lock and zone->lru_lock are two of the hottest locks in the kernel. + * zone->lock and the zone lru_lock are two of the hottest locks in the kernel. * So add a wild amount of padding here to ensure that they fall into separate * cachelines. There are very few zone structures in the machine, so space * consumption is not a concern here. @@ -496,7 +496,6 @@ struct zone { /* Write-intensive fields used by page reclaim */ /* Fields commonly accessed by the page reclaim scanner */ - spinlock_t lru_lock; struct lruvec lruvec; /* @@ -690,6 +689,9 @@ typedef struct pglist_data { /* Number of pages migrated during the rate limiting time interval */ unsigned long numabalancing_migrate_nr_pages; #endif + /* Write-intensive fields used from the page allocator */ + ZONE_PADDING(_pad1_) + spinlock_t lru_lock; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* @@ -721,6 +723,10 @@ typedef struct pglist_data { #define node_start_pfn(nid)(NODE_DATA(nid)->node_start_pfn) #define node_end_pfn(nid)
[PATCH 10/31] mm, vmscan: remove duplicate logic clearing node congestion and dirty state
Reclaim may stall if there is too much dirty or congested data on a node. This was previously based on zone flags and the logic for clearing the flags is in two places. As congestion/dirty tracking is now tracked on a per-node basis, we can remove some duplicate logic. Link: http://lkml.kernel.org/r/1466518566-30034-11-git-send-email-mgor...@techsingularity.net Signed-off-by: Mel Gorman Cc: Johannes Weiner Cc: Vlastimil Babka Cc: Rik van Riel Signed-off-by: Andrew Morton --- mm/vmscan.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 34656173a670..911142d25de2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3005,7 +3005,17 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { unsigned long mark = high_wmark_pages(zone); - return zone_watermark_ok_safe(zone, order, mark, classzone_idx); + if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) + return false; + + /* +* If any eligible zone is balanced then the node is not considered +* to be congested or dirty +*/ + clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); + clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); + + return true; } /* @@ -3151,13 +3161,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!zone_balanced(zone, order, 0)) { classzone_idx = i; break; - } else { - /* -* If any eligible zone is balanced then the -* node is not considered congested or dirty. -*/ - clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags); - clear_bit(PGDAT_DIRTY, >zone_pgdat->flags); } } @@ -3216,11 +3219,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) if (!populated_zone(zone)) continue; - if (zone_balanced(zone, sc.order, classzone_idx)) { - clear_bit(PGDAT_CONGESTED, >flags); - clear_bit(PGDAT_DIRTY, >flags); + if (zone_balanced(zone, sc.order, classzone_idx)) goto out; - } } /* -- 2.6.4
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%)
[git pull] IOMMU Fixes for Linux v4.7-rc5
Hi Linus, The following changes since commit a4c34ff1c029e90e7d5f8dd8d29b0a93b31c3cb2: iommu/vt-d: Enable QI on all IOMMUs before setting root entry (2016-06-17 11:29:48 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git tags/iommu-fixes-v4.7-rc5 for you to fetch changes up to 6082ee72e9d89e80a664418be06f47d728243e85: iommu/amd: Initialize devid variable before using it (2016-06-27 13:24:46 +0200) IOMMU Fixes for Linux v4.7-rc5 Three fixes: * Fix use of smp_processor_id() in preemptible code in the IOVA allocation code. This got introduced with the scalability improvements in this release cycle. * A VT-d fix for out-of-bounds access of the iommu->domains array. The bug showed during suspend/resume. * AMD IOMMU fix to print the correct device id in the ACPI parsing code. Chris Wilson (1): iommu/iova: Disable preemption around use of this_cpu_ptr() Jan Niehusmann (1): iommu/vt-d: Fix overflow of iommu->domains array Nicolas Iooss (1): iommu/amd: Initialize devid variable before using it drivers/iommu/amd_iommu_init.c | 2 +- drivers/iommu/intel-iommu.c| 2 +- drivers/iommu/iova.c | 8 ++-- 3 files changed, 8 insertions(+), 4 deletions(-) Please pull. Thanks, Joerg signature.asc Description: Digital signature
[PATCH 00/31] Move LRU page reclaim from zones to nodes v8
Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc - This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%)
[git pull] IOMMU Fixes for Linux v4.7-rc5
Hi Linus, The following changes since commit a4c34ff1c029e90e7d5f8dd8d29b0a93b31c3cb2: iommu/vt-d: Enable QI on all IOMMUs before setting root entry (2016-06-17 11:29:48 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git tags/iommu-fixes-v4.7-rc5 for you to fetch changes up to 6082ee72e9d89e80a664418be06f47d728243e85: iommu/amd: Initialize devid variable before using it (2016-06-27 13:24:46 +0200) IOMMU Fixes for Linux v4.7-rc5 Three fixes: * Fix use of smp_processor_id() in preemptible code in the IOVA allocation code. This got introduced with the scalability improvements in this release cycle. * A VT-d fix for out-of-bounds access of the iommu->domains array. The bug showed during suspend/resume. * AMD IOMMU fix to print the correct device id in the ACPI parsing code. Chris Wilson (1): iommu/iova: Disable preemption around use of this_cpu_ptr() Jan Niehusmann (1): iommu/vt-d: Fix overflow of iommu->domains array Nicolas Iooss (1): iommu/amd: Initialize devid variable before using it drivers/iommu/amd_iommu_init.c | 2 +- drivers/iommu/intel-iommu.c| 2 +- drivers/iommu/iova.c | 8 ++-- 3 files changed, 8 insertions(+), 4 deletions(-) Please pull. Thanks, Joerg signature.asc Description: Digital signature
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
2016-07-01 17:06+0200, Paolo Bonzini: > On 01/07/2016 16:38, Radim Krčmář wrote: > On the > other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its > current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or > 1023. Yes, thanks for pointing it out. APIC ID 1023 is reasonably low and I'm pretty sure that it cannot be surpassed with any topology of 288 VCPUs. (The limit should be 543, with 257 cores per package.)
Re: [PATCH] [media] cec: add missing inline stubs
On Friday, July 1, 2016 5:22:32 PM CEST Hans Verkuil wrote: > > diff --git a/drivers/media/platform/vivid/Kconfig > > b/drivers/media/platform/vivid/Kconfig > > index 8e6918c5c87c..8e31146d079a 100644 > > --- a/drivers/media/platform/vivid/Kconfig > > +++ b/drivers/media/platform/vivid/Kconfig > > @@ -26,6 +26,7 @@ config VIDEO_VIVID > > config VIDEO_VIVID_CEC > > bool "Enable CEC emulation support" > > depends on VIDEO_VIVID && MEDIA_CEC > > + depends on VIDEO_VIVID=m || MEDIA_CEC=y > > ---help--- > > When selected the vivid module will emulate the optional > > HDMI CEC feature. > > > > which is still not overly nice, but it manages to avoid the > > IS_REACHABLE() check and it lets MEDIA_CEC be a module. > > The only IS_REACHABLE is for the RC_CORE check, and that should remain. I believe that is already taken care of by my earlier "[media] cec: add RC_CORE dependency" patch, https://patchwork.linuxtv.org/patch/34892/ which seems to handle the dependency more gracefully (preventing nonsense configurations rather than just not using RC_CORE). > With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also > a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT, > so that works. To clarify, the problem with the option above is that VIDEO_VIVID_CEC is a 'bool' option, and Kconfig lets that be turned on if both VIDEO_VIVID and MEDIA_CEC are enabled, including the case where MEDIA_CEC is a module and VIDEO_VIVID is not. Your patch avoids that problem by making MEDIA_CEC a 'bool', my patch above is an alternative by ensuring that VIDEO_VIVID_CEC cannot be enabled if MEDIA_CEC is a module and VIDEO_VIVID is not. Arnd
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
2016-07-01 17:06+0200, Paolo Bonzini: > On 01/07/2016 16:38, Radim Krčmář wrote: > On the > other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its > current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or > 1023. Yes, thanks for pointing it out. APIC ID 1023 is reasonably low and I'm pretty sure that it cannot be surpassed with any topology of 288 VCPUs. (The limit should be 543, with 257 cores per package.)
Re: [PATCH] [media] cec: add missing inline stubs
On Friday, July 1, 2016 5:22:32 PM CEST Hans Verkuil wrote: > > diff --git a/drivers/media/platform/vivid/Kconfig > > b/drivers/media/platform/vivid/Kconfig > > index 8e6918c5c87c..8e31146d079a 100644 > > --- a/drivers/media/platform/vivid/Kconfig > > +++ b/drivers/media/platform/vivid/Kconfig > > @@ -26,6 +26,7 @@ config VIDEO_VIVID > > config VIDEO_VIVID_CEC > > bool "Enable CEC emulation support" > > depends on VIDEO_VIVID && MEDIA_CEC > > + depends on VIDEO_VIVID=m || MEDIA_CEC=y > > ---help--- > > When selected the vivid module will emulate the optional > > HDMI CEC feature. > > > > which is still not overly nice, but it manages to avoid the > > IS_REACHABLE() check and it lets MEDIA_CEC be a module. > > The only IS_REACHABLE is for the RC_CORE check, and that should remain. I believe that is already taken care of by my earlier "[media] cec: add RC_CORE dependency" patch, https://patchwork.linuxtv.org/patch/34892/ which seems to handle the dependency more gracefully (preventing nonsense configurations rather than just not using RC_CORE). > With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also > a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT, > so that works. To clarify, the problem with the option above is that VIDEO_VIVID_CEC is a 'bool' option, and Kconfig lets that be turned on if both VIDEO_VIVID and MEDIA_CEC are enabled, including the case where MEDIA_CEC is a module and VIDEO_VIVID is not. Your patch avoids that problem by making MEDIA_CEC a 'bool', my patch above is an alternative by ensuring that VIDEO_VIVID_CEC cannot be enabled if MEDIA_CEC is a module and VIDEO_VIVID is not. Arnd
Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
Linus Torvaldswrites: > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >> >> I think what you suggest will work if we don't consider A/D in >> pte_none(). I think there are a bunch of code path where assume that >> !pte_present() && !pte_none() means swap. > > Yeah, we would need to change pte_none() to mask off D/A, but I think > that might be the only real change needed (other than making sure that > we don't use the bits in the swap entries, I didn't look at that part > at all) It looks like __pte_to_swp_entry also needs to be changed to mask out those bits when the swap code reads pte entries. For all of the same reasons as pte_none. Eric
Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs
Linus Torvalds writes: > On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen wrote: >> >> I think what you suggest will work if we don't consider A/D in >> pte_none(). I think there are a bunch of code path where assume that >> !pte_present() && !pte_none() means swap. > > Yeah, we would need to change pte_none() to mask off D/A, but I think > that might be the only real change needed (other than making sure that > we don't use the bits in the swap entries, I didn't look at that part > at all) It looks like __pte_to_swp_entry also needs to be changed to mask out those bits when the swap code reads pte entries. For all of the same reasons as pte_none. Eric
Re: [PATCH v3 1/4] drm/rockchip: vop: export line flag function
On Fri, Jul 1, 2016 at 5:19 AM, Yakir Yangwrote: > VOP have integrated a hardware counter which indicate the exact display > line that vop is scanning. And if we're interested in a specific line, > we can set the line number to vop line_flag register, and then vop would > generate a line_flag interrupt for it. > > For example eDP PSR function is interested in the vertical blanking > period, then driver could set the line number to zero. > > This patch have exported a symbol that allow other driver to listen the > line flag event with given timeout limit: > - rockchip_drm_wait_line_flag() > > Signed-off-by: Yakir Yang > --- > Changes in v3: > - Export the 'rockchip_drm_wait_line_flag' symbol, and document it. > - Add 'line_flag_num_0' for RK3288/RK3036 > - Remove the notify for waiting line_flag event (Daniel) > > Changes in v2: > - Introduce in v2, split VOP line flag changes out > > drivers/gpu/drm/rockchip/rockchip_drm_drv.h | 3 + > drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 103 > > drivers/gpu/drm/rockchip/rockchip_drm_vop.h | 3 + > drivers/gpu/drm/rockchip/rockchip_vop_reg.c | 4 ++ > 4 files changed, 113 insertions(+) > > diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > index ea39329..239b830 100644 > --- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > +++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > @@ -70,4 +70,7 @@ int rockchip_drm_dma_attach_device(struct drm_device > *drm_dev, >struct device *dev); > void rockchip_drm_dma_detach_device(struct drm_device *drm_dev, > struct device *dev); > +int rockchip_drm_wait_line_flag(struct drm_crtc *crtc, unsigned int line_num, > + unsigned int mstimeout); > + > #endif /* _ROCKCHIP_DRM_DRV_H_ */ > diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > index c8a62a8..cd3cac5 100644 > --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > @@ -121,6 +121,8 @@ struct vop { > /* protected by dev->event_lock */ > struct drm_pending_vblank_event *event; > > + struct completion line_flag_completion; > + > const struct vop_data *data; > > uint32_t *regsbak; > @@ -431,6 +433,59 @@ static void vop_dsp_hold_valid_irq_disable(struct vop > *vop) > spin_unlock_irqrestore(>irq_lock, flags); > } > > +/* > + * (1) each frame starts at the start of the Vsync pulse which is signaled by > + * the "FRAME_SYNC" interrupt. > + * (2) the active data region of each frame ends at dsp_vact_end > + * (3) we should program this same number (dsp_vact_end) into > dsp_line_frag_num, > + * to get "LINE_FLAG" interrupt at the end of the active on screen data. > + * > + * VOP_INTR_CTRL0.dsp_line_frag_num = VOP_DSP_VACT_ST_END.dsp_vact_end > + * Interrupts > + * LINE_FLAG ---+ > + * FRAME_SYNC + | > + *| | > + *v v > + *| Vsync | Vbp | Vactive | Vfp | > + *^ ^ ^ ^ > + *| | | | > + *| | | | > + * dsp_vs_end + | | | VOP_DSP_VTOTAL_VS_END > + * dsp_vact_start --+ | | VOP_DSP_VACT_ST_END > + * dsp_vact_end + | VOP_DSP_VACT_ST_END > + * dsp_total -+ VOP_DSP_VTOTAL_VS_END > + */ > + > +static void vop_line_flag_irq_enable(struct vop *vop, int line_num) > +{ > + unsigned long flags; > + > + if (WARN_ON(!vop->is_enabled)) > + return; > + > + spin_lock_irqsave(>irq_lock, flags); > + > + VOP_CTRL_SET(vop, line_flag_num_0, line_num); > + VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 1); > + vop_cfg_done(vop); > + > + spin_unlock_irqrestore(>irq_lock, flags); > +} > + > +static void vop_line_flag_irq_disable(struct vop *vop) > +{ > + unsigned long flags; > + > + if (WARN_ON(!vop->is_enabled)) > + return; > + > + spin_lock_irqsave(>irq_lock, flags); > + > + VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 0); > + > + spin_unlock_irqrestore(>irq_lock, flags); > +} > + > static void vop_enable(struct drm_crtc *crtc) > { > struct vop *vop = to_vop(crtc); > @@ -1157,6 +1212,13 @@ static irqreturn_t vop_isr(int irq, void *data) > ret = IRQ_HANDLED; > } > > + if (active_irqs & LINE_FLAG_INTR) { > + if (!completion_done(>line_flag_completion)) > + complete(>line_flag_completion); I think there's potential to miss flags here if
Re: [PATCH v3 1/4] drm/rockchip: vop: export line flag function
On Fri, Jul 1, 2016 at 5:19 AM, Yakir Yang wrote: > VOP have integrated a hardware counter which indicate the exact display > line that vop is scanning. And if we're interested in a specific line, > we can set the line number to vop line_flag register, and then vop would > generate a line_flag interrupt for it. > > For example eDP PSR function is interested in the vertical blanking > period, then driver could set the line number to zero. > > This patch have exported a symbol that allow other driver to listen the > line flag event with given timeout limit: > - rockchip_drm_wait_line_flag() > > Signed-off-by: Yakir Yang > --- > Changes in v3: > - Export the 'rockchip_drm_wait_line_flag' symbol, and document it. > - Add 'line_flag_num_0' for RK3288/RK3036 > - Remove the notify for waiting line_flag event (Daniel) > > Changes in v2: > - Introduce in v2, split VOP line flag changes out > > drivers/gpu/drm/rockchip/rockchip_drm_drv.h | 3 + > drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 103 > > drivers/gpu/drm/rockchip/rockchip_drm_vop.h | 3 + > drivers/gpu/drm/rockchip/rockchip_vop_reg.c | 4 ++ > 4 files changed, 113 insertions(+) > > diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > index ea39329..239b830 100644 > --- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > +++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h > @@ -70,4 +70,7 @@ int rockchip_drm_dma_attach_device(struct drm_device > *drm_dev, >struct device *dev); > void rockchip_drm_dma_detach_device(struct drm_device *drm_dev, > struct device *dev); > +int rockchip_drm_wait_line_flag(struct drm_crtc *crtc, unsigned int line_num, > + unsigned int mstimeout); > + > #endif /* _ROCKCHIP_DRM_DRV_H_ */ > diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > index c8a62a8..cd3cac5 100644 > --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c > @@ -121,6 +121,8 @@ struct vop { > /* protected by dev->event_lock */ > struct drm_pending_vblank_event *event; > > + struct completion line_flag_completion; > + > const struct vop_data *data; > > uint32_t *regsbak; > @@ -431,6 +433,59 @@ static void vop_dsp_hold_valid_irq_disable(struct vop > *vop) > spin_unlock_irqrestore(>irq_lock, flags); > } > > +/* > + * (1) each frame starts at the start of the Vsync pulse which is signaled by > + * the "FRAME_SYNC" interrupt. > + * (2) the active data region of each frame ends at dsp_vact_end > + * (3) we should program this same number (dsp_vact_end) into > dsp_line_frag_num, > + * to get "LINE_FLAG" interrupt at the end of the active on screen data. > + * > + * VOP_INTR_CTRL0.dsp_line_frag_num = VOP_DSP_VACT_ST_END.dsp_vact_end > + * Interrupts > + * LINE_FLAG ---+ > + * FRAME_SYNC + | > + *| | > + *v v > + *| Vsync | Vbp | Vactive | Vfp | > + *^ ^ ^ ^ > + *| | | | > + *| | | | > + * dsp_vs_end + | | | VOP_DSP_VTOTAL_VS_END > + * dsp_vact_start --+ | | VOP_DSP_VACT_ST_END > + * dsp_vact_end + | VOP_DSP_VACT_ST_END > + * dsp_total -+ VOP_DSP_VTOTAL_VS_END > + */ > + > +static void vop_line_flag_irq_enable(struct vop *vop, int line_num) > +{ > + unsigned long flags; > + > + if (WARN_ON(!vop->is_enabled)) > + return; > + > + spin_lock_irqsave(>irq_lock, flags); > + > + VOP_CTRL_SET(vop, line_flag_num_0, line_num); > + VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 1); > + vop_cfg_done(vop); > + > + spin_unlock_irqrestore(>irq_lock, flags); > +} > + > +static void vop_line_flag_irq_disable(struct vop *vop) > +{ > + unsigned long flags; > + > + if (WARN_ON(!vop->is_enabled)) > + return; > + > + spin_lock_irqsave(>irq_lock, flags); > + > + VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 0); > + > + spin_unlock_irqrestore(>irq_lock, flags); > +} > + > static void vop_enable(struct drm_crtc *crtc) > { > struct vop *vop = to_vop(crtc); > @@ -1157,6 +1212,13 @@ static irqreturn_t vop_isr(int irq, void *data) > ret = IRQ_HANDLED; > } > > + if (active_irqs & LINE_FLAG_INTR) { > + if (!completion_done(>line_flag_completion)) > + complete(>line_flag_completion); I think there's potential to miss flags here if the timing is just right because the
RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Hi Andrew, > On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote: > > This patch series does the following > > ---> Add support for gmii2rgmii converter. > > How generic is this gmii2rgmii IP block? Could it be used with any GMII and > RGMII device? This converter does GMII2RGMII conversion. This can be used with any MAC which has shared MDIO with external PHY And this Converter. This Converter IP is validated for MACB. But it should work with any MAC which has shared MDIO bus (I mean single MDIO multiple PHYS)... This converter works like below MACB <==> GMII2RGMII <==> RGMII_PHY MDIO<> GMII2RGMII MCAB<===> <> RGMII Using MACB MDIO bus we can access both the converter and the external PHY. We need to program the line speed of the converter during run time based on the External phy negotiated speed. MDIO interface is used to set operating speed (line speed) The converter has only one register (0x10) that we need to program to set the operating speed. The composition of this register is similar to the IEEE standard 802.3 MDIO control register 0x0. Please let me know if you still not clear about how this converter works. > > Should it be placed in drivers/net/phy, so making it easier to reuse? Ok will move it drivers/net/phy folder in the next version... > > Also, Russell Kings phylink might be a more natural structure for this. It is > hard to > know when those patches will land, but it might be worth looking at. Ok sure will take a look at this series once posted... Regards, Kedar. > >Andrew
RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Hi Andrew, > On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote: > > This patch series does the following > > ---> Add support for gmii2rgmii converter. > > How generic is this gmii2rgmii IP block? Could it be used with any GMII and > RGMII device? This converter does GMII2RGMII conversion. This can be used with any MAC which has shared MDIO with external PHY And this Converter. This Converter IP is validated for MACB. But it should work with any MAC which has shared MDIO bus (I mean single MDIO multiple PHYS)... This converter works like below MACB <==> GMII2RGMII <==> RGMII_PHY MDIO<> GMII2RGMII MCAB<===> <> RGMII Using MACB MDIO bus we can access both the converter and the external PHY. We need to program the line speed of the converter during run time based on the External phy negotiated speed. MDIO interface is used to set operating speed (line speed) The converter has only one register (0x10) that we need to program to set the operating speed. The composition of this register is similar to the IEEE standard 802.3 MDIO control register 0x0. Please let me know if you still not clear about how this converter works. > > Should it be placed in drivers/net/phy, so making it easier to reuse? Ok will move it drivers/net/phy folder in the next version... > > Also, Russell Kings phylink might be a more natural structure for this. It is > hard to > know when those patches will land, but it might be worth looking at. Ok sure will take a look at this series once posted... Regards, Kedar. > >Andrew
Re: [PATCH v6 00/10] acpi, clocksource: add GTDT driver and GTDT support in arm_arch_timer
On Thu, Jun 30, 2016 at 09:48:02PM +0800, Hanjun Guo wrote: > On 2016/6/30 21:27, Rafael J. Wysocki wrote: > >On Thursday, June 30, 2016 10:10:02 AM Hanjun Guo wrote: > >>GTDT is part of ACPI spec, drivers/acpi/ is for driver code of > >>ACPI spec, I think it can stay in drivers/acpi/ from this point > >>of view, am I right? > > > >The question is not "Can it?", but "Does it need to?". > > > >It is in the spec, but still there's only one architecture needing it. > > > >There is no way to test it on any other architecture and no reason to build > >it > >for any other architecture, so why does it need to be located in > >drivers/acpi/ ? > > I'm fine to move it to other places such as arch/arm64/kernel/, but I > would like to ask ARM64 maintainer's suggestion for this. > > Will, Catalin, what's your opinion on this? We don't have any device-tree code for the architected timer under arch/arm64, so I don't see why we should need anything for ACPI either. Will
Re: [PATCH v6 00/10] acpi, clocksource: add GTDT driver and GTDT support in arm_arch_timer
On Thu, Jun 30, 2016 at 09:48:02PM +0800, Hanjun Guo wrote: > On 2016/6/30 21:27, Rafael J. Wysocki wrote: > >On Thursday, June 30, 2016 10:10:02 AM Hanjun Guo wrote: > >>GTDT is part of ACPI spec, drivers/acpi/ is for driver code of > >>ACPI spec, I think it can stay in drivers/acpi/ from this point > >>of view, am I right? > > > >The question is not "Can it?", but "Does it need to?". > > > >It is in the spec, but still there's only one architecture needing it. > > > >There is no way to test it on any other architecture and no reason to build > >it > >for any other architecture, so why does it need to be located in > >drivers/acpi/ ? > > I'm fine to move it to other places such as arch/arm64/kernel/, but I > would like to ask ARM64 maintainer's suggestion for this. > > Will, Catalin, what's your opinion on this? We don't have any device-tree code for the architected timer under arch/arm64, so I don't see why we should need anything for ACPI either. Will
Re: [PATCH] [media] cec: add missing inline stubs
On 07/01/2016 05:13 PM, Arnd Bergmann wrote: > On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote: >> On 07/01/2016 01:19 PM, Arnd Bergmann wrote: >>> The linux/cec.h header file contains empty inline definitions for >>> a subset of the API for the case in which CEC is not enabled, >>> however we have driver that call other functions as well: >>> >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status': >>> drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function >>> 'cec_transmit_done' [-Werror=implicit-function-declaration] >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr': >>> drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function >>> 'cec_received_msg' [-Werror=implicit-function-declaration] >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_probe': >>> drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of >>> function 'cec_allocate_adapter' [-Werror=implicit-function-declaration] >> >> I don't understand this. These calls are under #if >> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC), >> and that should be 0 if MEDIA_CEC is not selected. >> >> Am I missing some weird config combination? > > This must have been a build error I ran into before your patch, when I > had this one applied locally instead: > > diff --git a/include/media/cec.h b/include/media/cec.h > index c462f9b44074..564a6a06bed7 100644 > --- a/include/media/cec.h > +++ b/include/media/cec.h > @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter > *adap) > return adap->phys_addr == 0; > } > > -#if IS_ENABLED(CONFIG_MEDIA_CEC) > +#if IS_REACHABLE(CONFIG_MEDIA_CEC) > struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, > void *priv, const char *name, u32 caps, u8 available_las, > struct device *parent); > > because that was hiding the declarations when the code could not > reach it. With your newer patch that is not possible any more. > > I also wasn't aware that each of these already had their own Kconfig > symbols. Could we just do this instead of your patch then: > > diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig > index ce9006e10a30..73e047220905 100644 > --- a/drivers/media/i2c/Kconfig > +++ b/drivers/media/i2c/Kconfig > @@ -222,6 +222,7 @@ config VIDEO_ADV7604 > config VIDEO_ADV7604_CEC > bool "Enable Analog Devices ADV7604 CEC support" > depends on VIDEO_ADV7604 && MEDIA_CEC > + depends on VIDEO_ADV7604=m || MEDIA_CEC=y > ---help--- > When selected the adv7604 will support the optional > HDMI CEC feature. > @@ -243,6 +244,7 @@ config VIDEO_ADV7842 > config VIDEO_ADV7842_CEC > bool "Enable Analog Devices ADV7842 CEC support" > depends on VIDEO_ADV7842 && MEDIA_CEC > + depends on VIDEO_ADV7842=m || MEDIA_CEC=y > ---help--- > When selected the adv7842 will support the optional > HDMI CEC feature. > @@ -475,6 +477,7 @@ config VIDEO_ADV7511 > config VIDEO_ADV7511_CEC > bool "Enable Analog Devices ADV7511 CEC support" > depends on VIDEO_ADV7511 && MEDIA_CEC > + depends on VIDEO_ADV7511=m || MEDIA_CEC=y > ---help--- > When selected the adv7511 will support the optional > HDMI CEC feature. > diff --git a/drivers/media/platform/vivid/Kconfig > b/drivers/media/platform/vivid/Kconfig > index 8e6918c5c87c..8e31146d079a 100644 > --- a/drivers/media/platform/vivid/Kconfig > +++ b/drivers/media/platform/vivid/Kconfig > @@ -26,6 +26,7 @@ config VIDEO_VIVID > config VIDEO_VIVID_CEC > bool "Enable CEC emulation support" > depends on VIDEO_VIVID && MEDIA_CEC > + depends on VIDEO_VIVID=m || MEDIA_CEC=y > ---help--- > When selected the vivid module will emulate the optional > HDMI CEC feature. > > which is still not overly nice, but it manages to avoid the > IS_REACHABLE() check and it lets MEDIA_CEC be a module. The only IS_REACHABLE is for the RC_CORE check, and that should remain. With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT, so that works. As I mentioned in a previous post, this will change in the future but while it is in staging I prefer to keep it like this. Regards, Hans
Re: [PATCH] [media] cec: add missing inline stubs
On 07/01/2016 05:13 PM, Arnd Bergmann wrote: > On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote: >> On 07/01/2016 01:19 PM, Arnd Bergmann wrote: >>> The linux/cec.h header file contains empty inline definitions for >>> a subset of the API for the case in which CEC is not enabled, >>> however we have driver that call other functions as well: >>> >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status': >>> drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function >>> 'cec_transmit_done' [-Werror=implicit-function-declaration] >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr': >>> drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function >>> 'cec_received_msg' [-Werror=implicit-function-declaration] >>> drivers/media/i2c/adv7604.c: In function 'adv76xx_probe': >>> drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of >>> function 'cec_allocate_adapter' [-Werror=implicit-function-declaration] >> >> I don't understand this. These calls are under #if >> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC), >> and that should be 0 if MEDIA_CEC is not selected. >> >> Am I missing some weird config combination? > > This must have been a build error I ran into before your patch, when I > had this one applied locally instead: > > diff --git a/include/media/cec.h b/include/media/cec.h > index c462f9b44074..564a6a06bed7 100644 > --- a/include/media/cec.h > +++ b/include/media/cec.h > @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter > *adap) > return adap->phys_addr == 0; > } > > -#if IS_ENABLED(CONFIG_MEDIA_CEC) > +#if IS_REACHABLE(CONFIG_MEDIA_CEC) > struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, > void *priv, const char *name, u32 caps, u8 available_las, > struct device *parent); > > because that was hiding the declarations when the code could not > reach it. With your newer patch that is not possible any more. > > I also wasn't aware that each of these already had their own Kconfig > symbols. Could we just do this instead of your patch then: > > diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig > index ce9006e10a30..73e047220905 100644 > --- a/drivers/media/i2c/Kconfig > +++ b/drivers/media/i2c/Kconfig > @@ -222,6 +222,7 @@ config VIDEO_ADV7604 > config VIDEO_ADV7604_CEC > bool "Enable Analog Devices ADV7604 CEC support" > depends on VIDEO_ADV7604 && MEDIA_CEC > + depends on VIDEO_ADV7604=m || MEDIA_CEC=y > ---help--- > When selected the adv7604 will support the optional > HDMI CEC feature. > @@ -243,6 +244,7 @@ config VIDEO_ADV7842 > config VIDEO_ADV7842_CEC > bool "Enable Analog Devices ADV7842 CEC support" > depends on VIDEO_ADV7842 && MEDIA_CEC > + depends on VIDEO_ADV7842=m || MEDIA_CEC=y > ---help--- > When selected the adv7842 will support the optional > HDMI CEC feature. > @@ -475,6 +477,7 @@ config VIDEO_ADV7511 > config VIDEO_ADV7511_CEC > bool "Enable Analog Devices ADV7511 CEC support" > depends on VIDEO_ADV7511 && MEDIA_CEC > + depends on VIDEO_ADV7511=m || MEDIA_CEC=y > ---help--- > When selected the adv7511 will support the optional > HDMI CEC feature. > diff --git a/drivers/media/platform/vivid/Kconfig > b/drivers/media/platform/vivid/Kconfig > index 8e6918c5c87c..8e31146d079a 100644 > --- a/drivers/media/platform/vivid/Kconfig > +++ b/drivers/media/platform/vivid/Kconfig > @@ -26,6 +26,7 @@ config VIDEO_VIVID > config VIDEO_VIVID_CEC > bool "Enable CEC emulation support" > depends on VIDEO_VIVID && MEDIA_CEC > + depends on VIDEO_VIVID=m || MEDIA_CEC=y > ---help--- > When selected the vivid module will emulate the optional > HDMI CEC feature. > > which is still not overly nice, but it manages to avoid the > IS_REACHABLE() check and it lets MEDIA_CEC be a module. The only IS_REACHABLE is for the RC_CORE check, and that should remain. With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT, so that works. As I mentioned in a previous post, this will change in the future but while it is in staging I prefer to keep it like this. Regards, Hans
RE: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support
Hi Florian, Thanks for the review. > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed) > > +{ > > + struct gmii2rgmii *xphy = (struct xphy *)priv; > > Why not pass struct xphy pointer directly? Ok will fix in v2... > > > + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev; > > + u16 gmii2rgmii_reg = 0; > > + > > + switch (speed) { > > + case 1000: > > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000; > > + break; > > + case 100: > > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100; > > + break; > > + default: > > + return; > > + } > > + > > + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr, > > +XILINX_GMII2RGMII_REG_NUM, > > +gmii2rgmii_reg); > > +} > > + > > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) { > > + struct device_node *phy_node; > > + struct phy_device *phydev; > > + struct device_node *np = (struct device_node *)xphy->platform_data; > > + > > + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0); > > Is that property documented in a binding document? Will document. Will fix in v2... > > > + if (phy_node) { > > Should not there be an else clause which does not assign > xphy->fix_mac_speed in case this property lookup fails? Will fix in v2... > > > + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0); > > + if (!phydev) { > > + netdev_err(xphy->dev, > > + "%s: no gmii to rgmii converter found\n", > > + xphy->dev->name); > > + return -1; > > + } > > + xphy->gmii2rgmii_phy_dev = phydev; > > + } > > + xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed; > > + > > + return 0; > > +} > > +EXPORT_SYMBOL(gmii2rgmii_phyprobe); > > + > > +MODULE_DESCRIPTION("Xilinx GMII2RGMII converter driver"); > > +MODULE_LICENSE("GPL"); > > diff --git a/include/linux/xilinx_gmii2rgmii.h > > b/include/linux/xilinx_gmii2rgmii.h > > new file mode 100644 > > index 000..64e1659 > > --- /dev/null > > +++ b/include/linux/xilinx_gmii2rgmii.h > > @@ -0,0 +1,24 @@ > > +#ifndef _GMII2RGMII_H > > +#define _GMII2RGMII_H > > + > > +#include > > +#include > > +#include > > + > > +#define XILINX_GMII2RGMII_FULLDPLX BMCR_FULLDPLX > > +#define XILINX_GMII2RGMII_SPEED1000BMCR_SPEED1000 > > +#define XILINX_GMII2RGMII_SPEED100 BMCR_SPEED100 > > +#define XILINX_GMII2RGMII_REG_NUM 0x10 > > So the register semantics are fairly standard but not the register location, > have > you considered writing a small PHY driver for this block? I tried but this PHY doesn't have any vendor / Device ID's This converter won't suit to PHY framework as we need to programmed the PHY Control register with the external phy negotiated speed as explained in the other Mail thread... Regards, Kedar.
RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Hi Florian, Thanks for the review... > Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > > This patch series does the following > > ---> Add support for gmii2rgmii converter. > > ---> Add support for gmii2rgmii converter in the macb driver. > > > > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ > > which includes converter related stuff also in macb driver. > > This patch series fixes this issue. > > This still seems very adhoc and not completely explained. Can you clarify how > the gmmi2rgmii converter gets used? Sorry I should have been explained it clearly in the patch. Will fix it in the v2. This converter works like below MACB <==> GMII2RGMII <==> RGMII_PHY MDIO<> GMII2RGMII MCAB<===> <> RGMII Using MACB MDIO bus we can access both the converter and the external PHY. We need to program the line speed of the converter during run time based on the External phy negotiated speed. MDIO interface is used to set operating speed (line speed) The converter has only one register (0x10) that we need to program to set the operating speed. The composition of this register is similar to the IEEE standard 802.3 MDIO control register 0x0. Please let me know if you still need not clear with how this converter works. IP data sheet is available here @ http://www.xilinx.com/support/documentation/ip_documentation/gmii_to_rgmii/v3_0/pg160-gmii-to-rgmii.pdf > > Is the expectation that a MACB Ethernet adapter will be connected to a RGMII > PHY like this: > > MACB <=> GMII2RGMII <=> RGMII PHY > MACB MDIO <===> RGMII_PHY > > and still have the ability to control both the PHY device's MDIO registers > and the > GMII2RGMII converter and we need to make sure both have matching settings, > or is it something like this: > > MACB <=> GMII2RGMII <=> RGMII PHY > MACB MDIO unconnected > > and we lose the ability to query the PHY via MDIO? > > As Nicolas pointed out, providing a binding documentation update may help > reviewers understand what is being accomplished here. Sure will fix in v2. Regards, Kedar.
RE: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support
Hi Florian, Thanks for the review. > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed) > > +{ > > + struct gmii2rgmii *xphy = (struct xphy *)priv; > > Why not pass struct xphy pointer directly? Ok will fix in v2... > > > + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev; > > + u16 gmii2rgmii_reg = 0; > > + > > + switch (speed) { > > + case 1000: > > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000; > > + break; > > + case 100: > > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100; > > + break; > > + default: > > + return; > > + } > > + > > + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr, > > +XILINX_GMII2RGMII_REG_NUM, > > +gmii2rgmii_reg); > > +} > > + > > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) { > > + struct device_node *phy_node; > > + struct phy_device *phydev; > > + struct device_node *np = (struct device_node *)xphy->platform_data; > > + > > + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0); > > Is that property documented in a binding document? Will document. Will fix in v2... > > > + if (phy_node) { > > Should not there be an else clause which does not assign > xphy->fix_mac_speed in case this property lookup fails? Will fix in v2... > > > + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0); > > + if (!phydev) { > > + netdev_err(xphy->dev, > > + "%s: no gmii to rgmii converter found\n", > > + xphy->dev->name); > > + return -1; > > + } > > + xphy->gmii2rgmii_phy_dev = phydev; > > + } > > + xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed; > > + > > + return 0; > > +} > > +EXPORT_SYMBOL(gmii2rgmii_phyprobe); > > + > > +MODULE_DESCRIPTION("Xilinx GMII2RGMII converter driver"); > > +MODULE_LICENSE("GPL"); > > diff --git a/include/linux/xilinx_gmii2rgmii.h > > b/include/linux/xilinx_gmii2rgmii.h > > new file mode 100644 > > index 000..64e1659 > > --- /dev/null > > +++ b/include/linux/xilinx_gmii2rgmii.h > > @@ -0,0 +1,24 @@ > > +#ifndef _GMII2RGMII_H > > +#define _GMII2RGMII_H > > + > > +#include > > +#include > > +#include > > + > > +#define XILINX_GMII2RGMII_FULLDPLX BMCR_FULLDPLX > > +#define XILINX_GMII2RGMII_SPEED1000BMCR_SPEED1000 > > +#define XILINX_GMII2RGMII_SPEED100 BMCR_SPEED100 > > +#define XILINX_GMII2RGMII_REG_NUM 0x10 > > So the register semantics are fairly standard but not the register location, > have > you considered writing a small PHY driver for this block? I tried but this PHY doesn't have any vendor / Device ID's This converter won't suit to PHY framework as we need to programmed the PHY Control register with the external phy negotiated speed as explained in the other Mail thread... Regards, Kedar.
RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Hi Florian, Thanks for the review... > Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > > This patch series does the following > > ---> Add support for gmii2rgmii converter. > > ---> Add support for gmii2rgmii converter in the macb driver. > > > > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ > > which includes converter related stuff also in macb driver. > > This patch series fixes this issue. > > This still seems very adhoc and not completely explained. Can you clarify how > the gmmi2rgmii converter gets used? Sorry I should have been explained it clearly in the patch. Will fix it in the v2. This converter works like below MACB <==> GMII2RGMII <==> RGMII_PHY MDIO<> GMII2RGMII MCAB<===> <> RGMII Using MACB MDIO bus we can access both the converter and the external PHY. We need to program the line speed of the converter during run time based on the External phy negotiated speed. MDIO interface is used to set operating speed (line speed) The converter has only one register (0x10) that we need to program to set the operating speed. The composition of this register is similar to the IEEE standard 802.3 MDIO control register 0x0. Please let me know if you still need not clear with how this converter works. IP data sheet is available here @ http://www.xilinx.com/support/documentation/ip_documentation/gmii_to_rgmii/v3_0/pg160-gmii-to-rgmii.pdf > > Is the expectation that a MACB Ethernet adapter will be connected to a RGMII > PHY like this: > > MACB <=> GMII2RGMII <=> RGMII PHY > MACB MDIO <===> RGMII_PHY > > and still have the ability to control both the PHY device's MDIO registers > and the > GMII2RGMII converter and we need to make sure both have matching settings, > or is it something like this: > > MACB <=> GMII2RGMII <=> RGMII PHY > MACB MDIO unconnected > > and we lose the ability to query the PHY via MDIO? > > As Nicolas pointed out, providing a binding documentation update may help > reviewers understand what is being accomplished here. Sure will fix in v2. Regards, Kedar.
Re: [PATCH v4] vmlinux.lds: account for destructor sections
On Fri, Jul 1, 2016 at 5:19 PM, Dmitry Vyukovwrote: > If CONFIG_KASAN is enabled and gcc is configured with > --disable-initfini-array and/or gold linker is used, > gcc emits .ctors/.dtors and .text.startup/.text.exit > sections instead of .init_array/.fini_array. > .dtors section is not explicitly accounted in the linker > script and messes vvar/percpu layout. Want: > > 822bfd80 D _edata > 822c D __vvar_beginning_hack > 822c A __vvar_page > 822c0080 0098 D vsyscall_gtod_data > 822c1000 A __init_begin > 822c1000 D init_per_cpu__irq_stack_union > 822c1000 A __per_cpu_load > 822d3000 D init_per_cpu__gdt_page > > Got: > > 8279a600 D _edata > 8279b000 A __vvar_page > 8279c000 A __init_begin > 8279c000 D init_per_cpu__irq_stack_union > 8279c000 A __per_cpu_load > 8279e000 D __vvar_beginning_hack > 8279e080 0098 D vsyscall_gtod_data > 827ae000 D init_per_cpu__gdt_page > > This happens because __vvar_page and .vvar get different > addresses in arch/x86/kernel/vmlinux.lds.S: > > . = ALIGN(PAGE_SIZE); > __vvar_page = .; > > .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { > /* work around gold bug 13023 */ > __vvar_beginning_hack = .; > > Discard .dtors/.fini_array/.text.exit, since we don't call dtors. > Merge .text.startup into init text. > > Cc: # v4.0+ > Reviewed-by: Andrey Ryabinin > Signed-off-by: Dmitry Vyukov > > --- > > Changes since v1: > - discard .dtors > - don't define .mem sections > > Changes since v2: > - use 'vmlinux.lds' subsystem prefix instead of 'kasan' > > Changes since v3: > - add 'Cc: # v4.0+' tag > - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org Andrew, can you please take this to mm tree?
strange Mac OSX RST behavior
I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. We are also in the process of contacting Apple to see what can be done here...workaround patch is below. Here is the sequence from wireshark, mac osx is client sending the fin: 84581 14.752908 -> TCP 66 49896 > http [FIN, ACK] Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436 84984 14.788056 -> TCP 60 49896 > http [RST] Seq=673257230 Win=0 Len=0 84985 14.788061 -> TCP 66 http > 49896 [ACK] Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547 followed by a bunch of retransmits from server: 85138 14.994217 -> TCP 1054 [TCP segment of a reassembled PDU] 85237 15.348217 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85337 16.056224 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85436 17.472225 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85540 20.304222 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85644 25.968218 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85745 37.280230 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85845 59.904235 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] Thanks, -Jason --- net/ipv4/tcp_input.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff97523..b3c55b91140c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* + * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST + * has the same sequence number as the FIN. This is not compliant + * with RFC 5961, but ends up in a number of sockets tied up mostly + * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only + * accept the RST after we've accepted a FIN and have not previously + * sent a FIN and received back the corresponding ACK. + */ +static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) && + (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) && + (sk->sk_state == TCP_CLOSE_WAIT || +sk->sk_state == TCP_LAST_ACK || +sk->sk_state == TCP_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, >last_oow_ack_time)) tcp_send_dupack(sk, skb); - } + } else if (tcp_fin_rst_check(sk, skb)) + tcp_reset(sk); goto discard; } @@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_fin_rst_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = >selective_acks[0]; -- 2.6.1
Re: [PATCH v4] vmlinux.lds: account for destructor sections
On Fri, Jul 1, 2016 at 5:19 PM, Dmitry Vyukov wrote: > If CONFIG_KASAN is enabled and gcc is configured with > --disable-initfini-array and/or gold linker is used, > gcc emits .ctors/.dtors and .text.startup/.text.exit > sections instead of .init_array/.fini_array. > .dtors section is not explicitly accounted in the linker > script and messes vvar/percpu layout. Want: > > 822bfd80 D _edata > 822c D __vvar_beginning_hack > 822c A __vvar_page > 822c0080 0098 D vsyscall_gtod_data > 822c1000 A __init_begin > 822c1000 D init_per_cpu__irq_stack_union > 822c1000 A __per_cpu_load > 822d3000 D init_per_cpu__gdt_page > > Got: > > 8279a600 D _edata > 8279b000 A __vvar_page > 8279c000 A __init_begin > 8279c000 D init_per_cpu__irq_stack_union > 8279c000 A __per_cpu_load > 8279e000 D __vvar_beginning_hack > 8279e080 0098 D vsyscall_gtod_data > 827ae000 D init_per_cpu__gdt_page > > This happens because __vvar_page and .vvar get different > addresses in arch/x86/kernel/vmlinux.lds.S: > > . = ALIGN(PAGE_SIZE); > __vvar_page = .; > > .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { > /* work around gold bug 13023 */ > __vvar_beginning_hack = .; > > Discard .dtors/.fini_array/.text.exit, since we don't call dtors. > Merge .text.startup into init text. > > Cc: # v4.0+ > Reviewed-by: Andrey Ryabinin > Signed-off-by: Dmitry Vyukov > > --- > > Changes since v1: > - discard .dtors > - don't define .mem sections > > Changes since v2: > - use 'vmlinux.lds' subsystem prefix instead of 'kasan' > > Changes since v3: > - add 'Cc: # v4.0+' tag > - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org Andrew, can you please take this to mm tree?
strange Mac OSX RST behavior
I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. We are also in the process of contacting Apple to see what can be done here...workaround patch is below. Here is the sequence from wireshark, mac osx is client sending the fin: 84581 14.752908 -> TCP 66 49896 > http [FIN, ACK] Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436 84984 14.788056 -> TCP 60 49896 > http [RST] Seq=673257230 Win=0 Len=0 84985 14.788061 -> TCP 66 http > 49896 [ACK] Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547 followed by a bunch of retransmits from server: 85138 14.994217 -> TCP 1054 [TCP segment of a reassembled PDU] 85237 15.348217 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85337 16.056224 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85436 17.472225 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85540 20.304222 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85644 25.968218 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85745 37.280230 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] 85845 59.904235 -> TCP 1054 [TCP Retransmission] [TCP segment of a reassembled PDU] Thanks, -Jason --- net/ipv4/tcp_input.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff97523..b3c55b91140c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* + * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST + * has the same sequence number as the FIN. This is not compliant + * with RFC 5961, but ends up in a number of sockets tied up mostly + * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only + * accept the RST after we've accepted a FIN and have not previously + * sent a FIN and received back the corresponding ACK. + */ +static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) && + (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) && + (sk->sk_state == TCP_CLOSE_WAIT || +sk->sk_state == TCP_LAST_ACK || +sk->sk_state == TCP_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, >last_oow_ack_time)) tcp_send_dupack(sk, skb); - } + } else if (tcp_fin_rst_check(sk, skb)) + tcp_reset(sk); goto discard; } @@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_fin_rst_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = >selective_acks[0]; -- 2.6.1
Re: [PATCH v3] vmlinux.lds: account for destructor sections
On Fri, Jul 1, 2016 at 4:58 PM, Andrey Ryabininwrote: > 2016-06-29 20:28 GMT+03:00 Dmitry Vyukov : >> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin >> wrote: >>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov : If CONFIG_KASAN is enabled and gcc is configured with --disable-initfini-array and/or gold linker is used, gcc emits .ctors/.dtors and .text.startup/.text.exit sections instead of .init_array/.fini_array. .dtors section is not explicitly accounted in the linker script and messes vvar/percpu layout. Want: 822bfd80 D _edata 822c D __vvar_beginning_hack 822c A __vvar_page 822c0080 0098 D vsyscall_gtod_data 822c1000 A __init_begin 822c1000 D init_per_cpu__irq_stack_union 822c1000 A __per_cpu_load 822d3000 D init_per_cpu__gdt_page Got: 8279a600 D _edata 8279b000 A __vvar_page 8279c000 A __init_begin 8279c000 D init_per_cpu__irq_stack_union 8279c000 A __per_cpu_load 8279e000 D __vvar_beginning_hack 8279e080 0098 D vsyscall_gtod_data 827ae000 D init_per_cpu__gdt_page This happens because __vvar_page and .vvar get different addresses in arch/x86/kernel/vmlinux.lds.S: . = ALIGN(PAGE_SIZE); __vvar_page = .; .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { /* work around gold bug 13023 */ __vvar_beginning_hack = .; Discard .dtors/.fini_array/.text.exit, since we don't call dtors. Merge .text.startup into init text. Signed-off-by: Dmitry Vyukov >>> >>> Reviewed-by: Andrey Ryabinin >> >> >> Who can take it to some tree? > > akpm tends to be the maintainer of last resort. > But you probably need to resend the patch, because LKML was not in CC list. > Also, please add stable tag: Cc: # v4.0+ Remailed. Thanks.
[PATCH v4] vmlinux.lds: account for destructor sections
If CONFIG_KASAN is enabled and gcc is configured with --disable-initfini-array and/or gold linker is used, gcc emits .ctors/.dtors and .text.startup/.text.exit sections instead of .init_array/.fini_array. .dtors section is not explicitly accounted in the linker script and messes vvar/percpu layout. Want: 822bfd80 D _edata 822c D __vvar_beginning_hack 822c A __vvar_page 822c0080 0098 D vsyscall_gtod_data 822c1000 A __init_begin 822c1000 D init_per_cpu__irq_stack_union 822c1000 A __per_cpu_load 822d3000 D init_per_cpu__gdt_page Got: 8279a600 D _edata 8279b000 A __vvar_page 8279c000 A __init_begin 8279c000 D init_per_cpu__irq_stack_union 8279c000 A __per_cpu_load 8279e000 D __vvar_beginning_hack 8279e080 0098 D vsyscall_gtod_data 827ae000 D init_per_cpu__gdt_page This happens because __vvar_page and .vvar get different addresses in arch/x86/kernel/vmlinux.lds.S: . = ALIGN(PAGE_SIZE); __vvar_page = .; .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { /* work around gold bug 13023 */ __vvar_beginning_hack = .; Discard .dtors/.fini_array/.text.exit, since we don't call dtors. Merge .text.startup into init text. Cc:# v4.0+ Reviewed-by: Andrey Ryabinin Signed-off-by: Dmitry Vyukov --- Changes since v1: - discard .dtors - don't define .mem sections Changes since v2: - use 'vmlinux.lds' subsystem prefix instead of 'kasan' Changes since v3: - add 'Cc: # v4.0+' tag - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org --- include/asm-generic/vmlinux.lds.h | 4 1 file changed, 4 insertions(+) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 6a67ab9..081d0f2 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -542,15 +542,19 @@ #define INIT_TEXT \ *(.init.text) \ + *(.text.startup)\ MEM_DISCARD(init.text) #define EXIT_DATA \ *(.exit.data) \ + *(.fini_array) \ + *(.dtors) \ MEM_DISCARD(exit.data) \ MEM_DISCARD(exit.rodata) #define EXIT_TEXT \ *(.exit.text) \ + *(.text.exit) \ MEM_DISCARD(exit.text) #define EXIT_CALL \ -- 2.8.0.rc3.226.g39d4020
Re: [PATCH v3] vmlinux.lds: account for destructor sections
On Fri, Jul 1, 2016 at 4:58 PM, Andrey Ryabinin wrote: > 2016-06-29 20:28 GMT+03:00 Dmitry Vyukov : >> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin >> wrote: >>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov : If CONFIG_KASAN is enabled and gcc is configured with --disable-initfini-array and/or gold linker is used, gcc emits .ctors/.dtors and .text.startup/.text.exit sections instead of .init_array/.fini_array. .dtors section is not explicitly accounted in the linker script and messes vvar/percpu layout. Want: 822bfd80 D _edata 822c D __vvar_beginning_hack 822c A __vvar_page 822c0080 0098 D vsyscall_gtod_data 822c1000 A __init_begin 822c1000 D init_per_cpu__irq_stack_union 822c1000 A __per_cpu_load 822d3000 D init_per_cpu__gdt_page Got: 8279a600 D _edata 8279b000 A __vvar_page 8279c000 A __init_begin 8279c000 D init_per_cpu__irq_stack_union 8279c000 A __per_cpu_load 8279e000 D __vvar_beginning_hack 8279e080 0098 D vsyscall_gtod_data 827ae000 D init_per_cpu__gdt_page This happens because __vvar_page and .vvar get different addresses in arch/x86/kernel/vmlinux.lds.S: . = ALIGN(PAGE_SIZE); __vvar_page = .; .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { /* work around gold bug 13023 */ __vvar_beginning_hack = .; Discard .dtors/.fini_array/.text.exit, since we don't call dtors. Merge .text.startup into init text. Signed-off-by: Dmitry Vyukov >>> >>> Reviewed-by: Andrey Ryabinin >> >> >> Who can take it to some tree? > > akpm tends to be the maintainer of last resort. > But you probably need to resend the patch, because LKML was not in CC list. > Also, please add stable tag: Cc: # v4.0+ Remailed. Thanks.
[PATCH v4] vmlinux.lds: account for destructor sections
If CONFIG_KASAN is enabled and gcc is configured with --disable-initfini-array and/or gold linker is used, gcc emits .ctors/.dtors and .text.startup/.text.exit sections instead of .init_array/.fini_array. .dtors section is not explicitly accounted in the linker script and messes vvar/percpu layout. Want: 822bfd80 D _edata 822c D __vvar_beginning_hack 822c A __vvar_page 822c0080 0098 D vsyscall_gtod_data 822c1000 A __init_begin 822c1000 D init_per_cpu__irq_stack_union 822c1000 A __per_cpu_load 822d3000 D init_per_cpu__gdt_page Got: 8279a600 D _edata 8279b000 A __vvar_page 8279c000 A __init_begin 8279c000 D init_per_cpu__irq_stack_union 8279c000 A __per_cpu_load 8279e000 D __vvar_beginning_hack 8279e080 0098 D vsyscall_gtod_data 827ae000 D init_per_cpu__gdt_page This happens because __vvar_page and .vvar get different addresses in arch/x86/kernel/vmlinux.lds.S: . = ALIGN(PAGE_SIZE); __vvar_page = .; .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { /* work around gold bug 13023 */ __vvar_beginning_hack = .; Discard .dtors/.fini_array/.text.exit, since we don't call dtors. Merge .text.startup into init text. Cc: # v4.0+ Reviewed-by: Andrey Ryabinin Signed-off-by: Dmitry Vyukov --- Changes since v1: - discard .dtors - don't define .mem sections Changes since v2: - use 'vmlinux.lds' subsystem prefix instead of 'kasan' Changes since v3: - add 'Cc: # v4.0+' tag - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org --- include/asm-generic/vmlinux.lds.h | 4 1 file changed, 4 insertions(+) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 6a67ab9..081d0f2 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -542,15 +542,19 @@ #define INIT_TEXT \ *(.init.text) \ + *(.text.startup)\ MEM_DISCARD(init.text) #define EXIT_DATA \ *(.exit.data) \ + *(.fini_array) \ + *(.dtors) \ MEM_DISCARD(exit.data) \ MEM_DISCARD(exit.rodata) #define EXIT_TEXT \ *(.exit.text) \ + *(.text.exit) \ MEM_DISCARD(exit.text) #define EXIT_CALL \ -- 2.8.0.rc3.226.g39d4020
Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool
On 07/01/2016 10:03 AM, Peter Meerwald-Stadler wrote: > >> Found with scripts/coccinelle/misc/boolconv.cocci. > > I'd argue that > > pd = (inbuf[0] >> 1) & 0x3; > if (pd) { > data->powerdown = true; > data->powerdown_mode = pd - 1; > } else { > data->powerdown = false; > data->powerdown_mode = 2; /* largest register to gnd */ > } > > is less compact but probably easier to read/understand; > I agree, this is a much better fix. Would you like to submit this version as a patch? Andrew >> Signed-off-by: Andrew F. Davis> > but I'm fine with the proposed patch as well to make cocci happy > > regards, p. > >> --- >> drivers/iio/dac/mcp4725.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c >> index cca935c..c3bb3fd 100644 >> --- a/drivers/iio/dac/mcp4725.c >> +++ b/drivers/iio/dac/mcp4725.c >> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client, >> return err; >> } >> pd = (inbuf[0] >> 1) & 0x3; >> -data->powerdown = pd > 0 ? true : false; >> +data->powerdown = pd > 0; >> data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */ >> data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4); >> >> >
Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool
On 07/01/2016 10:03 AM, Peter Meerwald-Stadler wrote: > >> Found with scripts/coccinelle/misc/boolconv.cocci. > > I'd argue that > > pd = (inbuf[0] >> 1) & 0x3; > if (pd) { > data->powerdown = true; > data->powerdown_mode = pd - 1; > } else { > data->powerdown = false; > data->powerdown_mode = 2; /* largest register to gnd */ > } > > is less compact but probably easier to read/understand; > I agree, this is a much better fix. Would you like to submit this version as a patch? Andrew >> Signed-off-by: Andrew F. Davis > > but I'm fine with the proposed patch as well to make cocci happy > > regards, p. > >> --- >> drivers/iio/dac/mcp4725.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c >> index cca935c..c3bb3fd 100644 >> --- a/drivers/iio/dac/mcp4725.c >> +++ b/drivers/iio/dac/mcp4725.c >> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client, >> return err; >> } >> pd = (inbuf[0] >> 1) & 0x3; >> -data->powerdown = pd > 0 ? true : false; >> +data->powerdown = pd > 0; >> data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */ >> data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4); >> >> >
Re: [PATCH 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac
On Fri, Jul 1, 2016 at 5:46 AM, Arnd Bergmannwrote: > On Thursday, June 30, 2016 6:59:13 PM CEST Jon Mason wrote: >> + >> +Required properties: >> + - compatible: "brcm,bgmac-nsp" >> + - reg:Address and length of the GMAC registers, >> + Address and length of the GMAC IDM registers >> + - reg-names: Names of the registers. Must have both "gmac_base" and >> + "idm_base" >> + - interrupts: Interrupt number >> + > > > "brcm,bgmac-nsp" sounds a bit too general. As I understand, this is a family > of SoCs that might not all have the exact same implementation of this > ethernet device, as we can see from the long lookup table in bgmac_probe(). The Broadcom iProc family of SoCs contains: Northstar Northstar Plus Cygnus Northstar 2 a few SoCs that are under development and a number of ethernet switches (which might never be officially supported) Each one of these SoCs could have a different revision of the gmac IP block, but they should be uniform within each SoC (though there might be a A0/B0 change necessary). The Northstar Plus product family has a number of different implementations, but the SoC is unchanged. So, I think this might be too specific, when we really need a general compat string. Broadcom has a history of sharing IP blocks amongst the different divisions. So, this driver might be used on other SoC families (as it apparently has been done in the past, based on the code you reference). I do not know of any way to know what legacy, non-iProc chips have used this IP block. I can make this "brcm,iproc-bgmac", and add "brcm,iproc-nsp-bgmac" as an alternative compatible string in this file (which I believe you are suggesting), but there might be non-iProc SoCs that use this driver. Is this acceptable? Thanks, Jon > Please document the specific product numbers here that are publically > known already. Having the driver match just on "brcm,bgmac-nsp" as a fallback > is fine, so you can document that one as required for all users. > > Arnd
Re: [PATCH 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac
On Fri, Jul 1, 2016 at 5:46 AM, Arnd Bergmann wrote: > On Thursday, June 30, 2016 6:59:13 PM CEST Jon Mason wrote: >> + >> +Required properties: >> + - compatible: "brcm,bgmac-nsp" >> + - reg:Address and length of the GMAC registers, >> + Address and length of the GMAC IDM registers >> + - reg-names: Names of the registers. Must have both "gmac_base" and >> + "idm_base" >> + - interrupts: Interrupt number >> + > > > "brcm,bgmac-nsp" sounds a bit too general. As I understand, this is a family > of SoCs that might not all have the exact same implementation of this > ethernet device, as we can see from the long lookup table in bgmac_probe(). The Broadcom iProc family of SoCs contains: Northstar Northstar Plus Cygnus Northstar 2 a few SoCs that are under development and a number of ethernet switches (which might never be officially supported) Each one of these SoCs could have a different revision of the gmac IP block, but they should be uniform within each SoC (though there might be a A0/B0 change necessary). The Northstar Plus product family has a number of different implementations, but the SoC is unchanged. So, I think this might be too specific, when we really need a general compat string. Broadcom has a history of sharing IP blocks amongst the different divisions. So, this driver might be used on other SoC families (as it apparently has been done in the past, based on the code you reference). I do not know of any way to know what legacy, non-iProc chips have used this IP block. I can make this "brcm,iproc-bgmac", and add "brcm,iproc-nsp-bgmac" as an alternative compatible string in this file (which I believe you are suggesting), but there might be non-iProc SoCs that use this driver. Is this acceptable? Thanks, Jon > Please document the specific product numbers here that are publically > known already. Having the driver match just on "brcm,bgmac-nsp" as a fallback > is fine, so you can document that one as required for all users. > > Arnd
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
On 01/07/2016 17:06, Paolo Bonzini wrote: >>> >> > Should it? >> Yes, x2APIC ID cannot be changed in hardware and is initialized to the >> intitial APIC ID. >> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace >> reuses old VMs instead of building new ones after reconfiguration. >> I don't think it's a sensible use case and it it is currently broken, >> because we don't exit to userspace when changing APIC mode, so KVM would >> just set APIC ID to VCPU ID on any transition and userspace couldn't >> amend it. Forgot to reply about this: letting SET_LAPIC change x2APIC IDs is nonsense. In x2APIC mode + new capability disabled SET_LAPIC should ignore the id register altogether for backwards compatibility. In x2APIC mode + new capability enabled it should either ignore it, or fail if the x2APIC ID doesn't match the VCPU id. I suspect the latter is better because it would help catching the case where userspace is erroneously shifting the id left to bits 31-24. Paolo
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
On 01/07/2016 17:06, Paolo Bonzini wrote: >>> >> > Should it? >> Yes, x2APIC ID cannot be changed in hardware and is initialized to the >> intitial APIC ID. >> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace >> reuses old VMs instead of building new ones after reconfiguration. >> I don't think it's a sensible use case and it it is currently broken, >> because we don't exit to userspace when changing APIC mode, so KVM would >> just set APIC ID to VCPU ID on any transition and userspace couldn't >> amend it. Forgot to reply about this: letting SET_LAPIC change x2APIC IDs is nonsense. In x2APIC mode + new capability disabled SET_LAPIC should ignore the id register altogether for backwards compatibility. In x2APIC mode + new capability enabled it should either ignore it, or fail if the x2APIC ID doesn't match the VCPU id. I suspect the latter is better because it would help catching the case where userspace is erroneously shifting the id left to bits 31-24. Paolo
Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote: > This patch series does the following > ---> Add support for gmii2rgmii converter. How generic is this gmii2rgmii IP block? Could it be used with any GMII and RGMII device? Should it be placed in drivers/net/phy, so making it easier to reuse? Also, Russell Kings phylink might be a more natural structure for this. It is hard to know when those patches will land, but it might be worth looking at. Andrew
Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote: > This patch series does the following > ---> Add support for gmii2rgmii converter. How generic is this gmii2rgmii IP block? Could it be used with any GMII and RGMII device? Should it be placed in drivers/net/phy, so making it easier to reuse? Also, Russell Kings phylink might be a more natural structure for this. It is hard to know when those patches will land, but it might be worth looking at. Andrew
Re: [PATCH] [media] cec: add missing inline stubs
On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote: > On 07/01/2016 01:19 PM, Arnd Bergmann wrote: > > The linux/cec.h header file contains empty inline definitions for > > a subset of the API for the case in which CEC is not enabled, > > however we have driver that call other functions as well: > > > > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status': > > drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function > > 'cec_transmit_done' [-Werror=implicit-function-declaration] > > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr': > > drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function > > 'cec_received_msg' [-Werror=implicit-function-declaration] > > drivers/media/i2c/adv7604.c: In function 'adv76xx_probe': > > drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of > > function 'cec_allocate_adapter' [-Werror=implicit-function-declaration] > > I don't understand this. These calls are under #if > IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC), > and that should be 0 if MEDIA_CEC is not selected. > > Am I missing some weird config combination? This must have been a build error I ran into before your patch, when I had this one applied locally instead: diff --git a/include/media/cec.h b/include/media/cec.h index c462f9b44074..564a6a06bed7 100644 --- a/include/media/cec.h +++ b/include/media/cec.h @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter *adap) return adap->phys_addr == 0; } -#if IS_ENABLED(CONFIG_MEDIA_CEC) +#if IS_REACHABLE(CONFIG_MEDIA_CEC) struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, void *priv, const char *name, u32 caps, u8 available_las, struct device *parent); because that was hiding the declarations when the code could not reach it. With your newer patch that is not possible any more. I also wasn't aware that each of these already had their own Kconfig symbols. Could we just do this instead of your patch then: diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig index ce9006e10a30..73e047220905 100644 --- a/drivers/media/i2c/Kconfig +++ b/drivers/media/i2c/Kconfig @@ -222,6 +222,7 @@ config VIDEO_ADV7604 config VIDEO_ADV7604_CEC bool "Enable Analog Devices ADV7604 CEC support" depends on VIDEO_ADV7604 && MEDIA_CEC + depends on VIDEO_ADV7604=m || MEDIA_CEC=y ---help--- When selected the adv7604 will support the optional HDMI CEC feature. @@ -243,6 +244,7 @@ config VIDEO_ADV7842 config VIDEO_ADV7842_CEC bool "Enable Analog Devices ADV7842 CEC support" depends on VIDEO_ADV7842 && MEDIA_CEC + depends on VIDEO_ADV7842=m || MEDIA_CEC=y ---help--- When selected the adv7842 will support the optional HDMI CEC feature. @@ -475,6 +477,7 @@ config VIDEO_ADV7511 config VIDEO_ADV7511_CEC bool "Enable Analog Devices ADV7511 CEC support" depends on VIDEO_ADV7511 && MEDIA_CEC + depends on VIDEO_ADV7511=m || MEDIA_CEC=y ---help--- When selected the adv7511 will support the optional HDMI CEC feature. diff --git a/drivers/media/platform/vivid/Kconfig b/drivers/media/platform/vivid/Kconfig index 8e6918c5c87c..8e31146d079a 100644 --- a/drivers/media/platform/vivid/Kconfig +++ b/drivers/media/platform/vivid/Kconfig @@ -26,6 +26,7 @@ config VIDEO_VIVID config VIDEO_VIVID_CEC bool "Enable CEC emulation support" depends on VIDEO_VIVID && MEDIA_CEC + depends on VIDEO_VIVID=m || MEDIA_CEC=y ---help--- When selected the vivid module will emulate the optional HDMI CEC feature. which is still not overly nice, but it manages to avoid the IS_REACHABLE() check and it lets MEDIA_CEC be a module. Arnd
Re: [PATCH] [media] cec: add missing inline stubs
On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote: > On 07/01/2016 01:19 PM, Arnd Bergmann wrote: > > The linux/cec.h header file contains empty inline definitions for > > a subset of the API for the case in which CEC is not enabled, > > however we have driver that call other functions as well: > > > > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status': > > drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function > > 'cec_transmit_done' [-Werror=implicit-function-declaration] > > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr': > > drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function > > 'cec_received_msg' [-Werror=implicit-function-declaration] > > drivers/media/i2c/adv7604.c: In function 'adv76xx_probe': > > drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of > > function 'cec_allocate_adapter' [-Werror=implicit-function-declaration] > > I don't understand this. These calls are under #if > IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC), > and that should be 0 if MEDIA_CEC is not selected. > > Am I missing some weird config combination? This must have been a build error I ran into before your patch, when I had this one applied locally instead: diff --git a/include/media/cec.h b/include/media/cec.h index c462f9b44074..564a6a06bed7 100644 --- a/include/media/cec.h +++ b/include/media/cec.h @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter *adap) return adap->phys_addr == 0; } -#if IS_ENABLED(CONFIG_MEDIA_CEC) +#if IS_REACHABLE(CONFIG_MEDIA_CEC) struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops, void *priv, const char *name, u32 caps, u8 available_las, struct device *parent); because that was hiding the declarations when the code could not reach it. With your newer patch that is not possible any more. I also wasn't aware that each of these already had their own Kconfig symbols. Could we just do this instead of your patch then: diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig index ce9006e10a30..73e047220905 100644 --- a/drivers/media/i2c/Kconfig +++ b/drivers/media/i2c/Kconfig @@ -222,6 +222,7 @@ config VIDEO_ADV7604 config VIDEO_ADV7604_CEC bool "Enable Analog Devices ADV7604 CEC support" depends on VIDEO_ADV7604 && MEDIA_CEC + depends on VIDEO_ADV7604=m || MEDIA_CEC=y ---help--- When selected the adv7604 will support the optional HDMI CEC feature. @@ -243,6 +244,7 @@ config VIDEO_ADV7842 config VIDEO_ADV7842_CEC bool "Enable Analog Devices ADV7842 CEC support" depends on VIDEO_ADV7842 && MEDIA_CEC + depends on VIDEO_ADV7842=m || MEDIA_CEC=y ---help--- When selected the adv7842 will support the optional HDMI CEC feature. @@ -475,6 +477,7 @@ config VIDEO_ADV7511 config VIDEO_ADV7511_CEC bool "Enable Analog Devices ADV7511 CEC support" depends on VIDEO_ADV7511 && MEDIA_CEC + depends on VIDEO_ADV7511=m || MEDIA_CEC=y ---help--- When selected the adv7511 will support the optional HDMI CEC feature. diff --git a/drivers/media/platform/vivid/Kconfig b/drivers/media/platform/vivid/Kconfig index 8e6918c5c87c..8e31146d079a 100644 --- a/drivers/media/platform/vivid/Kconfig +++ b/drivers/media/platform/vivid/Kconfig @@ -26,6 +26,7 @@ config VIDEO_VIVID config VIDEO_VIVID_CEC bool "Enable CEC emulation support" depends on VIDEO_VIVID && MEDIA_CEC + depends on VIDEO_VIVID=m || MEDIA_CEC=y ---help--- When selected the vivid module will emulate the optional HDMI CEC feature. which is still not overly nice, but it manages to avoid the IS_REACHABLE() check and it lets MEDIA_CEC be a module. Arnd
Re: [PATCH] arm64: defconfig: Enable QDF2432 config options
Christopher Covington wrote: Due to distribution differences [1][2], I see =y built-in as the default on mobile platforms and =m modular as the default on server platforms. I don't think we should mix "server" defconfing entries with "mobile" defconfig entries. It's a arm64 defconfig, neither server nor mobile. The other entries are =y, so these should be =y No one uses this defconfig as-is for any platform. But thinking about it, we could get a poweroff, reset, over-temperature, or other interrupt before we get to the initramfs. So the pin controller driver should be =y built-in. One might make rootfs media =y built-in to make testing easier, but I prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by comparison. Let's not overthink things, and just make *everything* =y by default. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation.
Re: [PATCH] arm64: defconfig: Enable QDF2432 config options
Christopher Covington wrote: Due to distribution differences [1][2], I see =y built-in as the default on mobile platforms and =m modular as the default on server platforms. I don't think we should mix "server" defconfing entries with "mobile" defconfig entries. It's a arm64 defconfig, neither server nor mobile. The other entries are =y, so these should be =y No one uses this defconfig as-is for any platform. But thinking about it, we could get a poweroff, reset, over-temperature, or other interrupt before we get to the initramfs. So the pin controller driver should be =y built-in. One might make rootfs media =y built-in to make testing easier, but I prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by comparison. Let's not overthink things, and just make *everything* =y by default. -- Sent by an employee of the Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation.
Re: [docs-next PATCH] Documentation/sphinx: skip build if user requested specific DOCBOOKS
Em Fri, 01 Jul 2016 16:31:14 +0300 Jani Nikulaescreveu: > On Fri, 01 Jul 2016, Mauro Carvalho Chehab wrote: > > Em Fri, 1 Jul 2016 15:24:44 +0300 > > Jani Nikula escreveu: > > > >> If the user requested specific DocBooks to be built using 'make > >> DOCBOOKS=foo.xml htmldocs', assume no Sphinx build is desired. This > >> check is transitional, and can be removed once we drop the DocBook > >> build. > > > > I guess this is actually a wrong assumption. I mean, it is valid to > > build multiple documents at the same time. Sometimes I do things > > like: > > make DOCBOOKS="media_api.xml device-drivers.xml" htmldocs > > > > When I want both docs to be compiled. > > > > What I would be expecting is that Sphinx would be looking into > > the DOCBOOKS targets and see if (some) of them belongs to it. > > > > Alternatively, we could add a separate makefile var for the > > Sphinx targets, but the logic would be more complex, as it > > should: > > Please let's not conflate DOCBOOKS to mean something other than > DocBooks. I think it'll be easier that way. OK. Works for me. > So I guess we'll need a way to build just a subset of the Sphinx > documentation. I would like that to be a somewhat generic thing, not > requiring a separate conf file for each subset. Is the granularity of a > directory enough? I've been meaning to look into passing different > and to sphinx-build for this, but I don't have > the time now. In the case of the media doc, it is actually a matter of building one rst file (linux_tv/index.rst). For that, I'm equally happy if we specify it via or via . I suspect we'll end by needing filenames granularity in some future, but, if you find easier to just handle for now, that's OK. Regards, Mauro
Re: [docs-next PATCH] Documentation/sphinx: skip build if user requested specific DOCBOOKS
Em Fri, 01 Jul 2016 16:31:14 +0300 Jani Nikula escreveu: > On Fri, 01 Jul 2016, Mauro Carvalho Chehab wrote: > > Em Fri, 1 Jul 2016 15:24:44 +0300 > > Jani Nikula escreveu: > > > >> If the user requested specific DocBooks to be built using 'make > >> DOCBOOKS=foo.xml htmldocs', assume no Sphinx build is desired. This > >> check is transitional, and can be removed once we drop the DocBook > >> build. > > > > I guess this is actually a wrong assumption. I mean, it is valid to > > build multiple documents at the same time. Sometimes I do things > > like: > > make DOCBOOKS="media_api.xml device-drivers.xml" htmldocs > > > > When I want both docs to be compiled. > > > > What I would be expecting is that Sphinx would be looking into > > the DOCBOOKS targets and see if (some) of them belongs to it. > > > > Alternatively, we could add a separate makefile var for the > > Sphinx targets, but the logic would be more complex, as it > > should: > > Please let's not conflate DOCBOOKS to mean something other than > DocBooks. I think it'll be easier that way. OK. Works for me. > So I guess we'll need a way to build just a subset of the Sphinx > documentation. I would like that to be a somewhat generic thing, not > requiring a separate conf file for each subset. Is the granularity of a > directory enough? I've been meaning to look into passing different > and to sphinx-build for this, but I don't have > the time now. In the case of the media doc, it is actually a matter of building one rst file (linux_tv/index.rst). For that, I'm equally happy if we specify it via or via . I suspect we'll end by needing filenames granularity in some future, but, if you find easier to just handle for now, that's OK. Regards, Mauro
Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register
On 01/07/2016 16:54, Radim Krčmář wrote: >> >(Hint: we >> > want kvm-unit-tests for this). > :) Something like this? > > enable_xapic() > id = apic_id() > set_apic_id(id+1) // ? > enable_x2apic() > id == apic_id() & 0xff > disable_apic() > enable_xapic() > id == apic_id() > Yes, plus checking that it "moves" appropriately between low and high bits. Thanks, Paolo
Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register
On 01/07/2016 16:54, Radim Krčmář wrote: >> >(Hint: we >> > want kvm-unit-tests for this). > :) Something like this? > > enable_xapic() > id = apic_id() > set_apic_id(id+1) // ? > enable_x2apic() > id == apic_id() & 0xff > disable_apic() > enable_xapic() > id == apic_id() > Yes, plus checking that it "moves" appropriately between low and high bits. Thanks, Paolo
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
On 01/07/2016 16:38, Radim Krčmář wrote: >> > Should it? > Yes, x2APIC ID cannot be changed in hardware and is initialized to the > intitial APIC ID. > Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace > reuses old VMs instead of building new ones after reconfiguration. > I don't think it's a sensible use case and it it is currently broken, > because we don't exit to userspace when changing APIC mode, so KVM would > just set APIC ID to VCPU ID on any transition and userspace couldn't > amend it. > >> > According to QEMU if you have e.g. 3 cores per socket one >> > socket take 4 APIC IDs. For Knights Landing the "worst" prime factor in >> > 288 is 3^2 so you need APIC IDs up to 288 * (4/3)^2 = 512. > The topology can result in sparse APIC ID and APIC ID is initialized > from VCPU ID, so userspace has to pick VCPU ID accordingly. Right, I was confusing KVM_MAX_VCPUS and KVM_MAX_VCPU_ID. So the overflow case cannot happen and neither can the 32GB allocation. On the other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or 1023. Thanks, Paolo
Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map
On 01/07/2016 16:38, Radim Krčmář wrote: >> > Should it? > Yes, x2APIC ID cannot be changed in hardware and is initialized to the > intitial APIC ID. > Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace > reuses old VMs instead of building new ones after reconfiguration. > I don't think it's a sensible use case and it it is currently broken, > because we don't exit to userspace when changing APIC mode, so KVM would > just set APIC ID to VCPU ID on any transition and userspace couldn't > amend it. > >> > According to QEMU if you have e.g. 3 cores per socket one >> > socket take 4 APIC IDs. For Knights Landing the "worst" prime factor in >> > 288 is 3^2 so you need APIC IDs up to 288 * (4/3)^2 = 512. > The topology can result in sparse APIC ID and APIC ID is initialized > from VCPU ID, so userspace has to pick VCPU ID accordingly. Right, I was confusing KVM_MAX_VCPUS and KVM_MAX_VCPU_ID. So the overflow case cannot happen and neither can the 32GB allocation. On the other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or 1023. Thanks, Paolo
Re: [PATCH] arm64: defconfig: Enable QDF2432 config options
On 07/01/2016 10:14 AM, Timur Tabi wrote: > Christopher Covington wrote: >> arch/arm64/configs/defconfig | 4 >> scripts/patch-details.sh | 21 ++--- > > I don't think these two files should be combined. Oops, sorry. Fixed in v2. >> CONFIG_PINCTRL_SINGLE=y >> CONFIG_PINCTRL_MSM8916=y >> +CONFIG_PINCTRL_QDF2XXX=m >> CONFIG_PINCTRL_QCOM_SPMI_PMIC=y >> CONFIG_GPIO_SYSFS=y >> CONFIG_GPIO_DWAPB=y >> @@ -245,6 +246,7 @@ CONFIG_MMC=y >> CONFIG_MMC_BLOCK_MINORS=32 >> CONFIG_MMC_ARMMMCI=y >> CONFIG_MMC_SDHCI=y >> +CONFIG_MMC_SDHCI_ACPI=m >> CONFIG_MMC_SDHCI_PLTFM=y > > Why are these =m when all the others are =y? Due to distribution differences [1][2], I see =y built-in as the default on mobile platforms and =m modular as the default on server platforms. But thinking about it, we could get a poweroff, reset, over-temperature, or other interrupt before we get to the initramfs. So the pin controller driver should be =y built-in. One might make rootfs media =y built-in to make testing easier, but I prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by comparison. Thanks, Cov Examples I found to spot check my recollection: 1. https://github.com/CyanogenMod/android_kernel_lge_hammerhead/blob/cm-13.0/arch/arm/configs/cyanogenmod_hammerhead_defconfig 2. https://git.centos.org/blob/sig-altarch!kernel.git/c3cb6cc6edb5400321fd9d83413add4cf61d942e/SOURCES!config-arm64 -- Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool
> Found with scripts/coccinelle/misc/boolconv.cocci. I'd argue that pd = (inbuf[0] >> 1) & 0x3; if (pd) { data->powerdown = true; data->powerdown_mode = pd - 1; } else { data->powerdown = false; data->powerdown_mode = 2; /* largest register to gnd */ } is less compact but probably easier to read/understand; > Signed-off-by: Andrew F. Davisbut I'm fine with the proposed patch as well to make cocci happy regards, p. > --- > drivers/iio/dac/mcp4725.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c > index cca935c..c3bb3fd 100644 > --- a/drivers/iio/dac/mcp4725.c > +++ b/drivers/iio/dac/mcp4725.c > @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client, > return err; > } > pd = (inbuf[0] >> 1) & 0x3; > - data->powerdown = pd > 0 ? true : false; > + data->powerdown = pd > 0; > data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */ > data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4); > > -- Peter Meerwald-Stadler +43-664-218 (mobile)
Re: [PATCH] arm64: defconfig: Enable QDF2432 config options
On 07/01/2016 10:14 AM, Timur Tabi wrote: > Christopher Covington wrote: >> arch/arm64/configs/defconfig | 4 >> scripts/patch-details.sh | 21 ++--- > > I don't think these two files should be combined. Oops, sorry. Fixed in v2. >> CONFIG_PINCTRL_SINGLE=y >> CONFIG_PINCTRL_MSM8916=y >> +CONFIG_PINCTRL_QDF2XXX=m >> CONFIG_PINCTRL_QCOM_SPMI_PMIC=y >> CONFIG_GPIO_SYSFS=y >> CONFIG_GPIO_DWAPB=y >> @@ -245,6 +246,7 @@ CONFIG_MMC=y >> CONFIG_MMC_BLOCK_MINORS=32 >> CONFIG_MMC_ARMMMCI=y >> CONFIG_MMC_SDHCI=y >> +CONFIG_MMC_SDHCI_ACPI=m >> CONFIG_MMC_SDHCI_PLTFM=y > > Why are these =m when all the others are =y? Due to distribution differences [1][2], I see =y built-in as the default on mobile platforms and =m modular as the default on server platforms. But thinking about it, we could get a poweroff, reset, over-temperature, or other interrupt before we get to the initramfs. So the pin controller driver should be =y built-in. One might make rootfs media =y built-in to make testing easier, but I prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by comparison. Thanks, Cov Examples I found to spot check my recollection: 1. https://github.com/CyanogenMod/android_kernel_lge_hammerhead/blob/cm-13.0/arch/arm/configs/cyanogenmod_hammerhead_defconfig 2. https://git.centos.org/blob/sig-altarch!kernel.git/c3cb6cc6edb5400321fd9d83413add4cf61d942e/SOURCES!config-arm64 -- Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool
> Found with scripts/coccinelle/misc/boolconv.cocci. I'd argue that pd = (inbuf[0] >> 1) & 0x3; if (pd) { data->powerdown = true; data->powerdown_mode = pd - 1; } else { data->powerdown = false; data->powerdown_mode = 2; /* largest register to gnd */ } is less compact but probably easier to read/understand; > Signed-off-by: Andrew F. Davis but I'm fine with the proposed patch as well to make cocci happy regards, p. > --- > drivers/iio/dac/mcp4725.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c > index cca935c..c3bb3fd 100644 > --- a/drivers/iio/dac/mcp4725.c > +++ b/drivers/iio/dac/mcp4725.c > @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client, > return err; > } > pd = (inbuf[0] >> 1) & 0x3; > - data->powerdown = pd > 0 ? true : false; > + data->powerdown = pd > 0; > data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */ > data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4); > > -- Peter Meerwald-Stadler +43-664-218 (mobile)
Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > This patch series does the following > ---> Add support for gmii2rgmii converter. > ---> Add support for gmii2rgmii converter in the macb driver. > > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ > which includes converter related stuff also in macb driver. > This patch series fixes this issue. This still seems very adhoc and not completely explained. Can you clarify how the gmmi2rgmii converter gets used? Is the expectation that a MACB Ethernet adapter will be connected to a RGMII PHY like this: MACB <=> GMII2RGMII <=> RGMII PHY MACB MDIO <===> RGMII_PHY and still have the ability to control both the PHY device's MDIO registers and the GMII2RGMII converter and we need to make sure both have matching settings, or is it something like this: MACB <=> GMII2RGMII <=> RGMII PHY MACB MDIO unconnected and we lose the ability to query the PHY via MDIO? As Nicolas pointed out, providing a binding documentation update may help reviewers understand what is being accomplished here. Thanks! > > Kedareswara rao Appana (2): > net: ethernet: xilinx: Add gmii2rgmii converter support > net: macb: Add gmii2rgmii phy converter support > > drivers/net/ethernet/cadence/macb.c | 21 ++- > drivers/net/ethernet/cadence/macb.h |3 + > drivers/net/ethernet/xilinx/Kconfig |7 ++ > drivers/net/ethernet/xilinx/Makefile|1 + > drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c | 76 > +++ > include/linux/xilinx_gmii2rgmii.h | 24 +++ > 6 files changed, 131 insertions(+), 1 deletions(-) > create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > create mode 100644 include/linux/xilinx_gmii2rgmii.h > -- Florian
Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > This patch series does the following > ---> Add support for gmii2rgmii converter. > ---> Add support for gmii2rgmii converter in the macb driver. > > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ > which includes converter related stuff also in macb driver. > This patch series fixes this issue. This still seems very adhoc and not completely explained. Can you clarify how the gmmi2rgmii converter gets used? Is the expectation that a MACB Ethernet adapter will be connected to a RGMII PHY like this: MACB <=> GMII2RGMII <=> RGMII PHY MACB MDIO <===> RGMII_PHY and still have the ability to control both the PHY device's MDIO registers and the GMII2RGMII converter and we need to make sure both have matching settings, or is it something like this: MACB <=> GMII2RGMII <=> RGMII PHY MACB MDIO unconnected and we lose the ability to query the PHY via MDIO? As Nicolas pointed out, providing a binding documentation update may help reviewers understand what is being accomplished here. Thanks! > > Kedareswara rao Appana (2): > net: ethernet: xilinx: Add gmii2rgmii converter support > net: macb: Add gmii2rgmii phy converter support > > drivers/net/ethernet/cadence/macb.c | 21 ++- > drivers/net/ethernet/cadence/macb.h |3 + > drivers/net/ethernet/xilinx/Kconfig |7 ++ > drivers/net/ethernet/xilinx/Makefile|1 + > drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c | 76 > +++ > include/linux/xilinx_gmii2rgmii.h | 24 +++ > 6 files changed, 131 insertions(+), 1 deletions(-) > create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > create mode 100644 include/linux/xilinx_gmii2rgmii.h > -- Florian
Re: [PATCH] doc: flat-table directive
Em Fri, 1 Jul 2016 16:07:47 +0200 Markus Heiserescreveu: > Am 01.07.2016 um 15:09 schrieb Jani Nikula : > > > On Fri, 01 Jul 2016, Markus Heiser wrote: > >> In Sphinx, the code-block directive is a literal block, no refs or markup > >> will be interpreted. This is different to what you know from DocBook. > >> I will look for a solution, that matches your requirement. > > > > Parsed literal block solves the problem, but brings other problems like > > requiring escaping for a lot of stuff. And you lose syntax > > highlighting. It would be great to have a middle ground, like a > > "semi-parsed code block" where the references are parsed but nothing > > else. > > > > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block > > > > Yes, "parsed literal" blocks is not the solution and I have none > yet ... and you are right, we need something "semi". I doubt whether we > will eventually find a solution for this, but I will > think about it ... I don't know how, but it must be a solution that is > transparent to the the pygment highlighter and not distort the > node tree. I have to study the sphinx-writer and pygment first. Yeah, a "semi" solution would be perfect. In the specific case of the header files, a parsed literal should work. the code that generates the *.h.xml files right now already does escaping where it needs to make docbook happy. OK, the characters that need to be escaped will be different, but shouldn't hard to do it, I guess. At the media Makefile, it currently does escaping for <, > and &: ESCAPE = \ -e "s/&/\\/g" \ -e "s//\\/g" > > > >> OK, checking dead internal links might be one requirement more. Normally > >> sphinx reports internal refs which are broken with a WARNING, but I have to > >> analyze what xmllint checks and how we could realize something similar > >> in sphinx / or if it is enough what sphinx reports. > > > > When we turn function() and structure references to Sphinx > > references, we pretty much rely on *not* being strict about all the > > targets existing. At least for now. In an ideal world we'd aim for -n > > and -W sphinx-build options, but we're far from that, and I don't know > > if it's really possible ever. I guess that will depend on how much care the subsystem maintainer has with that. When building things on media, we have zero warnings when building the Kernel with W=1 on x86. We also check both sparse and smatch regularly, trying zero the warnings there too (currently, there are just a few of them, all false positives and not trivial to remove). We also have zero warnings on media documentation using xmllint pedantic mode. > > Is it possible to set -n and/or -W on a per-rst file basis, either in > > the top config file or from within the rst file itself? Then we could > > gradually improve this, and subsystems that really care about this could > > be in the forefront. > > There is a nitpick_ignore config, but this will not help. > As far as I can see, if you want similar on a per file basis, > you need to implement a (HTML) builder which checks > on which tree-level the *current* node is and if this node > is in a doctree of one of your files in your *configured file-list* > then turns the warning into an error ... may we see more > requirements coming into, which needs to implement a HTML-builder > we can implement one. I have implemented a man-page builder > for the kernel-doc comments, because it was inevitable > but I think it is not a good idea to reimplement the > HTML builder in this first usage of sphinx. > > May this is the totally wrong way, may it is better > to implement a *lint* builder from scratch (should not > be hard). That would work. > > >> But before I will send out some small patches which are needed > >> first (IMHO). E.g. customizing the RTD theme for rendering large > >> tables in HTML well and activation of useful extensions like todolist. > >> I have this in my "chaotic bulk" patch :-) ... I will separate it out > >> an send it to Jon one by one. > > > > Btw I don't think we are really attached to the RTD theme. It's just > > something I picked that was prettier than the default theme, and was > > widely available and packaged in distros. > > IMHO it is not prefect but the most elaborate you will find in the net. > > > Ideally, we should probably > > keep the customization at a level where it's possible for people to > > easily use different themes. > > Layout is done in the theme, we have no chance to influence > the layout out of / before the theme. > > > That said, rendering big tables in the RTD > > theme is definitely an issue. > > > > I'd also aim to be fairly conservative at first in terms of the rst > > features and Sphinx extensions we use. Keep it simple. It's really easy > > to go overboard in the beginning. See how things pan out and gradually > >
Re: [PATCH] doc: flat-table directive
Em Fri, 1 Jul 2016 16:07:47 +0200 Markus Heiserescreveu: > Am 01.07.2016 um 15:09 schrieb Jani Nikula : > > > On Fri, 01 Jul 2016, Markus Heiser wrote: > >> In Sphinx, the code-block directive is a literal block, no refs or markup > >> will be interpreted. This is different to what you know from DocBook. > >> I will look for a solution, that matches your requirement. > > > > Parsed literal block solves the problem, but brings other problems like > > requiring escaping for a lot of stuff. And you lose syntax > > highlighting. It would be great to have a middle ground, like a > > "semi-parsed code block" where the references are parsed but nothing > > else. > > > > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block > > > > Yes, "parsed literal" blocks is not the solution and I have none > yet ... and you are right, we need something "semi". I doubt whether we > will eventually find a solution for this, but I will > think about it ... I don't know how, but it must be a solution that is > transparent to the the pygment highlighter and not distort the > node tree. I have to study the sphinx-writer and pygment first. Yeah, a "semi" solution would be perfect. In the specific case of the header files, a parsed literal should work. the code that generates the *.h.xml files right now already does escaping where it needs to make docbook happy. OK, the characters that need to be escaped will be different, but shouldn't hard to do it, I guess. At the media Makefile, it currently does escaping for <, > and &: ESCAPE = \ -e "s/&/\\/g" \ -e "s//\\/g" > > > >> OK, checking dead internal links might be one requirement more. Normally > >> sphinx reports internal refs which are broken with a WARNING, but I have to > >> analyze what xmllint checks and how we could realize something similar > >> in sphinx / or if it is enough what sphinx reports. > > > > When we turn function() and structure references to Sphinx > > references, we pretty much rely on *not* being strict about all the > > targets existing. At least for now. In an ideal world we'd aim for -n > > and -W sphinx-build options, but we're far from that, and I don't know > > if it's really possible ever. I guess that will depend on how much care the subsystem maintainer has with that. When building things on media, we have zero warnings when building the Kernel with W=1 on x86. We also check both sparse and smatch regularly, trying zero the warnings there too (currently, there are just a few of them, all false positives and not trivial to remove). We also have zero warnings on media documentation using xmllint pedantic mode. > > Is it possible to set -n and/or -W on a per-rst file basis, either in > > the top config file or from within the rst file itself? Then we could > > gradually improve this, and subsystems that really care about this could > > be in the forefront. > > There is a nitpick_ignore config, but this will not help. > As far as I can see, if you want similar on a per file basis, > you need to implement a (HTML) builder which checks > on which tree-level the *current* node is and if this node > is in a doctree of one of your files in your *configured file-list* > then turns the warning into an error ... may we see more > requirements coming into, which needs to implement a HTML-builder > we can implement one. I have implemented a man-page builder > for the kernel-doc comments, because it was inevitable > but I think it is not a good idea to reimplement the > HTML builder in this first usage of sphinx. > > May this is the totally wrong way, may it is better > to implement a *lint* builder from scratch (should not > be hard). That would work. > > >> But before I will send out some small patches which are needed > >> first (IMHO). E.g. customizing the RTD theme for rendering large > >> tables in HTML well and activation of useful extensions like todolist. > >> I have this in my "chaotic bulk" patch :-) ... I will separate it out > >> an send it to Jon one by one. > > > > Btw I don't think we are really attached to the RTD theme. It's just > > something I picked that was prettier than the default theme, and was > > widely available and packaged in distros. > > IMHO it is not prefect but the most elaborate you will find in the net. > > > Ideally, we should probably > > keep the customization at a level where it's possible for people to > > easily use different themes. > > Layout is done in the theme, we have no chance to influence > the layout out of / before the theme. > > > That said, rendering big tables in the RTD > > theme is definitely an issue. > > > > I'd also aim to be fairly conservative at first in terms of the rst > > features and Sphinx extensions we use. Keep it simple. It's really easy > > to go overboard in the beginning. See how things pan out and gradually > >
Re: [PATCH] doc: flat-table directive
Em Fri, 1 Jul 2016 16:07:47 +0200 Markus Heiser escreveu: > Am 01.07.2016 um 15:09 schrieb Jani Nikula : > > > On Fri, 01 Jul 2016, Markus Heiser wrote: > >> In Sphinx, the code-block directive is a literal block, no refs or markup > >> will be interpreted. This is different to what you know from DocBook. > >> I will look for a solution, that matches your requirement. > > > > Parsed literal block solves the problem, but brings other problems like > > requiring escaping for a lot of stuff. And you lose syntax > > highlighting. It would be great to have a middle ground, like a > > "semi-parsed code block" where the references are parsed but nothing > > else. > > > > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block > > > > Yes, "parsed literal" blocks is not the solution and I have none > yet ... and you are right, we need something "semi". I doubt whether we > will eventually find a solution for this, but I will > think about it ... I don't know how, but it must be a solution that is > transparent to the the pygment highlighter and not distort the > node tree. I have to study the sphinx-writer and pygment first. Yeah, a "semi" solution would be perfect. In the specific case of the header files, a parsed literal should work. the code that generates the *.h.xml files right now already does escaping where it needs to make docbook happy. OK, the characters that need to be escaped will be different, but shouldn't hard to do it, I guess. At the media Makefile, it currently does escaping for <, > and &: ESCAPE = \ -e "s/&/\\/g" \ -e "s//\\/g" > > > >> OK, checking dead internal links might be one requirement more. Normally > >> sphinx reports internal refs which are broken with a WARNING, but I have to > >> analyze what xmllint checks and how we could realize something similar > >> in sphinx / or if it is enough what sphinx reports. > > > > When we turn function() and structure references to Sphinx > > references, we pretty much rely on *not* being strict about all the > > targets existing. At least for now. In an ideal world we'd aim for -n > > and -W sphinx-build options, but we're far from that, and I don't know > > if it's really possible ever. I guess that will depend on how much care the subsystem maintainer has with that. When building things on media, we have zero warnings when building the Kernel with W=1 on x86. We also check both sparse and smatch regularly, trying zero the warnings there too (currently, there are just a few of them, all false positives and not trivial to remove). We also have zero warnings on media documentation using xmllint pedantic mode. > > Is it possible to set -n and/or -W on a per-rst file basis, either in > > the top config file or from within the rst file itself? Then we could > > gradually improve this, and subsystems that really care about this could > > be in the forefront. > > There is a nitpick_ignore config, but this will not help. > As far as I can see, if you want similar on a per file basis, > you need to implement a (HTML) builder which checks > on which tree-level the *current* node is and if this node > is in a doctree of one of your files in your *configured file-list* > then turns the warning into an error ... may we see more > requirements coming into, which needs to implement a HTML-builder > we can implement one. I have implemented a man-page builder > for the kernel-doc comments, because it was inevitable > but I think it is not a good idea to reimplement the > HTML builder in this first usage of sphinx. > > May this is the totally wrong way, may it is better > to implement a *lint* builder from scratch (should not > be hard). That would work. > > >> But before I will send out some small patches which are needed > >> first (IMHO). E.g. customizing the RTD theme for rendering large > >> tables in HTML well and activation of useful extensions like todolist. > >> I have this in my "chaotic bulk" patch :-) ... I will separate it out > >> an send it to Jon one by one. > > > > Btw I don't think we are really attached to the RTD theme. It's just > > something I picked that was prettier than the default theme, and was > > widely available and packaged in distros. > > IMHO it is not prefect but the most elaborate you will find in the net. > > > Ideally, we should probably > > keep the customization at a level where it's possible for people to > > easily use different themes. > > Layout is done in the theme, we have no chance to influence > the layout out of / before the theme. > > > That said, rendering big tables in the RTD > > theme is definitely an issue. > > > > I'd also aim to be fairly conservative at first in terms of the rst > > features and Sphinx extensions we use. Keep it simple. It's really easy > > to go overboard in the beginning. See how things pan out and gradually > > extend from there. > > Yes, KIS ... I send the theme patch and you will see
Re: [PATCH] doc: flat-table directive
Em Fri, 1 Jul 2016 16:07:47 +0200 Markus Heiser escreveu: > Am 01.07.2016 um 15:09 schrieb Jani Nikula : > > > On Fri, 01 Jul 2016, Markus Heiser wrote: > >> In Sphinx, the code-block directive is a literal block, no refs or markup > >> will be interpreted. This is different to what you know from DocBook. > >> I will look for a solution, that matches your requirement. > > > > Parsed literal block solves the problem, but brings other problems like > > requiring escaping for a lot of stuff. And you lose syntax > > highlighting. It would be great to have a middle ground, like a > > "semi-parsed code block" where the references are parsed but nothing > > else. > > > > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block > > > > Yes, "parsed literal" blocks is not the solution and I have none > yet ... and you are right, we need something "semi". I doubt whether we > will eventually find a solution for this, but I will > think about it ... I don't know how, but it must be a solution that is > transparent to the the pygment highlighter and not distort the > node tree. I have to study the sphinx-writer and pygment first. Yeah, a "semi" solution would be perfect. In the specific case of the header files, a parsed literal should work. the code that generates the *.h.xml files right now already does escaping where it needs to make docbook happy. OK, the characters that need to be escaped will be different, but shouldn't hard to do it, I guess. At the media Makefile, it currently does escaping for <, > and &: ESCAPE = \ -e "s/&/\\/g" \ -e "s//\\/g" > > > >> OK, checking dead internal links might be one requirement more. Normally > >> sphinx reports internal refs which are broken with a WARNING, but I have to > >> analyze what xmllint checks and how we could realize something similar > >> in sphinx / or if it is enough what sphinx reports. > > > > When we turn function() and structure references to Sphinx > > references, we pretty much rely on *not* being strict about all the > > targets existing. At least for now. In an ideal world we'd aim for -n > > and -W sphinx-build options, but we're far from that, and I don't know > > if it's really possible ever. I guess that will depend on how much care the subsystem maintainer has with that. When building things on media, we have zero warnings when building the Kernel with W=1 on x86. We also check both sparse and smatch regularly, trying zero the warnings there too (currently, there are just a few of them, all false positives and not trivial to remove). We also have zero warnings on media documentation using xmllint pedantic mode. > > Is it possible to set -n and/or -W on a per-rst file basis, either in > > the top config file or from within the rst file itself? Then we could > > gradually improve this, and subsystems that really care about this could > > be in the forefront. > > There is a nitpick_ignore config, but this will not help. > As far as I can see, if you want similar on a per file basis, > you need to implement a (HTML) builder which checks > on which tree-level the *current* node is and if this node > is in a doctree of one of your files in your *configured file-list* > then turns the warning into an error ... may we see more > requirements coming into, which needs to implement a HTML-builder > we can implement one. I have implemented a man-page builder > for the kernel-doc comments, because it was inevitable > but I think it is not a good idea to reimplement the > HTML builder in this first usage of sphinx. > > May this is the totally wrong way, may it is better > to implement a *lint* builder from scratch (should not > be hard). That would work. > > >> But before I will send out some small patches which are needed > >> first (IMHO). E.g. customizing the RTD theme for rendering large > >> tables in HTML well and activation of useful extensions like todolist. > >> I have this in my "chaotic bulk" patch :-) ... I will separate it out > >> an send it to Jon one by one. > > > > Btw I don't think we are really attached to the RTD theme. It's just > > something I picked that was prettier than the default theme, and was > > widely available and packaged in distros. > > IMHO it is not prefect but the most elaborate you will find in the net. > > > Ideally, we should probably > > keep the customization at a level where it's possible for people to > > easily use different themes. > > Layout is done in the theme, we have no chance to influence > the layout out of / before the theme. > > > That said, rendering big tables in the RTD > > theme is definitely an issue. > > > > I'd also aim to be fairly conservative at first in terms of the rst > > features and Sphinx extensions we use. Keep it simple. It's really easy > > to go overboard in the beginning. See how things pan out and gradually > > extend from there. > > Yes, KIS ... I send the theme patch and you will see
Re: EXT: Re: [PATCH RFC 1/4] dma: imx-sdma - reduce transfer latency for DMA cyclic clients
On 28/06/16 17:34, Vinod Koul wrote: On Thu, Jun 09, 2016 at 03:16:30PM +0300, Nandor Han wrote: Having the SDMA driver use a tasklet for running the clients callback introduce some issues: - probability to have desynchronized data because of the race condition created since the DMA transaction status is retrieved only when the callback is executed, leaving plenty of time for transaction status to get altered. - inter-transfer latency which can leave channels idle. Move the callback execution, for cyclic channels, to SDMA interrupt (as advised in `Documentation/dmaengine/provider.txt`) to (a)reduce the inter-transfer latency and (b) eliminate the race condition possibility where DMA transaction status might be changed by the time is read. The responsibility of the SDMA interrupt latency is moved to the SDMA clients which case by case should defer the work to bottom-halves when needed. Both of these look fine. Please change the patch titles to dmaengine: Are these going to be merged thru dmaengine tree or serial one? I will send soon a V2 where I will fix the titles. If you are OK with all the patchset it can be merged to dmaengine tree, otherwise probably goes to serial one. Let me know which is the best option. Thanks, Nandor Signed-off-by: Nandor Han--- drivers/dma/imx-sdma.c | 36 1 file changed, 16 insertions(+), 20 deletions(-) diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index 0f6fd42..e497847 100644 --- a/drivers/dma/imx-sdma.c +++ b/drivers/dma/imx-sdma.c @@ -654,12 +654,6 @@ static void sdma_event_disable(struct sdma_channel *sdmac, unsigned int event) writel_relaxed(val, sdma->regs + chnenbl); } -static void sdma_handle_channel_loop(struct sdma_channel *sdmac) -{ - if (sdmac->desc.callback) - sdmac->desc.callback(sdmac->desc.callback_param); -} - static void sdma_update_channel_loop(struct sdma_channel *sdmac) { struct sdma_buffer_descriptor *bd; @@ -678,13 +672,25 @@ static void sdma_update_channel_loop(struct sdma_channel *sdmac) sdmac->status = DMA_ERROR; bd->mode.status |= BD_DONE; + + /* +* The callback is called from the interrupt context in order +* to reduce latency and to avoid the risk of altering the +* SDMA transaction status by the time the client tasklet is +* executed. +*/ + + if (sdmac->desc.callback) + sdmac->desc.callback(sdmac->desc.callback_param); + sdmac->buf_tail++; sdmac->buf_tail %= sdmac->num_bd; } } -static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac) +static void mxc_sdma_handle_channel_normal(unsigned long data) { + struct sdma_channel *sdmac = (struct sdma_channel *) data; struct sdma_buffer_descriptor *bd; int i, error = 0; @@ -711,16 +717,6 @@ static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac) sdmac->desc.callback(sdmac->desc.callback_param); } -static void sdma_tasklet(unsigned long data) -{ - struct sdma_channel *sdmac = (struct sdma_channel *) data; - - if (sdmac->flags & IMX_DMA_SG_LOOP) - sdma_handle_channel_loop(sdmac); - else - mxc_sdma_handle_channel_normal(sdmac); -} - static irqreturn_t sdma_int_handler(int irq, void *dev_id) { struct sdma_engine *sdma = dev_id; @@ -737,8 +733,8 @@ static irqreturn_t sdma_int_handler(int irq, void *dev_id) if (sdmac->flags & IMX_DMA_SG_LOOP) sdma_update_channel_loop(sdmac); - - tasklet_schedule(>tasklet); + else + tasklet_schedule(>tasklet); __clear_bit(channel, ); } @@ -1739,7 +1735,7 @@ static int sdma_probe(struct platform_device *pdev) dma_cookie_init(>chan); sdmac->channel = i; - tasklet_init(>tasklet, sdma_tasklet, + tasklet_init(>tasklet, mxc_sdma_handle_channel_normal, (unsigned long) sdmac); /* * Add the channel to the DMAC list. Do not add channel 0 though -- 2.8.3
Re: [PATCH] spi: imx: wait_for_completion_timeout(..) for PIO transfers
On Fri, Jul 01, 2016 at 02:32:58PM +0200, Christian Gmeiner wrote: > In some rare cases I see the following 'task blocked' information. It > looks like the PIO transfer has some problems and never succeeds. Make > use of wait_for_completion_timeout(..) to detect this case and > return -ETIMEDOUT. > > v2: remove backtrace from commit message Please do not submit new versions of already applied patches, please submit incremental updates to the existing code. Modifying existing commits creates problems for other users building on top of those commits so it's best practice to only change pubished git commits if absolutely essential. If you're including an inter-version changelog please keep it after the --- as covered in SubmittingPatches. signature.asc Description: PGP signature
Re: [PATCH] drivers/perf: arm-pmu: Handle per-interrupt affinity mask
Hi Marc, On 2016年07月01日 21:21, Marc Zyngier wrote: On a big-little system, PMUs can be wired to CPUs using per CPU interrups (PPI). In this case, it is important to make sure that the enable/disable do happen on the right set of CPUs. So instead of relying on the interrupt-affinity property, we can use the actual percpu affinity that DT exposes as part of the interrupt specifier. The DT binding is also updated to reflect the fact that the interrupt-affinity property shouldn't be used in that case. Signed-off-by: Marc ZyngierTested-by: Caesar Wang I pick this up on my local branch. 8bc671a FROMLIST: drivers/perf: arm-pmu: Handle per-interrupt affinity mask 3d723f4 FROMLIST: arm64: dts: rockchip: support the pmu node for rk3399 1359b92 FIXUP: FROMLIST: arm64: dts: rockchip: change all interrupts cells for 4 on rk3399 SoCs ... Tested on rk3399 board. localhost / # perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] ... perf stat --cpu 0/1/2/3. to minitor e.g. cpu0; localhost / # perf stat --cpu 0 ^C Performance counter stats for 'CPU(s) 0': 3374.917571 task-clock (msec) # 1.001 CPUs utilized [100.00%] 20 context-switches # 0.006 K/sec [100.00%] 2 cpu-migrations # 0.001 K/sec [100.00%] 55 page-faults # 0.016 K/sec 7151843 cycles # 0.002 GHz [100.00%] stalled-cycles-frontend stalled-cycles-backend 4272536 instructions # 0.60 insns per cycle [100.00%] 568406 branches # 0.168 M/sec [100.00%] 65652 branch-misses # 11.55% of all branches Also, 'perf top' to monitor the PMU interrupts from cpus --- --- Documentation/devicetree/bindings/arm/pmu.txt | 4 +++- drivers/perf/arm_pmu.c| 22 +- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt index 74d5417..61c8b46 100644 --- a/Documentation/devicetree/bindings/arm/pmu.txt +++ b/Documentation/devicetree/bindings/arm/pmu.txt @@ -39,7 +39,9 @@ Optional properties: When using a PPI, specifies a list of phandles to CPU nodes corresponding to the set of CPUs which have a PMU of this type signalling the PPI listed in the - interrupts property. + interrupts property, unless this is already specified + by the PPI interrupt specifier itself (in which case + the interrupt-affinity property shouldn't be present). This property should be present when there is more than a single SPI. diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c index 140436a..065ccec 100644 --- a/drivers/perf/arm_pmu.c +++ b/drivers/perf/arm_pmu.c @@ -603,7 +603,8 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu) irq = platform_get_irq(pmu_device, 0); if (irq >= 0 && irq_is_percpu(irq)) { - on_each_cpu(cpu_pmu_disable_percpu_irq, , 1); + on_each_cpu_mask(_pmu->supported_cpus, +cpu_pmu_disable_percpu_irq, , 1); free_percpu_irq(irq, _events->percpu_pmu); } else { for (i = 0; i < irqs; ++i) { @@ -645,7 +646,9 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler) irq); return err; } - on_each_cpu(cpu_pmu_enable_percpu_irq, , 1); + + on_each_cpu_mask(_pmu->supported_cpus, +cpu_pmu_enable_percpu_irq, , 1); } else { for (i = 0; i < irqs; ++i) { int cpu = i; @@ -961,9 +964,18 @@ static int of_pmu_irq_cfg(struct arm_pmu *pmu) i++; } while (1); - /* If we didn't manage to parse anything, claim to support all CPUs */ - if (cpumask_weight(>supported_cpus) == 0) - cpumask_setall(>supported_cpus); + /* If we didn't manage to parse anything, try the interrupt affinity */ + if (cpumask_weight(>supported_cpus) == 0) { + if (!using_spi) { + /* If using PPIs, check the affinity of the partition */ + int irq = platform_get_irq(pdev, 0); + irq_get_percpu_devid_partition(irq, + >supported_cpus); + } else { + /* Otherwise default to all CPUs */ + cpumask_setall(>supported_cpus); + } + } /* If we matched up the IRQ affinities, use them to route the SPIs */
Re: EXT: Re: [PATCH RFC 1/4] dma: imx-sdma - reduce transfer latency for DMA cyclic clients
On 28/06/16 17:34, Vinod Koul wrote: On Thu, Jun 09, 2016 at 03:16:30PM +0300, Nandor Han wrote: Having the SDMA driver use a tasklet for running the clients callback introduce some issues: - probability to have desynchronized data because of the race condition created since the DMA transaction status is retrieved only when the callback is executed, leaving plenty of time for transaction status to get altered. - inter-transfer latency which can leave channels idle. Move the callback execution, for cyclic channels, to SDMA interrupt (as advised in `Documentation/dmaengine/provider.txt`) to (a)reduce the inter-transfer latency and (b) eliminate the race condition possibility where DMA transaction status might be changed by the time is read. The responsibility of the SDMA interrupt latency is moved to the SDMA clients which case by case should defer the work to bottom-halves when needed. Both of these look fine. Please change the patch titles to dmaengine: Are these going to be merged thru dmaengine tree or serial one? I will send soon a V2 where I will fix the titles. If you are OK with all the patchset it can be merged to dmaengine tree, otherwise probably goes to serial one. Let me know which is the best option. Thanks, Nandor Signed-off-by: Nandor Han --- drivers/dma/imx-sdma.c | 36 1 file changed, 16 insertions(+), 20 deletions(-) diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index 0f6fd42..e497847 100644 --- a/drivers/dma/imx-sdma.c +++ b/drivers/dma/imx-sdma.c @@ -654,12 +654,6 @@ static void sdma_event_disable(struct sdma_channel *sdmac, unsigned int event) writel_relaxed(val, sdma->regs + chnenbl); } -static void sdma_handle_channel_loop(struct sdma_channel *sdmac) -{ - if (sdmac->desc.callback) - sdmac->desc.callback(sdmac->desc.callback_param); -} - static void sdma_update_channel_loop(struct sdma_channel *sdmac) { struct sdma_buffer_descriptor *bd; @@ -678,13 +672,25 @@ static void sdma_update_channel_loop(struct sdma_channel *sdmac) sdmac->status = DMA_ERROR; bd->mode.status |= BD_DONE; + + /* +* The callback is called from the interrupt context in order +* to reduce latency and to avoid the risk of altering the +* SDMA transaction status by the time the client tasklet is +* executed. +*/ + + if (sdmac->desc.callback) + sdmac->desc.callback(sdmac->desc.callback_param); + sdmac->buf_tail++; sdmac->buf_tail %= sdmac->num_bd; } } -static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac) +static void mxc_sdma_handle_channel_normal(unsigned long data) { + struct sdma_channel *sdmac = (struct sdma_channel *) data; struct sdma_buffer_descriptor *bd; int i, error = 0; @@ -711,16 +717,6 @@ static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac) sdmac->desc.callback(sdmac->desc.callback_param); } -static void sdma_tasklet(unsigned long data) -{ - struct sdma_channel *sdmac = (struct sdma_channel *) data; - - if (sdmac->flags & IMX_DMA_SG_LOOP) - sdma_handle_channel_loop(sdmac); - else - mxc_sdma_handle_channel_normal(sdmac); -} - static irqreturn_t sdma_int_handler(int irq, void *dev_id) { struct sdma_engine *sdma = dev_id; @@ -737,8 +733,8 @@ static irqreturn_t sdma_int_handler(int irq, void *dev_id) if (sdmac->flags & IMX_DMA_SG_LOOP) sdma_update_channel_loop(sdmac); - - tasklet_schedule(>tasklet); + else + tasklet_schedule(>tasklet); __clear_bit(channel, ); } @@ -1739,7 +1735,7 @@ static int sdma_probe(struct platform_device *pdev) dma_cookie_init(>chan); sdmac->channel = i; - tasklet_init(>tasklet, sdma_tasklet, + tasklet_init(>tasklet, mxc_sdma_handle_channel_normal, (unsigned long) sdmac); /* * Add the channel to the DMAC list. Do not add channel 0 though -- 2.8.3
Re: [PATCH] spi: imx: wait_for_completion_timeout(..) for PIO transfers
On Fri, Jul 01, 2016 at 02:32:58PM +0200, Christian Gmeiner wrote: > In some rare cases I see the following 'task blocked' information. It > looks like the PIO transfer has some problems and never succeeds. Make > use of wait_for_completion_timeout(..) to detect this case and > return -ETIMEDOUT. > > v2: remove backtrace from commit message Please do not submit new versions of already applied patches, please submit incremental updates to the existing code. Modifying existing commits creates problems for other users building on top of those commits so it's best practice to only change pubished git commits if absolutely essential. If you're including an inter-version changelog please keep it after the --- as covered in SubmittingPatches. signature.asc Description: PGP signature
Re: [PATCH] drivers/perf: arm-pmu: Handle per-interrupt affinity mask
Hi Marc, On 2016年07月01日 21:21, Marc Zyngier wrote: On a big-little system, PMUs can be wired to CPUs using per CPU interrups (PPI). In this case, it is important to make sure that the enable/disable do happen on the right set of CPUs. So instead of relying on the interrupt-affinity property, we can use the actual percpu affinity that DT exposes as part of the interrupt specifier. The DT binding is also updated to reflect the fact that the interrupt-affinity property shouldn't be used in that case. Signed-off-by: Marc Zyngier Tested-by: Caesar Wang I pick this up on my local branch. 8bc671a FROMLIST: drivers/perf: arm-pmu: Handle per-interrupt affinity mask 3d723f4 FROMLIST: arm64: dts: rockchip: support the pmu node for rk3399 1359b92 FIXUP: FROMLIST: arm64: dts: rockchip: change all interrupts cells for 4 on rk3399 SoCs ... Tested on rk3399 board. localhost / # perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] ... perf stat --cpu 0/1/2/3. to minitor e.g. cpu0; localhost / # perf stat --cpu 0 ^C Performance counter stats for 'CPU(s) 0': 3374.917571 task-clock (msec) # 1.001 CPUs utilized [100.00%] 20 context-switches # 0.006 K/sec [100.00%] 2 cpu-migrations # 0.001 K/sec [100.00%] 55 page-faults # 0.016 K/sec 7151843 cycles # 0.002 GHz [100.00%] stalled-cycles-frontend stalled-cycles-backend 4272536 instructions # 0.60 insns per cycle [100.00%] 568406 branches # 0.168 M/sec [100.00%] 65652 branch-misses # 11.55% of all branches Also, 'perf top' to monitor the PMU interrupts from cpus --- --- Documentation/devicetree/bindings/arm/pmu.txt | 4 +++- drivers/perf/arm_pmu.c| 22 +- 2 files changed, 20 insertions(+), 6 deletions(-) diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt index 74d5417..61c8b46 100644 --- a/Documentation/devicetree/bindings/arm/pmu.txt +++ b/Documentation/devicetree/bindings/arm/pmu.txt @@ -39,7 +39,9 @@ Optional properties: When using a PPI, specifies a list of phandles to CPU nodes corresponding to the set of CPUs which have a PMU of this type signalling the PPI listed in the - interrupts property. + interrupts property, unless this is already specified + by the PPI interrupt specifier itself (in which case + the interrupt-affinity property shouldn't be present). This property should be present when there is more than a single SPI. diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c index 140436a..065ccec 100644 --- a/drivers/perf/arm_pmu.c +++ b/drivers/perf/arm_pmu.c @@ -603,7 +603,8 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu) irq = platform_get_irq(pmu_device, 0); if (irq >= 0 && irq_is_percpu(irq)) { - on_each_cpu(cpu_pmu_disable_percpu_irq, , 1); + on_each_cpu_mask(_pmu->supported_cpus, +cpu_pmu_disable_percpu_irq, , 1); free_percpu_irq(irq, _events->percpu_pmu); } else { for (i = 0; i < irqs; ++i) { @@ -645,7 +646,9 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler) irq); return err; } - on_each_cpu(cpu_pmu_enable_percpu_irq, , 1); + + on_each_cpu_mask(_pmu->supported_cpus, +cpu_pmu_enable_percpu_irq, , 1); } else { for (i = 0; i < irqs; ++i) { int cpu = i; @@ -961,9 +964,18 @@ static int of_pmu_irq_cfg(struct arm_pmu *pmu) i++; } while (1); - /* If we didn't manage to parse anything, claim to support all CPUs */ - if (cpumask_weight(>supported_cpus) == 0) - cpumask_setall(>supported_cpus); + /* If we didn't manage to parse anything, try the interrupt affinity */ + if (cpumask_weight(>supported_cpus) == 0) { + if (!using_spi) { + /* If using PPIs, check the affinity of the partition */ + int irq = platform_get_irq(pdev, 0); + irq_get_percpu_devid_partition(irq, + >supported_cpus); + } else { + /* Otherwise default to all CPUs */ + cpumask_setall(>supported_cpus); + } + } /* If we matched up the IRQ affinities, use them to route the SPIs */ if (using_spi && i ==
Re: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > This patch adds support for gmii2rgmii converter. > > The GMII to RGMII IP core provides the Reduced Gigabit Media > Independent Interface (RGMII) between Ethernet physical media > Devices and the Gigabit Ethernet controller. This core can > switch dynamically between the three different speed modes of > Operation. > MDIO interface is used to set operating speed of Ethernet MAC > > Signed-off-by: Kedareswara rao Appana> --- > drivers/net/ethernet/xilinx/Kconfig |7 ++ > drivers/net/ethernet/xilinx/Makefile|1 + > drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c | 76 > +++ > include/linux/xilinx_gmii2rgmii.h | 24 +++ > 4 files changed, 108 insertions(+), 0 deletions(-) > create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > create mode 100644 include/linux/xilinx_gmii2rgmii.h > > diff --git a/drivers/net/ethernet/xilinx/Kconfig > b/drivers/net/ethernet/xilinx/Kconfig > index 4f5c024..d7df70a 100644 > --- a/drivers/net/ethernet/xilinx/Kconfig > +++ b/drivers/net/ethernet/xilinx/Kconfig > @@ -39,4 +39,11 @@ config XILINX_LL_TEMAC > This driver supports the Xilinx 10/100/1000 LocalLink TEMAC > core used in Xilinx Spartan and Virtex FPGAs > > +config XILINX_GMII2RGMII > + tristate "Xilinx GMII2RGMII converter driver" > + ---help--- > + This driver support xilinx GMII to RGMII IP core it provides > + the Reduced Gigabit Media Independent Interface(RGMII) between > + Ethernet physical media devices and the Gigabit Ethernet controller. > + > endif # NET_VENDOR_XILINX > diff --git a/drivers/net/ethernet/xilinx/Makefile > b/drivers/net/ethernet/xilinx/Makefile > index 214205e..bca0da0 100644 > --- a/drivers/net/ethernet/xilinx/Makefile > +++ b/drivers/net/ethernet/xilinx/Makefile > @@ -7,3 +7,4 @@ obj-$(CONFIG_XILINX_LL_TEMAC) += ll_temac.o > obj-$(CONFIG_XILINX_EMACLITE) += xilinx_emaclite.o > xilinx_emac-objs := xilinx_axienet_main.o xilinx_axienet_mdio.o > obj-$(CONFIG_XILINX_AXI_EMAC) += xilinx_emac.o > +obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o > diff --git a/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > new file mode 100644 > index 000..ca9f1ad > --- /dev/null > +++ b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > @@ -0,0 +1,76 @@ > +/* Xilinx GMII2RGMII Converter driver > + * > + * Copyright (C) 2016 Xilinx, Inc. > + * > + * Author: Kedareswara rao Appana > + * > + * Description: > + * This driver is developed for Xilinx GMII2RGMII Converter > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed) > +{ > + struct gmii2rgmii *xphy = (struct xphy *)priv; Why not pass struct xphy pointer directly? > + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev; > + u16 gmii2rgmii_reg = 0; > + > + switch (speed) { > + case 1000: > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000; > + break; > + case 100: > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100; > + break; > + default: > + return; > + } > + > + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr, > + XILINX_GMII2RGMII_REG_NUM, > + gmii2rgmii_reg); > +} > + > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) > +{ > + struct device_node *phy_node; > + struct phy_device *phydev; > + struct device_node *np = (struct device_node *)xphy->platform_data; > + > + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0); Is that property documented in a binding document? > + if (phy_node) { Should not there be an else clause which does not assign xphy->fix_mac_speed in case this property lookup fails? > + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0); > + if (!phydev) { > + netdev_err(xphy->dev, > +"%s: no gmii to rgmii converter found\n", > +xphy->dev->name); > + return -1; > + } > + xphy->gmii2rgmii_phy_dev = phydev; > + } > +
Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver
On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote: > On 1 July 2016 at 10:25, Mark Brownwrote: > > It's been repeatedly suggested to you that the tooling for this stuff > > could use some work. Please go and put some effort into that rather > > than continuing this thread which is accomplishing nothing. > You completely miss the point. No tooling will make people reconfigure > the kernel when the configuration in fact stays the same. > Sure the tooling does need work. And it would help getting the cases > when the tooling is NOT needed out of the way. I understand the problem perfectly, no amount of repeating yourself is going to change the problems that the bodge you are trying to force in creates for other users and the maintainability of the system. signature.asc Description: PGP signature
Re: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit : > This patch adds support for gmii2rgmii converter. > > The GMII to RGMII IP core provides the Reduced Gigabit Media > Independent Interface (RGMII) between Ethernet physical media > Devices and the Gigabit Ethernet controller. This core can > switch dynamically between the three different speed modes of > Operation. > MDIO interface is used to set operating speed of Ethernet MAC > > Signed-off-by: Kedareswara rao Appana > --- > drivers/net/ethernet/xilinx/Kconfig |7 ++ > drivers/net/ethernet/xilinx/Makefile|1 + > drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c | 76 > +++ > include/linux/xilinx_gmii2rgmii.h | 24 +++ > 4 files changed, 108 insertions(+), 0 deletions(-) > create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > create mode 100644 include/linux/xilinx_gmii2rgmii.h > > diff --git a/drivers/net/ethernet/xilinx/Kconfig > b/drivers/net/ethernet/xilinx/Kconfig > index 4f5c024..d7df70a 100644 > --- a/drivers/net/ethernet/xilinx/Kconfig > +++ b/drivers/net/ethernet/xilinx/Kconfig > @@ -39,4 +39,11 @@ config XILINX_LL_TEMAC > This driver supports the Xilinx 10/100/1000 LocalLink TEMAC > core used in Xilinx Spartan and Virtex FPGAs > > +config XILINX_GMII2RGMII > + tristate "Xilinx GMII2RGMII converter driver" > + ---help--- > + This driver support xilinx GMII to RGMII IP core it provides > + the Reduced Gigabit Media Independent Interface(RGMII) between > + Ethernet physical media devices and the Gigabit Ethernet controller. > + > endif # NET_VENDOR_XILINX > diff --git a/drivers/net/ethernet/xilinx/Makefile > b/drivers/net/ethernet/xilinx/Makefile > index 214205e..bca0da0 100644 > --- a/drivers/net/ethernet/xilinx/Makefile > +++ b/drivers/net/ethernet/xilinx/Makefile > @@ -7,3 +7,4 @@ obj-$(CONFIG_XILINX_LL_TEMAC) += ll_temac.o > obj-$(CONFIG_XILINX_EMACLITE) += xilinx_emaclite.o > xilinx_emac-objs := xilinx_axienet_main.o xilinx_axienet_mdio.o > obj-$(CONFIG_XILINX_AXI_EMAC) += xilinx_emac.o > +obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o > diff --git a/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > new file mode 100644 > index 000..ca9f1ad > --- /dev/null > +++ b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c > @@ -0,0 +1,76 @@ > +/* Xilinx GMII2RGMII Converter driver > + * > + * Copyright (C) 2016 Xilinx, Inc. > + * > + * Author: Kedareswara rao Appana > + * > + * Description: > + * This driver is developed for Xilinx GMII2RGMII Converter > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed) > +{ > + struct gmii2rgmii *xphy = (struct xphy *)priv; Why not pass struct xphy pointer directly? > + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev; > + u16 gmii2rgmii_reg = 0; > + > + switch (speed) { > + case 1000: > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000; > + break; > + case 100: > + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100; > + break; > + default: > + return; > + } > + > + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr, > + XILINX_GMII2RGMII_REG_NUM, > + gmii2rgmii_reg); > +} > + > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) > +{ > + struct device_node *phy_node; > + struct phy_device *phydev; > + struct device_node *np = (struct device_node *)xphy->platform_data; > + > + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0); Is that property documented in a binding document? > + if (phy_node) { Should not there be an else clause which does not assign xphy->fix_mac_speed in case this property lookup fails? > + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0); > + if (!phydev) { > + netdev_err(xphy->dev, > +"%s: no gmii to rgmii converter found\n", > +xphy->dev->name); > + return -1; > + } > + xphy->gmii2rgmii_phy_dev = phydev; > + } > + xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed; > + > +
Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver
On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote: > On 1 July 2016 at 10:25, Mark Brown wrote: > > It's been repeatedly suggested to you that the tooling for this stuff > > could use some work. Please go and put some effort into that rather > > than continuing this thread which is accomplishing nothing. > You completely miss the point. No tooling will make people reconfigure > the kernel when the configuration in fact stays the same. > Sure the tooling does need work. And it would help getting the cases > when the tooling is NOT needed out of the way. I understand the problem perfectly, no amount of repeating yourself is going to change the problems that the bodge you are trying to force in creates for other users and the maintainability of the system. signature.asc Description: PGP signature
Re: [PATCH v3] vmlinux.lds: account for destructor sections
2016-06-29 20:28 GMT+03:00 Dmitry Vyukov: > On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin > wrote: >> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov : >>> If CONFIG_KASAN is enabled and gcc is configured with >>> --disable-initfini-array and/or gold linker is used, >>> gcc emits .ctors/.dtors and .text.startup/.text.exit >>> sections instead of .init_array/.fini_array. >>> .dtors section is not explicitly accounted in the linker >>> script and messes vvar/percpu layout. Want: >>> >>> 822bfd80 D _edata >>> 822c D __vvar_beginning_hack >>> 822c A __vvar_page >>> 822c0080 0098 D vsyscall_gtod_data >>> 822c1000 A __init_begin >>> 822c1000 D init_per_cpu__irq_stack_union >>> 822c1000 A __per_cpu_load >>> 822d3000 D init_per_cpu__gdt_page >>> >>> Got: >>> >>> 8279a600 D _edata >>> 8279b000 A __vvar_page >>> 8279c000 A __init_begin >>> 8279c000 D init_per_cpu__irq_stack_union >>> 8279c000 A __per_cpu_load >>> 8279e000 D __vvar_beginning_hack >>> 8279e080 0098 D vsyscall_gtod_data >>> 827ae000 D init_per_cpu__gdt_page >>> >>> This happens because __vvar_page and .vvar get different >>> addresses in arch/x86/kernel/vmlinux.lds.S: >>> >>> . = ALIGN(PAGE_SIZE); >>> __vvar_page = .; >>> >>> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { >>> /* work around gold bug 13023 */ >>> __vvar_beginning_hack = .; >>> >>> Discard .dtors/.fini_array/.text.exit, since we don't call dtors. >>> Merge .text.startup into init text. >>> >>> Signed-off-by: Dmitry Vyukov >> >> Reviewed-by: Andrey Ryabinin > > > Who can take it to some tree? akpm tends to be the maintainer of last resort. But you probably need to resend the patch, because LKML was not in CC list. Also, please add stable tag: Cc: # v4.0+
Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote: > On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstrawrote: > > > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote: > > > well this is issue our partner met in the setup, > > > and I'm not sure what was their motivation for that, > > > perhaps James could clarify in here.. > > > > > > I tried to make the 'scratch that itch' solution as > > > mentioned in earlier discussion. > > > > > > So IIUIC what you say is that it needs to be more generic > > > solution..? I'll continue staring at it then ;-) > > > > I just want to know what problem we're trying to solve.. > > > > Because it appears this is running 1 task each on a 'weird' subset of > > cpus and things going badly. If this really is the case, then teaching > > active balance to only move tasks to idle cpus or something should also > > cure things. > > > > Also, I'm curious why people set such weird masks. > > > > I think the original issue was reported/seen when using a straight range > of CPUs, but in that range they crossed numa nodes. Then in the no knowing > what was triggering the issue and trying to reproduce the issue we started > getting some crazy masks. > > The work-around customer has been using is to use SCHED_RR as it doesn't > have this balance across NUMA issue. But it is also the fact that > SCHED_RR doesn't have this issue that makes it "seem" like a defect in > SCHED_OTHER. I have shared that this is triggered by the across numa node > taskset with customer so they are aware of that. But if this is something > that is seen as a limitation of SCHED_OTHER and not reasonable to be > addressed upstream I think maybe it as least something we should get > documented. But what exact usecase? A single task per cpu, or something else? Note that RR has different constraints than OTHER, but in both cases having skewed masks across a topology divide is unlikely to be good for performance. Esp. in the extreme case reported here, where one task is on an entirely different node than the rest of them, that task will cause cacheline transfers between the nodes slowing down itself as well as all the other tasks that have to pull it back in.
Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support
On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote: > If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with > vmalloc_node. > > grsecurity has had a similar feature (called > GRKERNSEC_KSTACKOVERFLOW) for a long time. > > Cc: Oleg Nesterov> Signed-off-by: Andy Lutomirski > --- > arch/Kconfig| 29 + > arch/ia64/include/asm/thread_info.h | 2 +- > include/linux/sched.h | 15 +++ > kernel/fork.c | 87 > + > 4 files changed, 113 insertions(+), 20 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 15996290fed4..18a2c3a7b460 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP > config CPU_NO_EFFICIENT_FFS > def_bool n > > +config HAVE_ARCH_VMAP_STACK > + def_bool n > + help > + An arch should select this symbol if it can support kernel stacks > + in vmalloc space. This means: > + > + - vmalloc space must be large enough to hold many kernel stacks. > + This may rule out many 32-bit architectures. > + > + - Stacks in vmalloc space need to work reliably. For example, if > + vmap page tables are created on demand, either this mechanism > + needs to work while the stack points to a virtual address with > + unpopulated page tables or arch code (switch_to and switch_mm, > + most likely) needs to ensure that the stack's page table entries > + are populated before running on a possibly unpopulated stack. > + > + - If the stack overflows into a guard page, something reasonable > + should happen. The definition of "reasonable" is flexible, but > + instantly rebooting without logging anything would be unfriendly. Nice, I wish more people would actually *explain* their Kconfig options properly. ... > diff --git a/kernel/fork.c b/kernel/fork.c > index 146c9840c079..06761de69360 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long > *stack) > * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a > * kmemcache based allocator. > */ > -# if THREAD_SIZE >= PAGE_SIZE > -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, > - int node) > +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) > +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int > node) > { > +#ifdef CONFIG_VMAP_STACK > + void *stack = __vmalloc_node_range( > + THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END, > + THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL, > + 0, node, __builtin_return_address(0)); Reformat: void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END, THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL, 0, node, __builtin_return_address(0)); > + > + /* > + * We can't call find_vm_area() in interrupt context, and > + * free_thread_info can be called in interrupt context, so cache free_thread_stack() ? > + * the vm_struct. > + */ > + if (stack) > + tsk->stack_vm_area = find_vm_area(stack); > + return stack; > +#else > struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP, > THREAD_SIZE_ORDER); > > return page ? page_address(page) : NULL; > +#endif > } > > -static inline void free_thread_stack(unsigned long *stack) > +static inline void free_thread_stack(struct task_struct *tsk) > { > - free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER); > + if (task_stack_vm_area(tsk)) > + vfree(tsk->stack); > + else > + free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER); > } > # else > static struct kmem_cache *thread_stack_cache; > @@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct > task_struct *tsk, > return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node); > } > > -static void free_thread_stack(unsigned long *stack) > +static void free_thread_stack(struct task_struct *tsk) > { > - kmem_cache_free(thread_stack_cache, stack); > + kmem_cache_free(thread_stack_cache, tsk->stack); > } > > void thread_stack_cache_init(void) > @@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep; > /* SLAB cache for mm_struct structures (tsk->mm) */ > static struct kmem_cache *mm_cachep; > > -static void account_kernel_stack(unsigned long *stack, int account) > +static void account_kernel_stack(struct task_struct *tsk, int account) > { > - /* All
Re: [PATCH v3] vmlinux.lds: account for destructor sections
2016-06-29 20:28 GMT+03:00 Dmitry Vyukov : > On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin > wrote: >> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov : >>> If CONFIG_KASAN is enabled and gcc is configured with >>> --disable-initfini-array and/or gold linker is used, >>> gcc emits .ctors/.dtors and .text.startup/.text.exit >>> sections instead of .init_array/.fini_array. >>> .dtors section is not explicitly accounted in the linker >>> script and messes vvar/percpu layout. Want: >>> >>> 822bfd80 D _edata >>> 822c D __vvar_beginning_hack >>> 822c A __vvar_page >>> 822c0080 0098 D vsyscall_gtod_data >>> 822c1000 A __init_begin >>> 822c1000 D init_per_cpu__irq_stack_union >>> 822c1000 A __per_cpu_load >>> 822d3000 D init_per_cpu__gdt_page >>> >>> Got: >>> >>> 8279a600 D _edata >>> 8279b000 A __vvar_page >>> 8279c000 A __init_begin >>> 8279c000 D init_per_cpu__irq_stack_union >>> 8279c000 A __per_cpu_load >>> 8279e000 D __vvar_beginning_hack >>> 8279e080 0098 D vsyscall_gtod_data >>> 827ae000 D init_per_cpu__gdt_page >>> >>> This happens because __vvar_page and .vvar get different >>> addresses in arch/x86/kernel/vmlinux.lds.S: >>> >>> . = ALIGN(PAGE_SIZE); >>> __vvar_page = .; >>> >>> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) { >>> /* work around gold bug 13023 */ >>> __vvar_beginning_hack = .; >>> >>> Discard .dtors/.fini_array/.text.exit, since we don't call dtors. >>> Merge .text.startup into init text. >>> >>> Signed-off-by: Dmitry Vyukov >> >> Reviewed-by: Andrey Ryabinin > > > Who can take it to some tree? akpm tends to be the maintainer of last resort. But you probably need to resend the patch, because LKML was not in CC list. Also, please add stable tag: Cc: # v4.0+
Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code
On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote: > On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstra wrote: > > > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote: > > > well this is issue our partner met in the setup, > > > and I'm not sure what was their motivation for that, > > > perhaps James could clarify in here.. > > > > > > I tried to make the 'scratch that itch' solution as > > > mentioned in earlier discussion. > > > > > > So IIUIC what you say is that it needs to be more generic > > > solution..? I'll continue staring at it then ;-) > > > > I just want to know what problem we're trying to solve.. > > > > Because it appears this is running 1 task each on a 'weird' subset of > > cpus and things going badly. If this really is the case, then teaching > > active balance to only move tasks to idle cpus or something should also > > cure things. > > > > Also, I'm curious why people set such weird masks. > > > > I think the original issue was reported/seen when using a straight range > of CPUs, but in that range they crossed numa nodes. Then in the no knowing > what was triggering the issue and trying to reproduce the issue we started > getting some crazy masks. > > The work-around customer has been using is to use SCHED_RR as it doesn't > have this balance across NUMA issue. But it is also the fact that > SCHED_RR doesn't have this issue that makes it "seem" like a defect in > SCHED_OTHER. I have shared that this is triggered by the across numa node > taskset with customer so they are aware of that. But if this is something > that is seen as a limitation of SCHED_OTHER and not reasonable to be > addressed upstream I think maybe it as least something we should get > documented. But what exact usecase? A single task per cpu, or something else? Note that RR has different constraints than OTHER, but in both cases having skewed masks across a topology divide is unlikely to be good for performance. Esp. in the extreme case reported here, where one task is on an entirely different node than the rest of them, that task will cause cacheline transfers between the nodes slowing down itself as well as all the other tasks that have to pull it back in.
Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support
On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote: > If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with > vmalloc_node. > > grsecurity has had a similar feature (called > GRKERNSEC_KSTACKOVERFLOW) for a long time. > > Cc: Oleg Nesterov > Signed-off-by: Andy Lutomirski > --- > arch/Kconfig| 29 + > arch/ia64/include/asm/thread_info.h | 2 +- > include/linux/sched.h | 15 +++ > kernel/fork.c | 87 > + > 4 files changed, 113 insertions(+), 20 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 15996290fed4..18a2c3a7b460 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP > config CPU_NO_EFFICIENT_FFS > def_bool n > > +config HAVE_ARCH_VMAP_STACK > + def_bool n > + help > + An arch should select this symbol if it can support kernel stacks > + in vmalloc space. This means: > + > + - vmalloc space must be large enough to hold many kernel stacks. > + This may rule out many 32-bit architectures. > + > + - Stacks in vmalloc space need to work reliably. For example, if > + vmap page tables are created on demand, either this mechanism > + needs to work while the stack points to a virtual address with > + unpopulated page tables or arch code (switch_to and switch_mm, > + most likely) needs to ensure that the stack's page table entries > + are populated before running on a possibly unpopulated stack. > + > + - If the stack overflows into a guard page, something reasonable > + should happen. The definition of "reasonable" is flexible, but > + instantly rebooting without logging anything would be unfriendly. Nice, I wish more people would actually *explain* their Kconfig options properly. ... > diff --git a/kernel/fork.c b/kernel/fork.c > index 146c9840c079..06761de69360 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long > *stack) > * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a > * kmemcache based allocator. > */ > -# if THREAD_SIZE >= PAGE_SIZE > -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, > - int node) > +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK) > +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int > node) > { > +#ifdef CONFIG_VMAP_STACK > + void *stack = __vmalloc_node_range( > + THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END, > + THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL, > + 0, node, __builtin_return_address(0)); Reformat: void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END, THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL, 0, node, __builtin_return_address(0)); > + > + /* > + * We can't call find_vm_area() in interrupt context, and > + * free_thread_info can be called in interrupt context, so cache free_thread_stack() ? > + * the vm_struct. > + */ > + if (stack) > + tsk->stack_vm_area = find_vm_area(stack); > + return stack; > +#else > struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP, > THREAD_SIZE_ORDER); > > return page ? page_address(page) : NULL; > +#endif > } > > -static inline void free_thread_stack(unsigned long *stack) > +static inline void free_thread_stack(struct task_struct *tsk) > { > - free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER); > + if (task_stack_vm_area(tsk)) > + vfree(tsk->stack); > + else > + free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER); > } > # else > static struct kmem_cache *thread_stack_cache; > @@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct > task_struct *tsk, > return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node); > } > > -static void free_thread_stack(unsigned long *stack) > +static void free_thread_stack(struct task_struct *tsk) > { > - kmem_cache_free(thread_stack_cache, stack); > + kmem_cache_free(thread_stack_cache, tsk->stack); > } > > void thread_stack_cache_init(void) > @@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep; > /* SLAB cache for mm_struct structures (tsk->mm) */ > static struct kmem_cache *mm_cachep; > > -static void account_kernel_stack(unsigned long *stack, int account) > +static void account_kernel_stack(struct task_struct *tsk, int account) > { > - /* All stack pages are in the same zone and
Re: mfd: dm355evm_msp: Refactoring for add_child()
> FYI, code looks fine. Thanks for your acknowledgement. > ... but please take this opportunity to set-up your submission > environment i.e. using `git format-patch` and `git send-email`. Would you like to see any special settings to be mentioned in a section like "15) Explicit In-Reply-To headers" from the document "SubmittingPatches"? > you've done that, please re-submit this patch with my: Does the association of this patch with a bit relevant discussion really hinder the desired commit? Regards, Markus
Re: mfd: dm355evm_msp: Refactoring for add_child()
> FYI, code looks fine. Thanks for your acknowledgement. > ... but please take this opportunity to set-up your submission > environment i.e. using `git format-patch` and `git send-email`. Would you like to see any special settings to be mentioned in a section like "15) Explicit In-Reply-To headers" from the document "SubmittingPatches"? > you've done that, please re-submit this patch with my: Does the association of this patch with a bit relevant discussion really hinder the desired commit? Regards, Markus
Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register
2016-07-01 16:12+0200, Paolo Bonzini: > On 01/07/2016 15:11, Radim Krčmář wrote: >> +static void __kvm_apic_state_fixup(struct kvm_vcpu *vcpu, >> + struct kvm_lapic_state *s, bool set) >> +{ >> + if (apic_x2apic_mode(vcpu->arch.apic)) { >> + u32 *id = (u32 *)(s->regs + APIC_ID); >> + if (set) >> + *id >>= 24; >> + else >> + *id <<= 24; >> + } >>> > >>> > When setting, this should read from the apic_base being set. So I think >>> > you cannot use apic_x2apic_mode. >> >> apic_x2apic_mode uses apic_base MSR, so its value does not depend on >> LAPIC_SET/GET. I don't like the dependency much, but all combinations >> of values/orders should work well. > > You're right of course. However it should be documented in the > KVM_GET_LAPIC/KVM_SET_LAPIC ioctl docs that KVM_SET_MSR for apic_base > should be performed first. Ok, that will make our life easier. > Should kvm_lapic_set_base change the value of the ID register when > x2apic mode is left just like we do for entering x2apic mode? Yes, the patch does it. The only possible change from x2APIC mode is to hardware disabled mode and re-enabled xAPIC starts with initial APIC ID in upper 8 bits. >(Hint: we > want kvm-unit-tests for this). :) Something like this? enable_xapic() id = apic_id() set_apic_id(id+1) // ? enable_x2apic() id == apic_id() & 0xff disable_apic() enable_xapic() id == apic_id()
Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register
2016-07-01 16:12+0200, Paolo Bonzini: > On 01/07/2016 15:11, Radim Krčmář wrote: >> +static void __kvm_apic_state_fixup(struct kvm_vcpu *vcpu, >> + struct kvm_lapic_state *s, bool set) >> +{ >> + if (apic_x2apic_mode(vcpu->arch.apic)) { >> + u32 *id = (u32 *)(s->regs + APIC_ID); >> + if (set) >> + *id >>= 24; >> + else >> + *id <<= 24; >> + } >>> > >>> > When setting, this should read from the apic_base being set. So I think >>> > you cannot use apic_x2apic_mode. >> >> apic_x2apic_mode uses apic_base MSR, so its value does not depend on >> LAPIC_SET/GET. I don't like the dependency much, but all combinations >> of values/orders should work well. > > You're right of course. However it should be documented in the > KVM_GET_LAPIC/KVM_SET_LAPIC ioctl docs that KVM_SET_MSR for apic_base > should be performed first. Ok, that will make our life easier. > Should kvm_lapic_set_base change the value of the ID register when > x2apic mode is left just like we do for entering x2apic mode? Yes, the patch does it. The only possible change from x2APIC mode is to hardware disabled mode and re-enabled xAPIC starts with initial APIC ID in upper 8 bits. >(Hint: we > want kvm-unit-tests for this). :) Something like this? enable_xapic() id = apic_id() set_apic_id(id+1) // ? enable_x2apic() id == apic_id() & 0xff disable_apic() enable_xapic() id == apic_id()