Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver

2016-07-01 Thread Michal Suchanek
On 1 July 2016 at 17:00, Mark Brown  wrote:
> On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote:
>> On 1 July 2016 at 10:25, Mark Brown  wrote:
>
>> > It's been repeatedly suggested to you that the tooling for this stuff
>> > could use some work.  Please go and put some effort into that rather
>> > than continuing this thread which is accomplishing nothing.
>
>> You completely miss the point. No tooling will make people reconfigure
>> the kernel when the configuration in fact stays the same.
>
>> Sure the tooling does need work. And it would help getting the cases
>> when the tooling is NOT needed out of the way.
>
> I understand the problem perfectly, no amount of repeating yourself is
> going to change the problems that the bodge you are trying to force in
> creates for other users and the maintainability of the system.

Can you, please, specify what problems this patch creates for other users
and maintainability?

Without stating the problems of this solution clearly or proposing alternate
workable solution there is not much that can be done.

Thanks

Michal


[PATCH 01/31] mm, vmstat: add infrastructure for per-node vmstats

2016-07-01 Thread Mel Gorman
Series: "Move LRU page reclaim from zones to nodes v7"

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.6-rc3 plus the page allocator optimisation
series. Conceptually, this is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series has been tested on a 16 core UMA machine and a 2-socket 48 core
NUMA machine. The UMA results are presented in most cases as the NUMA machine
behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc3  4.7.0-rc3
  mmotm-20160615  nodelru-v7r17
Min  total-odr0-1   485.00 (  0.00%)   462.00 (  4.74%)
Min  total-odr0-2   354.00 (  0.00%)   341.00 (  3.67%)
Min  total-odr0-4   285.00 (  0.00%)   277.00 (  2.81%)
Min  total-odr0-8   249.00 (  0.00%)   240.00 (  3.61%)
Min  total-odr0-16  230.00 (  0.00%)   224.00 (  2.61%)
Min  total-odr0-32  222.00 (  0.00%)   215.00 (  3.15%)
Min  total-odr0-64  216.00 (  0.00%)   210.00 (  2.78%)
Min  total-odr0-128 214.00 (  0.00%)   208.00 (  2.80%)
Min  total-odr0-256 248.00 (  0.00%)   233.00 (  6.05%)
Min  total-odr0-512 277.00 (  0.00%)   270.00 (  2.53%)
Min  total-odr0-1024294.00 (  0.00%)   284.00 (  3.40%)
Min  total-odr0-2048308.00 (  0.00%)   298.00 (  3.25%)
Min  total-odr0-4096318.00 (  0.00%)   307.00 (  3.46%)
Min  total-odr0-8192322.00 (  0.00%)   308.00 (  4.35%)
Min  total-odr0-16384   324.00 (  0.00%)   309.00 (  4.63%)
Min  total-odr1-1   729.00 (  0.00%)   686.00 (  5.90%)
Min  total-odr1-2   533.00 (  0.00%)   520.00 (  2.44%)
Min  total-odr1-4   434.00 (  0.00%)   415.00 (  4.38%)
Min  total-odr1-8   390.00 (  0.00%)   364.00 (  6.67%)
Min  total-odr1-16  359.00 (  0.00%)   335.00 (  6.69%)
Min  total-odr1-32  356.00 (  0.00%)   327.00 (  8.15%)
Min  total-odr1-64  356.00 (  0.00%)   321.00 (  9.83%)
Min  total-odr1-128 356.00 (  0.00%)   333.00 (  6.46%)
Min  total-odr1-256 354.00 (  0.00%)   337.00 (  4.80%)
Min  total-odr1-512 366.00 (  0.00%) 

[PATCH 04/31] mm, vmscan: begin reclaiming pages on a per-node basis

2016-07-01 Thread Mel Gorman
This patch makes reclaim decisions on a per-node basis.  A reclaimer knows
what zone is required by the allocation request and skips pages from
higher zones.  In many cases this will be ok because it's a GFP_HIGHMEM
request of some description.  On 64-bit, ZONE_DMA32 requests will cause
some problems but 32-bit devices on 64-bit platforms are increasingly
rare.  Historically it would have been a major problem on 32-bit with big
Highmem:Lowmem ratios but such configurations are also now rare and even
where they exist, they are not encouraged.  If it really becomes a
problem, it'll manifest as very low reclaim efficiencies.

Link: 
http://lkml.kernel.org/r/1466518566-30034-5-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 79 ++---
 1 file changed, 55 insertions(+), 24 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 86a523a761c9..766b36bec829 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -84,6 +84,9 @@ struct scan_control {
/* Scan (total_size >> priority) pages at once */
int priority;
 
+   /* The highest zone to isolate pages for reclaim from */
+   enum zone_type reclaim_idx;
+
unsigned int may_writepage:1;
 
/* Can mapped pages be reclaimed? */
@@ -1392,6 +1395,7 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
unsigned long nr_taken = 0;
unsigned long nr_zone_taken[MAX_NR_ZONES] = { 0 };
unsigned long scan, nr_pages;
+   LIST_HEAD(pages_skipped);
 
for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
!list_empty(src); scan++) {
@@ -1402,6 +1406,11 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
 
VM_BUG_ON_PAGE(!PageLRU(page), page);
 
+   if (page_zonenum(page) > sc->reclaim_idx) {
+   list_move(>lru, _skipped);
+   continue;
+   }
+
switch (__isolate_lru_page(page, mode)) {
case 0:
nr_pages = hpage_nr_pages(page);
@@ -1420,6 +1429,15 @@ static unsigned long isolate_lru_pages(unsigned long 
nr_to_scan,
}
}
 
+   /*
+* Splice any skipped pages to the start of the LRU list. Note that
+* this disrupts the LRU order when reclaiming for lower zones but
+* we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX
+* scanning would soon rescan the same pages to skip and put the
+* system at risk of premature OOM.
+*/
+   if (!list_empty(_skipped))
+   list_splice(_skipped, src);
*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
nr_taken, mode, is_file_lru(lru));
@@ -1589,7 +1607,7 @@ static int current_may_throttle(void)
 }
 
 /*
- * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
+ * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
 static noinline_for_stack unsigned long
@@ -2401,12 +2419,13 @@ static inline bool should_continue_reclaim(struct zone 
*zone,
}
 }
 
-static bool shrink_zone(struct zone *zone, struct scan_control *sc,
-   bool is_classzone)
+static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc,
+   enum zone_type classzone_idx)
 {
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+   struct zone *zone = >node_zones[classzone_idx];
 
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2438,7 +2457,7 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
shrink_zone_memcg(zone, memcg, sc, _pages);
zone_lru_pages += lru_pages;
 
-   if (memcg && is_classzone)
+   if (!global_reclaim(sc))
shrink_slab(sc->gfp_mask, zone_to_nid(zone),
memcg, sc->nr_scanned - scanned,
lru_pages);
@@ -2469,7 +2488,7 @@ static bool shrink_zone(struct zone *zone, struct 
scan_control *sc,
 * Shrink the slab caches in the same proportion that
 * the eligible LRU pages were scanned.
 */
-   if (global_reclaim(sc) && is_classzone)
+   if (global_reclaim(sc))
shrink_slab(sc->gfp_mask, zone_to_nid(zone), NULL,
sc->nr_scanned - nr_scanned,
zone_lru_pages);
@@ -2553,7 +2572,7 @@ static void 

Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver

2016-07-01 Thread Michal Suchanek
On 1 July 2016 at 17:00, Mark Brown  wrote:
> On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote:
>> On 1 July 2016 at 10:25, Mark Brown  wrote:
>
>> > It's been repeatedly suggested to you that the tooling for this stuff
>> > could use some work.  Please go and put some effort into that rather
>> > than continuing this thread which is accomplishing nothing.
>
>> You completely miss the point. No tooling will make people reconfigure
>> the kernel when the configuration in fact stays the same.
>
>> Sure the tooling does need work. And it would help getting the cases
>> when the tooling is NOT needed out of the way.
>
> I understand the problem perfectly, no amount of repeating yourself is
> going to change the problems that the bodge you are trying to force in
> creates for other users and the maintainability of the system.

Can you, please, specify what problems this patch creates for other users
and maintainability?

Without stating the problems of this solution clearly or proposing alternate
workable solution there is not much that can be done.

Thanks

Michal


[PATCH 11/31] mm: vmscan: do not reclaim from kswapd if there is any eligible zone

2016-07-01 Thread Mel Gorman
kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on lower
zones.  Now that we are reclaiming on a per-node basis, any eligible zone
can be used and pages will still be aged fairly.  This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

[hillf...@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: 
http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Signed-off-by: Hillf Danton 
Acked-by: Vlastimil Babka 
---
 mm/vmscan.c | 56 
 1 file changed, 24 insertions(+), 32 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 911142d25de2..2f898ba2ee2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3141,31 +3141,36 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /* Scan from the highest requested zone to dma */
-   for (i = classzone_idx; i >= 0; i--) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   /*
-* If the number of buffer_heads in the machine
-* exceeds the maximum allowed level and this node
-* has a highmem zone, force kswapd to reclaim from
-* it to relieve lowmem pressure.
-*/
-   if (buffer_heads_over_limit && is_highmem_idx(i)) {
-   classzone_idx = i;
-   break;
-   }
+   /*
+* If the number of buffer_heads in the machine exceeds the
+* maximum allowed level then reclaim from all zones. This is
+* not specific to highmem as highmem may not exist but it is
+* it is expected that buffer_heads are stripped in writeback.
+*/
+   if (buffer_heads_over_limit) {
+   for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
 
-   if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
}
}
 
-   if (i < 0)
-   goto out;
+   /*
+* Only reclaim if there are no eligible zones. Check from
+* high to low zone to avoid prematurely clearing pgdat
+* congested state.
+*/
+   for (i = classzone_idx; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
+
+   if (zone_balanced(zone, sc.order, classzone_idx))
+   goto out;
+   }
 
/*
 * Do some background aging of the anon list, to give
@@ -3211,19 +3216,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
break;
 
/*
-* Stop reclaiming if any eligible zone is balanced and clear
-* node writeback or congested.
-*/
-   for (i = 0; i <= classzone_idx; i++) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   if (zone_balanced(zone, sc.order, classzone_idx))
-   goto out;
-   }
-
-   /*
 * Raise priority if scanning rate is too low or there was no
 * progress in reclaiming pages
 */
-- 
2.6.4



[PATCH 06/31] mm, vmscan: make kswapd reclaim in terms of nodes

2016-07-01 Thread Mel Gorman
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
patch gets rid of many of the node-based versus zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Link: 
http://lkml.kernel.org/r/1466518566-30034-7-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 292 +---
 1 file changed, 101 insertions(+), 191 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6e61dae382b..7b382b90b145 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+   struct zone *zone, struct scan_control *sc)
 {
struct mem_cgroup *memcg;
 
@@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct 
scan_control *sc)
} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, bool highorder,
+static bool zone_balanced(struct zone *zone, int order,
unsigned long balance_gap, int classzone_idx)
 {
unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-   /*
-* When checking from pgdat_balanced(), kswapd should stop and sleep
-* when it reaches the high order-0 watermark and let kcompactd take
-* over. Other callers such as wakeup_kswapd() want to determine the
-* true high-order watermark.
-*/
-   if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
-   mark += (1UL << order);
-   order = 0;
-   }
-
return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- *   o a 16M DMA zone that is balanced will not balance a zone on any
- * reasonable sized machine
- *   o On all other machines, the top zone must be at least a reasonable
- * percentage of the middle zones. For example, on 32-bit x86, highmem
- * would need to be at least 256M for it to be balance a whole node.
- * Similarly, on x86-64 the Normal zone would need to be at least 1G
- * to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
-   unsigned long managed_pages = 0;
-   unsigned long balanced_pages = 0;
-   int i;
-
-   /* Check the watermark levels */
-   for (i = 0; i <= classzone_idx; i++) {
-   struct zone *zone = pgdat->node_zones + i;
-
-   if (!populated_zone(zone))
-   continue;
-
-   managed_pages += zone->managed_pages;
-
-   /*
-* A special case here:
-*
-* balance_pgdat() skips over all_unreclaimable after
-* DEF_PRIORITY. Effectively, it considers them balanced so
-* they must be considered balanced here as well!
-*/
-   if (!pgdat_reclaimable(zone->zone_pgdat)) {
-   balanced_pages += zone->managed_pages;
-   continue;
-   }
-
-   if (zone_balanced(zone, order, false, 0, i))
-   balanced_pages += zone->managed_pages;
-   else if (!order)
-   return false;
-   }
-
-   if (order)
-   return balanced_pages >= (managed_pages >> 2);
-   else
-  

[PATCH 11/31] mm: vmscan: do not reclaim from kswapd if there is any eligible zone

2016-07-01 Thread Mel Gorman
kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on lower
zones.  Now that we are reclaiming on a per-node basis, any eligible zone
can be used and pages will still be aged fairly.  This patch avoids
reclaiming excessively unless buffer_heads are over the limit and it's
necessary to reclaim from a higher zone than requested by the waker of
kswapd to relieve low memory pressure.

[hillf...@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: 
http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Signed-off-by: Hillf Danton 
Acked-by: Vlastimil Babka 
---
 mm/vmscan.c | 56 
 1 file changed, 24 insertions(+), 32 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 911142d25de2..2f898ba2ee2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3141,31 +3141,36 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /* Scan from the highest requested zone to dma */
-   for (i = classzone_idx; i >= 0; i--) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   /*
-* If the number of buffer_heads in the machine
-* exceeds the maximum allowed level and this node
-* has a highmem zone, force kswapd to reclaim from
-* it to relieve lowmem pressure.
-*/
-   if (buffer_heads_over_limit && is_highmem_idx(i)) {
-   classzone_idx = i;
-   break;
-   }
+   /*
+* If the number of buffer_heads in the machine exceeds the
+* maximum allowed level then reclaim from all zones. This is
+* not specific to highmem as highmem may not exist but it is
+* it is expected that buffer_heads are stripped in writeback.
+*/
+   if (buffer_heads_over_limit) {
+   for (i = MAX_NR_ZONES - 1; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
 
-   if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
}
}
 
-   if (i < 0)
-   goto out;
+   /*
+* Only reclaim if there are no eligible zones. Check from
+* high to low zone to avoid prematurely clearing pgdat
+* congested state.
+*/
+   for (i = classzone_idx; i >= 0; i--) {
+   zone = pgdat->node_zones + i;
+   if (!populated_zone(zone))
+   continue;
+
+   if (zone_balanced(zone, sc.order, classzone_idx))
+   goto out;
+   }
 
/*
 * Do some background aging of the anon list, to give
@@ -3211,19 +3216,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
break;
 
/*
-* Stop reclaiming if any eligible zone is balanced and clear
-* node writeback or congested.
-*/
-   for (i = 0; i <= classzone_idx; i++) {
-   zone = pgdat->node_zones + i;
-   if (!populated_zone(zone))
-   continue;
-
-   if (zone_balanced(zone, sc.order, classzone_idx))
-   goto out;
-   }
-
-   /*
 * Raise priority if scanning rate is too low or there was no
 * progress in reclaiming pages
 */
-- 
2.6.4



[PATCH 06/31] mm, vmscan: make kswapd reclaim in terms of nodes

2016-07-01 Thread Mel Gorman
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric. This
patch gets rid of many of the node-based versus zone-based decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Link: 
http://lkml.kernel.org/r/1466518566-30034-7-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 292 +---
 1 file changed, 101 insertions(+), 191 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c6e61dae382b..7b382b90b145 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2980,7 +2980,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
 }
 #endif
 
-static void age_active_anon(struct zone *zone, struct scan_control *sc)
+static void age_active_anon(struct pglist_data *pgdat,
+   struct zone *zone, struct scan_control *sc)
 {
struct mem_cgroup *memcg;
 
@@ -2999,85 +3000,15 @@ static void age_active_anon(struct zone *zone, struct 
scan_control *sc)
} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order, bool highorder,
+static bool zone_balanced(struct zone *zone, int order,
unsigned long balance_gap, int classzone_idx)
 {
unsigned long mark = high_wmark_pages(zone) + balance_gap;
 
-   /*
-* When checking from pgdat_balanced(), kswapd should stop and sleep
-* when it reaches the high order-0 watermark and let kcompactd take
-* over. Other callers such as wakeup_kswapd() want to determine the
-* true high-order watermark.
-*/
-   if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
-   mark += (1UL << order);
-   order = 0;
-   }
-
return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
 
 /*
- * pgdat_balanced() is used when checking if a node is balanced.
- *
- * For order-0, all zones must be balanced!
- *
- * For high-order allocations only zones that meet watermarks and are in a
- * zone allowed by the callers classzone_idx are added to balanced_pages. The
- * total of balanced pages must be at least 25% of the zones allowed by
- * classzone_idx for the node to be considered balanced. Forcing all zones to
- * be balanced for high orders can cause excessive reclaim when there are
- * imbalanced zones.
- * The choice of 25% is due to
- *   o a 16M DMA zone that is balanced will not balance a zone on any
- * reasonable sized machine
- *   o On all other machines, the top zone must be at least a reasonable
- * percentage of the middle zones. For example, on 32-bit x86, highmem
- * would need to be at least 256M for it to be balance a whole node.
- * Similarly, on x86-64 the Normal zone would need to be at least 1G
- * to balance a node on its own. These seemed like reasonable ratios.
- */
-static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
-{
-   unsigned long managed_pages = 0;
-   unsigned long balanced_pages = 0;
-   int i;
-
-   /* Check the watermark levels */
-   for (i = 0; i <= classzone_idx; i++) {
-   struct zone *zone = pgdat->node_zones + i;
-
-   if (!populated_zone(zone))
-   continue;
-
-   managed_pages += zone->managed_pages;
-
-   /*
-* A special case here:
-*
-* balance_pgdat() skips over all_unreclaimable after
-* DEF_PRIORITY. Effectively, it considers them balanced so
-* they must be considered balanced here as well!
-*/
-   if (!pgdat_reclaimable(zone->zone_pgdat)) {
-   balanced_pages += zone->managed_pages;
-   continue;
-   }
-
-   if (zone_balanced(zone, order, false, 0, i))
-   balanced_pages += zone->managed_pages;
-   else if (!order)
-   return false;
-   }
-
-   if (order)
-   return balanced_pages >= (managed_pages >> 2);
-   else
-   return true;
-}
-
-/*
  * Prepare kswapd for sleeping. This verifies that there are no processes

[PATCH 12/31] mm, vmscan: make shrink_node decisions more node-centric

2016-07-01 Thread Mel Gorman
Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too
much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

Signed-off-by: Mel Gorman 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
---
 include/linux/memcontrol.h | 19 
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  2 +-
 mm/memcontrol.c|  4 ++--
 mm/page_alloc.c|  2 +-
 mm/vmscan.c| 57 ++
 mm/workingset.c|  6 ++---
 7 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1927dcb6921e..48b43c709ed7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct 
zone *zone)
 }
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
struct mem_cgroup_per_zone *mz;
struct lruvec *lruvec;
 
if (mem_cgroup_disabled()) {
-   lruvec = zone_lruvec(zone);
+   lruvec = node_lruvec(pgdat);
goto out;
}
 
@@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, 
struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
-   return zone_lruvec(zone);
+   return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eb74e63df5cf..f88cbbb476c8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return >zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-   return >zone_pgdat->lruvec;
+   return >lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 916e2eddecd6..0ad616d7c381 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50c86ad121bc..c9ebec98e92a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup 
*root_memcg,
}
continue;
}
-   total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-zone, _scanned);
+   total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+   zone, _scanned);
*total_scanned += nr_scanned;
if (!soft_limit_excess(root_memcg))
break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 

[PATCH 07/31] mm, vmscan: remove balance gap

2016-07-01 Thread Mel Gorman
The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone.  With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.

[vba...@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
Link: 
http://lkml.kernel.org/r/1466518566-30034-8-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/swap.h |  9 -
 mm/vmscan.c  | 19 ---
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c82f916008b7..916e2eddecd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -157,15 +157,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX   0x3e/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD   0x3f/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE 0x40/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b382b90b145..a52167eabc96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc,
  */
 static inline bool compaction_ready(struct zone *zone, int order, int 
classzone_idx)
 {
-   unsigned long balance_gap, watermark;
+   unsigned long watermark;
bool watermark_ok;
 
/*
@@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, 
int order, int classzone_
 * there is a buffer of free pages available to give compaction
 * a reasonable chance of completing and allocating the page
 */
-   balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-   zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-   watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
+   watermark = high_wmark_pages(zone) + (2UL << order);
watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 
classzone_idx);
 
/*
@@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat,
} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-   unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
-   unsigned long mark = high_wmark_pages(zone) + balance_gap;
+   unsigned long mark = high_wmark_pages(zone);
 
return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
@@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, order, 0, classzone_idx))
+   if (zone_balanced(zone, order, classzone_idx))
return true;
}
 
@@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
break;
}
 
-   if (!zone_balanced(zone, order, 0, 0)) {
+   if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
} else {
@@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+   if (zone_balanced(zone, sc.order, classzone_idx)) {
clear_bit(PGDAT_CONGESTED, >flags);
clear_bit(PGDAT_DIRTY, >flags);
goto out;
@@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
}
if (!waitqueue_active(>kswapd_wait))
return;
-   if (zone_balanced(zone, order, 0, 0))
+   if (zone_balanced(zone, order, 0))
return;
 
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4



[PATCH 09/31] mm, vmscan: by default have direct reclaim only shrink once per node

2016-07-01 Thread Mel Gorman
Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim.  In the default case,
only shrink once per node.

Link: 
http://lkml.kernel.org/r/1466518566-30034-10-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b524d3b72527..34656173a670 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, 
int order, int classzone_
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- *allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- *must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- *zone defense algorithm.
- *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
@@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type classzone_idx;
+   pg_data_t *last_pgdat = NULL;
 
/*
 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2600,6 +2593,16 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
classzone_idx--;
 
/*
+* Shrink each node in the zonelist once. If the zonelist is
+* ordered by zone (not the default) then a node may be
+* shrunk multiple times but in that case the user prefers
+* lower zones being preserved
+*/
+   if (zone->zone_pgdat == last_pgdat)
+   continue;
+   last_pgdat = zone->zone_pgdat;
+
+   /*
 * Take care memory controller reclaiming has small influence
 * to global LRU.
 */
-- 
2.6.4



[PATCH 12/31] mm, vmscan: make shrink_node decisions more node-centric

2016-07-01 Thread Mel Gorman
Earlier patches focused on having direct reclaim and kswapd use data that
is node-centric for reclaiming but shrink_node() itself still uses too
much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

Signed-off-by: Mel Gorman 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
---
 include/linux/memcontrol.h | 19 
 include/linux/mmzone.h |  4 ++--
 include/linux/swap.h   |  2 +-
 mm/memcontrol.c|  4 ++--
 mm/page_alloc.c|  2 +-
 mm/vmscan.c| 57 ++
 mm/workingset.c|  6 ++---
 7 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1927dcb6921e..48b43c709ed7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -325,22 +325,23 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct 
zone *zone)
 }
 
 /**
- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
+ * mem_cgroup_lruvec - get the lru list vector for a node or a memcg zone
+ * @node: node of the wanted lruvec
  * @zone: zone of the wanted lruvec
  * @memcg: memcg of the wanted lruvec
  *
- * Returns the lru list vector holding pages for the given @zone and
- * @mem.  This can be the global zone lruvec, if the memory controller
+ * Returns the lru list vector holding pages for a given @node or a given
+ * @memcg and @zone. This can be the node lruvec, if the memory controller
  * is disabled.
  */
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
struct mem_cgroup_per_zone *mz;
struct lruvec *lruvec;
 
if (mem_cgroup_disabled()) {
-   lruvec = zone_lruvec(zone);
+   lruvec = node_lruvec(pgdat);
goto out;
}
 
@@ -610,10 +611,10 @@ static inline void mem_cgroup_migrate(struct page *old, 
struct page *new)
 {
 }
 
-static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
-   struct mem_cgroup *memcg)
+static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
+   struct zone *zone, struct mem_cgroup *memcg)
 {
-   return zone_lruvec(zone);
+   return node_lruvec(pgdat);
 }
 
 static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eb74e63df5cf..f88cbbb476c8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -739,9 +739,9 @@ static inline spinlock_t *zone_lru_lock(struct zone *zone)
return >zone_pgdat->lru_lock;
 }
 
-static inline struct lruvec *zone_lruvec(struct zone *zone)
+static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
-   return >zone_pgdat->lruvec;
+   return >lruvec;
 }
 
 static inline unsigned long pgdat_end_pfn(pg_data_t *pgdat)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 916e2eddecd6..0ad616d7c381 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,7 +316,7 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
  unsigned long nr_pages,
  gfp_t gfp_mask,
  bool may_swap);
-extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
+extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
struct zone *zone,
unsigned long *nr_scanned);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50c86ad121bc..c9ebec98e92a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1432,8 +1432,8 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup 
*root_memcg,
}
continue;
}
-   total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
-zone, _scanned);
+   total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+   zone, _scanned);
*total_scanned += nr_scanned;
if (!soft_limit_excess(root_memcg))
break;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f58548139bf2..b76ea2527c09 100644
--- a/mm/page_alloc.c
+++ 

[PATCH 07/31] mm, vmscan: remove balance gap

2016-07-01 Thread Mel Gorman
The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone.  With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.

[vba...@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
Link: 
http://lkml.kernel.org/r/1466518566-30034-8-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/swap.h |  9 -
 mm/vmscan.c  | 19 ---
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c82f916008b7..916e2eddecd6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -157,15 +157,6 @@ enum {
 #define SWAP_CLUSTER_MAX 32UL
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
-/*
- * Ratio between zone->managed_pages and the "gap" that above the per-zone
- * "high_wmark". While balancing nodes, We allow kswapd to shrink zones that
- * do not meet the (high_wmark + gap) watermark, even which already met the
- * high_wmark, in order to provide better per-zone lru behavior. We are ok to
- * spend not more than 1% of the memory for this zone balancing "gap".
- */
-#define KSWAPD_ZONE_BALANCE_GAP_RATIO 100
-
 #define SWAP_MAP_MAX   0x3e/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD   0x3f/* Note pageblock is bad, in first swap_map */
 #define SWAP_HAS_CACHE 0x40/* Flag page is cached, in first swap_map */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b382b90b145..a52167eabc96 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2518,7 +2518,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc,
  */
 static inline bool compaction_ready(struct zone *zone, int order, int 
classzone_idx)
 {
-   unsigned long balance_gap, watermark;
+   unsigned long watermark;
bool watermark_ok;
 
/*
@@ -2527,9 +2527,7 @@ static inline bool compaction_ready(struct zone *zone, 
int order, int classzone_
 * there is a buffer of free pages available to give compaction
 * a reasonable chance of completing and allocating the page
 */
-   balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
-   zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-   watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
+   watermark = high_wmark_pages(zone) + (2UL << order);
watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 
classzone_idx);
 
/*
@@ -3000,10 +2998,9 @@ static void age_active_anon(struct pglist_data *pgdat,
} while (memcg);
 }
 
-static bool zone_balanced(struct zone *zone, int order,
-   unsigned long balance_gap, int classzone_idx)
+static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
-   unsigned long mark = high_wmark_pages(zone) + balance_gap;
+   unsigned long mark = high_wmark_pages(zone);
 
return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
 }
@@ -3045,7 +3042,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int 
order, long remaining,
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, order, 0, classzone_idx))
+   if (zone_balanced(zone, order, classzone_idx))
return true;
}
 
@@ -3148,7 +3145,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
break;
}
 
-   if (!zone_balanced(zone, order, 0, 0)) {
+   if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
} else {
@@ -3216,7 +3213,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, 0, classzone_idx)) {
+   if (zone_balanced(zone, sc.order, classzone_idx)) {
clear_bit(PGDAT_CONGESTED, >flags);
clear_bit(PGDAT_DIRTY, >flags);
goto out;
@@ -3427,7 +3424,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum 
zone_type classzone_idx)
}
if (!waitqueue_active(>kswapd_wait))
return;
-   if (zone_balanced(zone, order, 0, 0))
+   if (zone_balanced(zone, order, 0))
return;
 
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
-- 
2.6.4



[PATCH 09/31] mm, vmscan: by default have direct reclaim only shrink once per node

2016-07-01 Thread Mel Gorman
Direct reclaim iterates over all zones in the zonelist and shrinking them
but this is in conflict with node-based reclaim.  In the default case,
only shrink once per node.

Link: 
http://lkml.kernel.org/r/1466518566-30034-10-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b524d3b72527..34656173a670 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2552,14 +2552,6 @@ static inline bool compaction_ready(struct zone *zone, 
int order, int classzone_
  * try to reclaim pages from zones which will satisfy the caller's allocation
  * request.
  *
- * We reclaim from a zone even if that zone is over high_wmark_pages(zone).
- * Because:
- * a) The caller may be trying to free *extra* pages to satisfy a higher-order
- *allocation or
- * b) The target zone may be at high_wmark_pages(zone) but the lower zones
- *must go *over* high_wmark_pages(zone) to satisfy the `incremental min'
- *zone defense algorithm.
- *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
@@ -2571,6 +2563,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type classzone_idx;
+   pg_data_t *last_pgdat = NULL;
 
/*
 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2600,6 +2593,16 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
classzone_idx--;
 
/*
+* Shrink each node in the zonelist once. If the zonelist is
+* ordered by zone (not the default) then a node may be
+* shrunk multiple times but in that case the user prefers
+* lower zones being preserved
+*/
+   if (zone->zone_pgdat == last_pgdat)
+   continue;
+   last_pgdat = zone->zone_pgdat;
+
+   /*
 * Take care memory controller reclaiming has small influence
 * to global LRU.
 */
-- 
2.6.4



[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: 
http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t 

[PATCH 08/31] mm, vmscan: simplify the logic deciding whether kswapd sleeps

2016-07-01 Thread Mel Gorman
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: 
http://lkml.kernel.org/r/1466518566-30034-9-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 include/linux/mmzone.h |   5 ++-
 mm/memory_hotplug.c|   5 ++-
 mm/page_alloc.c|   2 +-
 mm/vmscan.c| 102 ++---
 4 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 258c20758e80..eb74e63df5cf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,8 +667,9 @@ typedef struct pglist_data {
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
   mem_hotplug_begin/end() */
-   int kswapd_max_order;
-   enum zone_type classzone_idx;
+   int kswapd_order;
+   enum zone_type kswapd_classzone_idx;
+
 #ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c5278360ca66..065140ecd081 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 
start)
 
arch_refresh_nodedata(nid, pgdat);
} else {
-   /* Reset the nr_zones and classzone_idx to 0 before reuse */
+   /* Reset the nr_zones, order and classzone_idx before reuse */
pgdat->nr_zones = 0;
-   pgdat->classzone_idx = 0;
+   pgdat->kswapd_order = 0;
+   pgdat->kswapd_classzone_idx = 0;
}
 
/* we can use NODE_DATA(nid) from here */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59e4463e5dce..f58548139bf2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6084,7 +6084,7 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
unsigned long end_pfn = 0;
 
/* pg_data_t should be reset to zero when it's allocated */
-   WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
+   WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
 
reset_deferred_meminit(pgdat);
pgdat->node_id = nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a52167eabc96..b524d3b72527 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2762,7 +2762,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
 
/* kswapd must be awake if processes are being throttled */
if (!wmark_ok && waitqueue_active(>kswapd_wait)) {
-   pgdat->classzone_idx = min(pgdat->classzone_idx,
+   pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
(enum zone_type)ZONE_NORMAL);
wake_up_interruptible(>kswapd_wait);
}
@@ -3238,8 +3238,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int 
classzone_idx)
return sc.order;
 }
 
-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
-   int classzone_idx, int balanced_classzone_idx)
+static void kswapd_try_to_sleep(pg_data_t *pgdat, int alloc_order, int 
reclaim_order,
+   int classzone_idx)
 {
long remaining = 0;
DEFINE_WAIT(wait);
@@ -3249,9 +3249,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 
prepare_to_wait(>kswapd_wait, , TASK_INTERRUPTIBLE);
 
+   /*
+* If kswapd has not been woken recently, then kswapd goes fully
+* to sleep. kcompactd may still need to wake if the original
+* request was high-order.
+*/
+   if (classzone_idx == -1) {
+   wakeup_kcompactd(pgdat, alloc_order, classzone_idx);
+   classzone_idx = MAX_NR_ZONES - 1;
+   goto full_sleep;
+   }
+
/* Try to sleep for a short interval */
-   if (prepare_kswapd_sleep(pgdat, order, remaining,
-   balanced_classzone_idx)) {
+   if (prepare_kswapd_sleep(pgdat, reclaim_order, remaining, 
classzone_idx)) {
/*
 * Compaction records what page blocks it recently failed to
 * isolate pages from and skips them in the future scanning.
@@ -3264,19 +3274,19 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int 
order,
 * We have freed the memory, now we should compact it to make
 

[PATCH 05/31] mm, vmscan: have kswapd only scan based on the highest requested zone

2016-07-01 Thread Mel Gorman
kswapd checks all eligible zones to see if they need balancing even if it
was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved in
higher zones but this should be the exceptional case.

Link: 
http://lkml.kernel.org/r/1466518566-30034-6-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 766b36bec829..c6e61dae382b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /*
-* Scan in the highmem->dma direction for the highest
-* zone which needs scanning
-*/
-   for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+   /* Scan from the highest requested zone to dma */
+   for (i = classzone_idx; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
 
if (!populated_zone(zone))
-- 
2.6.4



[PATCH 02/31] mm, vmscan: move lru_lock to the node

2016-07-01 Thread Mel Gorman
Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming from
different zones.

Link: 
http://lkml.kernel.org/r/1466518566-30034-3-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt |  4 +--
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h | 10 +--
 mm/compaction.c| 10 +++
 mm/filemap.c   |  4 +--
 mm/huge_memory.c   |  6 ++---
 mm/memcontrol.c|  6 ++---
 mm/mlock.c | 10 +++
 mm/page_alloc.c|  4 +--
 mm/page_idle.c |  4 +--
 mm/rmap.c  |  2 +-
 mm/swap.c  | 30 ++---
 mm/vmscan.c| 48 +-
 14 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt 
b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global zone->lru_lock).
+   VM's control (means that it's handled under global zone_lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under zone->lru_lock().
+   list management functions under zone_lru_lock().
 
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index b14abf217239..946e69103cdd 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be 
delivered.
Other lock order is following:
PG_locked.
mm->page_table_lock
-   zone->lru_lock
+   zone_lru_lock
  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d3285b..ca2ed9a6c8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,7 +118,7 @@ struct page {
 */
union {
struct list_head lru;   /* Pageout list, eg. active_list
-* protected by zone->lru_lock !
+* protected by zone_lru_lock !
 * Can be used as a generic list
 * by the page owner.
 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 078ecb81e209..2d5087e3c034 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,7 +93,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -496,7 +496,6 @@ struct zone {
/* Write-intensive fields used by page reclaim */
 
/* Fields commonly accessed by the page reclaim scanner */
-   spinlock_t  lru_lock;
struct lruvec   lruvec;
 
/*
@@ -690,6 +689,9 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
 #endif
+   /* Write-intensive fields used from the page allocator */
+   ZONE_PADDING(_pad1_)
+   spinlock_t  lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
@@ -721,6 +723,10 @@ typedef struct pglist_data {
 
 

[PATCH 10/31] mm, vmscan: remove duplicate logic clearing node congestion and dirty state

2016-07-01 Thread Mel Gorman
Reclaim may stall if there is too much dirty or congested data on a node.
This was previously based on zone flags and the logic for clearing the
flags is in two places.  As congestion/dirty tracking is now tracked on a
per-node basis, we can remove some duplicate logic.

Link: 
http://lkml.kernel.org/r/1466518566-30034-11-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34656173a670..911142d25de2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3005,7 +3005,17 @@ static bool zone_balanced(struct zone *zone, int order, 
int classzone_idx)
 {
unsigned long mark = high_wmark_pages(zone);
 
-   return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+   if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+   return false;
+
+   /*
+* If any eligible zone is balanced then the node is not considered
+* to be congested or dirty
+*/
+   clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags);
+   clear_bit(PGDAT_DIRTY, >zone_pgdat->flags);
+
+   return true;
 }
 
 /*
@@ -3151,13 +3161,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
-   } else {
-   /*
-* If any eligible zone is balanced then the
-* node is not considered congested or dirty.
-*/
-   clear_bit(PGDAT_CONGESTED, 
>zone_pgdat->flags);
-   clear_bit(PGDAT_DIRTY, 
>zone_pgdat->flags);
}
}
 
@@ -3216,11 +3219,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, classzone_idx)) {
-   clear_bit(PGDAT_CONGESTED, >flags);
-   clear_bit(PGDAT_DIRTY, >flags);
+   if (zone_balanced(zone, sc.order, classzone_idx))
goto out;
-   }
}
 
/*
-- 
2.6.4



[PATCH 05/31] mm, vmscan: have kswapd only scan based on the highest requested zone

2016-07-01 Thread Mel Gorman
kswapd checks all eligible zones to see if they need balancing even if it
was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved in
higher zones but this should be the exceptional case.

Link: 
http://lkml.kernel.org/r/1466518566-30034-6-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 766b36bec829..c6e61dae382b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3209,11 +3209,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
 
sc.nr_reclaimed = 0;
 
-   /*
-* Scan in the highmem->dma direction for the highest
-* zone which needs scanning
-*/
-   for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+   /* Scan from the highest requested zone to dma */
+   for (i = classzone_idx; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;
 
if (!populated_zone(zone))
-- 
2.6.4



[PATCH 02/31] mm, vmscan: move lru_lock to the node

2016-07-01 Thread Mel Gorman
Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming from
different zones.

Link: 
http://lkml.kernel.org/r/1466518566-30034-3-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 Documentation/cgroup-v1/memcg_test.txt |  4 +--
 Documentation/cgroup-v1/memory.txt |  4 +--
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h | 10 +--
 mm/compaction.c| 10 +++
 mm/filemap.c   |  4 +--
 mm/huge_memory.c   |  6 ++---
 mm/memcontrol.c|  6 ++---
 mm/mlock.c | 10 +++
 mm/page_alloc.c|  4 +--
 mm/page_idle.c |  4 +--
 mm/rmap.c  |  2 +-
 mm/swap.c  | 30 ++---
 mm/vmscan.c| 48 +-
 14 files changed, 75 insertions(+), 69 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt 
b/Documentation/cgroup-v1/memcg_test.txt
index 8870b0212150..78a8c2963b38 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global zone->lru_lock).
+   VM's control (means that it's handled under global zone_lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under zone->lru_lock().
+   list management functions under zone_lru_lock().
 
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index b14abf217239..946e69103cdd 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be 
delivered.
Other lock order is following:
PG_locked.
mm->page_table_lock
-   zone->lru_lock
+   zone_lru_lock
  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone->lru_lock, it has no lock of its own.
+  zone_lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e093e1d3285b..ca2ed9a6c8d8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -118,7 +118,7 @@ struct page {
 */
union {
struct list_head lru;   /* Pageout list, eg. active_list
-* protected by zone->lru_lock !
+* protected by zone_lru_lock !
 * Can be used as a generic list
 * by the page owner.
 */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 078ecb81e209..2d5087e3c034 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -93,7 +93,7 @@ struct free_area {
 struct pglist_data;
 
 /*
- * zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
+ * zone->lock and the zone lru_lock are two of the hottest locks in the kernel.
  * So add a wild amount of padding here to ensure that they fall into separate
  * cachelines.  There are very few zone structures in the machine, so space
  * consumption is not a concern here.
@@ -496,7 +496,6 @@ struct zone {
/* Write-intensive fields used by page reclaim */
 
/* Fields commonly accessed by the page reclaim scanner */
-   spinlock_t  lru_lock;
struct lruvec   lruvec;
 
/*
@@ -690,6 +689,9 @@ typedef struct pglist_data {
/* Number of pages migrated during the rate limiting time interval */
unsigned long numabalancing_migrate_nr_pages;
 #endif
+   /* Write-intensive fields used from the page allocator */
+   ZONE_PADDING(_pad1_)
+   spinlock_t  lru_lock;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
@@ -721,6 +723,10 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) 

[PATCH 10/31] mm, vmscan: remove duplicate logic clearing node congestion and dirty state

2016-07-01 Thread Mel Gorman
Reclaim may stall if there is too much dirty or congested data on a node.
This was previously based on zone flags and the logic for clearing the
flags is in two places.  As congestion/dirty tracking is now tracked on a
per-node basis, we can remove some duplicate logic.

Link: 
http://lkml.kernel.org/r/1466518566-30034-11-git-send-email-mgor...@techsingularity.net
Signed-off-by: Mel Gorman 
Cc: Johannes Weiner 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
---
 mm/vmscan.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34656173a670..911142d25de2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3005,7 +3005,17 @@ static bool zone_balanced(struct zone *zone, int order, 
int classzone_idx)
 {
unsigned long mark = high_wmark_pages(zone);
 
-   return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+   if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
+   return false;
+
+   /*
+* If any eligible zone is balanced then the node is not considered
+* to be congested or dirty
+*/
+   clear_bit(PGDAT_CONGESTED, >zone_pgdat->flags);
+   clear_bit(PGDAT_DIRTY, >zone_pgdat->flags);
+
+   return true;
 }
 
 /*
@@ -3151,13 +3161,6 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!zone_balanced(zone, order, 0)) {
classzone_idx = i;
break;
-   } else {
-   /*
-* If any eligible zone is balanced then the
-* node is not considered congested or dirty.
-*/
-   clear_bit(PGDAT_CONGESTED, 
>zone_pgdat->flags);
-   clear_bit(PGDAT_DIRTY, 
>zone_pgdat->flags);
}
}
 
@@ -3216,11 +3219,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, 
int classzone_idx)
if (!populated_zone(zone))
continue;
 
-   if (zone_balanced(zone, sc.order, classzone_idx)) {
-   clear_bit(PGDAT_CONGESTED, >flags);
-   clear_bit(PGDAT_DIRTY, >flags);
+   if (zone_balanced(zone, sc.order, classzone_idx))
goto out;
-   }
}
 
/*
-- 
2.6.4



[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  total-odr0-256 242.00 (  0.00%)   242.00 (  0.00%)
Min  total-odr0-512 272.00 (  0.00%) 

[git pull] IOMMU Fixes for Linux v4.7-rc5

2016-07-01 Thread Joerg Roedel
Hi Linus,

The following changes since commit a4c34ff1c029e90e7d5f8dd8d29b0a93b31c3cb2:

  iommu/vt-d: Enable QI on all IOMMUs before setting root entry (2016-06-17 
11:29:48 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-fixes-v4.7-rc5

for you to fetch changes up to 6082ee72e9d89e80a664418be06f47d728243e85:

  iommu/amd: Initialize devid variable before using it (2016-06-27 13:24:46 
+0200)


IOMMU Fixes for Linux v4.7-rc5

Three fixes:

* Fix use of smp_processor_id() in preemptible code in the IOVA
  allocation code. This got introduced with the scalability
  improvements in this release cycle.

* A VT-d fix for out-of-bounds access of the iommu->domains
  array. The bug showed during suspend/resume.

* AMD IOMMU fix to print the correct device id in the ACPI
  parsing code.


Chris Wilson (1):
  iommu/iova: Disable preemption around use of this_cpu_ptr()

Jan Niehusmann (1):
  iommu/vt-d: Fix overflow of iommu->domains array

Nicolas Iooss (1):
  iommu/amd: Initialize devid variable before using it

 drivers/iommu/amd_iommu_init.c | 2 +-
 drivers/iommu/intel-iommu.c| 2 +-
 drivers/iommu/iova.c   | 8 ++--
 3 files changed, 8 insertions(+), 4 deletions(-)

Please pull.

Thanks,

Joerg


signature.asc
Description: Digital signature


[PATCH 00/31] Move LRU page reclaim from zones to nodes v8

2016-07-01 Thread Mel Gorman
Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
-

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

   4.7.0-rc4  4.7.0-rc4
  mmotm-20160623 nodelru-v8
Min  total-odr0-1   490.00 (  0.00%)   463.00 (  5.51%)
Min  total-odr0-2   349.00 (  0.00%)   325.00 (  6.88%)
Min  total-odr0-4   288.00 (  0.00%)   272.00 (  5.56%)
Min  total-odr0-8   250.00 (  0.00%)   235.00 (  6.00%)
Min  total-odr0-16  234.00 (  0.00%)   222.00 (  5.13%)
Min  total-odr0-32  223.00 (  0.00%)   205.00 (  8.07%)
Min  total-odr0-64  217.00 (  0.00%)   202.00 (  6.91%)
Min  total-odr0-128 214.00 (  0.00%)   207.00 (  3.27%)
Min  total-odr0-256 242.00 (  0.00%)   242.00 (  0.00%)
Min  total-odr0-512 272.00 (  0.00%) 

[git pull] IOMMU Fixes for Linux v4.7-rc5

2016-07-01 Thread Joerg Roedel
Hi Linus,

The following changes since commit a4c34ff1c029e90e7d5f8dd8d29b0a93b31c3cb2:

  iommu/vt-d: Enable QI on all IOMMUs before setting root entry (2016-06-17 
11:29:48 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git 
tags/iommu-fixes-v4.7-rc5

for you to fetch changes up to 6082ee72e9d89e80a664418be06f47d728243e85:

  iommu/amd: Initialize devid variable before using it (2016-06-27 13:24:46 
+0200)


IOMMU Fixes for Linux v4.7-rc5

Three fixes:

* Fix use of smp_processor_id() in preemptible code in the IOVA
  allocation code. This got introduced with the scalability
  improvements in this release cycle.

* A VT-d fix for out-of-bounds access of the iommu->domains
  array. The bug showed during suspend/resume.

* AMD IOMMU fix to print the correct device id in the ACPI
  parsing code.


Chris Wilson (1):
  iommu/iova: Disable preemption around use of this_cpu_ptr()

Jan Niehusmann (1):
  iommu/vt-d: Fix overflow of iommu->domains array

Nicolas Iooss (1):
  iommu/amd: Initialize devid variable before using it

 drivers/iommu/amd_iommu_init.c | 2 +-
 drivers/iommu/intel-iommu.c| 2 +-
 drivers/iommu/iova.c   | 8 ++--
 3 files changed, 8 insertions(+), 4 deletions(-)

Please pull.

Thanks,

Joerg


signature.asc
Description: Digital signature


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Radim Krčmář
2016-07-01 17:06+0200, Paolo Bonzini:
> On 01/07/2016 16:38, Radim Krčmář wrote:
>   On the
> other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its
> current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or
> 1023.

Yes, thanks for pointing it out.  APIC ID 1023 is reasonably low and I'm
pretty sure that it cannot be surpassed with any topology of 288 VCPUs.
(The limit should be 543, with 257 cores per package.)


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Arnd Bergmann
On Friday, July 1, 2016 5:22:32 PM CEST Hans Verkuil wrote:
> > diff --git a/drivers/media/platform/vivid/Kconfig 
> > b/drivers/media/platform/vivid/Kconfig
> > index 8e6918c5c87c..8e31146d079a 100644
> > --- a/drivers/media/platform/vivid/Kconfig
> > +++ b/drivers/media/platform/vivid/Kconfig
> > @@ -26,6 +26,7 @@ config VIDEO_VIVID
> >  config VIDEO_VIVID_CEC
> >   bool "Enable CEC emulation support"
> >   depends on VIDEO_VIVID && MEDIA_CEC
> > + depends on VIDEO_VIVID=m || MEDIA_CEC=y
> >   ---help---
> > When selected the vivid module will emulate the optional
> > HDMI CEC feature.
> > 
> > which is still not overly nice, but it manages to avoid the
> > IS_REACHABLE() check and it lets MEDIA_CEC be a module.
> 
> The only IS_REACHABLE is for the RC_CORE check, and that should remain.

I believe that is already taken care of by my earlier "[media] cec: add
RC_CORE dependency" patch, https://patchwork.linuxtv.org/patch/34892/
which seems to handle the dependency more gracefully (preventing nonsense
configurations rather than just not using RC_CORE).

> With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also
> a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT,
> so that works.

To clarify, the problem with the option above is that VIDEO_VIVID_CEC
is a 'bool' option, and Kconfig lets that be turned on if both
VIDEO_VIVID and MEDIA_CEC are enabled, including the case where MEDIA_CEC
is a module and VIDEO_VIVID is not.

Your patch avoids that problem by making MEDIA_CEC a 'bool', my patch
above is an alternative by ensuring that VIDEO_VIVID_CEC cannot be
enabled if MEDIA_CEC is a module and VIDEO_VIVID is not.

Arnd


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Radim Krčmář
2016-07-01 17:06+0200, Paolo Bonzini:
> On 01/07/2016 16:38, Radim Krčmář wrote:
>   On the
> other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its
> current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or
> 1023.

Yes, thanks for pointing it out.  APIC ID 1023 is reasonably low and I'm
pretty sure that it cannot be surpassed with any topology of 288 VCPUs.
(The limit should be 543, with 257 cores per package.)


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Arnd Bergmann
On Friday, July 1, 2016 5:22:32 PM CEST Hans Verkuil wrote:
> > diff --git a/drivers/media/platform/vivid/Kconfig 
> > b/drivers/media/platform/vivid/Kconfig
> > index 8e6918c5c87c..8e31146d079a 100644
> > --- a/drivers/media/platform/vivid/Kconfig
> > +++ b/drivers/media/platform/vivid/Kconfig
> > @@ -26,6 +26,7 @@ config VIDEO_VIVID
> >  config VIDEO_VIVID_CEC
> >   bool "Enable CEC emulation support"
> >   depends on VIDEO_VIVID && MEDIA_CEC
> > + depends on VIDEO_VIVID=m || MEDIA_CEC=y
> >   ---help---
> > When selected the vivid module will emulate the optional
> > HDMI CEC feature.
> > 
> > which is still not overly nice, but it manages to avoid the
> > IS_REACHABLE() check and it lets MEDIA_CEC be a module.
> 
> The only IS_REACHABLE is for the RC_CORE check, and that should remain.

I believe that is already taken care of by my earlier "[media] cec: add
RC_CORE dependency" patch, https://patchwork.linuxtv.org/patch/34892/
which seems to handle the dependency more gracefully (preventing nonsense
configurations rather than just not using RC_CORE).

> With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also
> a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT,
> so that works.

To clarify, the problem with the option above is that VIDEO_VIVID_CEC
is a 'bool' option, and Kconfig lets that be turned on if both
VIDEO_VIVID and MEDIA_CEC are enabled, including the case where MEDIA_CEC
is a module and VIDEO_VIVID is not.

Your patch avoids that problem by making MEDIA_CEC a 'bool', my patch
above is an alternative by ensuring that VIDEO_VIVID_CEC cannot be
enabled if MEDIA_CEC is a module and VIDEO_VIVID is not.

Arnd


Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs

2016-07-01 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen  wrote:
>>
>> I think what you suggest will work if we don't consider A/D in
>> pte_none().  I think there are a bunch of code path where assume that
>> !pte_present() && !pte_none() means swap.
>
> Yeah, we would need to change pte_none() to mask off D/A, but I think
> that might be the only real change needed (other than making sure that
> we don't use the bits in the swap entries, I didn't look at that part
> at all)

It looks like __pte_to_swp_entry also needs to be changed to mask out
those bits when the swap code reads pte entries.  For all of the same
reasons as pte_none.

Eric


Re: [PATCH 6/6] x86: Fix stray A/D bit setting into non-present PTEs

2016-07-01 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Thu, Jun 30, 2016 at 9:39 PM, Dave Hansen  wrote:
>>
>> I think what you suggest will work if we don't consider A/D in
>> pte_none().  I think there are a bunch of code path where assume that
>> !pte_present() && !pte_none() means swap.
>
> Yeah, we would need to change pte_none() to mask off D/A, but I think
> that might be the only real change needed (other than making sure that
> we don't use the bits in the swap entries, I didn't look at that part
> at all)

It looks like __pte_to_swp_entry also needs to be changed to mask out
those bits when the swap code reads pte entries.  For all of the same
reasons as pte_none.

Eric


Re: [PATCH v3 1/4] drm/rockchip: vop: export line flag function

2016-07-01 Thread Sean Paul
On Fri, Jul 1, 2016 at 5:19 AM, Yakir Yang  wrote:
> VOP have integrated a hardware counter which indicate the exact display
> line that vop is scanning. And if we're interested in a specific line,
> we can set the line number to vop line_flag register, and then vop would
> generate a line_flag interrupt for it.
>
> For example eDP PSR function is interested in the vertical blanking
> period, then driver could set the line number to zero.
>
> This patch have exported a symbol that allow other driver to listen the
> line flag event with given timeout limit:
> -  rockchip_drm_wait_line_flag()
>
> Signed-off-by: Yakir Yang 
> ---
> Changes in v3:
> - Export the 'rockchip_drm_wait_line_flag' symbol, and document it.
> - Add 'line_flag_num_0' for RK3288/RK3036
> - Remove the notify for waiting line_flag event (Daniel)
>
> Changes in v2:
> - Introduce in v2, split VOP line flag changes out
>
>  drivers/gpu/drm/rockchip/rockchip_drm_drv.h |   3 +
>  drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 103 
> 
>  drivers/gpu/drm/rockchip/rockchip_drm_vop.h |   3 +
>  drivers/gpu/drm/rockchip/rockchip_vop_reg.c |   4 ++
>  4 files changed, 113 insertions(+)
>
> diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h 
> b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> index ea39329..239b830 100644
> --- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> +++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> @@ -70,4 +70,7 @@ int rockchip_drm_dma_attach_device(struct drm_device 
> *drm_dev,
>struct device *dev);
>  void rockchip_drm_dma_detach_device(struct drm_device *drm_dev,
> struct device *dev);
> +int rockchip_drm_wait_line_flag(struct drm_crtc *crtc, unsigned int line_num,
> +   unsigned int mstimeout);
> +
>  #endif /* _ROCKCHIP_DRM_DRV_H_ */
> diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c 
> b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> index c8a62a8..cd3cac5 100644
> --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> @@ -121,6 +121,8 @@ struct vop {
> /* protected by dev->event_lock */
> struct drm_pending_vblank_event *event;
>
> +   struct completion line_flag_completion;
> +
> const struct vop_data *data;
>
> uint32_t *regsbak;
> @@ -431,6 +433,59 @@ static void vop_dsp_hold_valid_irq_disable(struct vop 
> *vop)
> spin_unlock_irqrestore(>irq_lock, flags);
>  }
>
> +/*
> + * (1) each frame starts at the start of the Vsync pulse which is signaled by
> + * the "FRAME_SYNC" interrupt.
> + * (2) the active data region of each frame ends at dsp_vact_end
> + * (3) we should program this same number (dsp_vact_end) into 
> dsp_line_frag_num,
> + *  to get "LINE_FLAG" interrupt at the end of the active on screen data.
> + *
> + * VOP_INTR_CTRL0.dsp_line_frag_num = VOP_DSP_VACT_ST_END.dsp_vact_end
> + * Interrupts
> + * LINE_FLAG ---+
> + * FRAME_SYNC + |
> + *| |
> + *v v
> + *| Vsync | Vbp |  Vactive  | Vfp |
> + *^ ^   ^ ^
> + *| |   | |
> + *| |   | |
> + * dsp_vs_end + |   | |   VOP_DSP_VTOTAL_VS_END
> + * dsp_vact_start --+   | |   VOP_DSP_VACT_ST_END
> + * dsp_vact_end + |   VOP_DSP_VACT_ST_END
> + * dsp_total -+   VOP_DSP_VTOTAL_VS_END
> + */
> +
> +static void vop_line_flag_irq_enable(struct vop *vop, int line_num)
> +{
> +   unsigned long flags;
> +
> +   if (WARN_ON(!vop->is_enabled))
> +   return;
> +
> +   spin_lock_irqsave(>irq_lock, flags);
> +
> +   VOP_CTRL_SET(vop, line_flag_num_0, line_num);
> +   VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 1);
> +   vop_cfg_done(vop);
> +
> +   spin_unlock_irqrestore(>irq_lock, flags);
> +}
> +
> +static void vop_line_flag_irq_disable(struct vop *vop)
> +{
> +   unsigned long flags;
> +
> +   if (WARN_ON(!vop->is_enabled))
> +   return;
> +
> +   spin_lock_irqsave(>irq_lock, flags);
> +
> +   VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 0);
> +
> +   spin_unlock_irqrestore(>irq_lock, flags);
> +}
> +
>  static void vop_enable(struct drm_crtc *crtc)
>  {
> struct vop *vop = to_vop(crtc);
> @@ -1157,6 +1212,13 @@ static irqreturn_t vop_isr(int irq, void *data)
> ret = IRQ_HANDLED;
> }
>
> +   if (active_irqs & LINE_FLAG_INTR) {
> +   if (!completion_done(>line_flag_completion))
> +   complete(>line_flag_completion);

I think there's potential to miss flags here if 

Re: [PATCH v3 1/4] drm/rockchip: vop: export line flag function

2016-07-01 Thread Sean Paul
On Fri, Jul 1, 2016 at 5:19 AM, Yakir Yang  wrote:
> VOP have integrated a hardware counter which indicate the exact display
> line that vop is scanning. And if we're interested in a specific line,
> we can set the line number to vop line_flag register, and then vop would
> generate a line_flag interrupt for it.
>
> For example eDP PSR function is interested in the vertical blanking
> period, then driver could set the line number to zero.
>
> This patch have exported a symbol that allow other driver to listen the
> line flag event with given timeout limit:
> -  rockchip_drm_wait_line_flag()
>
> Signed-off-by: Yakir Yang 
> ---
> Changes in v3:
> - Export the 'rockchip_drm_wait_line_flag' symbol, and document it.
> - Add 'line_flag_num_0' for RK3288/RK3036
> - Remove the notify for waiting line_flag event (Daniel)
>
> Changes in v2:
> - Introduce in v2, split VOP line flag changes out
>
>  drivers/gpu/drm/rockchip/rockchip_drm_drv.h |   3 +
>  drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 103 
> 
>  drivers/gpu/drm/rockchip/rockchip_drm_vop.h |   3 +
>  drivers/gpu/drm/rockchip/rockchip_vop_reg.c |   4 ++
>  4 files changed, 113 insertions(+)
>
> diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h 
> b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> index ea39329..239b830 100644
> --- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> +++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.h
> @@ -70,4 +70,7 @@ int rockchip_drm_dma_attach_device(struct drm_device 
> *drm_dev,
>struct device *dev);
>  void rockchip_drm_dma_detach_device(struct drm_device *drm_dev,
> struct device *dev);
> +int rockchip_drm_wait_line_flag(struct drm_crtc *crtc, unsigned int line_num,
> +   unsigned int mstimeout);
> +
>  #endif /* _ROCKCHIP_DRM_DRV_H_ */
> diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c 
> b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> index c8a62a8..cd3cac5 100644
> --- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> +++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
> @@ -121,6 +121,8 @@ struct vop {
> /* protected by dev->event_lock */
> struct drm_pending_vblank_event *event;
>
> +   struct completion line_flag_completion;
> +
> const struct vop_data *data;
>
> uint32_t *regsbak;
> @@ -431,6 +433,59 @@ static void vop_dsp_hold_valid_irq_disable(struct vop 
> *vop)
> spin_unlock_irqrestore(>irq_lock, flags);
>  }
>
> +/*
> + * (1) each frame starts at the start of the Vsync pulse which is signaled by
> + * the "FRAME_SYNC" interrupt.
> + * (2) the active data region of each frame ends at dsp_vact_end
> + * (3) we should program this same number (dsp_vact_end) into 
> dsp_line_frag_num,
> + *  to get "LINE_FLAG" interrupt at the end of the active on screen data.
> + *
> + * VOP_INTR_CTRL0.dsp_line_frag_num = VOP_DSP_VACT_ST_END.dsp_vact_end
> + * Interrupts
> + * LINE_FLAG ---+
> + * FRAME_SYNC + |
> + *| |
> + *v v
> + *| Vsync | Vbp |  Vactive  | Vfp |
> + *^ ^   ^ ^
> + *| |   | |
> + *| |   | |
> + * dsp_vs_end + |   | |   VOP_DSP_VTOTAL_VS_END
> + * dsp_vact_start --+   | |   VOP_DSP_VACT_ST_END
> + * dsp_vact_end + |   VOP_DSP_VACT_ST_END
> + * dsp_total -+   VOP_DSP_VTOTAL_VS_END
> + */
> +
> +static void vop_line_flag_irq_enable(struct vop *vop, int line_num)
> +{
> +   unsigned long flags;
> +
> +   if (WARN_ON(!vop->is_enabled))
> +   return;
> +
> +   spin_lock_irqsave(>irq_lock, flags);
> +
> +   VOP_CTRL_SET(vop, line_flag_num_0, line_num);
> +   VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 1);
> +   vop_cfg_done(vop);
> +
> +   spin_unlock_irqrestore(>irq_lock, flags);
> +}
> +
> +static void vop_line_flag_irq_disable(struct vop *vop)
> +{
> +   unsigned long flags;
> +
> +   if (WARN_ON(!vop->is_enabled))
> +   return;
> +
> +   spin_lock_irqsave(>irq_lock, flags);
> +
> +   VOP_INTR_SET_TYPE(vop, enable, LINE_FLAG_INTR, 0);
> +
> +   spin_unlock_irqrestore(>irq_lock, flags);
> +}
> +
>  static void vop_enable(struct drm_crtc *crtc)
>  {
> struct vop *vop = to_vop(crtc);
> @@ -1157,6 +1212,13 @@ static irqreturn_t vop_isr(int irq, void *data)
> ret = IRQ_HANDLED;
> }
>
> +   if (active_irqs & LINE_FLAG_INTR) {
> +   if (!completion_done(>line_flag_completion))
> +   complete(>line_flag_completion);

I think there's potential to miss flags here if the timing is just
right because the 

RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Andrew,

> On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote:
> > This patch series does the following
> > ---> Add support for gmii2rgmii converter.
> 
> How generic is this gmii2rgmii IP block? Could it be used with any GMII and
> RGMII device?

This converter does GMII2RGMII conversion.
This can be used with any MAC which has shared MDIO with external PHY
And this Converter. This Converter IP is validated for MACB.
But it should work with any MAC which has shared MDIO bus (I mean single MDIO 
multiple PHYS)...

This converter works like below

MACB <==> GMII2RGMII <==> RGMII_PHY

MDIO<> GMII2RGMII
MCAB<===> 
  <> RGMII

Using MACB MDIO bus we can access both the converter and the external PHY.
We need to program the line speed of the converter during run time based on the 
External phy negotiated speed. 

MDIO interface is used to set operating speed (line speed) 

The converter has only one register (0x10) that we need to program to set the 
operating speed.
The composition of this register is similar to the IEEE standard 802.3 MDIO 
control register 0x0.

Please let me know if you still not clear about how this converter works.

> 
> Should it be placed in drivers/net/phy, so making it easier to reuse?

Ok will move it drivers/net/phy folder in the next version...

> 
> Also, Russell Kings phylink might be a more natural structure for this. It is 
> hard to
> know when those patches will land, but it might be worth looking at.

Ok sure will take a look at this series once posted...

Regards,
Kedar.

> 
>Andrew


RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Andrew,

> On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote:
> > This patch series does the following
> > ---> Add support for gmii2rgmii converter.
> 
> How generic is this gmii2rgmii IP block? Could it be used with any GMII and
> RGMII device?

This converter does GMII2RGMII conversion.
This can be used with any MAC which has shared MDIO with external PHY
And this Converter. This Converter IP is validated for MACB.
But it should work with any MAC which has shared MDIO bus (I mean single MDIO 
multiple PHYS)...

This converter works like below

MACB <==> GMII2RGMII <==> RGMII_PHY

MDIO<> GMII2RGMII
MCAB<===> 
  <> RGMII

Using MACB MDIO bus we can access both the converter and the external PHY.
We need to program the line speed of the converter during run time based on the 
External phy negotiated speed. 

MDIO interface is used to set operating speed (line speed) 

The converter has only one register (0x10) that we need to program to set the 
operating speed.
The composition of this register is similar to the IEEE standard 802.3 MDIO 
control register 0x0.

Please let me know if you still not clear about how this converter works.

> 
> Should it be placed in drivers/net/phy, so making it easier to reuse?

Ok will move it drivers/net/phy folder in the next version...

> 
> Also, Russell Kings phylink might be a more natural structure for this. It is 
> hard to
> know when those patches will land, but it might be worth looking at.

Ok sure will take a look at this series once posted...

Regards,
Kedar.

> 
>Andrew


Re: [PATCH v6 00/10] acpi, clocksource: add GTDT driver and GTDT support in arm_arch_timer

2016-07-01 Thread Will Deacon
On Thu, Jun 30, 2016 at 09:48:02PM +0800, Hanjun Guo wrote:
> On 2016/6/30 21:27, Rafael J. Wysocki wrote:
> >On Thursday, June 30, 2016 10:10:02 AM Hanjun Guo wrote:
> >>GTDT is part of ACPI spec, drivers/acpi/ is for driver code of
> >>ACPI spec, I think it can stay in drivers/acpi/ from this point
> >>of view, am I right?
> >
> >The question is not "Can it?", but "Does it need to?".
> >
> >It is in the spec, but still there's only one architecture needing it.
> >
> >There is no way to test it on any other architecture and no reason to build 
> >it
> >for any other architecture, so why does it need to be located in 
> >drivers/acpi/ ?
> 
> I'm fine to move it to other places such as arch/arm64/kernel/, but I
> would like to ask ARM64 maintainer's suggestion for this.
> 
> Will, Catalin, what's your opinion on this?

We don't have any device-tree code for the architected timer under
arch/arm64, so I don't see why we should need anything for ACPI either.

Will


Re: [PATCH v6 00/10] acpi, clocksource: add GTDT driver and GTDT support in arm_arch_timer

2016-07-01 Thread Will Deacon
On Thu, Jun 30, 2016 at 09:48:02PM +0800, Hanjun Guo wrote:
> On 2016/6/30 21:27, Rafael J. Wysocki wrote:
> >On Thursday, June 30, 2016 10:10:02 AM Hanjun Guo wrote:
> >>GTDT is part of ACPI spec, drivers/acpi/ is for driver code of
> >>ACPI spec, I think it can stay in drivers/acpi/ from this point
> >>of view, am I right?
> >
> >The question is not "Can it?", but "Does it need to?".
> >
> >It is in the spec, but still there's only one architecture needing it.
> >
> >There is no way to test it on any other architecture and no reason to build 
> >it
> >for any other architecture, so why does it need to be located in 
> >drivers/acpi/ ?
> 
> I'm fine to move it to other places such as arch/arm64/kernel/, but I
> would like to ask ARM64 maintainer's suggestion for this.
> 
> Will, Catalin, what's your opinion on this?

We don't have any device-tree code for the architected timer under
arch/arm64, so I don't see why we should need anything for ACPI either.

Will


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Hans Verkuil
On 07/01/2016 05:13 PM, Arnd Bergmann wrote:
> On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote:
>> On 07/01/2016 01:19 PM, Arnd Bergmann wrote:
>>> The linux/cec.h header file contains empty inline definitions for
>>> a subset of the API for the case in which CEC is not enabled,
>>> however we have driver that call other functions as well:
>>>
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status':
>>> drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function 
>>> 'cec_transmit_done' [-Werror=implicit-function-declaration]
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr':
>>> drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function 
>>> 'cec_received_msg' [-Werror=implicit-function-declaration]
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_probe':
>>> drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of 
>>> function 'cec_allocate_adapter' [-Werror=implicit-function-declaration]
>>
>> I don't understand this. These calls are under #if 
>> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC),
>> and that should be 0 if MEDIA_CEC is not selected.
>>
>> Am I missing some weird config combination?
> 
> This must have been a build error I ran into before your patch, when I
> had this one applied locally instead:
> 
> diff --git a/include/media/cec.h b/include/media/cec.h
> index c462f9b44074..564a6a06bed7 100644
> --- a/include/media/cec.h
> +++ b/include/media/cec.h
> @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter 
> *adap)
>   return adap->phys_addr == 0;
>  }
>  
> -#if IS_ENABLED(CONFIG_MEDIA_CEC)
> +#if IS_REACHABLE(CONFIG_MEDIA_CEC)
>  struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops,
>   void *priv, const char *name, u32 caps, u8 available_las,
>   struct device *parent);
> 
> because that was hiding the declarations when the code could not
> reach it. With your newer patch that is not possible any more.
> 
> I also wasn't aware that each of these already had their own Kconfig
> symbols. Could we just do this instead of your patch then:
> 
> diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig
> index ce9006e10a30..73e047220905 100644
> --- a/drivers/media/i2c/Kconfig
> +++ b/drivers/media/i2c/Kconfig
> @@ -222,6 +222,7 @@ config VIDEO_ADV7604
>  config VIDEO_ADV7604_CEC
>   bool "Enable Analog Devices ADV7604 CEC support"
>   depends on VIDEO_ADV7604 && MEDIA_CEC
> + depends on VIDEO_ADV7604=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7604 will support the optional
> HDMI CEC feature.
> @@ -243,6 +244,7 @@ config VIDEO_ADV7842
>  config VIDEO_ADV7842_CEC
>   bool "Enable Analog Devices ADV7842 CEC support"
>   depends on VIDEO_ADV7842 && MEDIA_CEC
> + depends on VIDEO_ADV7842=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7842 will support the optional
> HDMI CEC feature.
> @@ -475,6 +477,7 @@ config VIDEO_ADV7511
>  config VIDEO_ADV7511_CEC
>   bool "Enable Analog Devices ADV7511 CEC support"
>   depends on VIDEO_ADV7511 && MEDIA_CEC
> + depends on VIDEO_ADV7511=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7511 will support the optional
> HDMI CEC feature.
> diff --git a/drivers/media/platform/vivid/Kconfig 
> b/drivers/media/platform/vivid/Kconfig
> index 8e6918c5c87c..8e31146d079a 100644
> --- a/drivers/media/platform/vivid/Kconfig
> +++ b/drivers/media/platform/vivid/Kconfig
> @@ -26,6 +26,7 @@ config VIDEO_VIVID
>  config VIDEO_VIVID_CEC
>   bool "Enable CEC emulation support"
>   depends on VIDEO_VIVID && MEDIA_CEC
> + depends on VIDEO_VIVID=m || MEDIA_CEC=y
>   ---help---
> When selected the vivid module will emulate the optional
> HDMI CEC feature.
> 
> which is still not overly nice, but it manages to avoid the
> IS_REACHABLE() check and it lets MEDIA_CEC be a module.

The only IS_REACHABLE is for the RC_CORE check, and that should remain.

With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also
a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT,
so that works.

As I mentioned in a previous post, this will change in the future but while
it is in staging I prefer to keep it like this.

Regards,

Hans


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Hans Verkuil
On 07/01/2016 05:13 PM, Arnd Bergmann wrote:
> On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote:
>> On 07/01/2016 01:19 PM, Arnd Bergmann wrote:
>>> The linux/cec.h header file contains empty inline definitions for
>>> a subset of the API for the case in which CEC is not enabled,
>>> however we have driver that call other functions as well:
>>>
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status':
>>> drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function 
>>> 'cec_transmit_done' [-Werror=implicit-function-declaration]
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr':
>>> drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function 
>>> 'cec_received_msg' [-Werror=implicit-function-declaration]
>>> drivers/media/i2c/adv7604.c: In function 'adv76xx_probe':
>>> drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of 
>>> function 'cec_allocate_adapter' [-Werror=implicit-function-declaration]
>>
>> I don't understand this. These calls are under #if 
>> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC),
>> and that should be 0 if MEDIA_CEC is not selected.
>>
>> Am I missing some weird config combination?
> 
> This must have been a build error I ran into before your patch, when I
> had this one applied locally instead:
> 
> diff --git a/include/media/cec.h b/include/media/cec.h
> index c462f9b44074..564a6a06bed7 100644
> --- a/include/media/cec.h
> +++ b/include/media/cec.h
> @@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter 
> *adap)
>   return adap->phys_addr == 0;
>  }
>  
> -#if IS_ENABLED(CONFIG_MEDIA_CEC)
> +#if IS_REACHABLE(CONFIG_MEDIA_CEC)
>  struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops,
>   void *priv, const char *name, u32 caps, u8 available_las,
>   struct device *parent);
> 
> because that was hiding the declarations when the code could not
> reach it. With your newer patch that is not possible any more.
> 
> I also wasn't aware that each of these already had their own Kconfig
> symbols. Could we just do this instead of your patch then:
> 
> diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig
> index ce9006e10a30..73e047220905 100644
> --- a/drivers/media/i2c/Kconfig
> +++ b/drivers/media/i2c/Kconfig
> @@ -222,6 +222,7 @@ config VIDEO_ADV7604
>  config VIDEO_ADV7604_CEC
>   bool "Enable Analog Devices ADV7604 CEC support"
>   depends on VIDEO_ADV7604 && MEDIA_CEC
> + depends on VIDEO_ADV7604=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7604 will support the optional
> HDMI CEC feature.
> @@ -243,6 +244,7 @@ config VIDEO_ADV7842
>  config VIDEO_ADV7842_CEC
>   bool "Enable Analog Devices ADV7842 CEC support"
>   depends on VIDEO_ADV7842 && MEDIA_CEC
> + depends on VIDEO_ADV7842=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7842 will support the optional
> HDMI CEC feature.
> @@ -475,6 +477,7 @@ config VIDEO_ADV7511
>  config VIDEO_ADV7511_CEC
>   bool "Enable Analog Devices ADV7511 CEC support"
>   depends on VIDEO_ADV7511 && MEDIA_CEC
> + depends on VIDEO_ADV7511=m || MEDIA_CEC=y
>   ---help---
> When selected the adv7511 will support the optional
> HDMI CEC feature.
> diff --git a/drivers/media/platform/vivid/Kconfig 
> b/drivers/media/platform/vivid/Kconfig
> index 8e6918c5c87c..8e31146d079a 100644
> --- a/drivers/media/platform/vivid/Kconfig
> +++ b/drivers/media/platform/vivid/Kconfig
> @@ -26,6 +26,7 @@ config VIDEO_VIVID
>  config VIDEO_VIVID_CEC
>   bool "Enable CEC emulation support"
>   depends on VIDEO_VIVID && MEDIA_CEC
> + depends on VIDEO_VIVID=m || MEDIA_CEC=y
>   ---help---
> When selected the vivid module will emulate the optional
> HDMI CEC feature.
> 
> which is still not overly nice, but it manages to avoid the
> IS_REACHABLE() check and it lets MEDIA_CEC be a module.

The only IS_REACHABLE is for the RC_CORE check, and that should remain.

With my patch MEDIA_CEC can remain a module provided MEDIA_SUPPORT is also
a module. All drivers depending on MEDIA_CEC also depend on MEDIA_SUPPORT,
so that works.

As I mentioned in a previous post, this will change in the future but while
it is in staging I prefer to keep it like this.

Regards,

Hans


RE: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Florian,

Thanks for the review.

> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed)
> > +{
> > +   struct gmii2rgmii *xphy = (struct xphy *)priv;
> 
> Why not pass struct xphy pointer directly?

Ok will fix in v2...

> 
> > +   struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev;
> > +   u16 gmii2rgmii_reg = 0;
> > +
> > +   switch (speed) {
> > +   case 1000:
> > +   gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000;
> > +   break;
> > +   case 100:
> > +   gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100;
> > +   break;
> > +   default:
> > +   return;
> > +   }
> > +
> > +   xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr,
> > +XILINX_GMII2RGMII_REG_NUM,
> > +gmii2rgmii_reg);
> > +}
> > +
> > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) {
> > +   struct device_node *phy_node;
> > +   struct phy_device *phydev;
> > +   struct device_node *np = (struct device_node *)xphy->platform_data;
> > +
> > +   phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0);
> 
> Is that property documented in a binding document?

Will document. Will fix in v2...

> 
> > +   if (phy_node) {
> 
> Should not there be an else clause which does not assign
> xphy->fix_mac_speed in case this property lookup fails?

Will fix in v2...

> 
> > +   phydev = of_phy_attach(xphy->dev, phy_node, 0, 0);
> > +   if (!phydev) {
> > +   netdev_err(xphy->dev,
> > +  "%s: no gmii to rgmii converter found\n",
> > +  xphy->dev->name);
> > +   return -1;
> > +   }
> > +   xphy->gmii2rgmii_phy_dev = phydev;
> > +   }
> > +   xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(gmii2rgmii_phyprobe);
> > +
> > +MODULE_DESCRIPTION("Xilinx GMII2RGMII converter driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/linux/xilinx_gmii2rgmii.h
> > b/include/linux/xilinx_gmii2rgmii.h
> > new file mode 100644
> > index 000..64e1659
> > --- /dev/null
> > +++ b/include/linux/xilinx_gmii2rgmii.h
> > @@ -0,0 +1,24 @@
> > +#ifndef _GMII2RGMII_H
> > +#define _GMII2RGMII_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define XILINX_GMII2RGMII_FULLDPLX BMCR_FULLDPLX
> > +#define XILINX_GMII2RGMII_SPEED1000BMCR_SPEED1000
> > +#define XILINX_GMII2RGMII_SPEED100 BMCR_SPEED100
> > +#define XILINX_GMII2RGMII_REG_NUM  0x10
> 
> So the register semantics are fairly standard but not the register location, 
> have
> you considered writing a small PHY driver for this block?

I tried but this PHY doesn't have any vendor / Device ID's
This converter won't suit to PHY framework as we need to programmed the
PHY Control register with the external phy negotiated speed as explained in the 
other
Mail thread...

Regards,
Kedar. 


RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Florian,

Thanks for the review...

> Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> > This patch series does the following
> > ---> Add support for gmii2rgmii converter.
> > ---> Add support for gmii2rgmii converter in the macb driver.
> >
> > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/
> > which includes converter related stuff also in macb driver.
> > This patch series fixes this issue.
> 
> This still seems very adhoc and not completely explained. Can you clarify how
> the gmmi2rgmii converter gets used?

Sorry I should have been explained it clearly in the patch.
Will fix it in the v2.

This converter works like below

MACB <==> GMII2RGMII <==> RGMII_PHY

MDIO<> GMII2RGMII
MCAB<===> 
  <> RGMII

Using MACB MDIO bus we can access both the converter and the external PHY.
We need to program the line speed of the converter during run time based on the
External phy negotiated speed. 

MDIO interface is used to set operating speed (line speed) 

The converter has only one register (0x10) that we need to program to set the 
operating speed.
The composition of this register is similar to the IEEE standard 802.3 MDIO 
control register 0x0.

Please let me know if you still need not clear with how this converter works.

IP data sheet is available here @ 
http://www.xilinx.com/support/documentation/ip_documentation/gmii_to_rgmii/v3_0/pg160-gmii-to-rgmii.pdf
 

   
> 
> Is the expectation that a MACB Ethernet adapter will be connected to a RGMII
> PHY like this:
> 
> MACB <=> GMII2RGMII <=> RGMII PHY
> MACB MDIO <===> RGMII_PHY
> 
> and still have the ability to control both the PHY device's MDIO registers 
> and the
> GMII2RGMII converter and we need to make sure both have matching settings,
> or is it something like this:
> 
> MACB <=> GMII2RGMII <=> RGMII PHY
> MACB MDIO unconnected
> 
> and we lose the ability to query the PHY via MDIO?
> 
> As Nicolas pointed out, providing a binding documentation update may help
> reviewers understand what is being accomplished here.

Sure will fix in v2.

Regards,
Kedar.



RE: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Florian,

Thanks for the review.

> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed)
> > +{
> > +   struct gmii2rgmii *xphy = (struct xphy *)priv;
> 
> Why not pass struct xphy pointer directly?

Ok will fix in v2...

> 
> > +   struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev;
> > +   u16 gmii2rgmii_reg = 0;
> > +
> > +   switch (speed) {
> > +   case 1000:
> > +   gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000;
> > +   break;
> > +   case 100:
> > +   gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100;
> > +   break;
> > +   default:
> > +   return;
> > +   }
> > +
> > +   xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr,
> > +XILINX_GMII2RGMII_REG_NUM,
> > +gmii2rgmii_reg);
> > +}
> > +
> > +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy) {
> > +   struct device_node *phy_node;
> > +   struct phy_device *phydev;
> > +   struct device_node *np = (struct device_node *)xphy->platform_data;
> > +
> > +   phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0);
> 
> Is that property documented in a binding document?

Will document. Will fix in v2...

> 
> > +   if (phy_node) {
> 
> Should not there be an else clause which does not assign
> xphy->fix_mac_speed in case this property lookup fails?

Will fix in v2...

> 
> > +   phydev = of_phy_attach(xphy->dev, phy_node, 0, 0);
> > +   if (!phydev) {
> > +   netdev_err(xphy->dev,
> > +  "%s: no gmii to rgmii converter found\n",
> > +  xphy->dev->name);
> > +   return -1;
> > +   }
> > +   xphy->gmii2rgmii_phy_dev = phydev;
> > +   }
> > +   xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(gmii2rgmii_phyprobe);
> > +
> > +MODULE_DESCRIPTION("Xilinx GMII2RGMII converter driver");
> > +MODULE_LICENSE("GPL");
> > diff --git a/include/linux/xilinx_gmii2rgmii.h
> > b/include/linux/xilinx_gmii2rgmii.h
> > new file mode 100644
> > index 000..64e1659
> > --- /dev/null
> > +++ b/include/linux/xilinx_gmii2rgmii.h
> > @@ -0,0 +1,24 @@
> > +#ifndef _GMII2RGMII_H
> > +#define _GMII2RGMII_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define XILINX_GMII2RGMII_FULLDPLX BMCR_FULLDPLX
> > +#define XILINX_GMII2RGMII_SPEED1000BMCR_SPEED1000
> > +#define XILINX_GMII2RGMII_SPEED100 BMCR_SPEED100
> > +#define XILINX_GMII2RGMII_REG_NUM  0x10
> 
> So the register semantics are fairly standard but not the register location, 
> have
> you considered writing a small PHY driver for this block?

I tried but this PHY doesn't have any vendor / Device ID's
This converter won't suit to PHY framework as we need to programmed the
PHY Control register with the external phy negotiated speed as explained in the 
other
Mail thread...

Regards,
Kedar. 


RE: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Appana Durga Kedareswara Rao
Hi Florian,

Thanks for the review...

> Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> > This patch series does the following
> > ---> Add support for gmii2rgmii converter.
> > ---> Add support for gmii2rgmii converter in the macb driver.
> >
> > Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/
> > which includes converter related stuff also in macb driver.
> > This patch series fixes this issue.
> 
> This still seems very adhoc and not completely explained. Can you clarify how
> the gmmi2rgmii converter gets used?

Sorry I should have been explained it clearly in the patch.
Will fix it in the v2.

This converter works like below

MACB <==> GMII2RGMII <==> RGMII_PHY

MDIO<> GMII2RGMII
MCAB<===> 
  <> RGMII

Using MACB MDIO bus we can access both the converter and the external PHY.
We need to program the line speed of the converter during run time based on the
External phy negotiated speed. 

MDIO interface is used to set operating speed (line speed) 

The converter has only one register (0x10) that we need to program to set the 
operating speed.
The composition of this register is similar to the IEEE standard 802.3 MDIO 
control register 0x0.

Please let me know if you still need not clear with how this converter works.

IP data sheet is available here @ 
http://www.xilinx.com/support/documentation/ip_documentation/gmii_to_rgmii/v3_0/pg160-gmii-to-rgmii.pdf
 

   
> 
> Is the expectation that a MACB Ethernet adapter will be connected to a RGMII
> PHY like this:
> 
> MACB <=> GMII2RGMII <=> RGMII PHY
> MACB MDIO <===> RGMII_PHY
> 
> and still have the ability to control both the PHY device's MDIO registers 
> and the
> GMII2RGMII converter and we need to make sure both have matching settings,
> or is it something like this:
> 
> MACB <=> GMII2RGMII <=> RGMII PHY
> MACB MDIO unconnected
> 
> and we lose the ability to query the PHY via MDIO?
> 
> As Nicolas pointed out, providing a binding documentation update may help
> reviewers understand what is being accomplished here.

Sure will fix in v2.

Regards,
Kedar.



Re: [PATCH v4] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
On Fri, Jul 1, 2016 at 5:19 PM, Dmitry Vyukov  wrote:
> If CONFIG_KASAN is enabled and gcc is configured with
> --disable-initfini-array and/or gold linker is used,
> gcc emits .ctors/.dtors and .text.startup/.text.exit
> sections instead of .init_array/.fini_array.
> .dtors section is not explicitly accounted in the linker
> script and messes vvar/percpu layout. Want:
>
> 822bfd80 D _edata
> 822c D __vvar_beginning_hack
> 822c A __vvar_page
> 822c0080 0098 D vsyscall_gtod_data
> 822c1000 A __init_begin
> 822c1000 D init_per_cpu__irq_stack_union
> 822c1000 A __per_cpu_load
> 822d3000 D init_per_cpu__gdt_page
>
> Got:
>
> 8279a600 D _edata
> 8279b000 A __vvar_page
> 8279c000 A __init_begin
> 8279c000 D init_per_cpu__irq_stack_union
> 8279c000 A __per_cpu_load
> 8279e000 D __vvar_beginning_hack
> 8279e080 0098 D vsyscall_gtod_data
> 827ae000 D init_per_cpu__gdt_page
>
> This happens because __vvar_page and .vvar get different
> addresses in arch/x86/kernel/vmlinux.lds.S:
>
> . = ALIGN(PAGE_SIZE);
> __vvar_page = .;
>
> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
> /* work around gold bug 13023 */
> __vvar_beginning_hack = .;
>
> Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
> Merge .text.startup into init text.
>
> Cc:  # v4.0+
> Reviewed-by: Andrey Ryabinin 
> Signed-off-by: Dmitry Vyukov 
>
> ---
>
> Changes since v1:
>  - discard .dtors
>  - don't define .mem sections
>
> Changes since v2:
>  - use 'vmlinux.lds' subsystem prefix instead of 'kasan'
>
> Changes since v3:
>  - add 'Cc:  # v4.0+' tag
>  - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org


Andrew, can you please take this to mm tree?


strange Mac OSX RST behavior

2016-07-01 Thread Jason Baron
I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet. The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.
We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.


Here is the sequence from wireshark, mac osx is client sending the
fin:

84581  14.752908  ->  TCP 66 49896 > http [FIN, ACK] 
Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436
84984  14.788056  ->  TCP 60 49896 > http [RST] 
Seq=673257230 Win=0 Len=0
84985  14.788061  ->  TCP 66 http > 49896 [ACK] 
Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547

followed by a bunch of retransmits from server:

85138  14.994217  ->  TCP 1054 [TCP segment of a 
reassembled PDU]
85237  15.348217  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85337  16.056224  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85436  17.472225  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85540  20.304222  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85644  25.968218  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85745  37.280230  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85845  59.904235  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]

Thanks,

-Jason

---
 net/ipv4/tcp_input.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff97523..b3c55b91140c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }
 
+/*
+ * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST
+ * has the same sequence number as the FIN. This is not compliant
+ * with RFC 5961, but ends up in a number of sockets tied up mostly
+ * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only
+ * accept the RST after we've accepted a FIN and have not previously
+ * sent a FIN and received back the corresponding ACK.
+ */
+static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+   (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) &&
+   (sk->sk_state == TCP_CLOSE_WAIT ||
+sk->sk_state == TCP_LAST_ACK ||
+sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
  LINUX_MIB_TCPACKSKIPPEDSEQ,
  >last_oow_ack_time))
tcp_send_dupack(sk, skb);
-   }
+   } else if (tcp_fin_rst_check(sk, skb))
+   tcp_reset(sk);
goto discard;
}
 
@@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
 * else
 * Send a challenge ACK
 */
-   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+   tcp_fin_rst_check(sk, skb)) {
rst_seq_match = true;
} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
struct tcp_sack_block *sp = >selective_acks[0];
-- 
2.6.1



Re: [PATCH v4] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
On Fri, Jul 1, 2016 at 5:19 PM, Dmitry Vyukov  wrote:
> If CONFIG_KASAN is enabled and gcc is configured with
> --disable-initfini-array and/or gold linker is used,
> gcc emits .ctors/.dtors and .text.startup/.text.exit
> sections instead of .init_array/.fini_array.
> .dtors section is not explicitly accounted in the linker
> script and messes vvar/percpu layout. Want:
>
> 822bfd80 D _edata
> 822c D __vvar_beginning_hack
> 822c A __vvar_page
> 822c0080 0098 D vsyscall_gtod_data
> 822c1000 A __init_begin
> 822c1000 D init_per_cpu__irq_stack_union
> 822c1000 A __per_cpu_load
> 822d3000 D init_per_cpu__gdt_page
>
> Got:
>
> 8279a600 D _edata
> 8279b000 A __vvar_page
> 8279c000 A __init_begin
> 8279c000 D init_per_cpu__irq_stack_union
> 8279c000 A __per_cpu_load
> 8279e000 D __vvar_beginning_hack
> 8279e080 0098 D vsyscall_gtod_data
> 827ae000 D init_per_cpu__gdt_page
>
> This happens because __vvar_page and .vvar get different
> addresses in arch/x86/kernel/vmlinux.lds.S:
>
> . = ALIGN(PAGE_SIZE);
> __vvar_page = .;
>
> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
> /* work around gold bug 13023 */
> __vvar_beginning_hack = .;
>
> Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
> Merge .text.startup into init text.
>
> Cc:  # v4.0+
> Reviewed-by: Andrey Ryabinin 
> Signed-off-by: Dmitry Vyukov 
>
> ---
>
> Changes since v1:
>  - discard .dtors
>  - don't define .mem sections
>
> Changes since v2:
>  - use 'vmlinux.lds' subsystem prefix instead of 'kasan'
>
> Changes since v3:
>  - add 'Cc:  # v4.0+' tag
>  - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org


Andrew, can you please take this to mm tree?


strange Mac OSX RST behavior

2016-07-01 Thread Jason Baron
I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet. The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.
We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.


Here is the sequence from wireshark, mac osx is client sending the
fin:

84581  14.752908  ->  TCP 66 49896 > http [FIN, ACK] 
Seq=673257230 Ack=924722210 Win=131072 Len=0 TSval=622455547 TSecr=346246436
84984  14.788056  ->  TCP 60 49896 > http [RST] 
Seq=673257230 Win=0 Len=0
84985  14.788061  ->  TCP 66 http > 49896 [ACK] 
Seq=924739994 Ack=673257231 Win=28960 Len=0 TSval=346246723 TSecr=622455547

followed by a bunch of retransmits from server:

85138  14.994217  ->  TCP 1054 [TCP segment of a 
reassembled PDU]
85237  15.348217  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85337  16.056224  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85436  17.472225  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85540  20.304222  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85644  25.968218  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85745  37.280230  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]
85845  59.904235  ->  TCP 1054 [TCP Retransmission] 
[TCP segment of a reassembled PDU]

Thanks,

-Jason

---
 net/ipv4/tcp_input.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94d4aff97523..b3c55b91140c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5155,6 +5155,25 @@ static int tcp_copy_to_iovec(struct sock *sk, struct 
sk_buff *skb, int hlen)
return err;
 }
 
+/*
+ * Mac OSX 10.11.5 can send a FIN followed by a RST where the RST
+ * has the same sequence number as the FIN. This is not compliant
+ * with RFC 5961, but ends up in a number of sockets tied up mostly
+ * in TCP_CLOSE_WAIT. The rst attack surface is limited b/c we only
+ * accept the RST after we've accepted a FIN and have not previously
+ * sent a FIN and received back the corresponding ACK.
+ */
+static bool tcp_fin_rst_check(struct sock *sk, struct sk_buff *skb)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   return unlikely((TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1)) &&
+   (TCP_SKB_CB(skb)->end_seq == (tp->rcv_nxt - 1)) &&
+   (sk->sk_state == TCP_CLOSE_WAIT ||
+sk->sk_state == TCP_LAST_ACK ||
+sk->sk_state == TCP_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5193,7 +5212,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
  LINUX_MIB_TCPACKSKIPPEDSEQ,
  >last_oow_ack_time))
tcp_send_dupack(sk, skb);
-   }
+   } else if (tcp_fin_rst_check(sk, skb))
+   tcp_reset(sk);
goto discard;
}
 
@@ -5206,7 +5226,8 @@ static bool tcp_validate_incoming(struct sock *sk, struct 
sk_buff *skb,
 * else
 * Send a challenge ACK
 */
-   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+   if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+   tcp_fin_rst_check(sk, skb)) {
rst_seq_match = true;
} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
struct tcp_sack_block *sp = >selective_acks[0];
-- 
2.6.1



Re: [PATCH v3] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
On Fri, Jul 1, 2016 at 4:58 PM, Andrey Ryabinin  wrote:
> 2016-06-29 20:28 GMT+03:00 Dmitry Vyukov :
>> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin  
>> wrote:
>>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov :
 If CONFIG_KASAN is enabled and gcc is configured with
 --disable-initfini-array and/or gold linker is used,
 gcc emits .ctors/.dtors and .text.startup/.text.exit
 sections instead of .init_array/.fini_array.
 .dtors section is not explicitly accounted in the linker
 script and messes vvar/percpu layout. Want:

 822bfd80 D _edata
 822c D __vvar_beginning_hack
 822c A __vvar_page
 822c0080 0098 D vsyscall_gtod_data
 822c1000 A __init_begin
 822c1000 D init_per_cpu__irq_stack_union
 822c1000 A __per_cpu_load
 822d3000 D init_per_cpu__gdt_page

 Got:

 8279a600 D _edata
 8279b000 A __vvar_page
 8279c000 A __init_begin
 8279c000 D init_per_cpu__irq_stack_union
 8279c000 A __per_cpu_load
 8279e000 D __vvar_beginning_hack
 8279e080 0098 D vsyscall_gtod_data
 827ae000 D init_per_cpu__gdt_page

 This happens because __vvar_page and .vvar get different
 addresses in arch/x86/kernel/vmlinux.lds.S:

 . = ALIGN(PAGE_SIZE);
 __vvar_page = .;

 .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
 /* work around gold bug 13023 */
 __vvar_beginning_hack = .;

 Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
 Merge .text.startup into init text.

 Signed-off-by: Dmitry Vyukov 
>>>
>>> Reviewed-by: Andrey Ryabinin 
>>
>>
>> Who can take it to some tree?
>
> akpm tends to be the maintainer of last resort.
> But you probably need to resend the patch, because LKML was not in CC list.
> Also, please add stable tag: Cc:  # v4.0+

Remailed. Thanks.


[PATCH v4] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
If CONFIG_KASAN is enabled and gcc is configured with
--disable-initfini-array and/or gold linker is used,
gcc emits .ctors/.dtors and .text.startup/.text.exit
sections instead of .init_array/.fini_array.
.dtors section is not explicitly accounted in the linker
script and messes vvar/percpu layout. Want:

822bfd80 D _edata
822c D __vvar_beginning_hack
822c A __vvar_page
822c0080 0098 D vsyscall_gtod_data
822c1000 A __init_begin
822c1000 D init_per_cpu__irq_stack_union
822c1000 A __per_cpu_load
822d3000 D init_per_cpu__gdt_page

Got:

8279a600 D _edata
8279b000 A __vvar_page
8279c000 A __init_begin
8279c000 D init_per_cpu__irq_stack_union
8279c000 A __per_cpu_load
8279e000 D __vvar_beginning_hack
8279e080 0098 D vsyscall_gtod_data
827ae000 D init_per_cpu__gdt_page

This happens because __vvar_page and .vvar get different
addresses in arch/x86/kernel/vmlinux.lds.S:

. = ALIGN(PAGE_SIZE);
__vvar_page = .;

.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
/* work around gold bug 13023 */
__vvar_beginning_hack = .;

Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
Merge .text.startup into init text.

Cc:  # v4.0+
Reviewed-by: Andrey Ryabinin 
Signed-off-by: Dmitry Vyukov 

---

Changes since v1:
 - discard .dtors
 - don't define .mem sections

Changes since v2:
 - use 'vmlinux.lds' subsystem prefix instead of 'kasan'

Changes since v3:
 - add 'Cc:  # v4.0+' tag
 - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org

---
 include/asm-generic/vmlinux.lds.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 6a67ab9..081d0f2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -542,15 +542,19 @@
 
 #define INIT_TEXT  \
*(.init.text)   \
+   *(.text.startup)\
MEM_DISCARD(init.text)
 
 #define EXIT_DATA  \
*(.exit.data)   \
+   *(.fini_array)  \
+   *(.dtors)   \
MEM_DISCARD(exit.data)  \
MEM_DISCARD(exit.rodata)
 
 #define EXIT_TEXT  \
*(.exit.text)   \
+   *(.text.exit)   \
MEM_DISCARD(exit.text)
 
 #define EXIT_CALL  \
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH v3] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
On Fri, Jul 1, 2016 at 4:58 PM, Andrey Ryabinin  wrote:
> 2016-06-29 20:28 GMT+03:00 Dmitry Vyukov :
>> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin  
>> wrote:
>>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov :
 If CONFIG_KASAN is enabled and gcc is configured with
 --disable-initfini-array and/or gold linker is used,
 gcc emits .ctors/.dtors and .text.startup/.text.exit
 sections instead of .init_array/.fini_array.
 .dtors section is not explicitly accounted in the linker
 script and messes vvar/percpu layout. Want:

 822bfd80 D _edata
 822c D __vvar_beginning_hack
 822c A __vvar_page
 822c0080 0098 D vsyscall_gtod_data
 822c1000 A __init_begin
 822c1000 D init_per_cpu__irq_stack_union
 822c1000 A __per_cpu_load
 822d3000 D init_per_cpu__gdt_page

 Got:

 8279a600 D _edata
 8279b000 A __vvar_page
 8279c000 A __init_begin
 8279c000 D init_per_cpu__irq_stack_union
 8279c000 A __per_cpu_load
 8279e000 D __vvar_beginning_hack
 8279e080 0098 D vsyscall_gtod_data
 827ae000 D init_per_cpu__gdt_page

 This happens because __vvar_page and .vvar get different
 addresses in arch/x86/kernel/vmlinux.lds.S:

 . = ALIGN(PAGE_SIZE);
 __vvar_page = .;

 .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
 /* work around gold bug 13023 */
 __vvar_beginning_hack = .;

 Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
 Merge .text.startup into init text.

 Signed-off-by: Dmitry Vyukov 
>>>
>>> Reviewed-by: Andrey Ryabinin 
>>
>>
>> Who can take it to some tree?
>
> akpm tends to be the maintainer of last resort.
> But you probably need to resend the patch, because LKML was not in CC list.
> Also, please add stable tag: Cc:  # v4.0+

Remailed. Thanks.


[PATCH v4] vmlinux.lds: account for destructor sections

2016-07-01 Thread Dmitry Vyukov
If CONFIG_KASAN is enabled and gcc is configured with
--disable-initfini-array and/or gold linker is used,
gcc emits .ctors/.dtors and .text.startup/.text.exit
sections instead of .init_array/.fini_array.
.dtors section is not explicitly accounted in the linker
script and messes vvar/percpu layout. Want:

822bfd80 D _edata
822c D __vvar_beginning_hack
822c A __vvar_page
822c0080 0098 D vsyscall_gtod_data
822c1000 A __init_begin
822c1000 D init_per_cpu__irq_stack_union
822c1000 A __per_cpu_load
822d3000 D init_per_cpu__gdt_page

Got:

8279a600 D _edata
8279b000 A __vvar_page
8279c000 A __init_begin
8279c000 D init_per_cpu__irq_stack_union
8279c000 A __per_cpu_load
8279e000 D __vvar_beginning_hack
8279e080 0098 D vsyscall_gtod_data
827ae000 D init_per_cpu__gdt_page

This happens because __vvar_page and .vvar get different
addresses in arch/x86/kernel/vmlinux.lds.S:

. = ALIGN(PAGE_SIZE);
__vvar_page = .;

.vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
/* work around gold bug 13023 */
__vvar_beginning_hack = .;

Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
Merge .text.startup into init text.

Cc:  # v4.0+
Reviewed-by: Andrey Ryabinin 
Signed-off-by: Dmitry Vyukov 

---

Changes since v1:
 - discard .dtors
 - don't define .mem sections

Changes since v2:
 - use 'vmlinux.lds' subsystem prefix instead of 'kasan'

Changes since v3:
 - add 'Cc:  # v4.0+' tag
 - CC linux-kernel@vger.kernel.org,a...@linux-foundation.org

---
 include/asm-generic/vmlinux.lds.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 6a67ab9..081d0f2 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -542,15 +542,19 @@
 
 #define INIT_TEXT  \
*(.init.text)   \
+   *(.text.startup)\
MEM_DISCARD(init.text)
 
 #define EXIT_DATA  \
*(.exit.data)   \
+   *(.fini_array)  \
+   *(.dtors)   \
MEM_DISCARD(exit.data)  \
MEM_DISCARD(exit.rodata)
 
 #define EXIT_TEXT  \
*(.exit.text)   \
+   *(.text.exit)   \
MEM_DISCARD(exit.text)
 
 #define EXIT_CALL  \
-- 
2.8.0.rc3.226.g39d4020



Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool

2016-07-01 Thread Andrew F. Davis
On 07/01/2016 10:03 AM, Peter Meerwald-Stadler wrote:
> 
>> Found with scripts/coccinelle/misc/boolconv.cocci.
> 
> I'd argue that
> 
> pd = (inbuf[0] >> 1) & 0x3;
> if (pd) {
> data->powerdown = true;
> data->powerdown_mode = pd - 1;
> } else {
> data->powerdown = false;
> data->powerdown_mode = 2; /* largest register to gnd */
> }
> 
> is less compact but probably easier to read/understand;
> 

I agree, this is a much better fix. Would you like to submit this
version as a patch?

Andrew

>> Signed-off-by: Andrew F. Davis 
> 
> but I'm fine with the proposed patch as well to make cocci happy
> 
> regards, p.
> 
>> ---
>>  drivers/iio/dac/mcp4725.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c
>> index cca935c..c3bb3fd 100644
>> --- a/drivers/iio/dac/mcp4725.c
>> +++ b/drivers/iio/dac/mcp4725.c
>> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client,
>>  return err;
>>  }
>>  pd = (inbuf[0] >> 1) & 0x3;
>> -data->powerdown = pd > 0 ? true : false;
>> +data->powerdown = pd > 0;
>>  data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */
>>  data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4);
>>  
>>
> 


Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool

2016-07-01 Thread Andrew F. Davis
On 07/01/2016 10:03 AM, Peter Meerwald-Stadler wrote:
> 
>> Found with scripts/coccinelle/misc/boolconv.cocci.
> 
> I'd argue that
> 
> pd = (inbuf[0] >> 1) & 0x3;
> if (pd) {
> data->powerdown = true;
> data->powerdown_mode = pd - 1;
> } else {
> data->powerdown = false;
> data->powerdown_mode = 2; /* largest register to gnd */
> }
> 
> is less compact but probably easier to read/understand;
> 

I agree, this is a much better fix. Would you like to submit this
version as a patch?

Andrew

>> Signed-off-by: Andrew F. Davis 
> 
> but I'm fine with the proposed patch as well to make cocci happy
> 
> regards, p.
> 
>> ---
>>  drivers/iio/dac/mcp4725.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c
>> index cca935c..c3bb3fd 100644
>> --- a/drivers/iio/dac/mcp4725.c
>> +++ b/drivers/iio/dac/mcp4725.c
>> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client,
>>  return err;
>>  }
>>  pd = (inbuf[0] >> 1) & 0x3;
>> -data->powerdown = pd > 0 ? true : false;
>> +data->powerdown = pd > 0;
>>  data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */
>>  data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4);
>>  
>>
> 


Re: [PATCH 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac

2016-07-01 Thread Jon Mason
On Fri, Jul 1, 2016 at 5:46 AM, Arnd Bergmann  wrote:
> On Thursday, June 30, 2016 6:59:13 PM CEST Jon Mason wrote:
>> +
>> +Required properties:
>> + - compatible: "brcm,bgmac-nsp"
>> + - reg:Address and length of the GMAC registers,
>> +   Address and length of the GMAC IDM registers
>> + - reg-names:  Names of the registers.  Must have both "gmac_base" and
>> +   "idm_base"
>> + - interrupts: Interrupt number
>> +
>
>
> "brcm,bgmac-nsp" sounds a bit too general. As I understand, this is a family
> of SoCs that might not all have the exact same implementation of this
> ethernet device, as we can see from the long lookup table in bgmac_probe().

The Broadcom iProc family of SoCs contains:
Northstar
Northstar Plus
Cygnus
Northstar 2
a few SoCs that are under development
and a number of ethernet switches (which might never be officially supported)

Each one of these SoCs could have a different revision of the gmac IP
block, but they should be uniform within each SoC (though there might
be a A0/B0 change necessary).  The Northstar Plus product family has a
number of different implementations, but the SoC is unchanged.  So, I
think this might be too specific, when we really need a general compat
string.

Broadcom has a history of sharing IP blocks amongst the different
divisions.  So, this driver might be used on other SoC families (as it
apparently has been done in the past, based on the code you
reference).  I do not know of any way to know what legacy, non-iProc
chips have used this IP block.  I can make this "brcm,iproc-bgmac",
and add "brcm,iproc-nsp-bgmac" as an alternative compatible string in
this file (which I believe you are suggesting), but there might be
non-iProc SoCs that use this driver.  Is this acceptable?

Thanks,
Jon

> Please document the specific product numbers here that are publically
> known already. Having the driver match just on "brcm,bgmac-nsp" as a fallback
> is fine, so you can document that one as required for all users.
>
> Arnd


Re: [PATCH 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac

2016-07-01 Thread Jon Mason
On Fri, Jul 1, 2016 at 5:46 AM, Arnd Bergmann  wrote:
> On Thursday, June 30, 2016 6:59:13 PM CEST Jon Mason wrote:
>> +
>> +Required properties:
>> + - compatible: "brcm,bgmac-nsp"
>> + - reg:Address and length of the GMAC registers,
>> +   Address and length of the GMAC IDM registers
>> + - reg-names:  Names of the registers.  Must have both "gmac_base" and
>> +   "idm_base"
>> + - interrupts: Interrupt number
>> +
>
>
> "brcm,bgmac-nsp" sounds a bit too general. As I understand, this is a family
> of SoCs that might not all have the exact same implementation of this
> ethernet device, as we can see from the long lookup table in bgmac_probe().

The Broadcom iProc family of SoCs contains:
Northstar
Northstar Plus
Cygnus
Northstar 2
a few SoCs that are under development
and a number of ethernet switches (which might never be officially supported)

Each one of these SoCs could have a different revision of the gmac IP
block, but they should be uniform within each SoC (though there might
be a A0/B0 change necessary).  The Northstar Plus product family has a
number of different implementations, but the SoC is unchanged.  So, I
think this might be too specific, when we really need a general compat
string.

Broadcom has a history of sharing IP blocks amongst the different
divisions.  So, this driver might be used on other SoC families (as it
apparently has been done in the past, based on the code you
reference).  I do not know of any way to know what legacy, non-iProc
chips have used this IP block.  I can make this "brcm,iproc-bgmac",
and add "brcm,iproc-nsp-bgmac" as an alternative compatible string in
this file (which I believe you are suggesting), but there might be
non-iProc SoCs that use this driver.  Is this acceptable?

Thanks,
Jon

> Please document the specific product numbers here that are publically
> known already. Having the driver match just on "brcm,bgmac-nsp" as a fallback
> is fine, so you can document that one as required for all users.
>
> Arnd


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 17:06, Paolo Bonzini wrote:
>>> >> > Should it?
>> Yes, x2APIC ID cannot be changed in hardware and is initialized to the
>> intitial APIC ID.
>> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace
>> reuses old VMs instead of building new ones after reconfiguration.
>> I don't think it's a sensible use case and it it is currently broken,
>> because we don't exit to userspace when changing APIC mode, so KVM would
>> just set APIC ID to VCPU ID on any transition and userspace couldn't
>> amend it.

Forgot to reply about this: letting SET_LAPIC change x2APIC IDs is nonsense.

In x2APIC mode + new capability disabled SET_LAPIC should ignore the id
register altogether for backwards compatibility.

In x2APIC mode + new capability enabled it should either ignore it, or
fail if the x2APIC ID doesn't match the VCPU id.  I suspect the latter
is better because it would help catching the case where userspace is
erroneously shifting the id left to bits 31-24.

Paolo


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 17:06, Paolo Bonzini wrote:
>>> >> > Should it?
>> Yes, x2APIC ID cannot be changed in hardware and is initialized to the
>> intitial APIC ID.
>> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace
>> reuses old VMs instead of building new ones after reconfiguration.
>> I don't think it's a sensible use case and it it is currently broken,
>> because we don't exit to userspace when changing APIC mode, so KVM would
>> just set APIC ID to VCPU ID on any transition and userspace couldn't
>> amend it.

Forgot to reply about this: letting SET_LAPIC change x2APIC IDs is nonsense.

In x2APIC mode + new capability disabled SET_LAPIC should ignore the id
register altogether for backwards compatibility.

In x2APIC mode + new capability enabled it should either ignore it, or
fail if the x2APIC ID doesn't match the VCPU id.  I suspect the latter
is better because it would help catching the case where userspace is
erroneously shifting the id left to bits 31-24.

Paolo


Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Andrew Lunn
On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote:
> This patch series does the following
> ---> Add support for gmii2rgmii converter.

How generic is this gmii2rgmii IP block? Could it be used with any
GMII and RGMII device?

Should it be placed in drivers/net/phy, so making it easier to reuse?

Also, Russell Kings phylink might be a more natural structure for
this. It is hard to know when those patches will land, but it might be
worth looking at.

   Andrew


Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Andrew Lunn
On Fri, Jul 01, 2016 at 11:50:10AM +0530, Kedareswara rao Appana wrote:
> This patch series does the following
> ---> Add support for gmii2rgmii converter.

How generic is this gmii2rgmii IP block? Could it be used with any
GMII and RGMII device?

Should it be placed in drivers/net/phy, so making it easier to reuse?

Also, Russell Kings phylink might be a more natural structure for
this. It is hard to know when those patches will land, but it might be
worth looking at.

   Andrew


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Arnd Bergmann
On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote:
> On 07/01/2016 01:19 PM, Arnd Bergmann wrote:
> > The linux/cec.h header file contains empty inline definitions for
> > a subset of the API for the case in which CEC is not enabled,
> > however we have driver that call other functions as well:
> > 
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status':
> > drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function 
> > 'cec_transmit_done' [-Werror=implicit-function-declaration]
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr':
> > drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function 
> > 'cec_received_msg' [-Werror=implicit-function-declaration]
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_probe':
> > drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of 
> > function 'cec_allocate_adapter' [-Werror=implicit-function-declaration]
> 
> I don't understand this. These calls are under #if 
> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC),
> and that should be 0 if MEDIA_CEC is not selected.
> 
> Am I missing some weird config combination?

This must have been a build error I ran into before your patch, when I
had this one applied locally instead:

diff --git a/include/media/cec.h b/include/media/cec.h
index c462f9b44074..564a6a06bed7 100644
--- a/include/media/cec.h
+++ b/include/media/cec.h
@@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter 
*adap)
return adap->phys_addr == 0;
 }
 
-#if IS_ENABLED(CONFIG_MEDIA_CEC)
+#if IS_REACHABLE(CONFIG_MEDIA_CEC)
 struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops,
void *priv, const char *name, u32 caps, u8 available_las,
struct device *parent);

because that was hiding the declarations when the code could not
reach it. With your newer patch that is not possible any more.

I also wasn't aware that each of these already had their own Kconfig
symbols. Could we just do this instead of your patch then:

diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig
index ce9006e10a30..73e047220905 100644
--- a/drivers/media/i2c/Kconfig
+++ b/drivers/media/i2c/Kconfig
@@ -222,6 +222,7 @@ config VIDEO_ADV7604
 config VIDEO_ADV7604_CEC
bool "Enable Analog Devices ADV7604 CEC support"
depends on VIDEO_ADV7604 && MEDIA_CEC
+   depends on VIDEO_ADV7604=m || MEDIA_CEC=y
---help---
  When selected the adv7604 will support the optional
  HDMI CEC feature.
@@ -243,6 +244,7 @@ config VIDEO_ADV7842
 config VIDEO_ADV7842_CEC
bool "Enable Analog Devices ADV7842 CEC support"
depends on VIDEO_ADV7842 && MEDIA_CEC
+   depends on VIDEO_ADV7842=m || MEDIA_CEC=y
---help---
  When selected the adv7842 will support the optional
  HDMI CEC feature.
@@ -475,6 +477,7 @@ config VIDEO_ADV7511
 config VIDEO_ADV7511_CEC
bool "Enable Analog Devices ADV7511 CEC support"
depends on VIDEO_ADV7511 && MEDIA_CEC
+   depends on VIDEO_ADV7511=m || MEDIA_CEC=y
---help---
  When selected the adv7511 will support the optional
  HDMI CEC feature.
diff --git a/drivers/media/platform/vivid/Kconfig 
b/drivers/media/platform/vivid/Kconfig
index 8e6918c5c87c..8e31146d079a 100644
--- a/drivers/media/platform/vivid/Kconfig
+++ b/drivers/media/platform/vivid/Kconfig
@@ -26,6 +26,7 @@ config VIDEO_VIVID
 config VIDEO_VIVID_CEC
bool "Enable CEC emulation support"
depends on VIDEO_VIVID && MEDIA_CEC
+   depends on VIDEO_VIVID=m || MEDIA_CEC=y
---help---
  When selected the vivid module will emulate the optional
  HDMI CEC feature.

which is still not overly nice, but it manages to avoid the
IS_REACHABLE() check and it lets MEDIA_CEC be a module.

Arnd


Re: [PATCH] [media] cec: add missing inline stubs

2016-07-01 Thread Arnd Bergmann
On Friday, July 1, 2016 4:35:09 PM CEST Hans Verkuil wrote:
> On 07/01/2016 01:19 PM, Arnd Bergmann wrote:
> > The linux/cec.h header file contains empty inline definitions for
> > a subset of the API for the case in which CEC is not enabled,
> > however we have driver that call other functions as well:
> > 
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_tx_raw_status':
> > drivers/media/i2c/adv7604.c:1956:3: error: implicit declaration of function 
> > 'cec_transmit_done' [-Werror=implicit-function-declaration]
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_cec_isr':
> > drivers/media/i2c/adv7604.c:2012:4: error: implicit declaration of function 
> > 'cec_received_msg' [-Werror=implicit-function-declaration]
> > drivers/media/i2c/adv7604.c: In function 'adv76xx_probe':
> > drivers/media/i2c/adv7604.c:3482:20: error: implicit declaration of 
> > function 'cec_allocate_adapter' [-Werror=implicit-function-declaration]
> 
> I don't understand this. These calls are under #if 
> IS_ENABLED(CONFIG_VIDEO_ADV7842_CEC),
> and that should be 0 if MEDIA_CEC is not selected.
> 
> Am I missing some weird config combination?

This must have been a build error I ran into before your patch, when I
had this one applied locally instead:

diff --git a/include/media/cec.h b/include/media/cec.h
index c462f9b44074..564a6a06bed7 100644
--- a/include/media/cec.h
+++ b/include/media/cec.h
@@ -187,7 +187,7 @@ static inline bool cec_is_sink(const struct cec_adapter 
*adap)
return adap->phys_addr == 0;
 }
 
-#if IS_ENABLED(CONFIG_MEDIA_CEC)
+#if IS_REACHABLE(CONFIG_MEDIA_CEC)
 struct cec_adapter *cec_allocate_adapter(const struct cec_adap_ops *ops,
void *priv, const char *name, u32 caps, u8 available_las,
struct device *parent);

because that was hiding the declarations when the code could not
reach it. With your newer patch that is not possible any more.

I also wasn't aware that each of these already had their own Kconfig
symbols. Could we just do this instead of your patch then:

diff --git a/drivers/media/i2c/Kconfig b/drivers/media/i2c/Kconfig
index ce9006e10a30..73e047220905 100644
--- a/drivers/media/i2c/Kconfig
+++ b/drivers/media/i2c/Kconfig
@@ -222,6 +222,7 @@ config VIDEO_ADV7604
 config VIDEO_ADV7604_CEC
bool "Enable Analog Devices ADV7604 CEC support"
depends on VIDEO_ADV7604 && MEDIA_CEC
+   depends on VIDEO_ADV7604=m || MEDIA_CEC=y
---help---
  When selected the adv7604 will support the optional
  HDMI CEC feature.
@@ -243,6 +244,7 @@ config VIDEO_ADV7842
 config VIDEO_ADV7842_CEC
bool "Enable Analog Devices ADV7842 CEC support"
depends on VIDEO_ADV7842 && MEDIA_CEC
+   depends on VIDEO_ADV7842=m || MEDIA_CEC=y
---help---
  When selected the adv7842 will support the optional
  HDMI CEC feature.
@@ -475,6 +477,7 @@ config VIDEO_ADV7511
 config VIDEO_ADV7511_CEC
bool "Enable Analog Devices ADV7511 CEC support"
depends on VIDEO_ADV7511 && MEDIA_CEC
+   depends on VIDEO_ADV7511=m || MEDIA_CEC=y
---help---
  When selected the adv7511 will support the optional
  HDMI CEC feature.
diff --git a/drivers/media/platform/vivid/Kconfig 
b/drivers/media/platform/vivid/Kconfig
index 8e6918c5c87c..8e31146d079a 100644
--- a/drivers/media/platform/vivid/Kconfig
+++ b/drivers/media/platform/vivid/Kconfig
@@ -26,6 +26,7 @@ config VIDEO_VIVID
 config VIDEO_VIVID_CEC
bool "Enable CEC emulation support"
depends on VIDEO_VIVID && MEDIA_CEC
+   depends on VIDEO_VIVID=m || MEDIA_CEC=y
---help---
  When selected the vivid module will emulate the optional
  HDMI CEC feature.

which is still not overly nice, but it manages to avoid the
IS_REACHABLE() check and it lets MEDIA_CEC be a module.

Arnd


Re: [PATCH] arm64: defconfig: Enable QDF2432 config options

2016-07-01 Thread Timur Tabi

Christopher Covington wrote:

Due to distribution differences [1][2], I see =y built-in as the default
on mobile platforms and =m modular as the default on server platforms.


I don't think we should mix "server" defconfing entries with "mobile" 
defconfig entries.  It's a arm64 defconfig, neither server nor mobile. 
The other entries are =y, so these should be =y


No one uses this defconfig as-is for any platform.


But thinking about it, we could get a poweroff, reset, over-temperature,
or other interrupt before we get to the initramfs. So the pin controller
driver should be =y built-in.

One might make rootfs media =y built-in to make testing easier, but I
prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by
comparison.


Let's not overthink things, and just make *everything* =y by default.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


Re: [PATCH] arm64: defconfig: Enable QDF2432 config options

2016-07-01 Thread Timur Tabi

Christopher Covington wrote:

Due to distribution differences [1][2], I see =y built-in as the default
on mobile platforms and =m modular as the default on server platforms.


I don't think we should mix "server" defconfing entries with "mobile" 
defconfig entries.  It's a arm64 defconfig, neither server nor mobile. 
The other entries are =y, so these should be =y


No one uses this defconfig as-is for any platform.


But thinking about it, we could get a poweroff, reset, over-temperature,
or other interrupt before we get to the initramfs. So the pin controller
driver should be =y built-in.

One might make rootfs media =y built-in to make testing easier, but I
prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by
comparison.


Let's not overthink things, and just make *everything* =y by default.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the
Code Aurora Forum, hosted by The Linux Foundation.


Re: [docs-next PATCH] Documentation/sphinx: skip build if user requested specific DOCBOOKS

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 01 Jul 2016 16:31:14 +0300
Jani Nikula  escreveu:

> On Fri, 01 Jul 2016, Mauro Carvalho Chehab  wrote:
> > Em Fri,  1 Jul 2016 15:24:44 +0300
> > Jani Nikula  escreveu:
> >  
> >> If the user requested specific DocBooks to be built using 'make
> >> DOCBOOKS=foo.xml htmldocs', assume no Sphinx build is desired. This
> >> check is transitional, and can be removed once we drop the DocBook
> >> build.  
> >
> > I guess this is actually a wrong assumption. I mean, it is valid to
> > build multiple documents at the same time. Sometimes I do things
> > like:
> > make DOCBOOKS="media_api.xml device-drivers.xml" htmldocs
> >
> > When I want both docs to be compiled.
> >
> > What I would be expecting is that Sphinx would be looking into
> > the DOCBOOKS targets and see if (some) of them belongs to it.
> >
> > Alternatively, we could add a separate makefile var for the
> > Sphinx targets, but the logic would be more complex, as it
> > should:  
> 
> Please let's not conflate DOCBOOKS to mean something other than
> DocBooks. I think it'll be easier that way.

OK. Works for me.

> So I guess we'll need a way to build just a subset of the Sphinx
> documentation. I would like that to be a somewhat generic thing, not
> requiring a separate conf file for each subset. Is the granularity of a
> directory enough? I've been meaning to look into passing different
>  and  to sphinx-build for this, but I don't have
> the time now.

In the case of the media doc, it is actually a matter of building
one rst file (linux_tv/index.rst). For that, I'm equally happy
if we specify it via  or via .

I suspect we'll end by needing filenames granularity in some future,
but, if you find easier to just handle  for now, that's
OK. 

Regards,
Mauro


Re: [docs-next PATCH] Documentation/sphinx: skip build if user requested specific DOCBOOKS

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 01 Jul 2016 16:31:14 +0300
Jani Nikula  escreveu:

> On Fri, 01 Jul 2016, Mauro Carvalho Chehab  wrote:
> > Em Fri,  1 Jul 2016 15:24:44 +0300
> > Jani Nikula  escreveu:
> >  
> >> If the user requested specific DocBooks to be built using 'make
> >> DOCBOOKS=foo.xml htmldocs', assume no Sphinx build is desired. This
> >> check is transitional, and can be removed once we drop the DocBook
> >> build.  
> >
> > I guess this is actually a wrong assumption. I mean, it is valid to
> > build multiple documents at the same time. Sometimes I do things
> > like:
> > make DOCBOOKS="media_api.xml device-drivers.xml" htmldocs
> >
> > When I want both docs to be compiled.
> >
> > What I would be expecting is that Sphinx would be looking into
> > the DOCBOOKS targets and see if (some) of them belongs to it.
> >
> > Alternatively, we could add a separate makefile var for the
> > Sphinx targets, but the logic would be more complex, as it
> > should:  
> 
> Please let's not conflate DOCBOOKS to mean something other than
> DocBooks. I think it'll be easier that way.

OK. Works for me.

> So I guess we'll need a way to build just a subset of the Sphinx
> documentation. I would like that to be a somewhat generic thing, not
> requiring a separate conf file for each subset. Is the granularity of a
> directory enough? I've been meaning to look into passing different
>  and  to sphinx-build for this, but I don't have
> the time now.

In the case of the media doc, it is actually a matter of building
one rst file (linux_tv/index.rst). For that, I'm equally happy
if we specify it via  or via .

I suspect we'll end by needing filenames granularity in some future,
but, if you find easier to just handle  for now, that's
OK. 

Regards,
Mauro


Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 16:54, Radim Krčmář wrote:
>> >(Hint: we
>> > want kvm-unit-tests for this).
> :) Something like this?
> 
>   enable_xapic()
>   id = apic_id()
>   set_apic_id(id+1) // ?
>   enable_x2apic()
>   id == apic_id() & 0xff
>   disable_apic()
>   enable_xapic()
>   id == apic_id()
> 

Yes, plus checking that it "moves" appropriately between low and high bits.

Thanks,

Paolo


Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 16:54, Radim Krčmář wrote:
>> >(Hint: we
>> > want kvm-unit-tests for this).
> :) Something like this?
> 
>   enable_xapic()
>   id = apic_id()
>   set_apic_id(id+1) // ?
>   enable_x2apic()
>   id == apic_id() & 0xff
>   disable_apic()
>   enable_xapic()
>   id == apic_id()
> 

Yes, plus checking that it "moves" appropriately between low and high bits.

Thanks,

Paolo


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 16:38, Radim Krčmář wrote:
>> > Should it?
> Yes, x2APIC ID cannot be changed in hardware and is initialized to the
> intitial APIC ID.
> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace
> reuses old VMs instead of building new ones after reconfiguration.
> I don't think it's a sensible use case and it it is currently broken,
> because we don't exit to userspace when changing APIC mode, so KVM would
> just set APIC ID to VCPU ID on any transition and userspace couldn't
> amend it.
> 
>> > According to QEMU if you have e.g. 3 cores per socket one
>> > socket take 4 APIC IDs.  For Knights Landing the "worst" prime factor in
>> > 288 is 3^2 so you need APIC IDs up to 288 * (4/3)^2 = 512.
> The topology can result in sparse APIC ID and APIC ID is initialized
> from VCPU ID, so userspace has to pick VCPU ID accordingly.

Right, I was confusing KVM_MAX_VCPUS and KVM_MAX_VCPU_ID.  So the
overflow case cannot happen and neither can the 32GB allocation.  On the
other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its
current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or
1023.

Thanks,

Paolo


Re: [PATCH v1 03/11] KVM: x86: dynamic kvm_apic_map

2016-07-01 Thread Paolo Bonzini


On 01/07/2016 16:38, Radim Krčmář wrote:
>> > Should it?
> Yes, x2APIC ID cannot be changed in hardware and is initialized to the
> intitial APIC ID.
> Letting LAPIC_SET change x2APIC ID would allow scenarios where userspace
> reuses old VMs instead of building new ones after reconfiguration.
> I don't think it's a sensible use case and it it is currently broken,
> because we don't exit to userspace when changing APIC mode, so KVM would
> just set APIC ID to VCPU ID on any transition and userspace couldn't
> amend it.
> 
>> > According to QEMU if you have e.g. 3 cores per socket one
>> > socket take 4 APIC IDs.  For Knights Landing the "worst" prime factor in
>> > 288 is 3^2 so you need APIC IDs up to 288 * (4/3)^2 = 512.
> The topology can result in sparse APIC ID and APIC ID is initialized
> from VCPU ID, so userspace has to pick VCPU ID accordingly.

Right, I was confusing KVM_MAX_VCPUS and KVM_MAX_VCPU_ID.  So the
overflow case cannot happen and neither can the 32GB allocation.  On the
other hand, I suspect you need to bump KVM_MAX_VCPU_ID beyond its
current default setting (which is equal to KVM_MAX_VCPUS), up to 511 or
1023.

Thanks,

Paolo


Re: [PATCH] arm64: defconfig: Enable QDF2432 config options

2016-07-01 Thread Christopher Covington
On 07/01/2016 10:14 AM, Timur Tabi wrote:
> Christopher Covington wrote:
>>   arch/arm64/configs/defconfig |  4 
>>   scripts/patch-details.sh | 21 ++---
> 
> I don't think these two files should be combined.

Oops, sorry. Fixed in v2.

>>  CONFIG_PINCTRL_SINGLE=y
>>  CONFIG_PINCTRL_MSM8916=y
>> +CONFIG_PINCTRL_QDF2XXX=m
>>  CONFIG_PINCTRL_QCOM_SPMI_PMIC=y
>>  CONFIG_GPIO_SYSFS=y
>>  CONFIG_GPIO_DWAPB=y
>> @@ -245,6 +246,7 @@ CONFIG_MMC=y
>>  CONFIG_MMC_BLOCK_MINORS=32
>>  CONFIG_MMC_ARMMMCI=y
>>  CONFIG_MMC_SDHCI=y
>> +CONFIG_MMC_SDHCI_ACPI=m
>>  CONFIG_MMC_SDHCI_PLTFM=y
> 
> Why are these =m when all the others are =y?

Due to distribution differences [1][2], I see =y built-in as the default
on mobile platforms and =m modular as the default on server platforms.

But thinking about it, we could get a poweroff, reset, over-temperature,
or other interrupt before we get to the initramfs. So the pin controller
driver should be =y built-in.

One might make rootfs media =y built-in to make testing easier, but I
prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by
comparison.

Thanks,
Cov

Examples I found to spot check my recollection:
1.
https://github.com/CyanogenMod/android_kernel_lge_hammerhead/blob/cm-13.0/arch/arm/configs/cyanogenmod_hammerhead_defconfig

2.
https://git.centos.org/blob/sig-altarch!kernel.git/c3cb6cc6edb5400321fd9d83413add4cf61d942e/SOURCES!config-arm64
-- 
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool

2016-07-01 Thread Peter Meerwald-Stadler

> Found with scripts/coccinelle/misc/boolconv.cocci.

I'd argue that

pd = (inbuf[0] >> 1) & 0x3;
if (pd) {
data->powerdown = true;
data->powerdown_mode = pd - 1;
} else {
data->powerdown = false;
data->powerdown_mode = 2; /* largest register to gnd */
}

is less compact but probably easier to read/understand;

> Signed-off-by: Andrew F. Davis 

but I'm fine with the proposed patch as well to make cocci happy

regards, p.

> ---
>  drivers/iio/dac/mcp4725.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c
> index cca935c..c3bb3fd 100644
> --- a/drivers/iio/dac/mcp4725.c
> +++ b/drivers/iio/dac/mcp4725.c
> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client,
>   return err;
>   }
>   pd = (inbuf[0] >> 1) & 0x3;
> - data->powerdown = pd > 0 ? true : false;
> + data->powerdown = pd > 0;
>   data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */
>   data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4);
>  
> 

-- 

Peter Meerwald-Stadler
+43-664-218 (mobile)


Re: [PATCH] arm64: defconfig: Enable QDF2432 config options

2016-07-01 Thread Christopher Covington
On 07/01/2016 10:14 AM, Timur Tabi wrote:
> Christopher Covington wrote:
>>   arch/arm64/configs/defconfig |  4 
>>   scripts/patch-details.sh | 21 ++---
> 
> I don't think these two files should be combined.

Oops, sorry. Fixed in v2.

>>  CONFIG_PINCTRL_SINGLE=y
>>  CONFIG_PINCTRL_MSM8916=y
>> +CONFIG_PINCTRL_QDF2XXX=m
>>  CONFIG_PINCTRL_QCOM_SPMI_PMIC=y
>>  CONFIG_GPIO_SYSFS=y
>>  CONFIG_GPIO_DWAPB=y
>> @@ -245,6 +246,7 @@ CONFIG_MMC=y
>>  CONFIG_MMC_BLOCK_MINORS=32
>>  CONFIG_MMC_ARMMMCI=y
>>  CONFIG_MMC_SDHCI=y
>> +CONFIG_MMC_SDHCI_ACPI=m
>>  CONFIG_MMC_SDHCI_PLTFM=y
> 
> Why are these =m when all the others are =y?

Due to distribution differences [1][2], I see =y built-in as the default
on mobile platforms and =m modular as the default on server platforms.

But thinking about it, we could get a poweroff, reset, over-temperature,
or other interrupt before we get to the initramfs. So the pin controller
driver should be =y built-in.

One might make rootfs media =y built-in to make testing easier, but I
prefer to put my rootfs on SATA, as SD/MMC is much slower and smaller by
comparison.

Thanks,
Cov

Examples I found to spot check my recollection:
1.
https://github.com/CyanogenMod/android_kernel_lge_hammerhead/blob/cm-13.0/arch/arm/configs/cyanogenmod_hammerhead_defconfig

2.
https://git.centos.org/blob/sig-altarch!kernel.git/c3cb6cc6edb5400321fd9d83413add4cf61d942e/SOURCES!config-arm64
-- 
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH] iio: dac: mcp4725: Remove unneeded conversions to bool

2016-07-01 Thread Peter Meerwald-Stadler

> Found with scripts/coccinelle/misc/boolconv.cocci.

I'd argue that

pd = (inbuf[0] >> 1) & 0x3;
if (pd) {
data->powerdown = true;
data->powerdown_mode = pd - 1;
} else {
data->powerdown = false;
data->powerdown_mode = 2; /* largest register to gnd */
}

is less compact but probably easier to read/understand;

> Signed-off-by: Andrew F. Davis 

but I'm fine with the proposed patch as well to make cocci happy

regards, p.

> ---
>  drivers/iio/dac/mcp4725.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iio/dac/mcp4725.c b/drivers/iio/dac/mcp4725.c
> index cca935c..c3bb3fd 100644
> --- a/drivers/iio/dac/mcp4725.c
> +++ b/drivers/iio/dac/mcp4725.c
> @@ -361,7 +361,7 @@ static int mcp4725_probe(struct i2c_client *client,
>   return err;
>   }
>   pd = (inbuf[0] >> 1) & 0x3;
> - data->powerdown = pd > 0 ? true : false;
> + data->powerdown = pd > 0;
>   data->powerdown_mode = pd ? pd - 1 : 2; /* largest register to gnd */
>   data->dac_value = (inbuf[1] << 4) | (inbuf[2] >> 4);
>  
> 

-- 

Peter Meerwald-Stadler
+43-664-218 (mobile)


Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Florian Fainelli
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> This patch series does the following
> ---> Add support for gmii2rgmii converter.
> ---> Add support for gmii2rgmii converter in the macb driver.
> 
> Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ 
> which includes converter related stuff also in macb driver.
> This patch series fixes this issue.

This still seems very adhoc and not completely explained. Can you
clarify how the gmmi2rgmii converter gets used?

Is the expectation that a MACB Ethernet adapter will be connected to a
RGMII PHY like this:

MACB <=> GMII2RGMII <=> RGMII PHY
MACB MDIO <===> RGMII_PHY

and still have the ability to control both the PHY device's MDIO
registers and the GMII2RGMII converter and we need to make sure both
have matching settings, or is it something like this:

MACB <=> GMII2RGMII <=> RGMII PHY
MACB MDIO unconnected

and we lose the ability to query the PHY via MDIO?

As Nicolas pointed out, providing a binding documentation update may
help reviewers understand what is being accomplished here.

Thanks!

> 
> Kedareswara rao Appana (2):
>   net: ethernet: xilinx: Add gmii2rgmii converter support
>   net: macb: Add gmii2rgmii phy converter support
> 
>  drivers/net/ethernet/cadence/macb.c |   21 ++-
>  drivers/net/ethernet/cadence/macb.h |3 +
>  drivers/net/ethernet/xilinx/Kconfig |7 ++
>  drivers/net/ethernet/xilinx/Makefile|1 +
>  drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c |   76 
> +++
>  include/linux/xilinx_gmii2rgmii.h   |   24 +++
>  6 files changed, 131 insertions(+), 1 deletions(-)
>  create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
>  create mode 100644 include/linux/xilinx_gmii2rgmii.h
> 


-- 
Florian


Re: [RFC PATCH 0/2] net: ethernet: Add support for gmii2rgmii converter

2016-07-01 Thread Florian Fainelli
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> This patch series does the following
> ---> Add support for gmii2rgmii converter.
> ---> Add support for gmii2rgmii converter in the macb driver.
> 
> Earlier sent one RFC patch https://patchwork.kernel.org/patch/9186615/ 
> which includes converter related stuff also in macb driver.
> This patch series fixes this issue.

This still seems very adhoc and not completely explained. Can you
clarify how the gmmi2rgmii converter gets used?

Is the expectation that a MACB Ethernet adapter will be connected to a
RGMII PHY like this:

MACB <=> GMII2RGMII <=> RGMII PHY
MACB MDIO <===> RGMII_PHY

and still have the ability to control both the PHY device's MDIO
registers and the GMII2RGMII converter and we need to make sure both
have matching settings, or is it something like this:

MACB <=> GMII2RGMII <=> RGMII PHY
MACB MDIO unconnected

and we lose the ability to query the PHY via MDIO?

As Nicolas pointed out, providing a binding documentation update may
help reviewers understand what is being accomplished here.

Thanks!

> 
> Kedareswara rao Appana (2):
>   net: ethernet: xilinx: Add gmii2rgmii converter support
>   net: macb: Add gmii2rgmii phy converter support
> 
>  drivers/net/ethernet/cadence/macb.c |   21 ++-
>  drivers/net/ethernet/cadence/macb.h |3 +
>  drivers/net/ethernet/xilinx/Kconfig |7 ++
>  drivers/net/ethernet/xilinx/Makefile|1 +
>  drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c |   76 
> +++
>  include/linux/xilinx_gmii2rgmii.h   |   24 +++
>  6 files changed, 131 insertions(+), 1 deletions(-)
>  create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
>  create mode 100644 include/linux/xilinx_gmii2rgmii.h
> 


-- 
Florian


Re: [PATCH] doc: flat-table directive

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 1 Jul 2016 16:07:47 +0200
Markus Heiser  escreveu:

> Am 01.07.2016 um 15:09 schrieb Jani Nikula :
> 
> > On Fri, 01 Jul 2016, Markus Heiser  wrote:  
> >> In Sphinx, the code-block directive is a literal block, no refs or markup
> >> will be interpreted. This is different to what you know from DocBook.
> >> I will look for a solution, that matches your requirement.  
> > 
> > Parsed literal block solves the problem, but brings other problems like
> > requiring escaping for a lot of stuff. And you lose syntax
> > highlighting. It would be great to have a middle ground, like a
> > "semi-parsed code block" where the references are parsed but nothing
> > else.
> > 
> > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block
> >   
> 
> Yes, "parsed literal" blocks is not the solution and I have none 
> yet ... and you are right, we need something "semi". I doubt whether we 
> will eventually find a solution for this, but I will
> think about it ... I don't know how, but it must be a solution that is
> transparent to the the pygment highlighter and not distort the
> node tree. I have to study the sphinx-writer and pygment first.

Yeah, a "semi" solution would be perfect. In the specific case of the
header files, a parsed literal should work. the code that generates
the *.h.xml files right now already does escaping where it needs to
make docbook happy. OK, the characters that need to be escaped will
be different, but shouldn't hard to do it, I guess. At the media
Makefile, it currently does escaping for <, > and &:

ESCAPE = \
-e "s/&/\\/g" \
-e "s//\\/g"

> 
> 
> >> OK, checking dead internal links might be one requirement more. Normally 
> >> sphinx reports internal refs which are broken with a WARNING, but I have to
> >> analyze what xmllint checks and how we could realize something similar
> >> in sphinx / or if it is enough what sphinx reports.  
> > 
> > When we turn function() and  structure references to Sphinx
> > references, we pretty much rely on *not* being strict about all the
> > targets existing. At least for now. In an ideal world we'd aim for -n
> > and -W sphinx-build options, but we're far from that, and I don't know
> > if it's really possible ever.

I guess that will depend on how much care the subsystem maintainer has
with that. When building things on media, we have zero warnings when
building the Kernel with W=1 on x86. We also check both sparse and smatch 
regularly, trying zero the warnings there too (currently, there are just
a few of them, all false positives and not trivial to remove).

We also have zero warnings on media documentation using xmllint
pedantic mode.

> > Is it possible to set -n and/or -W on a per-rst file basis, either in
> > the top config file or from within the rst file itself? Then we could
> > gradually improve this, and subsystems that really care about this could
> > be in the forefront.  
> 
> There is a nitpick_ignore config, but this will not help.
> As far as I can see, if you want similar on a per file basis,
> you need to implement a (HTML) builder which checks
> on which tree-level the *current* node is and if this node
> is in a doctree of one of your files in your *configured file-list*  
> then turns the warning into an error ... may we see more
> requirements coming into, which needs to implement a HTML-builder
> we can implement one. I have implemented a man-page builder
> for the kernel-doc comments, because it was inevitable
> but I think it is not a good idea to reimplement the
> HTML builder in this first usage of sphinx.
> 
> May this is the totally wrong way, may it is better
> to implement a *lint* builder from scratch (should not
> be hard).

That would work.

> 
> >> But before I will send out some small patches which are needed 
> >> first (IMHO). E.g. customizing the RTD theme for rendering large 
> >> tables in HTML well and activation of useful extensions like todolist.
> >> I have this in my "chaotic bulk" patch :-) ... I will separate it out
> >> an send it to Jon one by one.  
> > 
> > Btw I don't think we are really attached to the RTD theme. It's just
> > something I picked that was prettier than the default theme, and was
> > widely available and packaged in distros.  
> 
> IMHO it is not prefect but the most elaborate you will find in the net.
> 
> > Ideally, we should probably
> > keep the customization at a level where it's possible for people to
> > easily use different themes.  
> 
> Layout is done in the theme, we have no chance to influence 
> the layout out of / before the theme. 
> 
> > That said, rendering big tables in the RTD
> > theme is definitely an issue.
> > 
> > I'd also aim to be fairly conservative at first in terms of the rst
> > features and Sphinx extensions we use. Keep it simple. It's really easy
> > to go overboard in the beginning. See how things pan out and gradually
> > 

Re: [PATCH] doc: flat-table directive

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 1 Jul 2016 16:07:47 +0200
Markus Heiser  escreveu:

> Am 01.07.2016 um 15:09 schrieb Jani Nikula :
> 
> > On Fri, 01 Jul 2016, Markus Heiser  wrote:  
> >> In Sphinx, the code-block directive is a literal block, no refs or markup
> >> will be interpreted. This is different to what you know from DocBook.
> >> I will look for a solution, that matches your requirement.  
> > 
> > Parsed literal block solves the problem, but brings other problems like
> > requiring escaping for a lot of stuff. And you lose syntax
> > highlighting. It would be great to have a middle ground, like a
> > "semi-parsed code block" where the references are parsed but nothing
> > else.
> > 
> > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block
> >   
> 
> Yes, "parsed literal" blocks is not the solution and I have none 
> yet ... and you are right, we need something "semi". I doubt whether we 
> will eventually find a solution for this, but I will
> think about it ... I don't know how, but it must be a solution that is
> transparent to the the pygment highlighter and not distort the
> node tree. I have to study the sphinx-writer and pygment first.

Yeah, a "semi" solution would be perfect. In the specific case of the
header files, a parsed literal should work. the code that generates
the *.h.xml files right now already does escaping where it needs to
make docbook happy. OK, the characters that need to be escaped will
be different, but shouldn't hard to do it, I guess. At the media
Makefile, it currently does escaping for <, > and &:

ESCAPE = \
-e "s/&/\\/g" \
-e "s//\\/g"

> 
> 
> >> OK, checking dead internal links might be one requirement more. Normally 
> >> sphinx reports internal refs which are broken with a WARNING, but I have to
> >> analyze what xmllint checks and how we could realize something similar
> >> in sphinx / or if it is enough what sphinx reports.  
> > 
> > When we turn function() and  structure references to Sphinx
> > references, we pretty much rely on *not* being strict about all the
> > targets existing. At least for now. In an ideal world we'd aim for -n
> > and -W sphinx-build options, but we're far from that, and I don't know
> > if it's really possible ever.

I guess that will depend on how much care the subsystem maintainer has
with that. When building things on media, we have zero warnings when
building the Kernel with W=1 on x86. We also check both sparse and smatch 
regularly, trying zero the warnings there too (currently, there are just
a few of them, all false positives and not trivial to remove).

We also have zero warnings on media documentation using xmllint
pedantic mode.

> > Is it possible to set -n and/or -W on a per-rst file basis, either in
> > the top config file or from within the rst file itself? Then we could
> > gradually improve this, and subsystems that really care about this could
> > be in the forefront.  
> 
> There is a nitpick_ignore config, but this will not help.
> As far as I can see, if you want similar on a per file basis,
> you need to implement a (HTML) builder which checks
> on which tree-level the *current* node is and if this node
> is in a doctree of one of your files in your *configured file-list*  
> then turns the warning into an error ... may we see more
> requirements coming into, which needs to implement a HTML-builder
> we can implement one. I have implemented a man-page builder
> for the kernel-doc comments, because it was inevitable
> but I think it is not a good idea to reimplement the
> HTML builder in this first usage of sphinx.
> 
> May this is the totally wrong way, may it is better
> to implement a *lint* builder from scratch (should not
> be hard).

That would work.

> 
> >> But before I will send out some small patches which are needed 
> >> first (IMHO). E.g. customizing the RTD theme for rendering large 
> >> tables in HTML well and activation of useful extensions like todolist.
> >> I have this in my "chaotic bulk" patch :-) ... I will separate it out
> >> an send it to Jon one by one.  
> > 
> > Btw I don't think we are really attached to the RTD theme. It's just
> > something I picked that was prettier than the default theme, and was
> > widely available and packaged in distros.  
> 
> IMHO it is not prefect but the most elaborate you will find in the net.
> 
> > Ideally, we should probably
> > keep the customization at a level where it's possible for people to
> > easily use different themes.  
> 
> Layout is done in the theme, we have no chance to influence 
> the layout out of / before the theme. 
> 
> > That said, rendering big tables in the RTD
> > theme is definitely an issue.
> > 
> > I'd also aim to be fairly conservative at first in terms of the rst
> > features and Sphinx extensions we use. Keep it simple. It's really easy
> > to go overboard in the beginning. See how things pan out and gradually
> > 

Re: [PATCH] doc: flat-table directive

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 1 Jul 2016 16:07:47 +0200
Markus Heiser  escreveu:

> Am 01.07.2016 um 15:09 schrieb Jani Nikula :
> 
> > On Fri, 01 Jul 2016, Markus Heiser  wrote:  
> >> In Sphinx, the code-block directive is a literal block, no refs or markup
> >> will be interpreted. This is different to what you know from DocBook.
> >> I will look for a solution, that matches your requirement.  
> > 
> > Parsed literal block solves the problem, but brings other problems like
> > requiring escaping for a lot of stuff. And you lose syntax
> > highlighting. It would be great to have a middle ground, like a
> > "semi-parsed code block" where the references are parsed but nothing
> > else.
> > 
> > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block
> >   
> 
> Yes, "parsed literal" blocks is not the solution and I have none 
> yet ... and you are right, we need something "semi". I doubt whether we 
> will eventually find a solution for this, but I will
> think about it ... I don't know how, but it must be a solution that is
> transparent to the the pygment highlighter and not distort the
> node tree. I have to study the sphinx-writer and pygment first.

Yeah, a "semi" solution would be perfect. In the specific case of the
header files, a parsed literal should work. the code that generates
the *.h.xml files right now already does escaping where it needs to
make docbook happy. OK, the characters that need to be escaped will
be different, but shouldn't hard to do it, I guess. At the media
Makefile, it currently does escaping for <, > and &:

ESCAPE = \
-e "s/&/\\/g" \
-e "s//\\/g"

> 
> 
> >> OK, checking dead internal links might be one requirement more. Normally 
> >> sphinx reports internal refs which are broken with a WARNING, but I have to
> >> analyze what xmllint checks and how we could realize something similar
> >> in sphinx / or if it is enough what sphinx reports.  
> > 
> > When we turn function() and  structure references to Sphinx
> > references, we pretty much rely on *not* being strict about all the
> > targets existing. At least for now. In an ideal world we'd aim for -n
> > and -W sphinx-build options, but we're far from that, and I don't know
> > if it's really possible ever.

I guess that will depend on how much care the subsystem maintainer has
with that. When building things on media, we have zero warnings when
building the Kernel with W=1 on x86. We also check both sparse and smatch 
regularly, trying zero the warnings there too (currently, there are just
a few of them, all false positives and not trivial to remove).

We also have zero warnings on media documentation using xmllint
pedantic mode.

> > Is it possible to set -n and/or -W on a per-rst file basis, either in
> > the top config file or from within the rst file itself? Then we could
> > gradually improve this, and subsystems that really care about this could
> > be in the forefront.  
> 
> There is a nitpick_ignore config, but this will not help.
> As far as I can see, if you want similar on a per file basis,
> you need to implement a (HTML) builder which checks
> on which tree-level the *current* node is and if this node
> is in a doctree of one of your files in your *configured file-list*  
> then turns the warning into an error ... may we see more
> requirements coming into, which needs to implement a HTML-builder
> we can implement one. I have implemented a man-page builder
> for the kernel-doc comments, because it was inevitable
> but I think it is not a good idea to reimplement the
> HTML builder in this first usage of sphinx.
> 
> May this is the totally wrong way, may it is better
> to implement a *lint* builder from scratch (should not
> be hard).

That would work.

> 
> >> But before I will send out some small patches which are needed 
> >> first (IMHO). E.g. customizing the RTD theme for rendering large 
> >> tables in HTML well and activation of useful extensions like todolist.
> >> I have this in my "chaotic bulk" patch :-) ... I will separate it out
> >> an send it to Jon one by one.  
> > 
> > Btw I don't think we are really attached to the RTD theme. It's just
> > something I picked that was prettier than the default theme, and was
> > widely available and packaged in distros.  
> 
> IMHO it is not prefect but the most elaborate you will find in the net.
> 
> > Ideally, we should probably
> > keep the customization at a level where it's possible for people to
> > easily use different themes.  
> 
> Layout is done in the theme, we have no chance to influence 
> the layout out of / before the theme. 
> 
> > That said, rendering big tables in the RTD
> > theme is definitely an issue.
> > 
> > I'd also aim to be fairly conservative at first in terms of the rst
> > features and Sphinx extensions we use. Keep it simple. It's really easy
> > to go overboard in the beginning. See how things pan out and gradually
> > extend from there.  
> 
> Yes, KIS ... I send the theme patch and you will see 

Re: [PATCH] doc: flat-table directive

2016-07-01 Thread Mauro Carvalho Chehab
Em Fri, 1 Jul 2016 16:07:47 +0200
Markus Heiser  escreveu:

> Am 01.07.2016 um 15:09 schrieb Jani Nikula :
> 
> > On Fri, 01 Jul 2016, Markus Heiser  wrote:  
> >> In Sphinx, the code-block directive is a literal block, no refs or markup
> >> will be interpreted. This is different to what you know from DocBook.
> >> I will look for a solution, that matches your requirement.  
> > 
> > Parsed literal block solves the problem, but brings other problems like
> > requiring escaping for a lot of stuff. And you lose syntax
> > highlighting. It would be great to have a middle ground, like a
> > "semi-parsed code block" where the references are parsed but nothing
> > else.
> > 
> > http://docutils.sourceforge.net/docs/ref/rst/directives.html#parsed-literal-block
> >   
> 
> Yes, "parsed literal" blocks is not the solution and I have none 
> yet ... and you are right, we need something "semi". I doubt whether we 
> will eventually find a solution for this, but I will
> think about it ... I don't know how, but it must be a solution that is
> transparent to the the pygment highlighter and not distort the
> node tree. I have to study the sphinx-writer and pygment first.

Yeah, a "semi" solution would be perfect. In the specific case of the
header files, a parsed literal should work. the code that generates
the *.h.xml files right now already does escaping where it needs to
make docbook happy. OK, the characters that need to be escaped will
be different, but shouldn't hard to do it, I guess. At the media
Makefile, it currently does escaping for <, > and &:

ESCAPE = \
-e "s/&/\\/g" \
-e "s//\\/g"

> 
> 
> >> OK, checking dead internal links might be one requirement more. Normally 
> >> sphinx reports internal refs which are broken with a WARNING, but I have to
> >> analyze what xmllint checks and how we could realize something similar
> >> in sphinx / or if it is enough what sphinx reports.  
> > 
> > When we turn function() and  structure references to Sphinx
> > references, we pretty much rely on *not* being strict about all the
> > targets existing. At least for now. In an ideal world we'd aim for -n
> > and -W sphinx-build options, but we're far from that, and I don't know
> > if it's really possible ever.

I guess that will depend on how much care the subsystem maintainer has
with that. When building things on media, we have zero warnings when
building the Kernel with W=1 on x86. We also check both sparse and smatch 
regularly, trying zero the warnings there too (currently, there are just
a few of them, all false positives and not trivial to remove).

We also have zero warnings on media documentation using xmllint
pedantic mode.

> > Is it possible to set -n and/or -W on a per-rst file basis, either in
> > the top config file or from within the rst file itself? Then we could
> > gradually improve this, and subsystems that really care about this could
> > be in the forefront.  
> 
> There is a nitpick_ignore config, but this will not help.
> As far as I can see, if you want similar on a per file basis,
> you need to implement a (HTML) builder which checks
> on which tree-level the *current* node is and if this node
> is in a doctree of one of your files in your *configured file-list*  
> then turns the warning into an error ... may we see more
> requirements coming into, which needs to implement a HTML-builder
> we can implement one. I have implemented a man-page builder
> for the kernel-doc comments, because it was inevitable
> but I think it is not a good idea to reimplement the
> HTML builder in this first usage of sphinx.
> 
> May this is the totally wrong way, may it is better
> to implement a *lint* builder from scratch (should not
> be hard).

That would work.

> 
> >> But before I will send out some small patches which are needed 
> >> first (IMHO). E.g. customizing the RTD theme for rendering large 
> >> tables in HTML well and activation of useful extensions like todolist.
> >> I have this in my "chaotic bulk" patch :-) ... I will separate it out
> >> an send it to Jon one by one.  
> > 
> > Btw I don't think we are really attached to the RTD theme. It's just
> > something I picked that was prettier than the default theme, and was
> > widely available and packaged in distros.  
> 
> IMHO it is not prefect but the most elaborate you will find in the net.
> 
> > Ideally, we should probably
> > keep the customization at a level where it's possible for people to
> > easily use different themes.  
> 
> Layout is done in the theme, we have no chance to influence 
> the layout out of / before the theme. 
> 
> > That said, rendering big tables in the RTD
> > theme is definitely an issue.
> > 
> > I'd also aim to be fairly conservative at first in terms of the rst
> > features and Sphinx extensions we use. Keep it simple. It's really easy
> > to go overboard in the beginning. See how things pan out and gradually
> > extend from there.  
> 
> Yes, KIS ... I send the theme patch and you will see 

Re: EXT: Re: [PATCH RFC 1/4] dma: imx-sdma - reduce transfer latency for DMA cyclic clients

2016-07-01 Thread Nandor Han



On 28/06/16 17:34, Vinod Koul wrote:

On Thu, Jun 09, 2016 at 03:16:30PM +0300, Nandor Han wrote:

Having the SDMA driver use a tasklet for running the clients
callback introduce some issues:
   - probability to have desynchronized data because of the
 race condition created since the DMA transaction status
 is retrieved only when the callback is executed, leaving
 plenty of time for transaction status to get altered.
   - inter-transfer latency which can leave channels idle.

Move the callback execution, for cyclic channels, to SDMA
interrupt (as advised in `Documentation/dmaengine/provider.txt`)
to (a)reduce the inter-transfer latency and (b) eliminate the
race condition possibility where DMA transaction status might
be changed by the time is read.

The responsibility of the SDMA interrupt latency
is moved to the SDMA clients which case by case should defer
the work to bottom-halves when needed.


Both of these look fine. Please change the patch titles to dmaengine: 

Are these going to be merged thru dmaengine tree or serial one?



I will send soon a V2 where I will fix the titles. If you are OK with 
all the patchset it can be merged to dmaengine tree, otherwise probably 
goes to serial one.


Let me know which is the best option.

Thanks,
   Nandor



Signed-off-by: Nandor Han 
---
  drivers/dma/imx-sdma.c | 36 
  1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
index 0f6fd42..e497847 100644
--- a/drivers/dma/imx-sdma.c
+++ b/drivers/dma/imx-sdma.c
@@ -654,12 +654,6 @@ static void sdma_event_disable(struct sdma_channel *sdmac, 
unsigned int event)
writel_relaxed(val, sdma->regs + chnenbl);
  }

-static void sdma_handle_channel_loop(struct sdma_channel *sdmac)
-{
-   if (sdmac->desc.callback)
-   sdmac->desc.callback(sdmac->desc.callback_param);
-}
-
  static void sdma_update_channel_loop(struct sdma_channel *sdmac)
  {
struct sdma_buffer_descriptor *bd;
@@ -678,13 +672,25 @@ static void sdma_update_channel_loop(struct sdma_channel 
*sdmac)
sdmac->status = DMA_ERROR;

bd->mode.status |= BD_DONE;
+
+   /*
+* The callback is called from the interrupt context in order
+* to reduce latency and to avoid the risk of altering the
+* SDMA transaction status by the time the client tasklet is
+* executed.
+*/
+
+   if (sdmac->desc.callback)
+   sdmac->desc.callback(sdmac->desc.callback_param);
+
sdmac->buf_tail++;
sdmac->buf_tail %= sdmac->num_bd;
}
  }

-static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac)
+static void mxc_sdma_handle_channel_normal(unsigned long data)
  {
+   struct sdma_channel *sdmac = (struct sdma_channel *) data;
struct sdma_buffer_descriptor *bd;
int i, error = 0;

@@ -711,16 +717,6 @@ static void mxc_sdma_handle_channel_normal(struct 
sdma_channel *sdmac)
sdmac->desc.callback(sdmac->desc.callback_param);
  }

-static void sdma_tasklet(unsigned long data)
-{
-   struct sdma_channel *sdmac = (struct sdma_channel *) data;
-
-   if (sdmac->flags & IMX_DMA_SG_LOOP)
-   sdma_handle_channel_loop(sdmac);
-   else
-   mxc_sdma_handle_channel_normal(sdmac);
-}
-
  static irqreturn_t sdma_int_handler(int irq, void *dev_id)
  {
struct sdma_engine *sdma = dev_id;
@@ -737,8 +733,8 @@ static irqreturn_t sdma_int_handler(int irq, void *dev_id)

if (sdmac->flags & IMX_DMA_SG_LOOP)
sdma_update_channel_loop(sdmac);
-
-   tasklet_schedule(>tasklet);
+   else
+   tasklet_schedule(>tasklet);

__clear_bit(channel, );
}
@@ -1739,7 +1735,7 @@ static int sdma_probe(struct platform_device *pdev)
dma_cookie_init(>chan);
sdmac->channel = i;

-   tasklet_init(>tasklet, sdma_tasklet,
+   tasklet_init(>tasklet, mxc_sdma_handle_channel_normal,
 (unsigned long) sdmac);
/*
 * Add the channel to the DMAC list. Do not add channel 0 though
--
2.8.3





Re: [PATCH] spi: imx: wait_for_completion_timeout(..) for PIO transfers

2016-07-01 Thread Mark Brown
On Fri, Jul 01, 2016 at 02:32:58PM +0200, Christian Gmeiner wrote:
> In some rare cases I see the following 'task blocked' information. It
> looks like the PIO transfer has some problems and never succeeds. Make
> use of wait_for_completion_timeout(..) to detect this case and
> return -ETIMEDOUT.
> 
> v2: remove backtrace from commit message

Please do not submit new versions of already applied patches, please
submit incremental updates to the existing code.  Modifying existing
commits creates problems for other users building on top of those
commits so it's best practice to only change pubished git commits if
absolutely essential.

If you're including an inter-version changelog please keep it after the
--- as covered in SubmittingPatches.


signature.asc
Description: PGP signature


Re: [PATCH] drivers/perf: arm-pmu: Handle per-interrupt affinity mask

2016-07-01 Thread Caesar Wang

Hi Marc,

On 2016年07月01日 21:21, Marc Zyngier wrote:

On a big-little system, PMUs can be wired to CPUs using per CPU
interrups (PPI). In this case, it is important to make sure that
the enable/disable do happen on the right set of CPUs.

So instead of relying on the interrupt-affinity property, we can
use the actual percpu affinity that DT exposes as part of the
interrupt specifier. The DT binding is also updated to reflect
the fact that the interrupt-affinity property shouldn't be used
in that case.

Signed-off-by: Marc Zyngier 


Tested-by: Caesar Wang 

I pick this up on my local branch.
8bc671a FROMLIST: drivers/perf: arm-pmu: Handle per-interrupt affinity mask
3d723f4 FROMLIST: arm64: dts: rockchip: support the pmu node for rk3399
1359b92 FIXUP: FROMLIST: arm64: dts: rockchip: change all interrupts 
cells for 4 on rk3399 SoCs

...

Tested on rk3399 board.
localhost / # perf list

List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
...

perf stat --cpu 0/1/2/3. to minitor
e.g. cpu0;

localhost / # perf stat --cpu 0
^C
Performance counter stats for 'CPU(s) 0':

3374.917571 task-clock (msec) # 1.001 CPUs utilized [100.00%]
20 context-switches # 0.006 K/sec [100.00%]
2 cpu-migrations # 0.001 K/sec [100.00%]
55 page-faults # 0.016 K/sec
7151843 cycles # 0.002 GHz [100.00%]
 stalled-cycles-frontend
 stalled-cycles-backend
4272536 instructions # 0.60 insns per cycle [100.00%]
568406 branches # 0.168 M/sec [100.00%]
65652 branch-misses # 11.55% of all branches

Also, 'perf top' to monitor the PMU interrupts from cpus
---




---
  Documentation/devicetree/bindings/arm/pmu.txt |  4 +++-
  drivers/perf/arm_pmu.c| 22 +-
  2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/pmu.txt 
b/Documentation/devicetree/bindings/arm/pmu.txt
index 74d5417..61c8b46 100644
--- a/Documentation/devicetree/bindings/arm/pmu.txt
+++ b/Documentation/devicetree/bindings/arm/pmu.txt
@@ -39,7 +39,9 @@ Optional properties:
 When using a PPI, specifies a list of phandles to CPU
   nodes corresponding to the set of CPUs which have
   a PMU of this type signalling the PPI listed in the
-  interrupts property.
+  interrupts property, unless this is already specified
+  by the PPI interrupt specifier itself (in which case
+  the interrupt-affinity property shouldn't be present).
  
 This property should be present when there is more than

   a single SPI.
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 140436a..065ccec 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -603,7 +603,8 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
  
  	irq = platform_get_irq(pmu_device, 0);

if (irq >= 0 && irq_is_percpu(irq)) {
-   on_each_cpu(cpu_pmu_disable_percpu_irq, , 1);
+   on_each_cpu_mask(_pmu->supported_cpus,
+cpu_pmu_disable_percpu_irq, , 1);
free_percpu_irq(irq, _events->percpu_pmu);
} else {
for (i = 0; i < irqs; ++i) {
@@ -645,7 +646,9 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, 
irq_handler_t handler)
irq);
return err;
}
-   on_each_cpu(cpu_pmu_enable_percpu_irq, , 1);
+
+   on_each_cpu_mask(_pmu->supported_cpus,
+cpu_pmu_enable_percpu_irq, , 1);
} else {
for (i = 0; i < irqs; ++i) {
int cpu = i;
@@ -961,9 +964,18 @@ static int of_pmu_irq_cfg(struct arm_pmu *pmu)
i++;
} while (1);
  
-	/* If we didn't manage to parse anything, claim to support all CPUs */

-   if (cpumask_weight(>supported_cpus) == 0)
-   cpumask_setall(>supported_cpus);
+   /* If we didn't manage to parse anything, try the interrupt affinity */
+   if (cpumask_weight(>supported_cpus) == 0) {
+   if (!using_spi) {
+   /* If using PPIs, check the affinity of the partition */
+   int irq = platform_get_irq(pdev, 0);
+   irq_get_percpu_devid_partition(irq,
+  >supported_cpus);
+   } else {
+   /* Otherwise default to all CPUs */
+   cpumask_setall(>supported_cpus);
+   }
+   }
  
  	/* If we matched up the IRQ affinities, use them to route the SPIs */

   

Re: EXT: Re: [PATCH RFC 1/4] dma: imx-sdma - reduce transfer latency for DMA cyclic clients

2016-07-01 Thread Nandor Han



On 28/06/16 17:34, Vinod Koul wrote:

On Thu, Jun 09, 2016 at 03:16:30PM +0300, Nandor Han wrote:

Having the SDMA driver use a tasklet for running the clients
callback introduce some issues:
   - probability to have desynchronized data because of the
 race condition created since the DMA transaction status
 is retrieved only when the callback is executed, leaving
 plenty of time for transaction status to get altered.
   - inter-transfer latency which can leave channels idle.

Move the callback execution, for cyclic channels, to SDMA
interrupt (as advised in `Documentation/dmaengine/provider.txt`)
to (a)reduce the inter-transfer latency and (b) eliminate the
race condition possibility where DMA transaction status might
be changed by the time is read.

The responsibility of the SDMA interrupt latency
is moved to the SDMA clients which case by case should defer
the work to bottom-halves when needed.


Both of these look fine. Please change the patch titles to dmaengine: 

Are these going to be merged thru dmaengine tree or serial one?



I will send soon a V2 where I will fix the titles. If you are OK with 
all the patchset it can be merged to dmaengine tree, otherwise probably 
goes to serial one.


Let me know which is the best option.

Thanks,
   Nandor



Signed-off-by: Nandor Han 
---
  drivers/dma/imx-sdma.c | 36 
  1 file changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
index 0f6fd42..e497847 100644
--- a/drivers/dma/imx-sdma.c
+++ b/drivers/dma/imx-sdma.c
@@ -654,12 +654,6 @@ static void sdma_event_disable(struct sdma_channel *sdmac, 
unsigned int event)
writel_relaxed(val, sdma->regs + chnenbl);
  }

-static void sdma_handle_channel_loop(struct sdma_channel *sdmac)
-{
-   if (sdmac->desc.callback)
-   sdmac->desc.callback(sdmac->desc.callback_param);
-}
-
  static void sdma_update_channel_loop(struct sdma_channel *sdmac)
  {
struct sdma_buffer_descriptor *bd;
@@ -678,13 +672,25 @@ static void sdma_update_channel_loop(struct sdma_channel 
*sdmac)
sdmac->status = DMA_ERROR;

bd->mode.status |= BD_DONE;
+
+   /*
+* The callback is called from the interrupt context in order
+* to reduce latency and to avoid the risk of altering the
+* SDMA transaction status by the time the client tasklet is
+* executed.
+*/
+
+   if (sdmac->desc.callback)
+   sdmac->desc.callback(sdmac->desc.callback_param);
+
sdmac->buf_tail++;
sdmac->buf_tail %= sdmac->num_bd;
}
  }

-static void mxc_sdma_handle_channel_normal(struct sdma_channel *sdmac)
+static void mxc_sdma_handle_channel_normal(unsigned long data)
  {
+   struct sdma_channel *sdmac = (struct sdma_channel *) data;
struct sdma_buffer_descriptor *bd;
int i, error = 0;

@@ -711,16 +717,6 @@ static void mxc_sdma_handle_channel_normal(struct 
sdma_channel *sdmac)
sdmac->desc.callback(sdmac->desc.callback_param);
  }

-static void sdma_tasklet(unsigned long data)
-{
-   struct sdma_channel *sdmac = (struct sdma_channel *) data;
-
-   if (sdmac->flags & IMX_DMA_SG_LOOP)
-   sdma_handle_channel_loop(sdmac);
-   else
-   mxc_sdma_handle_channel_normal(sdmac);
-}
-
  static irqreturn_t sdma_int_handler(int irq, void *dev_id)
  {
struct sdma_engine *sdma = dev_id;
@@ -737,8 +733,8 @@ static irqreturn_t sdma_int_handler(int irq, void *dev_id)

if (sdmac->flags & IMX_DMA_SG_LOOP)
sdma_update_channel_loop(sdmac);
-
-   tasklet_schedule(>tasklet);
+   else
+   tasklet_schedule(>tasklet);

__clear_bit(channel, );
}
@@ -1739,7 +1735,7 @@ static int sdma_probe(struct platform_device *pdev)
dma_cookie_init(>chan);
sdmac->channel = i;

-   tasklet_init(>tasklet, sdma_tasklet,
+   tasklet_init(>tasklet, mxc_sdma_handle_channel_normal,
 (unsigned long) sdmac);
/*
 * Add the channel to the DMAC list. Do not add channel 0 though
--
2.8.3





Re: [PATCH] spi: imx: wait_for_completion_timeout(..) for PIO transfers

2016-07-01 Thread Mark Brown
On Fri, Jul 01, 2016 at 02:32:58PM +0200, Christian Gmeiner wrote:
> In some rare cases I see the following 'task blocked' information. It
> looks like the PIO transfer has some problems and never succeeds. Make
> use of wait_for_completion_timeout(..) to detect this case and
> return -ETIMEDOUT.
> 
> v2: remove backtrace from commit message

Please do not submit new versions of already applied patches, please
submit incremental updates to the existing code.  Modifying existing
commits creates problems for other users building on top of those
commits so it's best practice to only change pubished git commits if
absolutely essential.

If you're including an inter-version changelog please keep it after the
--- as covered in SubmittingPatches.


signature.asc
Description: PGP signature


Re: [PATCH] drivers/perf: arm-pmu: Handle per-interrupt affinity mask

2016-07-01 Thread Caesar Wang

Hi Marc,

On 2016年07月01日 21:21, Marc Zyngier wrote:

On a big-little system, PMUs can be wired to CPUs using per CPU
interrups (PPI). In this case, it is important to make sure that
the enable/disable do happen on the right set of CPUs.

So instead of relying on the interrupt-affinity property, we can
use the actual percpu affinity that DT exposes as part of the
interrupt specifier. The DT binding is also updated to reflect
the fact that the interrupt-affinity property shouldn't be used
in that case.

Signed-off-by: Marc Zyngier 


Tested-by: Caesar Wang 

I pick this up on my local branch.
8bc671a FROMLIST: drivers/perf: arm-pmu: Handle per-interrupt affinity mask
3d723f4 FROMLIST: arm64: dts: rockchip: support the pmu node for rk3399
1359b92 FIXUP: FROMLIST: arm64: dts: rockchip: change all interrupts 
cells for 4 on rk3399 SoCs

...

Tested on rk3399 board.
localhost / # perf list

List of pre-defined events (to be used in -e):
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
cache-references [Hardware event]
cache-misses [Hardware event]
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
...

perf stat --cpu 0/1/2/3. to minitor
e.g. cpu0;

localhost / # perf stat --cpu 0
^C
Performance counter stats for 'CPU(s) 0':

3374.917571 task-clock (msec) # 1.001 CPUs utilized [100.00%]
20 context-switches # 0.006 K/sec [100.00%]
2 cpu-migrations # 0.001 K/sec [100.00%]
55 page-faults # 0.016 K/sec
7151843 cycles # 0.002 GHz [100.00%]
 stalled-cycles-frontend
 stalled-cycles-backend
4272536 instructions # 0.60 insns per cycle [100.00%]
568406 branches # 0.168 M/sec [100.00%]
65652 branch-misses # 11.55% of all branches

Also, 'perf top' to monitor the PMU interrupts from cpus
---




---
  Documentation/devicetree/bindings/arm/pmu.txt |  4 +++-
  drivers/perf/arm_pmu.c| 22 +-
  2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/Documentation/devicetree/bindings/arm/pmu.txt 
b/Documentation/devicetree/bindings/arm/pmu.txt
index 74d5417..61c8b46 100644
--- a/Documentation/devicetree/bindings/arm/pmu.txt
+++ b/Documentation/devicetree/bindings/arm/pmu.txt
@@ -39,7 +39,9 @@ Optional properties:
 When using a PPI, specifies a list of phandles to CPU
   nodes corresponding to the set of CPUs which have
   a PMU of this type signalling the PPI listed in the
-  interrupts property.
+  interrupts property, unless this is already specified
+  by the PPI interrupt specifier itself (in which case
+  the interrupt-affinity property shouldn't be present).
  
 This property should be present when there is more than

   a single SPI.
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 140436a..065ccec 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -603,7 +603,8 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
  
  	irq = platform_get_irq(pmu_device, 0);

if (irq >= 0 && irq_is_percpu(irq)) {
-   on_each_cpu(cpu_pmu_disable_percpu_irq, , 1);
+   on_each_cpu_mask(_pmu->supported_cpus,
+cpu_pmu_disable_percpu_irq, , 1);
free_percpu_irq(irq, _events->percpu_pmu);
} else {
for (i = 0; i < irqs; ++i) {
@@ -645,7 +646,9 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, 
irq_handler_t handler)
irq);
return err;
}
-   on_each_cpu(cpu_pmu_enable_percpu_irq, , 1);
+
+   on_each_cpu_mask(_pmu->supported_cpus,
+cpu_pmu_enable_percpu_irq, , 1);
} else {
for (i = 0; i < irqs; ++i) {
int cpu = i;
@@ -961,9 +964,18 @@ static int of_pmu_irq_cfg(struct arm_pmu *pmu)
i++;
} while (1);
  
-	/* If we didn't manage to parse anything, claim to support all CPUs */

-   if (cpumask_weight(>supported_cpus) == 0)
-   cpumask_setall(>supported_cpus);
+   /* If we didn't manage to parse anything, try the interrupt affinity */
+   if (cpumask_weight(>supported_cpus) == 0) {
+   if (!using_spi) {
+   /* If using PPIs, check the affinity of the partition */
+   int irq = platform_get_irq(pdev, 0);
+   irq_get_percpu_devid_partition(irq,
+  >supported_cpus);
+   } else {
+   /* Otherwise default to all CPUs */
+   cpumask_setall(>supported_cpus);
+   }
+   }
  
  	/* If we matched up the IRQ affinities, use them to route the SPIs */

if (using_spi && i == 

Re: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support

2016-07-01 Thread Florian Fainelli
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> This patch adds support for gmii2rgmii converter.
> 
> The GMII to RGMII IP core provides the Reduced Gigabit Media
> Independent Interface (RGMII) between Ethernet physical media
> Devices and the Gigabit Ethernet controller. This core can
> switch dynamically between the three different speed modes of
> Operation.
> MDIO interface is used to set operating speed of Ethernet MAC
> 
> Signed-off-by: Kedareswara rao Appana 
> ---
>  drivers/net/ethernet/xilinx/Kconfig |7 ++
>  drivers/net/ethernet/xilinx/Makefile|1 +
>  drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c |   76 
> +++
>  include/linux/xilinx_gmii2rgmii.h   |   24 +++
>  4 files changed, 108 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
>  create mode 100644 include/linux/xilinx_gmii2rgmii.h
> 
> diff --git a/drivers/net/ethernet/xilinx/Kconfig 
> b/drivers/net/ethernet/xilinx/Kconfig
> index 4f5c024..d7df70a 100644
> --- a/drivers/net/ethernet/xilinx/Kconfig
> +++ b/drivers/net/ethernet/xilinx/Kconfig
> @@ -39,4 +39,11 @@ config XILINX_LL_TEMAC
> This driver supports the Xilinx 10/100/1000 LocalLink TEMAC
> core used in Xilinx Spartan and Virtex FPGAs
>  
> +config XILINX_GMII2RGMII
> + tristate "Xilinx GMII2RGMII converter driver"
> + ---help---
> +   This driver support xilinx GMII to RGMII IP core it provides
> +   the Reduced Gigabit Media Independent Interface(RGMII) between
> +   Ethernet physical media devices and the Gigabit Ethernet controller.
> +
>  endif # NET_VENDOR_XILINX
> diff --git a/drivers/net/ethernet/xilinx/Makefile 
> b/drivers/net/ethernet/xilinx/Makefile
> index 214205e..bca0da0 100644
> --- a/drivers/net/ethernet/xilinx/Makefile
> +++ b/drivers/net/ethernet/xilinx/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_XILINX_LL_TEMAC) += ll_temac.o
>  obj-$(CONFIG_XILINX_EMACLITE) += xilinx_emaclite.o
>  xilinx_emac-objs := xilinx_axienet_main.o xilinx_axienet_mdio.o
>  obj-$(CONFIG_XILINX_AXI_EMAC) += xilinx_emac.o
> +obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
> diff --git a/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c 
> b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
> new file mode 100644
> index 000..ca9f1ad
> --- /dev/null
> +++ b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
> @@ -0,0 +1,76 @@
> +/* Xilinx GMII2RGMII Converter driver
> + *
> + * Copyright (C) 2016 Xilinx, Inc.
> + *
> + * Author: Kedareswara rao Appana 
> + *
> + * Description:
> + * This driver is developed for Xilinx GMII2RGMII Converter
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed)
> +{
> + struct gmii2rgmii *xphy = (struct xphy *)priv;

Why not pass struct xphy pointer directly?

> + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev;
> + u16 gmii2rgmii_reg = 0;
> +
> + switch (speed) {
> + case 1000:
> + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000;
> + break;
> + case 100:
> + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100;
> + break;
> + default:
> + return;
> + }
> +
> + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr,
> +  XILINX_GMII2RGMII_REG_NUM,
> +  gmii2rgmii_reg);
> +}
> +
> +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy)
> +{
> + struct device_node *phy_node;
> + struct phy_device *phydev;
> + struct device_node *np = (struct device_node *)xphy->platform_data;
> +
> + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0);

Is that property documented in a binding document?

> + if (phy_node) {

Should not there be an else clause which does not assign
xphy->fix_mac_speed in case this property lookup fails?

> + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0);
> + if (!phydev) {
> + netdev_err(xphy->dev,
> +"%s: no gmii to rgmii converter found\n",
> +xphy->dev->name);
> + return -1;
> + }
> + xphy->gmii2rgmii_phy_dev = phydev;
> + }
> + 

Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver

2016-07-01 Thread Mark Brown
On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote:
> On 1 July 2016 at 10:25, Mark Brown  wrote:

> > It's been repeatedly suggested to you that the tooling for this stuff
> > could use some work.  Please go and put some effort into that rather
> > than continuing this thread which is accomplishing nothing.

> You completely miss the point. No tooling will make people reconfigure
> the kernel when the configuration in fact stays the same.

> Sure the tooling does need work. And it would help getting the cases
> when the tooling is NOT needed out of the way.

I understand the problem perfectly, no amount of repeating yourself is
going to change the problems that the bodge you are trying to force in
creates for other users and the maintainability of the system.


signature.asc
Description: PGP signature


Re: [RFC PATCH 1/2] net: ethernet: xilinx: Add gmii2rgmii converter support

2016-07-01 Thread Florian Fainelli
Le 30/06/2016 23:20, Kedareswara rao Appana a écrit :
> This patch adds support for gmii2rgmii converter.
> 
> The GMII to RGMII IP core provides the Reduced Gigabit Media
> Independent Interface (RGMII) between Ethernet physical media
> Devices and the Gigabit Ethernet controller. This core can
> switch dynamically between the three different speed modes of
> Operation.
> MDIO interface is used to set operating speed of Ethernet MAC
> 
> Signed-off-by: Kedareswara rao Appana 
> ---
>  drivers/net/ethernet/xilinx/Kconfig |7 ++
>  drivers/net/ethernet/xilinx/Makefile|1 +
>  drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c |   76 
> +++
>  include/linux/xilinx_gmii2rgmii.h   |   24 +++
>  4 files changed, 108 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
>  create mode 100644 include/linux/xilinx_gmii2rgmii.h
> 
> diff --git a/drivers/net/ethernet/xilinx/Kconfig 
> b/drivers/net/ethernet/xilinx/Kconfig
> index 4f5c024..d7df70a 100644
> --- a/drivers/net/ethernet/xilinx/Kconfig
> +++ b/drivers/net/ethernet/xilinx/Kconfig
> @@ -39,4 +39,11 @@ config XILINX_LL_TEMAC
> This driver supports the Xilinx 10/100/1000 LocalLink TEMAC
> core used in Xilinx Spartan and Virtex FPGAs
>  
> +config XILINX_GMII2RGMII
> + tristate "Xilinx GMII2RGMII converter driver"
> + ---help---
> +   This driver support xilinx GMII to RGMII IP core it provides
> +   the Reduced Gigabit Media Independent Interface(RGMII) between
> +   Ethernet physical media devices and the Gigabit Ethernet controller.
> +
>  endif # NET_VENDOR_XILINX
> diff --git a/drivers/net/ethernet/xilinx/Makefile 
> b/drivers/net/ethernet/xilinx/Makefile
> index 214205e..bca0da0 100644
> --- a/drivers/net/ethernet/xilinx/Makefile
> +++ b/drivers/net/ethernet/xilinx/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_XILINX_LL_TEMAC) += ll_temac.o
>  obj-$(CONFIG_XILINX_EMACLITE) += xilinx_emaclite.o
>  xilinx_emac-objs := xilinx_axienet_main.o xilinx_axienet_mdio.o
>  obj-$(CONFIG_XILINX_AXI_EMAC) += xilinx_emac.o
> +obj-$(CONFIG_XILINX_GMII2RGMII) += xilinx_gmii2rgmii.o
> diff --git a/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c 
> b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
> new file mode 100644
> index 000..ca9f1ad
> --- /dev/null
> +++ b/drivers/net/ethernet/xilinx/xilinx_gmii2rgmii.c
> @@ -0,0 +1,76 @@
> +/* Xilinx GMII2RGMII Converter driver
> + *
> + * Copyright (C) 2016 Xilinx, Inc.
> + *
> + * Author: Kedareswara rao Appana 
> + *
> + * Description:
> + * This driver is developed for Xilinx GMII2RGMII Converter
> + *
> + * This program is free software: you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation, either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static void xgmii2rgmii_fix_mac_speed(void *priv, unsigned int speed)
> +{
> + struct gmii2rgmii *xphy = (struct xphy *)priv;

Why not pass struct xphy pointer directly?

> + struct phy_device *gmii2rgmii_phydev = xphy->gmii2rgmii_phy_dev;
> + u16 gmii2rgmii_reg = 0;
> +
> + switch (speed) {
> + case 1000:
> + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED1000;
> + break;
> + case 100:
> + gmii2rgmii_reg |= XILINX_GMII2RGMII_SPEED100;
> + break;
> + default:
> + return;
> + }
> +
> + xphy->mdio_write(xphy->mii_bus, gmii2rgmii_phydev->mdio.addr,
> +  XILINX_GMII2RGMII_REG_NUM,
> +  gmii2rgmii_reg);
> +}
> +
> +int gmii2rgmii_phyprobe(struct gmii2rgmii *xphy)
> +{
> + struct device_node *phy_node;
> + struct phy_device *phydev;
> + struct device_node *np = (struct device_node *)xphy->platform_data;
> +
> + phy_node = of_parse_phandle(np, "gmii2rgmii-phy-handle", 0);

Is that property documented in a binding document?

> + if (phy_node) {

Should not there be an else clause which does not assign
xphy->fix_mac_speed in case this property lookup fails?

> + phydev = of_phy_attach(xphy->dev, phy_node, 0, 0);
> + if (!phydev) {
> + netdev_err(xphy->dev,
> +"%s: no gmii to rgmii converter found\n",
> +xphy->dev->name);
> + return -1;
> + }
> + xphy->gmii2rgmii_phy_dev = phydev;
> + }
> + xphy->fix_mac_speed = xgmii2rgmii_fix_mac_speed;
> +
> +  

Re: [PATCH v2 3/3] drivers core: allow id match override when manually binding driver

2016-07-01 Thread Mark Brown
On Fri, Jul 01, 2016 at 10:58:34AM +0200, Michal Suchanek wrote:
> On 1 July 2016 at 10:25, Mark Brown  wrote:

> > It's been repeatedly suggested to you that the tooling for this stuff
> > could use some work.  Please go and put some effort into that rather
> > than continuing this thread which is accomplishing nothing.

> You completely miss the point. No tooling will make people reconfigure
> the kernel when the configuration in fact stays the same.

> Sure the tooling does need work. And it would help getting the cases
> when the tooling is NOT needed out of the way.

I understand the problem perfectly, no amount of repeating yourself is
going to change the problems that the bodge you are trying to force in
creates for other users and the maintainability of the system.


signature.asc
Description: PGP signature


Re: [PATCH v3] vmlinux.lds: account for destructor sections

2016-07-01 Thread Andrey Ryabinin
2016-06-29 20:28 GMT+03:00 Dmitry Vyukov :
> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin  
> wrote:
>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov :
>>> If CONFIG_KASAN is enabled and gcc is configured with
>>> --disable-initfini-array and/or gold linker is used,
>>> gcc emits .ctors/.dtors and .text.startup/.text.exit
>>> sections instead of .init_array/.fini_array.
>>> .dtors section is not explicitly accounted in the linker
>>> script and messes vvar/percpu layout. Want:
>>>
>>> 822bfd80 D _edata
>>> 822c D __vvar_beginning_hack
>>> 822c A __vvar_page
>>> 822c0080 0098 D vsyscall_gtod_data
>>> 822c1000 A __init_begin
>>> 822c1000 D init_per_cpu__irq_stack_union
>>> 822c1000 A __per_cpu_load
>>> 822d3000 D init_per_cpu__gdt_page
>>>
>>> Got:
>>>
>>> 8279a600 D _edata
>>> 8279b000 A __vvar_page
>>> 8279c000 A __init_begin
>>> 8279c000 D init_per_cpu__irq_stack_union
>>> 8279c000 A __per_cpu_load
>>> 8279e000 D __vvar_beginning_hack
>>> 8279e080 0098 D vsyscall_gtod_data
>>> 827ae000 D init_per_cpu__gdt_page
>>>
>>> This happens because __vvar_page and .vvar get different
>>> addresses in arch/x86/kernel/vmlinux.lds.S:
>>>
>>> . = ALIGN(PAGE_SIZE);
>>> __vvar_page = .;
>>>
>>> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
>>> /* work around gold bug 13023 */
>>> __vvar_beginning_hack = .;
>>>
>>> Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
>>> Merge .text.startup into init text.
>>>
>>> Signed-off-by: Dmitry Vyukov 
>>
>> Reviewed-by: Andrey Ryabinin 
>
>
> Who can take it to some tree?

akpm tends to be the maintainer of last resort.
But you probably need to resend the patch, because LKML was not in CC list.
Also, please add stable tag: Cc:  # v4.0+


Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code

2016-07-01 Thread Peter Zijlstra
On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote:
> On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstra  wrote:
> 
> > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote:
> > > well this is issue our partner met in the setup,
> > > and I'm not sure what was their motivation for that,
> > > perhaps James could clarify in here..
> > >
> > > I tried to make the 'scratch that itch' solution as
> > > mentioned in earlier discussion.
> > >
> > > So IIUIC what you say is that it needs to be more generic
> > > solution..?  I'll continue staring at it then ;-)
> >
> > I just want to know what problem we're trying to solve..
> >
> > Because it appears this is running 1 task each on a 'weird' subset of
> > cpus and things going badly. If this really is the case, then teaching
> > active balance to only move tasks to idle cpus or something should also
> > cure things.
> >
> > Also, I'm curious why people set such weird masks.
> >
> 
> ​I think the original issue was reported/seen when using a straight range
> of CPUs, but in that range they crossed numa nodes.  Then in the no knowing
> what was triggering the issue and trying to reproduce the issue we started
> getting some crazy masks.
> 
> The work-around customer has been using is to use SCHED_RR as it doesn't
> have this balance across NUMA issue​.  But it is also the fact that
> SCHED_RR doesn't have this issue that makes it "seem" like a defect in
> SCHED_OTHER.  I have shared that this is triggered by the across numa node
> taskset with customer so they are aware of that.  But if this is something
> that is seen as a limitation of SCHED_OTHER and not reasonable to be
> addressed upstream I think maybe it as least something we should get
> documented.

But what exact usecase? A single task per cpu, or something else?

Note that RR has different constraints than OTHER, but in both cases
having skewed masks across a topology divide is unlikely to be good for
performance.

Esp. in the extreme case reported here, where one task is on an entirely
different node than the rest of them, that task will cause cacheline
transfers between the nodes slowing down itself as well as all the other
tasks that have to pull it back in.


Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support

2016-07-01 Thread Borislav Petkov
On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
> 
> grsecurity has had a similar feature (called
> GRKERNSEC_KSTACKOVERFLOW) for a long time.
> 
> Cc: Oleg Nesterov 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/Kconfig| 29 +
>  arch/ia64/include/asm/thread_info.h |  2 +-
>  include/linux/sched.h   | 15 +++
>  kernel/fork.c   | 87 
> +
>  4 files changed, 113 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 15996290fed4..18a2c3a7b460 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>   def_bool n
>  
> +config HAVE_ARCH_VMAP_STACK
> + def_bool n
> + help
> +   An arch should select this symbol if it can support kernel stacks
> +   in vmalloc space.  This means:
> +
> +   - vmalloc space must be large enough to hold many kernel stacks.
> + This may rule out many 32-bit architectures.
> +
> +   - Stacks in vmalloc space need to work reliably.  For example, if
> + vmap page tables are created on demand, either this mechanism
> + needs to work while the stack points to a virtual address with
> + unpopulated page tables or arch code (switch_to and switch_mm,
> + most likely) needs to ensure that the stack's page table entries
> + are populated before running on a possibly unpopulated stack.
> +
> +   - If the stack overflows into a guard page, something reasonable
> + should happen.  The definition of "reasonable" is flexible, but
> + instantly rebooting without logging anything would be unfriendly.

Nice, I wish more people would actually *explain* their Kconfig options
properly.

...

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 146c9840c079..06761de69360 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long 
> *stack)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> -   int node)
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int 
> node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> + void *stack = __vmalloc_node_range(
> + THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> + THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> + 0, node, __builtin_return_address(0));

Reformat:

void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
   VMALLOC_START, VMALLOC_END,
   THREADINFO_GFP | __GFP_HIGHMEM,
   PAGE_KERNEL,
   0, node, 
__builtin_return_address(0));


> +
> + /*
> +  * We can't call find_vm_area() in interrupt context, and
> +  * free_thread_info can be called in interrupt context, so cache

free_thread_stack() ?

> +  * the vm_struct.
> +  */
> + if (stack)
> + tsk->stack_vm_area = find_vm_area(stack);
> + return stack;
> +#else
>   struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
> THREAD_SIZE_ORDER);
>  
>   return page ? page_address(page) : NULL;
> +#endif
>  }
>  
> -static inline void free_thread_stack(unsigned long *stack)
> +static inline void free_thread_stack(struct task_struct *tsk)
>  {
> - free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER);
> + if (task_stack_vm_area(tsk))
> + vfree(tsk->stack);
> + else
> + free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_stack_cache;
> @@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct 
> task_struct *tsk,
>   return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
>  }
>  
> -static void free_thread_stack(unsigned long *stack)
> +static void free_thread_stack(struct task_struct *tsk)
>  {
> - kmem_cache_free(thread_stack_cache, stack);
> + kmem_cache_free(thread_stack_cache, tsk->stack);
>  }
>  
>  void thread_stack_cache_init(void)
> @@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(unsigned long *stack, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> - /* All 

Re: [PATCH v3] vmlinux.lds: account for destructor sections

2016-07-01 Thread Andrey Ryabinin
2016-06-29 20:28 GMT+03:00 Dmitry Vyukov :
> On Fri, Jun 24, 2016 at 6:57 PM, Andrey Ryabinin  
> wrote:
>> 2016-06-24 18:39 GMT+03:00 Dmitry Vyukov :
>>> If CONFIG_KASAN is enabled and gcc is configured with
>>> --disable-initfini-array and/or gold linker is used,
>>> gcc emits .ctors/.dtors and .text.startup/.text.exit
>>> sections instead of .init_array/.fini_array.
>>> .dtors section is not explicitly accounted in the linker
>>> script and messes vvar/percpu layout. Want:
>>>
>>> 822bfd80 D _edata
>>> 822c D __vvar_beginning_hack
>>> 822c A __vvar_page
>>> 822c0080 0098 D vsyscall_gtod_data
>>> 822c1000 A __init_begin
>>> 822c1000 D init_per_cpu__irq_stack_union
>>> 822c1000 A __per_cpu_load
>>> 822d3000 D init_per_cpu__gdt_page
>>>
>>> Got:
>>>
>>> 8279a600 D _edata
>>> 8279b000 A __vvar_page
>>> 8279c000 A __init_begin
>>> 8279c000 D init_per_cpu__irq_stack_union
>>> 8279c000 A __per_cpu_load
>>> 8279e000 D __vvar_beginning_hack
>>> 8279e080 0098 D vsyscall_gtod_data
>>> 827ae000 D init_per_cpu__gdt_page
>>>
>>> This happens because __vvar_page and .vvar get different
>>> addresses in arch/x86/kernel/vmlinux.lds.S:
>>>
>>> . = ALIGN(PAGE_SIZE);
>>> __vvar_page = .;
>>>
>>> .vvar : AT(ADDR(.vvar) - LOAD_OFFSET) {
>>> /* work around gold bug 13023 */
>>> __vvar_beginning_hack = .;
>>>
>>> Discard .dtors/.fini_array/.text.exit, since we don't call dtors.
>>> Merge .text.startup into init text.
>>>
>>> Signed-off-by: Dmitry Vyukov 
>>
>> Reviewed-by: Andrey Ryabinin 
>
>
> Who can take it to some tree?

akpm tends to be the maintainer of last resort.
But you probably need to resend the patch, because LKML was not in CC list.
Also, please add stable tag: Cc:  # v4.0+


Re: [PATCH 3/4] sched/fair: Add REBALANCE_AFFINITY rebalancing code

2016-07-01 Thread Peter Zijlstra
On Fri, Jul 01, 2016 at 09:15:55AM -0500, James Hartsock wrote:
> On Fri, Jul 1, 2016 at 3:24 AM, Peter Zijlstra  wrote:
> 
> > On Fri, Jul 01, 2016 at 09:35:46AM +0200, Jiri Olsa wrote:
> > > well this is issue our partner met in the setup,
> > > and I'm not sure what was their motivation for that,
> > > perhaps James could clarify in here..
> > >
> > > I tried to make the 'scratch that itch' solution as
> > > mentioned in earlier discussion.
> > >
> > > So IIUIC what you say is that it needs to be more generic
> > > solution..?  I'll continue staring at it then ;-)
> >
> > I just want to know what problem we're trying to solve..
> >
> > Because it appears this is running 1 task each on a 'weird' subset of
> > cpus and things going badly. If this really is the case, then teaching
> > active balance to only move tasks to idle cpus or something should also
> > cure things.
> >
> > Also, I'm curious why people set such weird masks.
> >
> 
> ​I think the original issue was reported/seen when using a straight range
> of CPUs, but in that range they crossed numa nodes.  Then in the no knowing
> what was triggering the issue and trying to reproduce the issue we started
> getting some crazy masks.
> 
> The work-around customer has been using is to use SCHED_RR as it doesn't
> have this balance across NUMA issue​.  But it is also the fact that
> SCHED_RR doesn't have this issue that makes it "seem" like a defect in
> SCHED_OTHER.  I have shared that this is triggered by the across numa node
> taskset with customer so they are aware of that.  But if this is something
> that is seen as a limitation of SCHED_OTHER and not reasonable to be
> addressed upstream I think maybe it as least something we should get
> documented.

But what exact usecase? A single task per cpu, or something else?

Note that RR has different constraints than OTHER, but in both cases
having skewed masks across a topology divide is unlikely to be good for
performance.

Esp. in the extreme case reported here, where one task is on an entirely
different node than the rest of them, that task will cause cacheline
transfers between the nodes slowing down itself as well as all the other
tasks that have to pull it back in.


Re: [PATCH v4 09/29] fork: Add generic vmalloced stack support

2016-07-01 Thread Borislav Petkov
On Sun, Jun 26, 2016 at 02:55:31PM -0700, Andy Lutomirski wrote:
> If CONFIG_VMAP_STACK is selected, kernel stacks are allocated with
> vmalloc_node.
> 
> grsecurity has had a similar feature (called
> GRKERNSEC_KSTACKOVERFLOW) for a long time.
> 
> Cc: Oleg Nesterov 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/Kconfig| 29 +
>  arch/ia64/include/asm/thread_info.h |  2 +-
>  include/linux/sched.h   | 15 +++
>  kernel/fork.c   | 87 
> +
>  4 files changed, 113 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 15996290fed4..18a2c3a7b460 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -661,4 +661,33 @@ config ARCH_NO_COHERENT_DMA_MMAP
>  config CPU_NO_EFFICIENT_FFS
>   def_bool n
>  
> +config HAVE_ARCH_VMAP_STACK
> + def_bool n
> + help
> +   An arch should select this symbol if it can support kernel stacks
> +   in vmalloc space.  This means:
> +
> +   - vmalloc space must be large enough to hold many kernel stacks.
> + This may rule out many 32-bit architectures.
> +
> +   - Stacks in vmalloc space need to work reliably.  For example, if
> + vmap page tables are created on demand, either this mechanism
> + needs to work while the stack points to a virtual address with
> + unpopulated page tables or arch code (switch_to and switch_mm,
> + most likely) needs to ensure that the stack's page table entries
> + are populated before running on a possibly unpopulated stack.
> +
> +   - If the stack overflows into a guard page, something reasonable
> + should happen.  The definition of "reasonable" is flexible, but
> + instantly rebooting without logging anything would be unfriendly.

Nice, I wish more people would actually *explain* their Kconfig options
properly.

...

> diff --git a/kernel/fork.c b/kernel/fork.c
> index 146c9840c079..06761de69360 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -158,19 +158,37 @@ void __weak arch_release_thread_stack(unsigned long 
> *stack)
>   * Allocate pages if THREAD_SIZE is >= PAGE_SIZE, otherwise use a
>   * kmemcache based allocator.
>   */
> -# if THREAD_SIZE >= PAGE_SIZE
> -static unsigned long *alloc_thread_stack_node(struct task_struct *tsk,
> -   int node)
> +# if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
> +static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int 
> node)
>  {
> +#ifdef CONFIG_VMAP_STACK
> + void *stack = __vmalloc_node_range(
> + THREAD_SIZE, THREAD_SIZE, VMALLOC_START, VMALLOC_END,
> + THREADINFO_GFP | __GFP_HIGHMEM, PAGE_KERNEL,
> + 0, node, __builtin_return_address(0));

Reformat:

void *stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
   VMALLOC_START, VMALLOC_END,
   THREADINFO_GFP | __GFP_HIGHMEM,
   PAGE_KERNEL,
   0, node, 
__builtin_return_address(0));


> +
> + /*
> +  * We can't call find_vm_area() in interrupt context, and
> +  * free_thread_info can be called in interrupt context, so cache

free_thread_stack() ?

> +  * the vm_struct.
> +  */
> + if (stack)
> + tsk->stack_vm_area = find_vm_area(stack);
> + return stack;
> +#else
>   struct page *page = alloc_kmem_pages_node(node, THREADINFO_GFP,
> THREAD_SIZE_ORDER);
>  
>   return page ? page_address(page) : NULL;
> +#endif
>  }
>  
> -static inline void free_thread_stack(unsigned long *stack)
> +static inline void free_thread_stack(struct task_struct *tsk)
>  {
> - free_kmem_pages((unsigned long)stack, THREAD_SIZE_ORDER);
> + if (task_stack_vm_area(tsk))
> + vfree(tsk->stack);
> + else
> + free_kmem_pages((unsigned long)tsk->stack, THREAD_SIZE_ORDER);
>  }
>  # else
>  static struct kmem_cache *thread_stack_cache;
> @@ -181,9 +199,9 @@ static unsigned long *alloc_thread_stack_node(struct 
> task_struct *tsk,
>   return kmem_cache_alloc_node(thread_stack_cache, THREADINFO_GFP, node);
>  }
>  
> -static void free_thread_stack(unsigned long *stack)
> +static void free_thread_stack(struct task_struct *tsk)
>  {
> - kmem_cache_free(thread_stack_cache, stack);
> + kmem_cache_free(thread_stack_cache, tsk->stack);
>  }
>  
>  void thread_stack_cache_init(void)
> @@ -213,24 +231,49 @@ struct kmem_cache *vm_area_cachep;
>  /* SLAB cache for mm_struct structures (tsk->mm) */
>  static struct kmem_cache *mm_cachep;
>  
> -static void account_kernel_stack(unsigned long *stack, int account)
> +static void account_kernel_stack(struct task_struct *tsk, int account)
>  {
> - /* All stack pages are in the same zone and 

Re: mfd: dm355evm_msp: Refactoring for add_child()

2016-07-01 Thread SF Markus Elfring
> FYI, code looks fine.

Thanks for your acknowledgement.


> ... but please take this opportunity to set-up your submission
> environment i.e. using `git format-patch` and `git send-email`.


Would you like to see any special settings to be mentioned
in a section like "15) Explicit In-Reply-To headers" from
the document "SubmittingPatches"?



> you've done that, please re-submit this patch with my:

Does the association of this patch with a bit relevant discussion
really hinder the desired commit?

Regards,
Markus


Re: mfd: dm355evm_msp: Refactoring for add_child()

2016-07-01 Thread SF Markus Elfring
> FYI, code looks fine.

Thanks for your acknowledgement.


> ... but please take this opportunity to set-up your submission
> environment i.e. using `git format-patch` and `git send-email`.


Would you like to see any special settings to be mentioned
in a section like "15) Explicit In-Reply-To headers" from
the document "SubmittingPatches"?



> you've done that, please re-submit this patch with my:

Does the association of this patch with a bit relevant discussion
really hinder the desired commit?

Regards,
Markus


Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register

2016-07-01 Thread Radim Krčmář
2016-07-01 16:12+0200, Paolo Bonzini:
> On 01/07/2016 15:11, Radim Krčmář wrote:
 >> +static void __kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 >> +   struct kvm_lapic_state *s, bool set)
 >> +{
 >> +   if (apic_x2apic_mode(vcpu->arch.apic)) {
 >> +   u32 *id = (u32 *)(s->regs + APIC_ID);
 >> +   if (set)
 >> +   *id >>= 24;
 >> +   else
 >> +   *id <<= 24;
 >> +   }
>>> > 
>>> > When setting, this should read from the apic_base being set.  So I think
>>> > you cannot use apic_x2apic_mode.
>> 
>> apic_x2apic_mode uses apic_base MSR, so its value does not depend on
>> LAPIC_SET/GET.  I don't like the dependency much, but all combinations
>> of values/orders should work well.
> 
> You're right of course.  However it should be documented in the
> KVM_GET_LAPIC/KVM_SET_LAPIC ioctl docs that KVM_SET_MSR for apic_base
> should be performed first.

Ok, that will make our life easier.

> Should kvm_lapic_set_base change the value of the ID register when
> x2apic mode is left just like we do for entering x2apic mode?

Yes, the patch does it.  The only possible change from x2APIC mode is to
hardware disabled mode and re-enabled xAPIC starts with initial APIC ID
in upper 8 bits.

>(Hint: we
> want kvm-unit-tests for this).

:) Something like this?

  enable_xapic()
  id = apic_id()
  set_apic_id(id+1) // ?
  enable_x2apic()
  id == apic_id() & 0xff
  disable_apic()
  enable_xapic()
  id == apic_id()


Re: [PATCH v1 06/11] KVM: x86: use hardware-compatible format for APIC ID register

2016-07-01 Thread Radim Krčmář
2016-07-01 16:12+0200, Paolo Bonzini:
> On 01/07/2016 15:11, Radim Krčmář wrote:
 >> +static void __kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 >> +   struct kvm_lapic_state *s, bool set)
 >> +{
 >> +   if (apic_x2apic_mode(vcpu->arch.apic)) {
 >> +   u32 *id = (u32 *)(s->regs + APIC_ID);
 >> +   if (set)
 >> +   *id >>= 24;
 >> +   else
 >> +   *id <<= 24;
 >> +   }
>>> > 
>>> > When setting, this should read from the apic_base being set.  So I think
>>> > you cannot use apic_x2apic_mode.
>> 
>> apic_x2apic_mode uses apic_base MSR, so its value does not depend on
>> LAPIC_SET/GET.  I don't like the dependency much, but all combinations
>> of values/orders should work well.
> 
> You're right of course.  However it should be documented in the
> KVM_GET_LAPIC/KVM_SET_LAPIC ioctl docs that KVM_SET_MSR for apic_base
> should be performed first.

Ok, that will make our life easier.

> Should kvm_lapic_set_base change the value of the ID register when
> x2apic mode is left just like we do for entering x2apic mode?

Yes, the patch does it.  The only possible change from x2APIC mode is to
hardware disabled mode and re-enabled xAPIC starts with initial APIC ID
in upper 8 bits.

>(Hint: we
> want kvm-unit-tests for this).

:) Something like this?

  enable_xapic()
  id = apic_id()
  set_apic_id(id+1) // ?
  enable_x2apic()
  id == apic_id() & 0xff
  disable_apic()
  enable_xapic()
  id == apic_id()


<    2   3   4   5   6   7   8   9   10   11   >