Re: [PATCH 18/22] perf scripts python: exported-sql-viewer.py: Add IPC information to the Branch reports

2019-06-02 Thread Adrian Hunter
On 31/05/19 7:44 PM, Arnaldo Carvalho de Melo wrote:
> Em Mon, May 20, 2019 at 02:37:24PM +0300, Adrian Hunter escreveu:
>> Enhance the "All branches" and "Selected branches" reports to display IPC
>> information if it is available.
> 
> So, testing this I noticed that it all starts with the left arrow in every
> line, that should mean there is some tree there, i.e. look at all those ▶
> symbols:
> 
> Time  CPU Command PID   TID   Branch Type  In Tx  Insn 
> Cnt  Cyc Cnt  IPC  Branch
> ▶ 187836112195670 7   simple-retpolin 23003 23003 trace begin  No 0   
>   00   0 unknown (unknown) -> 7f6f33d4f110 _start 
> (ld-2.28.so)
> ▶ 187836112195987 7   simple-retpolin 23003 23003 trace endNo 0   
>   883  07f6f33d4f110 _start (ld-2.28.so) -> 0 unknown (unknown)
> ▶ 187836112199189 7   simple-retpolin 23003 23003 trace begin  No 0   
>   00   0 unknown (unknown) -> 7f6f33d4f110 _start 
> (ld-2.28.so)
> ▶ 187836112199189 7   simple-retpolin 23003 23003 call No 0   
>   007f6f33d4f113 _start+0x3 (ld-2.28.so) -> 7f6f33d4ff50 
> _dl_start (ld-2.28.so)
> ▶ 187836112199544 7   simple-retpolin 23003 23003 trace endNo 17  
>   996  0.02 7f6f33d4ff73 _dl_start+0x23 (ld-2.28.so) -> 0 unknown 
> (unknown)
> ▶ 187836112200939 7   simple-retpolin 23003 23003 trace begin  No 0   
>   00   0 unknown (unknown) -> 7f6f33d4ff73 _dl_start+0x23 
> (ld-2.28.so)
> ▶ 187836112201229 7   simple-retpolin 23003 23003 trace endNo 1   
>   816  0.00 7f6f33d4ff7a _dl_start+0x2a (ld-2.28.so) -> 0 unknown 
> (unknown)
> ▶ 187836112203500 7   simple-retpolin 23003 23003 trace begin  No 0   
>   00   0 unknown (unknown) -> 7f6f33d4ff7a _dl_start+0x2a 
> (ld-2.28.so)
> 
> But if you click on it, that ▶ disappears and a new click doesn't make it
> reappear, looks buggy, but seems like a minor oddity that will not prevent me
> from applying it now, please check and provide a fix on top of this,

The arrow is to display disssassembly, but only if xed is installed and the
object is in the buildid cache.  Unfortunately, it is not efficient to
determine if there is anything to expand before the user clicks.


RE: [External] Re: linux kernel page allocation failure and tuning of page cache

2019-06-02 Thread Nagal, Amit UTC CCS



-Original Message-
From: Matthew Wilcox [mailto:wi...@infradead.org] 
Sent: Saturday, June 1, 2019 1:01 AM
To: Nagal, Amit UTC CCS 
Cc: linux-kernel@vger.kernel.org; linux...@kvack.org; CHAWLA, RITU UTC CCS 
; net...@vger.kernel.org
Subject: [External] Re: linux kernel page allocation failure and tuning of page 
cache

> 1) the platform is low memory platform having memory 64MB.
> 
> 2)  we are doing around 45MB TCP data transfer from PC to target using netcat 
> utility .On Target , a process receives data over socket and writes the data 
> to flash disk .

>I think your network is faster than your disk ...

Ok . I need to check it . But how does this affect page reclaim procedure .

> 5) sometimes , we observed kernel memory getting exhausted as page allocation 
> failure happens in kernel  with the backtrace is printed below :
> # [  775.947949] nc.traditional: page allocation failure: order:0, 
> mode:0x2080020(GFP_ATOMIC)

>We're in the soft interrupt handler at this point, so we have very few options 
>for freeing memory; we can't wait for I/O to complete, for example.

>That said, this is a TCP connection.  We could drop the packet silently 
>without such a noisy warning.  Perhaps just collect statistics on how many 
>packets we dropped due to a low memory situation.

I will collect statistics for it .

> [  775.956362] CPU: 0 PID: 1288 Comm: nc.traditional Tainted: G   O   
>  4.9.123-pic6-g31a13de-dirty #19
> [  775.966085] Hardware name: Generic R7S72100 (Flattened Device Tree) 
> [  775.972501] [] (unwind_backtrace) from [] 
> (show_stack+0xb/0xc) [  775.980118] [] (show_stack) from 
> [] (warn_alloc+0x89/0xba) [  775.987361] [] 
> (warn_alloc) from [] (__alloc_pages_nodemask+0x1eb/0x634)
> [  775.995790] [] (__alloc_pages_nodemask) from [] 
> (__alloc_page_frag+0x39/0xde) [  776.004685] [] 
> (__alloc_page_frag) from [] (__netdev_alloc_skb+0x51/0xb0) [  
> 776.013217] [] (__netdev_alloc_skb) from [] 
> (sh_eth_poll+0xbf/0x3c0) [  776.021342] [] (sh_eth_poll) 
> from [] (net_rx_action+0x77/0x170) [  776.029051] 
> [] (net_rx_action) from [] 
> (__do_softirq+0x107/0x160) [  776.036896] [] (__do_softirq) 
> from [] (irq_exit+0x5d/0x80) [  776.044165] [] 
> (irq_exit) from [] (__handle_domain_irq+0x57/0x8c) [  776.052007] 
> [] (__handle_domain_irq) from [] 
> (gic_handle_irq+0x31/0x48) [  776.060362] [] (gic_handle_irq) from 
> [] (__irq_svc+0x65/0xac) [  776.067835] Exception stack(0xc1cafd70 
> to 0xc1cafdb8)
> [  776.072876] fd60: 0002751c c1dec6a0 
> 000c 521c3be5
> [  776.081042] fd80: 56feb08e f64823a6 ffb35f7b feab513d f9cb0643 
> 056c c1caff10 e000 [  776.089204] fda0: b1f49160 c1cafdc4 
> c180c677 c0234ace 200e0033  [  776.095816] [] 
> (__irq_svc) from [] (__copy_to_user_std+0x7e/0x430) [  
> 776.103796] [] (__copy_to_user_std) from [] 
> (copy_page_to_iter+0x105/0x250) [  776.112503] [] 
> (copy_page_to_iter) from [] 
> (skb_copy_datagram_iter+0xa3/0x108)
> [  776.121469] [] (skb_copy_datagram_iter) from [] 
> (tcp_recvmsg+0x3ab/0x5f4) [  776.130045] [] (tcp_recvmsg) 
> from [] (inet_recvmsg+0x21/0x2c) [  776.137576] [] 
> (inet_recvmsg) from [] (sock_read_iter+0x51/0x6e) [  
> 776.145384] [] (sock_read_iter) from [] 
> (__vfs_read+0x97/0xb0) [  776.152967] [] (__vfs_read) from 
> [] (vfs_read+0x51/0xb0) [  776.159983] [] 
> (vfs_read) from [] (SyS_read+0x27/0x52) [  776.166837] [] 
> (SyS_read) from [] (ret_fast_syscall+0x1/0x54) [  776.174308] 
> Mem-Info:
> [  776.176650] active_anon:2037 inactive_anon:23 isolated_anon:0 [  
> 776.176650]  active_file:2636 inactive_file:7391 isolated_file:32 [  
> 776.176650]  unevictable:0 dirty:1366 writeback:1281 unstable:0

>Almost all the dirty pages are under writeback at this point.

> [  776.176650]  slab_reclaimable:719 slab_unreclaimable:724 [  
> 776.176650]  mapped:1990 shmem:26 pagetables:159 bounce:0 [  
> 776.176650]  free:373 free_pcp:6 free_cma:0

>We have 373 free pages, but refused to allocate one of them to GFP_ATOMIC?
>I don't understand why that failed.  We also didn't try to steal an 
>inactive_file or inactive_anon page, which seems like an obvious thing we 
>might want to do.

Yes that's where I am concerned . we do not have swap device so I am assuming 
perhaps inactive_anon pages are not stolen , but inactive_file pages could have 
been used . 

> [  776.209062] Node 0 active_anon:8148kB inactive_anon:92kB 
> active_file:10544kB inactive_file:29564kB unevictable:0kB 
> isolated(anon):0kB isolated(file):128kB mapped:7960kB dirty:5464kB 
> writeback:5124kB shmem:104kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no [  776.233602] Normal 
> free:1492kB min:964kB low:1204kB high:1444kB active_anon:8148kB 
> inactive_anon:92kB active_file:10544kB inactive_file:29564kB 
> unevictable:0kB writepending:10588kB present:65536kB managed:59304kB 
> mlocked:0kB slab_reclaimable:2876kB slab_unreclaimable:2896kB 
> kernel_stack:1152kB 

[PATCH v1 2/4] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM

2019-06-02 Thread Minchan Kim
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.

Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.

This is a preparatory work for next patch.

* RFCv1
 * use ignore_referecnes as parameter name - hannes

Acked-by: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
  struct scan_control *sc,
  enum ttu_flags ttu_flags,
  struct reclaim_stat *stat,
- bool force_reclaim)
+ bool ignore_references)
 {
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
-   enum page_references references = PAGEREF_RECLAIM_CLEAN;
+   enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
 
@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
}
}
 
-   if (!force_reclaim)
+   if (!ignore_references)
references = page_check_references(page, sc);
 
switch (references) {
-- 
2.22.0.rc1.311.g5d7573a151-goog



[PATCH v1 3/4] mm: account nr_isolated_xxx in [isolate|putback]_lru_page

2019-06-02 Thread Minchan Kim
The isolate counting is pecpu counter so it would be not huge gain
to work them by batch. Rather than complicating to make them batch,
let's make it more stright-foward via adding the counting logic
into [isolate|putback]_lru_page API.

Link: http://lkml.kernel.org/r/20190531165927.ga20...@cmpxchg.org
Suggested-by: Johannes Weiner 
Signed-off-by: Minchan Kim 
---
 mm/compaction.c |  2 --
 mm/gup.c|  7 +--
 mm/khugepaged.c |  3 ---
 mm/memory-failure.c |  3 ---
 mm/memory_hotplug.c |  4 
 mm/mempolicy.c  |  6 +-
 mm/migrate.c| 37 -
 mm/vmscan.c | 22 --
 8 files changed, 26 insertions(+), 58 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..c6591682deda 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -982,8 +982,6 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/* Successfully isolated */
del_page_from_lru_list(page, lruvec, page_lru(page));
-   inc_node_page_state(page,
-   NR_ISOLATED_ANON + page_is_file_cache(page));
 
 isolate_success:
list_add(>lru, >migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index 63ac50e48072..2d9a9bc358c7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1360,13 +1360,8 @@ static long check_and_migrate_cma_pages(struct 
task_struct *tsk,
drain_allow = false;
}
 
-   if (!isolate_lru_page(head)) {
+   if (!isolate_lru_page(head))
list_add_tail(>lru, 
_page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   
page_is_file_cache(head),
-   
hpage_nr_pages(head));
-   }
}
}
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index a335f7c1fac4..3359df994fb4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -503,7 +503,6 @@ void __khugepaged_exit(struct mm_struct *mm)
 
 static void release_pte_page(struct page *page)
 {
-   dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
unlock_page(page);
putback_lru_page(page);
 }
@@ -602,8 +601,6 @@ static int __collapse_huge_page_isolate(struct 
vm_area_struct *vma,
result = SCAN_DEL_PAGE_LRU;
goto out;
}
-   inc_node_page_state(page,
-   NR_ISOLATED_ANON + page_is_file_cache(page));
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index bc749265a8f3..2187bad7ceff 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1796,9 +1796,6 @@ static int __soft_offline_page(struct page *page, int 
flags)
 * so use !__PageMovable instead for LRU page's mapping
 * cannot have PAGE_MAPPING_MOVABLE.
 */
-   if (!__PageMovable(page))
-   inc_node_page_state(page, NR_ISOLATED_ANON +
-   page_is_file_cache(page));
list_add(>lru, );
ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index a88c5f334e5a..a41bea24d0c9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1390,10 +1390,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
if (!ret) { /* Success */
list_add_tail(>lru, );
-   if (!__PageMovable(page))
-   inc_node_page_state(page, NR_ISOLATED_ANON +
-   page_is_file_cache(page));
-
} else {
pr_warn("failed to isolate pfn %lx\n", pfn);
dump_page(page, "isolation failed");
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5b3bf1747c19..cfb0590f69bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -948,12 +948,8 @@ static void migrate_page_add(struct page *page, struct 
list_head *pagelist,
 * Avoid migrating a page that is shared with others.
 */
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
-   if (!isolate_lru_page(head)) {
+   if (!isolate_lru_page(head))
list_add_tail(>lru, pagelist);
- 

[PATCH v1 0/4] Introduce MADV_COLD and MADV_PAGEOUT

2019-06-02 Thread Minchan Kim
This patch is part of previous series:
https://lore.kernel.org/lkml/20190531064313.193437-1-minc...@kernel.org/T/#u
Originally, it was created for external madvise hinting feature.

https://lkml.org/lkml/2019/5/31/463
Michal wanted to separte the discussion from external hinting interface
so this patchset includes only first part of my entire patchset
  - introduce MADV_COLD and MADV_PAGEOUT hint to madvise.

However, I keep entire description for others for easier understanding
why this kinds of hint was born.

Thanks.

This patchset is against on next-20190530.

Below is description of previous entire patchset.
= &< =

- Background

The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.

To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.

- Problem

Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.

- Approach

The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.

To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.

This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -

struct pr_madvise_param {
int size;   /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem;/* count of below arrary fields */
int __user *hints;  /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};

int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);

The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once. It also has cookie variable
to support atomicity of the API for address ranges operations. IOW, if
target process changes address space since monitor process has parsed
address ranges via map_files or maps, the API can detect the race so
could cancel entire address space operation. It's not implemented yet.
Daniel Colascione suggested a idea(Please read 

[PATCH v1 1/4] mm: introduce MADV_COLD

2019-06-02 Thread Minchan Kim
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

active file page -> inactive file LRU
active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
files's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. Even, it could
give a bonus to make them be reclaimed on swapless system. However,
MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in
in the end. So it's better to move inactive anon's LRU list, not file LRU.
Furthermore, it would help to avoid unnecessary scanning of cold anonymous
if system doesn't have a swap device.

All of error rule is same with MADV_DONTNEED.

Note:
This hint works with only private pages(IOW, page_mapcount(page) < 2)
because shared page could have more chance to be accessed from other
processes sharing the page although the caller reset the reference bits.
It ends up preventing the reclaim of the page and wastes CPU cycle.

* RFCv2
 * add more description - mhocko

* RFCv1
 * renaming from MADV_COOL to MADV_COLD - hannes

* internal review
 * use clear_page_youn in deactivate_page - joelaf
 * Revise the description - surenb
 * Renaming from MADV_WARM to MADV_COOL - surenb

Signed-off-by: Minchan Kim 
---
 include/linux/page-flags.h |   1 +
 include/linux/page_idle.h  |  15 
 include/linux/swap.h   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/internal.h  |   2 +-
 mm/madvise.c   | 115 -
 mm/oom_kill.c  |   2 +-
 mm/swap.c  |  43 +
 8 files changed, 176 insertions(+), 4 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page 
*page)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
 }
 
+static inline void clear_page_young(struct page *page)
+{
+   ClearPageYoung(page);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, _ext->flags);
 }
 
+static void clear_page_young(struct page *page)
+{
+   struct page_ext *page_ext = lookup_page_ext(page);
+
+   if (unlikely(!page_ext))
+   return;
+
+   clear_bit(PAGE_EXT_YOUNG, _ext->flags);
+}
+
 static inline bool test_and_clear_page_young(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
 extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
 extern void mark_page_lazyfree(struct page *page);
 extern void swap_setup(void);
 
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
 #define MADV_SEQUENTIAL2   /* expect sequential page 
references */
 #define MADV_WILLNEED  3   /* will need these pages */
 #define MADV_DONTNEED  4   /* don't need these pages */
+#define 

[PATCH v1 4/4] mm: introduce MADV_PAGEOUT

2019-06-02 Thread Minchan Kim
When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.

This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims *any LRU* pages instantly. The hint can help kernel in deciding
which pages to evict proactively.

All of error rule is same with MADV_DONTNEED.

Note:
This hint works with only private pages(IOW, page_mapcount(page) < 2)
because shared page could have more chance to be accessed from other
processes sharing the page so that it could cause major fault soon,
which is inefficient.

* RFC v2
 * make reclaim_pages simple via factoring out isolate logic - hannes

* RFCv1
 * rename from MADV_COLD to MADV_PAGEOUT - hannes
 * bail out if process is being killed - Hillf
 * fix reclaim_pages bugs - Hillf

Signed-off-by: Minchan Kim 
---
 include/linux/swap.h   |   1 +
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/madvise.c   | 126 +
 mm/vmscan.c|  34 +++
 4 files changed, 162 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern unsigned long vm_total_pages;
 
+extern unsigned long reclaim_pages(struct list_head *page_list);
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 1190f4e7f7b9..92e347a89ddc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -44,6 +44,7 @@
 #define MADV_WILLNEED  3   /* will need these pages */
 #define MADV_DONTNEED  4   /* don't need these pages */
 #define MADV_COLD  5   /* deactivatie these pages */
+#define MADV_PAGEOUT   6   /* reclaim these pages */
 
 /* common parameters: try to keep these consistent across architectures */
 #define MADV_FREE  8   /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index ab158766858a..b010249cb8b6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COLD:
+   case MADV_PAGEOUT:
case MADV_FREE:
return 0;
default:
@@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
 }
 
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+   unsigned long end, struct mm_walk *walk)
+{
+   pte_t *orig_pte, *pte, ptent;
+   spinlock_t *ptl;
+   LIST_HEAD(page_list);
+   struct page *page;
+   int isolated = 0;
+   struct vm_area_struct *vma = walk->vma;
+   unsigned long next;
+
+   if (fatal_signal_pending(current))
+   return -EINTR;
+
+   next = pmd_addr_end(addr, end);
+   if (pmd_trans_huge(*pmd)) {
+   ptl = pmd_trans_huge_lock(pmd, vma);
+   if (!ptl)
+   return 0;
+
+   if (is_huge_zero_pmd(*pmd))
+   goto huge_unlock;
+
+   page = pmd_page(*pmd);
+   if (page_mapcount(page) > 1)
+   goto huge_unlock;
+
+   if (next - addr != HPAGE_PMD_SIZE) {
+   int err;
+
+   get_page(page);
+   spin_unlock(ptl);
+   lock_page(page);
+   err = split_huge_page(page);
+   unlock_page(page);
+   put_page(page);
+   if (!err)
+   goto regular_page;
+   return 0;
+   }
+
+   if (isolate_lru_page(page))
+   goto huge_unlock;
+
+   list_add(>lru, _list);
+huge_unlock:
+   spin_unlock(ptl);
+   reclaim_pages(_list);
+   return 0;
+   }
+
+   if (pmd_trans_unstable(pmd))
+   return 0;
+regular_page:
+   orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, );
+   for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+   ptent = *pte;
+   if (!pte_present(ptent))
+   continue;
+
+   page = vm_normal_page(vma, addr, ptent);
+   if (!page)
+   

Re: [PATCH] regulator: bd70528: Drop unused include

2019-06-02 Thread Vaittinen, Matti
Thanks Linus!

On Sat, 2019-06-01 at 01:06 +0200, Linus Walleij wrote:
> This driver does not use any symbols from 
> so just drop the include.
> 
> Cc: Matti Vaittinen 
> Signed-off-by: Linus Walleij 
Acked-By: Matti Vaittinen 

Br,
Matti Vaittinen


Re: [PATCH] regulator: bd718x7: Drop unused include

2019-06-02 Thread Vaittinen, Matti
And thanks for this too =)

On Sat, 2019-06-01 at 01:08 +0200, Linus Walleij wrote:
> This driver does not use any symbols from 
> so just drop the include.
> 
> Cc: Matti Vaittinen 
> Signed-off-by: Linus Walleij 
Acked-By: Matti Vaittinen 

Br,
Matti Vaittinen



Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu
On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> 
> 1.These guarantees are of full memory barriers, -not- compiler
>   barriers.

What I'm saying is that wherever they are, they must come with
compiler barriers.  I'm not aware of any synchronisation mechanism
in the kernel that gives a memory barrier without a compiler barrier.

> 2.These rules don't say exactly where these full memory barriers
>   go.  SRCU is at one extreme, placing those full barriers in
>   srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU
>   at the other, placing these barriers entirely within the callback
>   queueing/invocation, grace-period computation, and the scheduler.
>   Preemptible Tree RCU is in the middle, with rcu_read_unlock()
>   sometimes including a full memory barrier, but other times with
>   the full memory barrier being confined as it is with !PREEMPT
>   Tree RCU.

The rules do say that the (full) memory barrier must precede any
RCU read-side that occur after the synchronize_rcu and after the
end of any RCU read-side that occur before the synchronize_rcu.

All I'm arguing is that wherever that full mb is, as long as it
also carries with it a barrier() (which it must do if it's done
using an existing kernel mb/locking primitive), then we're fine.

> Interleaving and inserting full memory barriers as per the rules above:
> 
>   CPU1: WRITE_ONCE(a, 1)
>   CPU1: synchronize_rcu   
>   /* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

Let's put them in because I think they are critical.  smp_mb() also
carries with it a barrier().

>   CPU2: rcu_read_lock();
>   CPU1: b = 2;
>   CPU2: if (READ_ONCE(a) == 0)
>   CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
>   CPU2: b = 1;
>   CPU2: rcu_read_unlock
> 
> In fact, CPU2's load from b might be moved up to race with CPU1's store,
> which (I believe) is why the model complains in this case.

Let's put aside my doubt over how we're even allowing a compiler
to turn

b = 1

into

if (b != 1)
b = 1

Since you seem to be assuming that (a == 0) is true in this case
(as the assignment b = 1 is carried out), then because of the
presence of the full memory barrier, the RCU read-side section
must have started prior to the synchronize_rcu.  This means that
synchronize_rcu is not allowed to return until at least the end
of the grace period, or at least until the end of rcu_read_unlock.

So it actually should be:

CPU1: WRITE_ONCE(a, 1)
CPU1: synchronize_rcu called
/* Could put a full memory barrier here, but it wouldn't help. */

CPU1: smp_mb();
CPU2: smp_mb();

CPU2: grace period starts
...time passes...
CPU2: rcu_read_lock();
CPU2: if (READ_ONCE(a) == 0)
CPU2: if (b != 1)  /* Weakly ordered CPU moved this up! */
CPU2: b = 1;
CPU2: rcu_read_unlock
...time passes...
CPU2: grace period ends

/* This full memory barrier is also guaranteed by RCU. */
CPU2: smp_mb();

CPU1 synchronize_rcu returns
CPU1: b = 2;

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [EXT] INFO: trying to register non-static key in del_timer_sync (2)

2019-06-02 Thread Dmitry Vyukov
On Sat, Jun 1, 2019 at 7:52 PM Ganapathi Bhat  wrote:
>
> Hi syzbot,
>
> >
> > syzbot found the following crash on:
> >
> As per the 
> link(https://syzkaller.appspot.com/bug?extid=dc4127f950da51639216), the issue 
> is fixed; Is it OK? Let us know if we need to do something?

Hi Ganapathi,

The "fixed" status relates to the similar past bug that was reported
and fixed more than a year ago:
https://groups.google.com/forum/#!msg/syzkaller-bugs/3YnGX1chF2w/jeQjeihtBAAJ
https://syzkaller.appspot.com/bug?id=b4b5c74c57c4b69f4fff86131abb799106182749

This one is still well alive and kicking, with 1200+ crashes and the
last one happened less then 30min ago.


[GIT] Sparc

2019-06-02 Thread David Miller


Please pull to get these three bug fixes, and TLB flushing one is of
particular brown paper bag quality...

Thanks.

The following changes since commit f2c7c76c5d0a443053e94adb9f0918fa2fb85c3a:

  Linux 5.2-rc3 (2019-06-02 13:55:33 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc.git 

for you to fetch changes up to 56cd0aefa475079e9613085b14a0f05037518fed:

  sparc: perf: fix updated event period in response to PERF_EVENT_IOC_PERIOD 
(2019-06-02 22:16:33 -0700)


Gen Zhang (1):
  mdesc: fix a missing-check bug in get_vdev_port_node_info()

James Clarke (1):
  sparc64: Fix regression in non-hypervisor TLB flush xcall

Young Xiao (1):
  sparc: perf: fix updated event period in response to PERF_EVENT_IOC_PERIOD

 arch/sparc/kernel/mdesc.c  | 2 ++
 arch/sparc/kernel/perf_event.c | 4 
 arch/sparc/mm/ultra.S  | 4 ++--
 3 files changed, 8 insertions(+), 2 deletions(-)


Re: [PATCH] mdesc: fix a missing-check bug in get_vdev_port_node_info()

2019-06-02 Thread David Miller
From: Gen Zhang 
Date: Fri, 31 May 2019 09:24:18 +0800

> In get_vdev_port_node_info(), 'node_info->vdev_port.name' is allcoated
> by kstrdup_const(), and it returns NULL when fails. So 
> 'node_info->vdev_port.name' should be checked.
> 
> Signed-off-by: Gen Zhang 

Applied, thanks.


Re: [PATCH] sparc: perf: fix updated event period in response to PERF_EVENT_IOC_PERIOD

2019-06-02 Thread David Miller
From: Young Xiao <92siuy...@gmail.com>
Date: Wed, 29 May 2019 10:21:48 +0800

> The PERF_EVENT_IOC_PERIOD ioctl command can be used to change the
> sample period of a running perf_event. Consequently, when calculating
> the next event period, the new period will only be considered after the
> previous one has overflowed.
> 
> This patch changes the calculation of the remaining event ticks so that
> they are offset if the period has changed.
> 
> See commit 3581fe0ef37c ("ARM: 7556/1: perf: fix updated event period in
> response to PERF_EVENT_IOC_PERIOD") for details.
> 
> Signed-off-by: Young Xiao <92siuy...@gmail.com>

Applied, thanks.


Re: [PATCHv6 5/6] arm64: dts: lx2160a: Add PCIe controller DT nodes

2019-06-02 Thread Karthikeyan Mitran
Hi Hou Zhiqiang
   Two instances [@360 and @380] of the six has a different
window count, the RC can not have more than 8 windows.
apio-wins = <256>;  //Can we change it to 8
ppio-wins = <24>;//Can we change it to 8

On Tue, May 28, 2019 at 12:20 PM Z.q. Hou  wrote:
>
> From: Hou Zhiqiang 
>
> The LX2160A integrated 6 PCIe Gen4 controllers.
>
> Signed-off-by: Hou Zhiqiang 
> Reviewed-by: Minghuan Lian 
> ---
> V6:
>  - No change.
>
>  .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi | 163 ++
>  1 file changed, 163 insertions(+)
>
> diff --git a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi 
> b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> index 125a8cc2c5b3..7a2b91ff1fbc 100644
> --- a/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> +++ b/arch/arm64/boot/dts/freescale/fsl-lx2160a.dtsi
> @@ -964,5 +964,168 @@
> };
> };
> };
> +
> +   pcie@340 {
> +   compatible = "fsl,lx2160a-pcie";
> +   reg = <0x00 0x0340 0x0 0x0010   /* controller 
> registers */
> +  0x80 0x 0x0 0x1000>; /* 
> configuration space */
> +   reg-names = "csr_axi_slave", "config_axi_slave";
> +   interrupts = , /* 
> AER interrupt */
> +, /* 
> PME interrupt */
> +; /* 
> controller interrupt */
> +   interrupt-names = "aer", "pme", "intr";
> +   #address-cells = <3>;
> +   #size-cells = <2>;
> +   device_type = "pci";
> +   dma-coherent;
> +   apio-wins = <8>;
> +   ppio-wins = <8>;
> +   bus-range = <0x0 0xff>;
> +   ranges = <0x8200 0x0 0x4000 0x80 0x4000 
> 0x0 0x4000>; /* non-prefetchable memory */
> +   msi-parent = <>;
> +   #interrupt-cells = <1>;
> +   interrupt-map-mask = <0 0 0 7>;
> +   interrupt-map = < 0 0 1  0 0 GIC_SPI 109 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 2  0 0 GIC_SPI 110 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 3  0 0 GIC_SPI 111 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 4  0 0 GIC_SPI 112 
> IRQ_TYPE_LEVEL_HIGH>;
> +   status = "disabled";
> +   };
> +
> +   pcie@350 {
> +   compatible = "fsl,lx2160a-pcie";
> +   reg = <0x00 0x0350 0x0 0x0010   /* controller 
> registers */
> +  0x88 0x 0x0 0x1000>; /* 
> configuration space */
> +   reg-names = "csr_axi_slave", "config_axi_slave";
> +   interrupts = , /* 
> AER interrupt */
> +, /* 
> PME interrupt */
> +; /* 
> controller interrupt */
> +   interrupt-names = "aer", "pme", "intr";
> +   #address-cells = <3>;
> +   #size-cells = <2>;
> +   device_type = "pci";
> +   dma-coherent;
> +   apio-wins = <8>;
> +   ppio-wins = <8>;
> +   bus-range = <0x0 0xff>;
> +   ranges = <0x8200 0x0 0x4000 0x88 0x4000 
> 0x0 0x4000>; /* non-prefetchable memory */
> +   msi-parent = <>;
> +   #interrupt-cells = <1>;
> +   interrupt-map-mask = <0 0 0 7>;
> +   interrupt-map = < 0 0 1  0 0 GIC_SPI 114 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 2  0 0 GIC_SPI 115 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 3  0 0 GIC_SPI 116 
> IRQ_TYPE_LEVEL_HIGH>,
> +   < 0 0 4  0 0 GIC_SPI 117 
> IRQ_TYPE_LEVEL_HIGH>;
> +   status = "disabled";
> +   };
> +
> +   pcie@360 {
> +   compatible = "fsl,lx2160a-pcie";
> +   reg = <0x00 0x0360 0x0 0x0010   /* controller 
> registers */
> +  0x90 0x 0x0 0x1000>; /* 
> configuration space */
> +   reg-names = "csr_axi_slave", "config_axi_slave";
> +   interrupts = , /* 
> AER interrupt */
> +, /* 
> PME interrupt */
> +; /* 
> controller interrupt */
> +   interrupt-names = "aer", "pme", "intr";
> +   #address-cells = <3>;
> +   #size-cells 

Re: [PATCH v3 1/3] PCI: Introduce pcibios_ignore_alignment_request

2019-06-02 Thread Alexey Kardashevskiy



On 03/06/2019 12:23, Shawn Anastasio wrote:
> 
> 
> On 5/30/19 10:56 PM, Alexey Kardashevskiy wrote:
>>
>>
>> On 31/05/2019 08:49, Shawn Anastasio wrote:
>>> On 5/29/19 10:39 PM, Alexey Kardashevskiy wrote:


 On 28/05/2019 17:39, Shawn Anastasio wrote:
>
>
> On 5/28/19 1:27 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 28/05/2019 15:36, Oliver wrote:
>>> On Tue, May 28, 2019 at 2:03 PM Shawn Anastasio 
>>> wrote:

 Introduce a new pcibios function pcibios_ignore_alignment_request
 which allows the PCI core to defer to platform-specific code to
 determine whether or not to ignore alignment requests for PCI
 resources.

 The existing behavior is to simply ignore alignment requests when
 PCI_PROBE_ONLY is set. This is behavior is maintained by the
 default implementation of pcibios_ignore_alignment_request.

 Signed-off-by: Shawn Anastasio 
 ---
     drivers/pci/pci.c   | 9 +++--
     include/linux/pci.h | 1 +
     2 files changed, 8 insertions(+), 2 deletions(-)

 diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
 index 8abc843b1615..8207a09085d1 100644
 --- a/drivers/pci/pci.c
 +++ b/drivers/pci/pci.c
 @@ -5882,6 +5882,11 @@ resource_size_t __weak
 pcibios_default_alignment(void)
    return 0;
     }

 +int __weak pcibios_ignore_alignment_request(void)
 +{
 +   return pci_has_flag(PCI_PROBE_ONLY);
 +}
 +
     #define RESOURCE_ALIGNMENT_PARAM_SIZE COMMAND_LINE_SIZE
     static char
 resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
     static DEFINE_SPINLOCK(resource_alignment_lock);
 @@ -5906,9 +5911,9 @@ static resource_size_t
 pci_specified_resource_alignment(struct pci_dev *dev,
    p = resource_alignment_param;
    if (!*p && !align)
    goto out;
 -   if (pci_has_flag(PCI_PROBE_ONLY)) {
 +   if (pcibios_ignore_alignment_request()) {
    align = 0;
 -   pr_info_once("PCI: Ignoring requested alignments
 (PCI_PROBE_ONLY)\n");
 +   pr_info_once("PCI: Ignoring requested
 alignments\n");
    goto out;
    }
>>>
>>> I think the logic here is questionable to begin with. If the user
>>> has
>>> explicitly requested re-aligning a resource via the command line
>>> then
>>> we should probably do it even if PCI_PROBE_ONLY is set. When it
>>> breaks
>>> they get to keep the pieces.
>>>
>>> That said, the real issue here is that PCI_PROBE_ONLY probably
>>> shouldn't be set under qemu/kvm. Under the other hypervisor
>>> (PowerVM)
>>> hotplugged devices are configured by firmware before it's passed to
>>> the guest and we need to keep the FW assignments otherwise things
>>> break. QEMU however doesn't do any BAR assignments and relies on
>>> that
>>> being handled by the guest. At boot time this is done by SLOF, but
>>> Linux only keeps SLOF around until it's extracted the device-tree.
>>> Once that's done SLOF gets blown away and the kernel needs to do
>>> it's
>>> own BAR assignments. I'm guessing there's a hack in there to make it
>>> work today, but it's a little surprising that it works at all...
>>
>>
>> The hack is to run a modified qemu-aware "/usr/sbin/rtas_errd" in the
>> guest which receives an event from qemu (RAS_EPOW from
>> /proc/interrupts), fetches device tree chunks (and as I understand
>> it -
>> they come with BARs from phyp but without from qemu) and writes
>> "1" to
>> "/sys/bus/pci/rescan" which calls pci_assign_resource() eventually:
>
> Interesting. Does this mean that the PHYP hotplug path doesn't
> call pci_assign_resource?


 I'd expect dlpar_add_slot() to be called under phyp and eventually
 pci_device_add() which (I think) may or may not trigger later
 reassignment.


> If so it means the patch may not
> break that platform after all, though it still may not be
> the correct way of doing things.


 We should probably stop enforcing the PCI_PROBE_ONLY flag - it seems
 that (unless resource_alignment= is used) the pseries guest should just
 walk through all allocated resources and leave them unchanged.
>>>
>>> If we add a pcibios_default_alignment() implementation like was
>>> suggested earlier, then it will behave as if the user has
>>> specified resource_alignment= by default and SLOF's assignments
>>> won't be honored (I think).
>>
>>
>> I removed pci_add_flags(PCI_PROBE_ONLY) from pSeries_setup_arch and
>> tried booting with and without 

linux-next: manual merge of the akpm-current tree with the dma-mapping tree

2019-06-02 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the akpm-current tree got a conflict in:

  include/linux/genalloc.h

between commit:

  3334e1dc5d71 ("lib/genalloc: add gen_pool_dma_zalloc() for zeroed DMA 
allocations")

from the dma-mapping tree and commit:

  1c6b703cba18 ("lib/genalloc: introduce chunk owners")

from the akpm-current tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/linux/genalloc.h
index 6c62eeca754f,b0ab64879ccb..
--- a/include/linux/genalloc.h
+++ b/include/linux/genalloc.h
@@@ -116,13 -124,47 +124,48 @@@ static inline int gen_pool_add(struct g
return gen_pool_add_virt(pool, addr, -1, size, nid);
  }
  extern void gen_pool_destroy(struct gen_pool *);
- extern unsigned long gen_pool_alloc(struct gen_pool *, size_t);
- extern unsigned long gen_pool_alloc_algo(struct gen_pool *, size_t,
-   genpool_algo_t algo, void *data);
+ unsigned long gen_pool_alloc_algo_owner(struct gen_pool *pool, size_t size,
+   genpool_algo_t algo, void *data, void **owner);
+ 
+ static inline unsigned long gen_pool_alloc_owner(struct gen_pool *pool,
+   size_t size, void **owner)
+ {
+   return gen_pool_alloc_algo_owner(pool, size, pool->algo, pool->data,
+   owner);
+ }
+ 
+ static inline unsigned long gen_pool_alloc_algo(struct gen_pool *pool,
+   size_t size, genpool_algo_t algo, void *data)
+ {
+   return gen_pool_alloc_algo_owner(pool, size, algo, data, NULL);
+ }
+ 
+ /**
+  * gen_pool_alloc - allocate special memory from the pool
+  * @pool: pool to allocate from
+  * @size: number of bytes to allocate from the pool
+  *
+  * Allocate the requested number of bytes from the specified pool.
+  * Uses the pool allocation function (with first-fit algorithm by default).
+  * Can not be used in NMI handler on architectures without
+  * NMI-safe cmpxchg implementation.
+  */
+ static inline unsigned long gen_pool_alloc(struct gen_pool *pool, size_t size)
+ {
+   return gen_pool_alloc_algo(pool, size, pool->algo, pool->data);
+ }
+ 
  extern void *gen_pool_dma_alloc(struct gen_pool *pool, size_t size,
dma_addr_t *dma);
 +void *gen_pool_dma_zalloc(struct gen_pool *pool, size_t size, dma_addr_t 
*dma);
- extern void gen_pool_free(struct gen_pool *, unsigned long, size_t);
+ extern void gen_pool_free_owner(struct gen_pool *pool, unsigned long addr,
+   size_t size, void **owner);
+ static inline void gen_pool_free(struct gen_pool *pool, unsigned long addr,
+ size_t size)
+ {
+   gen_pool_free_owner(pool, addr, size, NULL);
+ }
+ 
  extern void gen_pool_for_each_chunk(struct gen_pool *,
void (*)(struct gen_pool *, struct gen_pool_chunk *, void *), void *);
  extern size_t gen_pool_avail(struct gen_pool *);


pgp0hv1vZlEf0.pgp
Description: OpenPGP digital signature


Re: [RFC] mm: Generalize notify_page_fault()

2019-06-02 Thread Anshuman Khandual



On 05/31/2019 11:18 PM, Matthew Wilcox wrote:
> On Fri, May 31, 2019 at 02:17:43PM +0530, Anshuman Khandual wrote:
>> On 05/30/2019 07:09 PM, Matthew Wilcox wrote:
>>> On Thu, May 30, 2019 at 05:31:15PM +0530, Anshuman Khandual wrote:
 On 05/30/2019 04:36 PM, Matthew Wilcox wrote:
> The two handle preemption differently.  Why is x86 wrong and this one
> correct?

 Here it expects context to be already non-preemptible where as the proposed
 generic function makes it non-preemptible with a preempt_[disable|enable]()
 pair for the required code section, irrespective of it's present state. Is
 not this better ?
>>>
>>> git log -p arch/x86/mm/fault.c
>>>
>>> search for 'kprobes'.
>>>
>>> tell me what you think.
>>
>> Are you referring to these following commits
>>
>> a980c0ef9f6d ("x86/kprobes: Refactor kprobes_fault() like 
>> kprobe_exceptions_notify()")
>> b506a9d08bae ("x86: code clarification patch to Kprobes arch code")
>>
>> In particular the later one (b506a9d08bae). It explains how the invoking 
>> context
>> in itself should be non-preemptible for the kprobes processing context 
>> irrespective
>> of whether kprobe_running() or perhaps smp_processor_id() is safe or not. 
>> Hence it
>> does not make much sense to continue when original invoking context is 
>> preemptible.
>> Instead just bail out earlier. This seems to be making more sense than 
>> preempt
>> disable-enable pair. If there are no concerns about this change from other 
>> platforms,
>> I will change the preemption behavior in proposed generic function next time 
>> around.
> 
> Exactly.
> 
> So, any of the arch maintainers know of a reason they behave differently
> from x86 in this regard?  Or can Anshuman use the x86 implementation
> for all the architectures supporting kprobes?

So the generic notify_page_fault() will be like this.

int __kprobes notify_page_fault(struct pt_regs *regs, unsigned int trap)
{
int ret = 0;

/*
 * To be potentially processing a kprobe fault and to be allowed
 * to call kprobe_running(), we have to be non-preemptible.
 */
if (kprobes_built_in() && !preemptible() && !user_mode(regs)) {
if (kprobe_running() && kprobe_fault_handler(regs, trap))
ret = 1;
}
return ret;
}


[PATCH] sched/fair: don't restart enqueued cfs quota slack timer

2019-06-02 Thread Liangyan
From: "liangyan.ply" 

start_cfs_slack_bandwidth() will restart the quota slack timer,
if it is called frequently, this timer will be restarted continuously
and may have no chance to expire to unthrottle cfs tasks.
This will cause that the throttled tasks can't be unthrottled in time
although they have remaining quota.

Signed-off-by: Liangyan 
---
 kernel/sched/fair.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d90a64620072..fdb03c752f97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4411,9 +4411,11 @@ static void start_cfs_slack_bandwidth(struct 
cfs_bandwidth *cfs_b)
if (runtime_refresh_within(cfs_b, min_left))
return;
 
-   hrtimer_start(_b->slack_timer,
+   if (!hrtimer_active(_b->slack_timer)) {
+   hrtimer_start(_b->slack_timer,
ns_to_ktime(cfs_bandwidth_slack_period),
HRTIMER_MODE_REL);
+   }
 }
 
 /* we know any runtime found here is valid as update_curr() precedes return */
-- 
2.14.4.44.g2045bb6



Re: [PATCH] PCI: endpoint: Add DMA to Linux PCI EP Framework

2019-06-02 Thread Kishon Vijay Abraham I
Hi Alan,

On 31/05/19 11:46 PM, Alan Mikhak wrote:
> On Thu, May 30, 2019 at 10:08 PM Kishon Vijay Abraham I  wrote:
>> Hi Alan,
>>>
>>> Hi Kishon,
>>>
>>> I have some improvements in mind for a v2 patch in response to
>>> feedback from Gustavo Pimentel that the current implementation is HW
>>> specific. I hesitate from submitting a v2 patch because it seems best
>>> to seek comment on possible directions this may be taking.
>>>
>>> One alternative is to wait for or modify test functions in
>>> pci-epf-test.c to call DMAengine client APIs, if possible. I imagine
>>> pci-epf-test.c test functions would still allocate the necessary local
>>> buffer on the endpoint side for the same canned tests for everyone to
>>> use. They would prepare the buffer in the existing manner by filling
>>> it with random bytes and calculate CRC in the case of a write test.
>>> However, they would then initiate DMA operations by using DMAengine
>>> client APIs in a generic way instead of calling memcpy_toio() and
>>> memcpy_fromio(). They would post-process the buffer in the existing
>>
>> No, you can't remove memcpy_toio/memcpy_fromio APIs. There could be platforms
>> without system DMA or they could have system DMA but without MEMCOPY channels
>> or without DMA in their PCI controller.
> 
> I agree. I wouldn't remove memcpy_toio/fromio. That is the reason this
> patch introduces the '-d' flag for pcitest to communicate that user
> intent across the PCIe bus to pci-epf-test so the endpoint can
> initiate the transfer using either memcpy_toio/fromio or DMA.
> 
>>> manner such as the checking for CRC in the case of a read test.
>>> Finally, they would release the resources and report results back to
>>> the user of pcitest across the PCIe bus through the existing methods.
>>>
>>> Another alternative I have in mind for v2 is to change the struct
>>> pci_epc_dma that this patch added to pci-epc.h from the following:
>>>
>>> struct pci_epc_dma {
>>> u32 control;
>>> u32 size;
>>> u64 sar;
>>> u64 dar;
>>> };
>>>
>>> to something similar to the following:
>>>
>>> struct pci_epc_dma {
>>> size_t  size;
>>> void *buffer;
>>> int flags;
>>> };
>>>
>>> The 'flags' field can be a bit field or separate boolean values to
>>> specify such things as linked-list mode vs single-block, etc.
>>> Associated #defines would be removed from pci-epc.h to be replaced if
>>> needed with something generic. The 'size' field specifies the size of
>>> DMA transfer that can fit in the buffer.
>>
>> I still have to look closer into your DMA patch but linked-list mode or 
>> single
>> block mode shouldn't be an user select-able option but should be determined 
>> by
>> the size of transfer.
> 
> Please consider the following when taking a closer look at this patch.

After seeing comments from Vinod and Arnd, it looks like the better way of
adding DMA support would be to register DMA within PCI endpoint controller to
DMA subsystem (as dmaengine) and use only dmaengine APIs in pci_epf_test.
> 
> In my specific use case, I need to verify that any valid block size,
> including a one byte transfer, can be transferred across the PCIe bus
> by memcpy_toio/fromio() or by DMA either as a single block or as
> linked-list. That is why, instead of deciding based on transfer size,
> this patch introduces the '-L' flag for pcitest to communicate the
> user intent across the PCIe bus to pci-epf-test so the endpoint can
> initiate the DMA transfer using a single block or in linked-list mode.
The -L option seems to select an internal DMA configuration which might be
specific to one implementation. As Gustavo already pointed, we should have only
generic options in pcitest. This would no longer be applicable when we move to
dmaengine.

Thanks
Kishon


Re: [PATCH] PCI: endpoint: Add DMA to Linux PCI EP Framework

2019-06-02 Thread Vinod Koul
Hi Kishon,

On 03-06-19, 09:54, Kishon Vijay Abraham I wrote:

> right. For the endpoint case, drivers/pci/controller should register with the
> dmaengine i.e if the controller has aN embedded DMA (I think it should be okay
> to keep that in drivers/pci/controller itself instead of drivers/dma) and
> drivers/pci/endpoint/functions/ should use dmaengine API's (Depending on the
> platform, this will either use system DMA or DMA within the PCI controller).

Typically I would prefer the driver to be part of drivers/dma.
Would this be a standalone driver or part of the endpoint driver. In
former case we can move to dmaengine for latter i guess it makes sense
to stay in PCI

Thanks
-- 
~Vinod


[PATCH] cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending

2019-06-02 Thread Pavankumar Kondeti
When "deep" suspend is enabled, all CPUs except the primary CPU
are hotplugged out. Since CPU hotplug is a costly operation,
check if we have to abort the suspend in between each CPU
hotplug. This would improve the system suspend abort latency
upon detecting a wakeup condition.

Signed-off-by: Pavankumar Kondeti 
---
 kernel/cpu.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/cpu.c b/kernel/cpu.c
index f2ef104..784b33d 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1221,6 +1221,13 @@ int freeze_secondary_cpus(int primary)
for_each_online_cpu(cpu) {
if (cpu == primary)
continue;
+
+   if (pm_wakeup_pending()) {
+   pr_info("Aborting disabling non-boot CPUs..\n");
+   error = -EBUSY;
+   break;
+   }
+
trace_suspend_resume(TPS("CPU_OFF"), cpu, true);
error = _cpu_down(cpu, 1, CPUHP_OFFLINE);
trace_suspend_resume(TPS("CPU_OFF"), cpu, false);
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.



Re: [PATCH] PCI: endpoint: Add DMA to Linux PCI EP Framework

2019-06-02 Thread Kishon Vijay Abraham I
Hi,

On 31/05/19 1:19 PM, Arnd Bergmann wrote:
> On Fri, May 31, 2019 at 8:32 AM Vinod Koul  wrote:
>> On 31-05-19, 10:50, Kishon Vijay Abraham I wrote:
>>> On 31/05/19 10:37 AM, Vinod Koul wrote:
 On 30-05-19, 11:16, Kishon Vijay Abraham I wrote:
>
> right, my initial thought process was to use only dmaengine APIs in
> pci-epf-test so that the system DMA or DMA within the PCIe controller can 
> be
> used transparently. But can we register DMA within the PCIe controller to 
> the
> DMA subsystem? AFAIK only system DMA should register with the DMA 
> subsystem.
> (ADMA in SDHCI doesn't use dmaengine). Vinod Koul can confirm.

 So would this DMA be dedicated for PCI and all PCI devices on the bus?
>>>
>>> Yes, this DMA will be used only by PCI ($patch is w.r.t PCIe device mode. So
>>> all endpoint functions both physical and virtual functions will use the DMA 
>>> in
>>> the controller).
 If so I do not see a reason why this cannot be using dmaengine. The use
>>>
>>> Thanks for clarifying. I was under the impression any DMA within a 
>>> peripheral
>>> controller shouldn't use DMAengine.
>>
>> That is indeed a correct assumption. The dmaengine helps in cases where
>> we have a dma controller with multiple users, for a single user case it
>> might be overhead to setup dma driver and then use it thru framework.
>>
>> Someone needs to see the benefit and cost of using the framework and
>> decide.
> 
> I think the main question is about how generalized we want this to be.
> There are lots of difference PCIe endpoint implementations, and in
> case of some licensable IP cores like the designware PCIe there are
> many variants, as each SoC will do the implementation in a slightly
> different way.
> 
> If we can have a single endpoint driver than can either have an
> integrated DMA engine or use an external one, then abstracting that
> DMA engine helps make the driver work more readily either way.
> 
> Similarly, there may be PCIe endpoint implementations that have
> a dedicated DMA engine in them that is not usable for anything else,
> but that is closely related to an IP core we already have a dmaengine
> driver for. In this case, we can avoid duplication.

right. Either way it makes more sense to register DMA embedded within the PCIe
endpoint controller instead of creating epc_ops for DMA transfers.

Thanks
Kishon


[PATCH 12/15] dcache: Provide a dentry constructor

2019-06-02 Thread Tobin C. Harding
In order to support object migration on the dentry cache we need to have
a determined object state at all times. Without a constructor the object
would have a random state after allocation.

Provide a dentry constructor.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c435398f2c81..867d97a86940 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1603,6 +1603,16 @@ void d_invalidate(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_invalidate);
 
+static void dcache_ctor(void *p)
+{
+   struct dentry *dentry = p;
+
+   /* Mimic lockref_mark_dead() */
+   dentry->d_lockref.count = -128;
+
+   spin_lock_init(>d_lock);
+}
+
 /**
  * __d_alloc   -   allocate a dcache entry
  * @sb: filesystem it will belong to
@@ -1658,7 +1668,6 @@ struct dentry *__d_alloc(struct super_block *sb, const 
struct qstr *name)
 
dentry->d_lockref.count = 1;
dentry->d_flags = 0;
-   spin_lock_init(>d_lock);
seqcount_init(>d_seq);
dentry->d_inode = NULL;
dentry->d_parent = dentry;
@@ -3096,14 +3105,17 @@ static void __init dcache_init_early(void)
 
 static void __init dcache_init(void)
 {
-   /*
-* A constructor could be added for stable state like the lists,
-* but it is probably not worth it because of the cache nature
-* of the dcache.
-*/
-   dentry_cache = KMEM_CACHE_USERCOPY(dentry,
-   SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD|SLAB_ACCOUNT,
-   d_iname);
+   slab_flags_t flags =
+   SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | SLAB_MEM_SPREAD | 
SLAB_ACCOUNT;
+
+   dentry_cache =
+   kmem_cache_create_usercopy("dentry",
+  sizeof(struct dentry),
+  __alignof__(struct dentry),
+  flags,
+  offsetof(struct dentry, d_iname),
+  sizeof_field(struct dentry, d_iname),
+  dcache_ctor);
 
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
-- 
2.21.0



[PATCH 14/15] slub: Enable moving objects to/from specific nodes

2019-06-02 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO, object migration).
Currently object migration is used to defrag a cache.  On NUMA systems
it would be nice to be able to control the source and destination nodes
when moving objects.

Add CONFIG_SLUB_SMO_NODE to guard this feature.  CONFIG_SLUB_SMO_NODE
depends on CONFIG_SLUB_DEBUG because we use the full list.

Implement moving all objects (including those in full slabs) to a
specific node.  Expose this functionality to userspace via a sysfs
entry.

Add sysfs entry:

   /sysfs/kernel/slab//move

With this users get access to the following functionality:

 - Move all objects to specified node.

echo "N1" > move

 - Move all objects from specified node to other specified
   node (from N1 -> to N2):

echo "N1 N2" > move

This also enables shrinking slabs on a specific node:

echo "N1 N1" > move

Signed-off-by: Tobin C. Harding 
---
 mm/Kconfig |   7 ++
 mm/slub.c  | 247 +
 2 files changed, 254 insertions(+)

diff --git a/mm/Kconfig b/mm/Kconfig
index f0c76ba47695..c1438b9e578b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -259,6 +259,13 @@ config ARCH_ENABLE_THP_MIGRATION
 config CONTIG_ALLOC
def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
+config SLUB_SMO_NODE
+   bool "Enable per node control of Slab Movable Objects"
+   depends on SLUB && SYSFS
+   select SLUB_DEBUG
+   help
+ On NUMA systems enable moving objects to and from a specified node.
+
 config PHYS_ADDR_T_64BIT
def_bool 64BIT
 
diff --git a/mm/slub.c b/mm/slub.c
index 2157205df7ba..23566e5a712b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4336,6 +4336,130 @@ static void move_slab_page(struct page *page, void 
*scratch, int node)
s->migrate(s, vector, count, node, private);
 }
 
+#ifdef CONFIG_SLUB_SMO_NODE
+/*
+ * kmem_cache_move() - Attempt to move all slab objects.
+ * @s: The cache we are working on.
+ * @node: The node to move objects away from.
+ * @target_node: The node to move objects on to.
+ *
+ * Attempts to move all objects (partial slabs and full slabs) to target
+ * node.
+ *
+ * Context: Takes the list_lock.
+ * Return: The number of slabs remaining on node.
+ */
+static unsigned long kmem_cache_move(struct kmem_cache *s,
+int node, int target_node)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   goto out;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >partial, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   if (page->inuse) {
+   list_move(>lru, _list);
+   /* Stop page being considered for allocations */
+   n->nr_partial--;
+   page->frozen = 1;
+
+   slab_unlock(page);
+   } else {/* Empty slab page */
+   list_del(>lru);
+   n->nr_partial--;
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+   }
+
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   if (page->inuse == page->objects) {
+   list_add(>lru, >full);
+   slab_unlock(page);
+   } else {
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   }
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }

[PATCH 15/15] slub: Enable balancing slabs across nodes

2019-06-02 Thread Tobin C. Harding
We have just implemented Slab Movable Objects (SMO).  On NUMA systems
slabs can become unbalanced i.e. many slabs on one node while other
nodes have few slabs.  Using SMO we can balance the slabs across all
the nodes.

The algorithm used is as follows:

 1. Move all objects to node 0 (this has the effect of defragmenting the
cache).

 2. Calculate the desired number of slabs for each node (this is done
using the approximation nr_slabs / nr_nodes).

 3. Loop over the nodes moving the desired number of slabs from node 0
to the node.

Feature is conditionally built in with CONFIG_SMO_NODE, this is because
we need the full list (we enable SLUB_DEBUG to get this).  Future
version may separate final list out of SLUB_DEBUG.

Expose this functionality to userspace via a sysfs entry.  Add sysfs
entry:

   /sysfs/kernel/slab//balance

Write of '1' to this file triggers balance, no other value accepted.

This feature relies on SMO being enable for the cache, this is done with
a call to, after the isolate/migrate functions have been defined.

kmem_cache_setup_mobility(s, isolate, migrate)

Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 130 ++
 1 file changed, 130 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 23566e5a712b..70e46c4db757 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4458,6 +4458,119 @@ static unsigned long kmem_cache_move_to_node(struct 
kmem_cache *s, int node)
 
return left;
 }
+
+/*
+ * kmem_cache_move_slabs() - Attempt to move @num slabs to target_node,
+ * @s: The cache we are working on.
+ * @node: The node to move objects from.
+ * @target_node: The node to move objects to.
+ * @num: The number of slabs to move.
+ *
+ * Attempts to move @num slabs from @node to @target_node.  This is done
+ * by migrating objects from slabs on the full_list.
+ *
+ * Return: The number of slabs moved or error code.
+ */
+static long kmem_cache_move_slabs(struct kmem_cache *s,
+ int node, int target_node, long num)
+{
+   struct kmem_cache_node *n = get_node(s, node);
+   LIST_HEAD(move_list);
+   struct page *page, *page2;
+   unsigned long flags;
+   void **scratch;
+   long done = 0;
+
+   if (!s->migrate) {
+   pr_warn("%s SMO not enabled, cannot move objects\n", s->name);
+   goto out;
+   }
+
+   if (node == target_node)
+   return -EINVAL;
+
+   scratch = alloc_scratch(s);
+   if (!scratch)
+   return -ENOMEM;
+
+   spin_lock_irqsave(>list_lock, flags);
+
+   list_for_each_entry_safe(page, page2, >full, lru) {
+   if (!slab_trylock(page))
+   /* Busy slab. Get out of the way */
+   continue;
+
+   list_move(>lru, _list);
+   page->frozen = 1;
+   slab_unlock(page);
+
+   if (++done >= num)
+   break;
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+
+   list_for_each_entry(page, _list, lru) {
+   if (page->inuse)
+   move_slab_page(page, scratch, target_node);
+   }
+   kfree(scratch);
+
+   /* Bail here to save taking the list_lock */
+   if (list_empty(_list))
+   goto out;
+
+   /* Inspect results and dispose of pages */
+   spin_lock_irqsave(>list_lock, flags);
+   list_for_each_entry_safe(page, page2, _list, lru) {
+   list_del(>lru);
+   slab_lock(page);
+   page->frozen = 0;
+
+   if (page->inuse) {
+   /*
+* This is best effort only, if slab still has
+* objects just put it back on the partial list.
+*/
+   n->nr_partial++;
+   list_add_tail(>lru, >partial);
+   slab_unlock(page);
+   } else {
+   slab_unlock(page);
+   discard_slab(s, page);
+   }
+   }
+   spin_unlock_irqrestore(>list_lock, flags);
+out:
+   return done;
+}
+
+/*
+ * kmem_cache_balance_nodes() - Balance slabs across nodes.
+ * @s: The cache we are working on.
+ */
+static void kmem_cache_balance_nodes(struct kmem_cache *s)
+{
+   struct kmem_cache_node *n = get_node(s, 0);
+   unsigned long desired_nr_slabs_per_node;
+   unsigned long nr_slabs;
+   int nr_nodes = 0;
+   int nid;
+
+   (void)kmem_cache_move_to_node(s, 0);
+
+   for_each_node_state(nid, N_NORMAL_MEMORY)
+   nr_nodes++;
+
+   nr_slabs = atomic_long_read(>nr_slabs);
+   desired_nr_slabs_per_node = nr_slabs / nr_nodes;
+
+   for_each_node_state(nid, N_NORMAL_MEMORY) {
+   if (nid == 0)
+   continue;
+
+   kmem_cache_move_slabs(s, 0, nid, desired_nr_slabs_per_node);
+   }
+}
 

[PATCH 10/15] xarray: Implement migration function for xa_node objects

2019-06-02 Thread Tobin C. Harding
Recently Slab Movable Objects (SMO) was implemented for the SLUB
allocator.  The XArray can take advantage of this and make the xa_node
slab cache objects movable.

Implement functions to migrate objects and activate SMO when we
initialise the XArray slab cache.

This is based on initial code by Matthew Wilcox and was modified to work
with slab object migration.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 lib/xarray.c | 61 
 1 file changed, 61 insertions(+)

diff --git a/lib/xarray.c b/lib/xarray.c
index 861c042daa1d..9354e0f01f26 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1993,12 +1993,73 @@ static void xa_node_ctor(void *arg)
INIT_LIST_HEAD(>private_list);
 }
 
+static void xa_object_migrate(struct xa_node *node, int numa_node)
+{
+   struct xarray *xa = READ_ONCE(node->array);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_RCU_FREE)
+   return;
+
+   new_node = kmem_cache_alloc_node(xa_node_cachep, GFP_KERNEL, numa_node);
+   if (!new_node) {
+   pr_err("%s: slab cache allocation failed\n", __func__);
+   return;
+   }
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   if (list_empty(>private_list))
+   INIT_LIST_HEAD(_node->private_list);
+   else
+   list_replace(>private_list, _node->private_list);
+
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+}
+
+static void xa_migrate(struct kmem_cache *s, void **objects, int nr,
+  int node, void *_unused)
+{
+   int i;
+
+   for (i = 0; i < nr; i++)
+   xa_object_migrate(objects[i], node);
+}
+
 void __init xarray_slabcache_init(void)
 {
xa_node_cachep = kmem_cache_create("xarray_node",
   sizeof(struct xa_node), 0,
   SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
   xa_node_ctor);
+
+   kmem_cache_setup_mobility(xa_node_cachep, NULL, xa_migrate);
 }
 
 #ifdef XA_DEBUG
-- 
2.21.0



[PATCH 07/15] tools/testing/slab: Add object migration test module

2019-06-02 Thread Tobin C. Harding
We just implemented slab movable objects for the SLUB allocator.  We
should test that code.  In order to do so we need to be able to do a
number of things

 - Create a cache
 - Enable Slab Movable Objects for the cache
 - Allocate objects to the cache
 - Free objects from within specific slabs of the cache

We can do all this via a loadable module.

Add a module that defines functions that can be triggered from userspace
via a debugfs entry. From the source:

  /*
   * SLUB defragmentation a.k.a. Slab Movable Objects (SMO).
   *
   * This module is used for testing the SLUB allocator.  Enables
   * userspace to run kernel functions via a debugfs file.
   *
   *   debugfs: /sys/kernel/debugfs/smo/callfn (write only)
   *
   * String written to `callfn` is parsed by the module and associated
   * function is called.  See fn_tab for mapping of strings to functions.
   */

References to allocated objects are kept by the module in a linked list
so that userspace can control which object to free.

We introduce the following four functions via the function table

  "enable": Enables object migration for the test cache.
  "alloc X": Allocates X objects
  "free X [Y]": Frees X objects starting at list position Y (default Y==0)
  "test": Runs [stress] tests from within the module (see below).

   {"enable", smo_enable_cache_mobility},
   {"alloc", smo_alloc_objects},
   {"free", smo_free_object},
   {"test", smo_run_module_tests},

Freeing from the start of the list creates a hole in the slab being
freed from (i.e. creates a partial slab).  The results of running these
commands can be see using `slabinfo` (available in tools/vm/):

make -o slabinfo tools/vm/slabinfo.c

Stress tests can be run from within the module.  These tests are
internal to the module because we verify that object references are
still good after object migration.  These are called 'stress' tests
because it is intended that they create/free a lot of objects.
Userspace can control the number of objects to create, default is 1000.

Example test session


Relevant /proc/slabinfo column headers:

  name   

  # mount -t debugfs none /sys/kernel/debug/
  $ cd path/to/linux/tools/testing/slab; make
  ...

  # insmod slub_defrag.ko
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test   0  0392   202

>From this we can see that the module created cache 'smo_test' with 20
objects per slab and 2 pages per slab (and cache is currently empty).

We can play with the slab allocator manually:

  # insmod slub_defrag.ko
  # echo 'alloc 21' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test  21 40392   202

We see here that 21 active objects have been allocated creating 2
slabs (40 total objects).

  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 21

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   2   Sanity Checks : On   Total:   16384
  SlabObj: 392  Full   :   1   Redzoning : On   Used :1176
  SlabSiz:8192  Partial:   1   Poisoning : On   Loss :   15208
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:7056
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 704

Now free an object from the first slot of the first slab

  # echo 'free 1' > callfn
  # cat /proc/slabinfo | grep smo_test | sed 's/:.*//'
  smo_test  20 40392   202

  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 20

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   2   Sanity Checks : On   Total:   16384
  SlabObj: 392  Full   :   0   Redzoning : On   Used :1120
  SlabSiz:8192  Partial:   2   Poisoning : On   Loss :   15264
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig:6720
  Align  :   8  Objects:  20   Tracing   : Off  Lpadd: 704

Calling shrink now on the cache does nothing because object migration is
not enabled (output omitted).  If we enable object migration then shrink
the cache we expect the object from the second slab to me moved to the
first slot in the first slab and the second slab to be removed from the
partial list.

  # echo 'enable' > callfn
  # slabinfo smo_test --shrink
  # slabinfo smo_test --report

  Slabcache: smo_test Aliases:  0 Order :  1 Objects: 20
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object :  56  Total  :   1   Sanity Checks : On   Total:8192
  SlabObj: 392  Full   :   1   

[PATCH 09/15] lib: Separate radix_tree_node and xa_node slab cache

2019-06-02 Thread Tobin C. Harding
Earlier, Slab Movable Objects (SMO) was implemented.  The XArray is now
able to take advantage of SMO in order to make xarray nodes
movable (when using the SLUB allocator).

Currently the radix tree uses the same slab cache as the XArray.  Only
XArray nodes are movable _not_ radix tree nodes.  We can give the radix
tree its own slab cache to overcome this.

In preparation for implementing XArray object migration (xa_node
objects) via Slab Movable Objects add a slab cache solely for XArray
nodes and make the XArray use this slab cache instead of the
radix_tree_node slab cache.

Cc: Matthew Wilcox 
Signed-off-by: Tobin C. Harding 
---
 include/linux/xarray.h |  3 +++
 init/main.c|  2 ++
 lib/radix-tree.c   |  2 +-
 lib/xarray.c   | 48 ++
 4 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 0e01e6129145..773f91f8e1db 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -42,6 +42,9 @@
 
 #define BITS_PER_XA_VALUE  (BITS_PER_LONG - 1)
 
+/* Called from init/main.c */
+void xarray_slabcache_init(void);
+
 /**
  * xa_mk_value() - Create an XArray entry from an integer.
  * @v: Value to store in XArray.
diff --git a/init/main.c b/init/main.c
index 66a196c5e4c3..8c409a5dc937 100644
--- a/init/main.c
+++ b/init/main.c
@@ -107,6 +107,7 @@ static int kernel_init(void *);
 
 extern void init_IRQ(void);
 extern void radix_tree_init(void);
+extern void xarray_slabcache_init(void);
 
 /*
  * Debug helper: via this flag we know that we are in 'early bootup code'
@@ -622,6 +623,7 @@ asmlinkage __visible void __init start_kernel(void)
 "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
radix_tree_init();
+   xarray_slabcache_init();
 
/*
 * Set up housekeeping before setting up workqueues to allow the unbound
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 18c1dfbb1765..e6127c4c84b5 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -31,7 +31,7 @@
 /*
  * Radix tree node cache.
  */
-struct kmem_cache *radix_tree_node_cachep;
+static struct kmem_cache *radix_tree_node_cachep;
 
 /*
  * The radix tree is variable-height, so an insert operation not only has
diff --git a/lib/xarray.c b/lib/xarray.c
index 6be3acbb861f..861c042daa1d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -27,6 +27,8 @@
  * @entry refers to something stored in a slot in the xarray
  */
 
+static struct kmem_cache *xa_node_cachep;
+
 static inline unsigned int xa_lock_type(const struct xarray *xa)
 {
return (__force unsigned int)xa->xa_flags & 3;
@@ -244,9 +246,21 @@ void *xas_load(struct xa_state *xas)
 }
 EXPORT_SYMBOL_GPL(xas_load);
 
-/* Move the radix tree node cache here */
-extern struct kmem_cache *radix_tree_node_cachep;
-extern void radix_tree_node_rcu_free(struct rcu_head *head);
+static void xa_node_rcu_free(struct rcu_head *head)
+{
+   struct xa_node *node = container_of(head, struct xa_node, rcu_head);
+
+   /*
+* Must only free zeroed nodes into the slab.  We can be left with
+* non-NULL entries by radix_tree_free_nodes, so clear the entries
+* and tags here.
+*/
+   memset(node->slots, 0, sizeof(node->slots));
+   memset(node->tags, 0, sizeof(node->tags));
+   INIT_LIST_HEAD(>private_list);
+
+   kmem_cache_free(xa_node_cachep, node);
+}
 
 #define XA_RCU_FREE((struct xarray *)1)
 
@@ -254,7 +268,7 @@ static void xa_node_free(struct xa_node *node)
 {
XA_NODE_BUG_ON(node, !list_empty(>private_list));
node->array = XA_RCU_FREE;
-   call_rcu(>rcu_head, radix_tree_node_rcu_free);
+   call_rcu(>rcu_head, xa_node_rcu_free);
 }
 
 /*
@@ -270,7 +284,7 @@ static void xas_destroy(struct xa_state *xas)
if (!node)
return;
XA_NODE_BUG_ON(node, !list_empty(>private_list));
-   kmem_cache_free(radix_tree_node_cachep, node);
+   kmem_cache_free(xa_node_cachep, node);
xas->xa_alloc = NULL;
 }
 
@@ -298,7 +312,7 @@ bool xas_nomem(struct xa_state *xas, gfp_t gfp)
xas_destroy(xas);
return false;
}
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
if (!xas->xa_alloc)
return false;
XA_NODE_BUG_ON(xas->xa_alloc, 
!list_empty(>xa_alloc->private_list));
@@ -327,10 +341,10 @@ static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
}
if (gfpflags_allow_blocking(gfp)) {
xas_unlock_type(xas, lock_type);
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+   xas->xa_alloc = kmem_cache_alloc(xa_node_cachep, gfp);
xas_lock_type(xas, lock_type);
} else {
-   xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+

[PATCH 13/15] dcache: Implement partial shrink via Slab Movable Objects

2019-06-02 Thread Tobin C. Harding
The dentry slab cache is susceptible to internal fragmentation.  Now
that we have Slab Movable Objects we can attempt to defragment the
dcache.  Dentry objects are inherently _not_ relocatable however under
some conditions they can be free'd.  This is the same as shrinking the
dcache but instead of shrinking the whole cache we only attempt to free
those objects that are located in partially full slab pages.  There is
no guarantee that this will reduce the memory usage of the system, it is
a compromise between fragmented memory and total cache shrinkage with
the hope that some memory pressure can be alleviated.

This is implemented using the newly added Slab Movable Objects
infrastructure.  The dcache 'migration' function is intentionally _not_
called 'd_migrate' because we only free, we do not migrate.  Call it
'd_partial_shrink' to make explicit that no reallocation is done.

In order to enable SMO a call to kmem_cache_setup_mobility() must be
made, we do this during initialization of the dcache.

Implement isolate and 'migrate' functions for the dentry slab cache.
Enable SMO for the dcache during initialization.

Signed-off-by: Tobin C. Harding 
---
 fs/dcache.c | 75 +
 1 file changed, 75 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 867d97a86940..3ca721752723 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3072,6 +3072,79 @@ void d_tmpfile(struct dentry *dentry, struct inode 
*inode)
 }
 EXPORT_SYMBOL(d_tmpfile);
 
+/*
+ * d_isolate() - Dentry isolation callback function.
+ * @s: The dentry cache.
+ * @v: Vector of pointers to the objects to isolate.
+ * @nr: Number of objects in @v.
+ *
+ * The slab allocator is holding off frees. We can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *d_isolate(struct kmem_cache *s, void **v, int nr)
+{
+   struct list_head *dispose;
+   struct dentry *dentry;
+   int i;
+
+   dispose = kmalloc(sizeof(*dispose), GFP_KERNEL);
+   if (!dispose)
+   return NULL;
+
+   INIT_LIST_HEAD(dispose);
+
+   for (i = 0; i < nr; i++) {
+   dentry = v[i];
+   spin_lock(>d_lock);
+
+   if (dentry->d_lockref.count > 0 ||
+   dentry->d_flags & DCACHE_SHRINK_LIST) {
+   spin_unlock(>d_lock);
+   continue;
+   }
+
+   if (dentry->d_flags & DCACHE_LRU_LIST)
+   d_lru_del(dentry);
+
+   d_shrink_add(dentry, dispose);
+   spin_unlock(>d_lock);
+   }
+
+   return dispose;
+}
+
+/*
+ * d_partial_shrink() - Dentry migration callback function.
+ * @s: The dentry cache.
+ * @_unused: We do not access the vector.
+ * @__unused: No need for length of vector.
+ * @___unused: We do not do any allocation.
+ * @private: list_head pointer representing the shrink list.
+ *
+ * Dispose of the shrink list created during isolation function.
+ *
+ * Dentry objects can _not_ be relocated and shrinking the whole dcache
+ * can be expensive.  This is an effort to free dentry objects that are
+ * stopping slab pages from being free'd without clearing the whole dcache.
+ *
+ * This callback is called from the SLUB allocator object migration
+ * infrastructure in attempt to free up slab pages by freeing dentry
+ * objects from partially full slabs.
+ */
+static void d_partial_shrink(struct kmem_cache *s, void **_unused, int 
__unused,
+int ___unused, void *private)
+{
+   struct list_head *dispose = private;
+
+   if (!private)   /* kmalloc error during isolate. */
+   return;
+
+   if (!list_empty(dispose))
+   shrink_dentry_list(dispose);
+
+   kfree(private);
+}
+
 static __initdata unsigned long dhash_entries;
 static int __init set_dhash_entries(char *str)
 {
@@ -3117,6 +3190,8 @@ static void __init dcache_init(void)
   sizeof_field(struct dentry, d_iname),
   dcache_ctor);
 
+   kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);
+
/* Hash may have been set up in dcache_init_early */
if (!hashdist)
return;
-- 
2.21.0



[PATCH 11/15] tools/testing/slab: Add XArray movable objects tests

2019-06-02 Thread Tobin C. Harding
We just implemented movable objects for the XArray.  Let's test it
intree.

Add test module for the XArray's movable objects implementation.

Functionality of the XArray Slab Movable Object implementation can
usually be seen by simply by using `slabinfo` on a running machine since
the radix tree is typically in use on a running machine and will have
partial slabs.  For repeated testing we can use the test module to run
to simulate a workload on the XArray then use `slabinfo` to test object
migration is functioning.

If testing on freshly spun up VM (low radix tree workload) it may be
necessary to load/unload the module a number of times to create partial
slabs.

Example test session


Relevant /proc/slabinfo column headers:

  name   

Prior to testing slabinfo report for radix_tree_node:

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8352
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 497   Sanity Checks : On   Total: 8142848
  SlabObj: 912  Full   : 473   Redzoning : On   Used : 4810752
  SlabSiz:   16384  Partial:  24   Poisoning : On   Loss : 3332096
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2806272
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  437360

Here you can see the kernel was built with Slab Movable Objects enabled
for the XArray (XArray uses the radix tree below the surface).

After inserting the test module (note we have triggered allocation of a
number of radix tree nodes increasing the object count but decreasing the
number of partial slabs):

  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8442
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 499   Sanity Checks : On   Total: 8175616
  SlabObj: 912  Full   : 484   Redzoning : On   Used : 4862592
  SlabSiz:   16384  Partial:  15   Poisoning : On   Loss : 3313024
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2836512
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  439120

Now we can shrink the radix_tree_node cache:

  # slabinfo radix_tree_node --shrink
  # slabinfo radix_tree_node --report

  Slabcache: radix_tree_node  Aliases:  0 Order :  2 Objects: 8515
  ** Reclaim accounting active
  ** Defragmentation at 30%

  Sizes (bytes) Slabs  DebugMemory
  
  Object : 576  Total  : 501   Sanity Checks : On   Total: 8208384
  SlabObj: 912  Full   : 500   Redzoning : On   Used : 4904640
  SlabSiz:   16384  Partial:   1   Poisoning : On   Loss : 3303744
  Loss   : 336  CpuSlab:   0   Tracking  : On   Lalig: 2861040
  Align  :   8  Objects:  17   Tracing   : Off  Lpadd:  440880

Note the single remaining partial slab.

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/Makefile |   2 +-
 tools/testing/slab/slub_defrag_xarray.c | 211 
 2 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/slab/slub_defrag_xarray.c

diff --git a/tools/testing/slab/Makefile b/tools/testing/slab/Makefile
index 440c2e3e356f..44c18d9a4d52 100644
--- a/tools/testing/slab/Makefile
+++ b/tools/testing/slab/Makefile
@@ -1,4 +1,4 @@
-obj-m += slub_defrag.o
+obj-m += slub_defrag.o slub_defrag_xarray.o
 
 KTREE=../../..
 
diff --git a/tools/testing/slab/slub_defrag_xarray.c 
b/tools/testing/slab/slub_defrag_xarray.c
new file mode 100644
index ..41143f73256c
--- /dev/null
+++ b/tools/testing/slab/slub_defrag_xarray.c
@@ -0,0 +1,211 @@
+// SPDX-License-Identifier: GPL-2.0+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SMOX_CACHE_NAME "smox_test"
+static struct kmem_cache *cachep;
+
+/*
+ * Declare XArrays globally so we can clean them up on module unload.
+ */
+
+/* Used by test_smo_xarray()*/
+DEFINE_XARRAY(things);
+
+/* Thing to store pointers to in the XArray */
+struct smox_thing {
+   long id;
+};
+
+/* It's up to the caller to ensure id is unique */
+static struct smox_thing *alloc_thing(int id)
+{
+   struct smox_thing *thing;
+
+   thing = kmem_cache_alloc(cachep, GFP_KERNEL);
+   if (!thing)
+   return ERR_PTR(-ENOMEM);
+
+   thing->id = id;
+   return thing;
+}
+
+/**
+ * smox_object_ctor() - SMO object constructor function.
+ * @ptr: Pointer to memory where the object should be constructed.
+ */
+void 

[PATCH 08/15] tools/testing/slab: Add object migration test suite

2019-06-02 Thread Tobin C. Harding
We just added a module that enables testing the SLUB allocators ability
to defrag/shrink caches via movable objects.  Tests are better when they
are automated.

Add automated testing via a python script for SLUB movable objects.

Example output:

  $ cd path/to/linux/tools/testing/slab
  $ /slub_defrag.py
  Please run script as root

  $ sudo ./slub_defrag.py
  

  $ sudo ./slub_defrag.py --debug
  Loading module ...
  Slab cache smo_test created
  Objects per slab: 20
  Running sanity checks ...

  Running module stress test (see dmesg for additional test output) ...
  Removing module slub_defrag ...
  Loading module ...
  Slab cache smo_test created

  Running test non-movable ...
  testing slab 'smo_test' prior to enabling movable objects ...
  verified non-movable slabs are NOT shrinkable

  Running test movable ...
  testing slab 'smo_test' after enabling movable objects ...
  verified movable slabs are shrinkable

  Removing module slub_defrag ...

Signed-off-by: Tobin C. Harding 
---
 tools/testing/slab/slub_defrag.c  |   1 +
 tools/testing/slab/slub_defrag.py | 451 ++
 2 files changed, 452 insertions(+)
 create mode 100755 tools/testing/slab/slub_defrag.py

diff --git a/tools/testing/slab/slub_defrag.c b/tools/testing/slab/slub_defrag.c
index 4a5c24394b96..8332e69ee868 100644
--- a/tools/testing/slab/slub_defrag.c
+++ b/tools/testing/slab/slub_defrag.c
@@ -337,6 +337,7 @@ static int smo_run_module_tests(int nr_objs, int keep)
 
 /*
  * struct functions() - Map command to a function pointer.
+ * If you update this please update the documentation in slub_defrag.py
  */
 struct functions {
char *fn_name;
diff --git a/tools/testing/slab/slub_defrag.py 
b/tools/testing/slab/slub_defrag.py
new file mode 100755
index ..41747c0db39b
--- /dev/null
+++ b/tools/testing/slab/slub_defrag.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import subprocess
+import sys
+from os import path
+
+# SLUB Movable Objects test suite.
+#
+# Requirements:
+#  - CONFIG_SLUB=y
+#  - CONFIG_SLUB_DEBUG=y
+#  - The slub_defrag module in this directory.
+
+# Test SMO using a kernel module that enables triggering arbitrary
+# kernel code from userspace via a debugfs file.
+#
+# Module code is in ./slub_defrag.c, basically the functionality is as
+# follows:
+#
+#  - Creates debugfs file /sys/kernel/debugfs/smo/callfn
+#  - Writes to 'callfn' are parsed as a command string and the function
+#associated with command is called.
+#  - Defines 4 commands (all commands operate on smo_test cache):
+# - 'test': Runs module stress tests.
+# - 'alloc N': Allocates N slub objects
+# - 'free N POS': Frees N objects starting at POS (see below)
+# - 'enable': Enables SLUB Movable Objects
+#
+# The module maintains a list of allocated objects.  Allocation adds
+# objects to the tail of the list.  Free'ing frees from the head of the
+# list.  This has the effect of creating free slots in the slab.  For
+# finer grained control over where in the cache slots are free'd POS
+# (position) argument may be used.
+
+# The main() function is reasonably readable; the test suite does the
+# following:
+#
+# 1. Runs the module stress tests.
+# 2. Tests the cache without movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs are _not_ removed by shrink (see below).
+# 3. Tests the cache with movable objects enabled.
+#- Creates multiple partial slabs as explained above.
+#- Verifies that partial slabs _are_ removed by shrink (see below).
+
+# The sysfs file /sys/kernel/slab//shrink enables calling the
+# function kmem_cache_shrink() (see mm/slab_common.c and mm/slub.cc).
+# Shrinking a cache attempts to consolidate all partial slabs by moving
+# objects if object migration is enable for the cache, otherwise
+# shrinking a cache simply re-orders the partial list so as most densely
+# populated slab are at the head of the list.
+
+# Enable/disable debugging output (also enabled via -d | --debug).
+debug = False
+
+# Used in debug messages and when running `insmod`.
+MODULE_NAME = "slub_defrag"
+
+# Slab cache created by the test module.
+CACHE_NAME = "smo_test"
+
+# Set by get_slab_config()
+objects_per_slab = 0
+pages_per_slab = 0
+debugfs_mounted = False # Set to true if we mount debugfs.
+
+
+def eprint(*args, **kwargs):
+print(*args, file=sys.stderr, **kwargs)
+
+
+def dprint(*args, **kwargs):
+if debug:
+print(*args, file=sys.stderr, **kwargs)
+
+
+def run_shell(cmd):
+return subprocess.call([cmd], shell=True)
+
+
+def run_shell_get_stdout(cmd):
+return subprocess.check_output([cmd], shell=True)
+
+
+def assert_root():
+user = run_shell_get_stdout('whoami')
+if user != b'root\n':
+eprint("Please run script as root")
+sys.exit(1)
+
+
+def mount_debugfs():
+mounted = False
+
+# Check if debugfs is mounted at a known 

[PATCH 06/15] tools/vm/slabinfo: Add defrag_used_ratio output

2019-06-02 Thread Tobin C. Harding
Add output for the newly added defrag_used_ratio sysfs knob.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index d2c22f9ee2d8..ef4ff93df4cc 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int defrag_used_ratio;
int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
@@ -549,6 +550,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_used_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -1279,6 +1282,7 @@ static void read_slab_dir(void)
slab->deactivate_bypass = get_obj("deactivate_bypass");
slab->remote_node_defrag_ratio =
get_obj("remote_node_defrag_ratio");
+   slab->defrag_used_ratio = get_obj("defrag_used_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[PATCH 05/15] tools/vm/slabinfo: Add remote node defrag ratio output

2019-06-02 Thread Tobin C. Harding
Add output line for NUMA remote node defrag ratio.

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index cbfc56c44c2f..d2c22f9ee2d8 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -34,6 +34,7 @@ struct slabinfo {
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
int movable, ctor;
+   int remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -377,6 +378,10 @@ static void slab_numa(struct slabinfo *s, int mode)
if (skip_zero && !s->slabs)
return;
 
+   if (mode) {
+   printf("\nNUMA remote node defrag ratio: %3d\n",
+  s->remote_node_defrag_ratio);
+   }
if (!line) {
printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
@@ -1272,6 +1277,8 @@ static void read_slab_dir(void)
slab->cpu_partial_free = get_obj("cpu_partial_free");
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
+   slab->remote_node_defrag_ratio =
+   get_obj("remote_node_defrag_ratio");
chdir("..");
if (read_slab_obj(slab, "ops")) {
if (strstr(buffer, "ctor :"))
-- 
2.21.0



[PATCH 01/15] slub: Add isolate() and migrate() methods

2019-06-02 Thread Tobin C. Harding
Add the two methods needed for moving objects and enable the display of
the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab
cache.

Add empty functions for SLAB/SLOB. The API is generic so it could be
theoretically implemented for these allocators as well.

Change sysfs 'ctor' field to be 'ops' to contain all the callback
operations defined for a slab cache.  Display the existing 'ctor'
callback in the ops fields contents along with 'isolate' and 'migrate'
callbacks.

Signed-off-by: Tobin C. Harding 
---
 include/linux/slab.h | 70 
 include/linux/slub_def.h |  3 ++
 mm/slub.c| 59 +
 3 files changed, 126 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9449b19c5f10..886fc130334d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -154,6 +154,76 @@ void memcg_create_kmem_cache(struct mem_cgroup *, struct 
kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
+/*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable
+ * mobile objects and targeted reclaim in slab caches.
+ */
+
+/**
+ * typedef kmem_cache_isolate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to isolate.
+ * @nr: Number of objects in @ptr array.
+ *
+ * The purpose of kmem_cache_isolate_func() is to pin each object so that
+ * they cannot be freed until kmem_cache_migrate_func() has processed
+ * them. This may be accomplished by increasing the refcount or setting
+ * a flag.
+ *
+ * The object pointer array passed is also passed to
+ * kmem_cache_migrate_func().  The function may remove objects from the
+ * array by setting pointers to %NULL. This is useful if we can
+ * determine that an object is being freed because
+ * kmem_cache_isolate_func() was called when the subsystem was calling
+ * kmem_cache_free().  In that case it is not necessary to increase the
+ * refcount or specially mark the object because the release of the slab
+ * lock will lead to the immediate freeing of the object.
+ *
+ * Context: Called with locks held so that the slab objects cannot be
+ *  freed.  We are in an atomic context and no slab operations
+ *  may be performed.
+ * Return: A pointer that is passed to the migrate function. If any
+ * objects cannot be touched at this point then the pointer may
+ * indicate a failure and then the migration function can simply
+ * remove the references that were already obtained. The private
+ * data could be used to track the objects that were already pinned.
+ */
+typedef void *kmem_cache_isolate_func(struct kmem_cache *s, void **ptr, int 
nr);
+
+/**
+ * typedef kmem_cache_migrate_func - Object migration callback function.
+ * @s: The cache we are working on.
+ * @ptr: Pointer to an array of pointers to the objects to migrate.
+ * @nr: Number of objects in @ptr array.
+ * @node: The NUMA node where the object should be allocated.
+ * @private: The pointer returned by kmem_cache_isolate_func().
+ *
+ * This function is responsible for migrating objects.  Typically, for
+ * each object in the input array you will want to allocate an new
+ * object, copy the original object, update any pointers, and free the
+ * old object.
+ *
+ * After this function returns all pointers to the old object should now
+ * point to the new object.
+ *
+ * Context: Called with no locks held and interrupts enabled.  Sleeping
+ *  is possible.  Any operation may be performed.
+ */
+typedef void kmem_cache_migrate_func(struct kmem_cache *s, void **ptr,
+int nr, int node, void *private);
+
+/*
+ * kmem_cache_setup_mobility() is used to setup callbacks for a slab cache.
+ */
+#ifdef CONFIG_SLUB
+void kmem_cache_setup_mobility(struct kmem_cache *, kmem_cache_isolate_func,
+  kmem_cache_migrate_func);
+#else
+static inline void
+kmem_cache_setup_mobility(struct kmem_cache *s, kmem_cache_isolate_func 
isolate,
+ kmem_cache_migrate_func migrate) {}
+#endif
+
 /*
  * Please use this macro to create slab caches. Simply specify the
  * name of the structure and maybe some flags that are listed above.
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index d2153789bd9f..2879a2f5f8eb 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_cache_isolate_func *isolate;
+   kmem_cache_migrate_func *migrate;
+

[PATCH 02/15] tools/vm/slabinfo: Add support for -C and -M options

2019-06-02 Thread Tobin C. Harding
-C lists caches that use a ctor.

-M lists caches that support object migration.

Add command line options to show caches with a constructor and caches
that are movable (i.e. have migrate function).

Signed-off-by: Tobin C. Harding 
---
 tools/vm/slabinfo.c | 40 
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c
index 73818f1b2ef8..cbfc56c44c2f 100644
--- a/tools/vm/slabinfo.c
+++ b/tools/vm/slabinfo.c
@@ -33,6 +33,7 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +68,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,11 +112,13 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-aADefhilnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+  "slabinfo [-aACDefhilMnosrStTvz1LXBU] [N=K] [-dafzput] 
[slab-regexp]\n"
+
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-B|--Bytes Show size in bytes\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
@@ -121,6 +126,7 @@ static void usage(void)
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
"-L|--Loss  Sort by loss\n"
+   "-M|--movable   Show caches that support movable 
objects\n"
"-n|--numa  Show NUMA information\n"
"-N|--lines=K   Show the first K slabs\n"
"-o|--ops   Show kmem_cache_ops\n"
@@ -588,6 +594,12 @@ static void slabcache(struct slabinfo *s)
if (show_empty && s->slabs)
return;
 
+   if (show_ctor && !s->ctor)
+   return;
+
+   if (show_movable && !s->movable)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -602,6 +614,10 @@ static void slabcache(struct slabinfo *s)
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->ctor)
+   *p++ = 'C';
+   if (s->movable)
+   *p++ = 'M';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -636,7 +652,8 @@ static void slabcache(struct slabinfo *s)
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s->slabs ? (s->partial * 100) / s->slabs : 100,
+   s->slabs ? (s->partial * 100) /
+   (s->slabs * s->objs_per_slab) : 100,
s->slabs ? (s->objects * s->object_size * 100) /
(s->slabs * (page_size << s->order)) : 100,
flags);
@@ -1256,6 +1273,13 @@ static void read_slab_dir(void)
slab->alloc_node_mismatch = 
get_obj("alloc_node_mismatch");
slab->deactivate_bypass = get_obj("deactivate_bypass");
chdir("..");
+   if (read_slab_obj(slab, "ops")) {
+   if (strstr(buffer, "ctor :"))
+   slab->ctor = 1;
+   if (strstr(buffer, "migrate :"))
+   slab->movable = 1;
+   }
+
if (slab->name[0] == ':')
alias_targets++;
slab++;
@@ -1332,6 +1356,8 @@ static void xtotals(void)
 }
 
 struct option opts[] = {
+   { "ctor", no_argument, NULL, 'C' },
+   { "movable", no_argument, NULL, 'M' },
{ "aliases", no_argument, NULL, 'a' },
{ "activity", no_argument, NULL, 'A' },
{ "debug", optional_argument, NULL, 'd' },
@@ -1367,7 +1393,7 @@ int main(int argc, char *argv[])
 
page_size = getpagesize();
 
-   while ((c = getopt_long(argc, 

[PATCH 03/15] slub: Sort slab cache list

2019-06-02 Thread Tobin C. Harding
It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Signed-off-by: Tobin C. Harding 
---
 mm/slab_common.c | 2 +-
 mm/slub.c| 6 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 58251ba63e4a..db5e9a0b1535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(const char *name,
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)
diff --git a/mm/slub.c b/mm/slub.c
index 1c380a2bc78a..66d474397c0f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4333,6 +4333,8 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
return;
}
 
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
 
@@ -4341,6 +4343,10 @@ void kmem_cache_setup_mobility(struct kmem_cache *s,
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
+
+   list_move(>list, _caches);  /* Move to top */
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
-- 
2.21.0



[PATCH 04/15] slub: Slab defrag core

2019-06-02 Thread Tobin C. Harding
Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

Implement Slab Movable Objects in order to defragment slab caches.

Slab defragmentation may occur:

1. Unconditionally when __kmem_cache_shrink() is called on a slab cache
   by the kernel calling kmem_cache_shrink().

2. Unconditionally through the use of the slabinfo command.

slabinfo  -s

3. Conditionally via the use of kmem_cache_defrag()

- Use Slab Movable Objects when shrinking cache.

Currently when the kernel calls kmem_cache_shrink() we curate the
partial slabs list.  If object migration is not enabled for the cache we
still do this, if however, SMO is enabled we attempt to move objects in
partially full slabs in order to defragment the cache.  Shrink attempts
to move all objects in order to reduce the cache to a single partial
slab for each node.

- Add conditional per node defrag via new function:

kmem_defrag_slabs(int node).

kmem_defrag_slabs() attempts to defragment all slab caches for
node. Defragmentation is done conditionally dependent on MAX_PARTIAL
_and_ defrag_used_ratio.

   Caches are only considered for defragmentation if the number of
   partial slabs exceeds MAX_PARTIAL (per node).

   Also, defragmentation only occurs if the usage ratio of the slab is
   lower than the configured percentage (sysfs field added in this
   patch).  Fragmentation ratios are measured by calculating the
   percentage of objects in use compared to the total number of objects
   that the slab page can accommodate.

   The scanning of slab caches is optimized because the defragmentable
   slabs come first on the list. Thus we can terminate scans on the
   first slab encountered that does not support defragmentation.

   kmem_defrag_slabs() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

   Defragmentation may be disabled by setting defrag ratio to 0

echo 0 > /sys/kernel/slab//defrag_used_ratio

- Add a defrag ratio sysfs field and set it to 30% by default. A limit
of 30% specifies that more than 3 out of 10 available slots for objects
need to be in use otherwise slab defragmentation will be attempted on
the remaining objects.

In order for a cache to be defragmentable the cache must support object
migration (SMO).  Enabling SMO for a cache is done via a call to the
recently added function:

void kmem_cache_setup_mobility(struct kmem_cache *,
   kmem_cache_isolate_func,
   kmem_cache_migrate_func);

Signed-off-by: Tobin C. Harding 
---
 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 include/linux/slab.h|   1 +
 include/linux/slub_def.h|   7 +
 mm/slub.c   | 385 
 4 files changed, 334 insertions(+), 73 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-slab 
b/Documentation/ABI/testing/sysfs-kernel-slab
index 29601d93a1c2..8bd893968e4f 100644
--- a/Documentation/ABI/testing/sysfs-kernel-slab
+++ b/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,20 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_used_ratio
+Date:  June 2019
+KernelVersion: 5.2
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_used_ratio file allows the control of how aggressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab has
+   less than this percentage of objects allocated then reclaim will
+   attempt to reclaim objects so that the whole slab page can be
+   freed. 0% specifies no reclaim attempt (defrag disabled), 100%
+   specifies attempt to reclaim all pages.  The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 886fc130334d..4bf381b34829 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char 
*name,
void (*ctor)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
+unsigned long kmem_defrag_slabs(int node);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void 

[PATCH 00/15] Slab Movable Objects (SMO)

2019-06-02 Thread Tobin C. Harding
Hi,

TL;DR - Add object migration (SMO) to the SLUB allocator and implement
object migration for the XArray and the dcache. 

Thanks for you patience with all the RFC's of this patch set.  Here it
is, ready for prime time.

Internal fragmentation can occur within pages used by the slub
allocator.  Under some workloads large numbers of pages can be used by
partial slab pages.  This under-utilisation is bad simply because it
wastes memory but also because if the system is under memory pressure
higher order allocations may become difficult to satisfy.  If we can
defrag slab caches we can alleviate these problems.

In order to be able to defrag slab chaches we need to be able to migrate
objects to a new slab.  Slab object migration is the core functionality
added by this patch series.

Internal slab fragmentation is a long known problem.  This series does
not claim to completely _fix_ the issue.  Instead we are adding core
code to the SLUB allocator to enable users of the allocator to help
mitigate internal fragmentation.  Object migration is on a per cache
basis, with each cache being able to take advantage of object migration
to varying degrees depending on the nature of the objects stored in the
cache.

Series includes test modules and test code that can be used to verify the
claimed behaviour.

Patch #1 - Adds the callbacks used to enable SMO for a particular cache.

Patch #2 - Updates the slabinfo tool to show operations related to SMO.

Patch #3 - Sorts the cache list putting migratable slabs at front.

Patch #4 - Adds the SMO infrastructure.  This is the core patch of the
   series.

Patch #5, #6 - Further update slabinfo tool for information just added.

Patch #7 - Add a module for testing SMO.

Patch #8 - Add unit test suite in Python utilising test module from #7.

Patch #9 - Add a new slab cache for the XArray (separate from radix tree).

Patch #10 - Implement SMO for the XArray.

Patch #11 - Add module for testing XArray SMO implementation.

Patch #12 - Add a dentry constructor.

Patch #13 - Use SMO to attempt to reduce fragmentation of the dcache by
selectively freeing dentry objects.

Patch #14 - Add functionality to move slab objects to a specific NUMA node.

Patch #15 - Add functionality to balance slab objects across all NUMA nodes.

The last RFC (RFCv5 and discussion on it) included code to conditionally
exclude SMO for the dcache.  This has been removed.  IMO it is now not
needed.  Al sufficiently bollock'ed me during development that I believe
the dentry code is good and does not negatively effect the dcache.  If
someone would like to prove me wrong simply remove the call to

kmem_cache_setup_mobility(dentry_cache, d_isolate, d_partial_shrink);

Testing:

The series has been tested to verify that objects are moved using bare
metal (core i5) and also Qemu.  This has not been tested on big metal or
on NUMA hardware.

I have no measurements on performance gains achievable with this set, I
have just verified that the migration works and does not appear to break
anything.

Patch #14 and #15 depend on

CONFIG_SLBU_DEBUG_ON or boot with 'slub_debug'

Thanks for taking the time to look at this.

Tobin


Tobin C. Harding (15):
  slub: Add isolate() and migrate() methods
  tools/vm/slabinfo: Add support for -C and -M options
  slub: Sort slab cache list
  slub: Slab defrag core
  tools/vm/slabinfo: Add remote node defrag ratio output
  tools/vm/slabinfo: Add defrag_used_ratio output
  tools/testing/slab: Add object migration test module
  tools/testing/slab: Add object migration test suite
  lib: Separate radix_tree_node and xa_node slab cache
  xarray: Implement migration function for xa_node objects
  tools/testing/slab: Add XArray movable objects tests
  dcache: Provide a dentry constructor
  dcache: Implement partial shrink via Slab Movable Objects
  slub: Enable moving objects to/from specific nodes
  slub: Enable balancing slabs across nodes

 Documentation/ABI/testing/sysfs-kernel-slab |  14 +
 fs/dcache.c | 105 ++-
 include/linux/slab.h|  71 ++
 include/linux/slub_def.h|  10 +
 include/linux/xarray.h  |   3 +
 init/main.c |   2 +
 lib/radix-tree.c|   2 +-
 lib/xarray.c| 109 ++-
 mm/Kconfig  |   7 +
 mm/slab_common.c|   2 +-
 mm/slub.c   | 827 ++--
 tools/testing/slab/Makefile |  10 +
 tools/testing/slab/slub_defrag.c| 567 ++
 tools/testing/slab/slub_defrag.py   | 451 +++
 tools/testing/slab/slub_defrag_xarray.c | 211 +
 tools/vm/slabinfo.c |  51 +-
 16 files changed, 2339 insertions(+), 103 deletions(-)
 create mode 100644 tools/testing/slab/Makefile
 create mode 100644 

Re: [RFC PATCH v5 16/16] dcache: Add CONFIG_DCACHE_SMO

2019-06-02 Thread Tobin C. Harding
On Wed, May 29, 2019 at 04:16:51PM +, Roman Gushchin wrote:
> On Wed, May 29, 2019 at 01:54:06PM +1000, Tobin C. Harding wrote:
> > On Tue, May 21, 2019 at 02:05:38AM +, Roman Gushchin wrote:
> > > On Tue, May 21, 2019 at 11:31:18AM +1000, Tobin C. Harding wrote:
> > > > On Tue, May 21, 2019 at 12:57:47AM +, Roman Gushchin wrote:
> > > > > On Mon, May 20, 2019 at 03:40:17PM +1000, Tobin C. Harding wrote:
> > > > > > In an attempt to make the SMO patchset as non-invasive as possible 
> > > > > > add a
> > > > > > config option CONFIG_DCACHE_SMO (under "Memory Management options") 
> > > > > > for
> > > > > > enabling SMO for the DCACHE.  Whithout this option dcache 
> > > > > > constructor is
> > > > > > used but no other code is built in, with this option enabled slab
> > > > > > mobility is enabled and the isolate/migrate functions are built in.
> > > > > > 
> > > > > > Add CONFIG_DCACHE_SMO to guard the partial shrinking of the dcache 
> > > > > > via
> > > > > > Slab Movable Objects infrastructure.
> > > > > 
> > > > > Hm, isn't it better to make it a static branch? Or basically anything
> > > > > that allows switching on the fly?
> > > > 
> > > > If that is wanted, turning SMO on and off per cache, we can probably do
> > > > this in the SMO code in SLUB.
> > > 
> > > Not necessarily per cache, but without recompiling the kernel.
> > > > 
> > > > > It seems that the cost of just building it in shouldn't be that high.
> > > > > And the question if the defragmentation worth the trouble is so much
> > > > > easier to answer if it's possible to turn it on and off without 
> > > > > rebooting.
> > > > 
> > > > If the question is 'is defragmentation worth the trouble for the
> > > > dcache', I'm not sure having SMO turned off helps answer that question.
> > > > If one doesn't shrink the dentry cache there should be very little
> > > > overhead in having SMO enabled.  So if one wants to explore this
> > > > question then they can turn on the config option.  Please correct me if
> > > > I'm wrong.
> > > 
> > > The problem with a config option is that it's hard to switch over.
> > > 
> > > So just to test your changes in production a new kernel should be built,
> > > tested and rolled out to a representative set of machines (which can be
> > > measured in thousands of machines). Then if results are questionable,
> > > it should be rolled back.
> > > 
> > > What you're actually guarding is the kmem_cache_setup_mobility() call,
> > > which can be perfectly avoided using a boot option, for example. Turning
> > > it on and off completely dynamic isn't that hard too.
> > 
> > Hi Roman,
> > 
> > I've added a boot parameter to SLUB so that admins can enable/disable
> > SMO at boot time system wide.  Then for each object that implements SMO
> > (currently XArray and dcache) I've also added a boot parameter to
> > enable/disable SMO for that cache specifically (these depend on SMO
> > being enabled system wide).
> > 
> > All three boot parameters default to 'off', I've added a config option
> > to default each to 'on'.
> > 
> > I've got a little more testing to do on another part of the set then the
> > PATCH version is coming at you :)
> > 
> > This is more a courtesy email than a request for comment, but please
> > feel free to shout if you don't like the method outlined above.
> > 
> > Fully dynamic config is not currently possible because currently the SMO
> > implementation does not support disabling mobility for a cache once it
> > is turned on, a bit of extra logic would need to be added and some state
> > stored - I'm not sure it warrants it ATM but that can be easily added
> > later if wanted.  Maybe Christoph will give his opinion on this.
> 
> Perfect!

Hi Roman,

I'm about to post PATCH series.  I have removed all the boot time config
options in contrast to what I stated in this thread.  I feel it requires
some comment so as not to seem rude to you.  Please feel free to
re-raise these issues on the series if you feel it is a better place to
do it than on this thread.

I still hear you re making testing easier if there are boot parameters.
I don't have extensive experience testing on a large number of machines
so I have no basis to contradict what you said.

It was suggested to me that having switches to turn SMO off implies the
series is not ready.  I am claiming that SMO _is_ ready and also that it
has no negative effects (especially on the dcache).  I therefore think
this comment is pertinent.

So ... I re-did the boot parameters defaulting to 'on'.  However I could
then see no reason (outside of testing) to turn them off.  It seems ugly
to have code that is only required during testing and never after.
Please correct me if I'm wrong.

Finally I decided that since adding a boot parameter is trivial that
hackers could easily add one to test if they wanted to test a specific
cache.  Otherwise we just test 'patched kernel' vs 'unpatched kernel'.
Again, please correct me if I'm wrong.

So, that 

Re: [PATCH] PCI: endpoint: Add DMA to Linux PCI EP Framework

2019-06-02 Thread Kishon Vijay Abraham I
Hi Vinod,

On 31/05/19 12:02 PM, Vinod Koul wrote:
> On 31-05-19, 10:50, Kishon Vijay Abraham I wrote:
>> Hi Vinod,
>>
>> On 31/05/19 10:37 AM, Vinod Koul wrote:
>>> Hi Kishon,
>>>
>>> On 30-05-19, 11:16, Kishon Vijay Abraham I wrote:
 +Vinod Koul

 Hi,

 On 30/05/19 4:07 AM, Alan Mikhak wrote:
> On Mon, May 27, 2019 at 2:09 AM Gustavo Pimentel
>  wrote:
>>
>> On Fri, May 24, 2019 at 20:42:43, Alan Mikhak 
>> wrote:
>>
>> Hi Alan,
>>
>>> On Fri, May 24, 2019 at 1:59 AM Gustavo Pimentel
>>>  wrote:

 Hi Alan,

 This patch implementation is very HW implementation dependent and
 requires the DMA to exposed through PCIe BARs, which aren't always the
 case. Besides, you are defining some control bits on
 include/linux/pci-epc.h that may not have any meaning to other types of
 DMA.

 I don't think this was what Kishon had in mind when he developed the
 pcitest, but let see what Kishon was to say about it.

 I've developed a DMA driver for DWC PCI using Linux Kernel DMAengine 
 API
 and which I submitted some days ago.
 By having a DMA driver which implemented using DMAengine API, means the
 pcitest can use the DMAengine client API, which will be completely
 generic to any other DMA implementation.

 right, my initial thought process was to use only dmaengine APIs in
 pci-epf-test so that the system DMA or DMA within the PCIe controller can 
 be
 used transparently. But can we register DMA within the PCIe controller to 
 the
 DMA subsystem? AFAIK only system DMA should register with the DMA 
 subsystem.
 (ADMA in SDHCI doesn't use dmaengine). Vinod Koul can confirm.
>>>
>>> So would this DMA be dedicated for PCI and all PCI devices on the bus?
>>
>> Yes, this DMA will be used only by PCI ($patch is w.r.t PCIe device mode. So
>> all endpoint functions both physical and virtual functions will use the DMA 
>> in
>> the controller).
>>> If so I do not see a reason why this cannot be using dmaengine. The use
>>
>> Thanks for clarifying. I was under the impression any DMA within a peripheral
>> controller shouldn't use DMAengine.
> 
> That is indeed a correct assumption. The dmaengine helps in cases where
> we have a dma controller with multiple users, for a single user case it
> might be overhead to setup dma driver and then use it thru framework.
> 
> Someone needs to see the benefit and cost of using the framework and
> decide.

The DMA within the endpoint controller can indeed be used by multiple users for
e.g in the case of multi function EP devices or SR-IOV devices, all the
function drivers can use the DMA in the endpoint controller.

I think it makes sense to use dmaengine for DMA within the endpoint controller.
> 
>>> case would be memcpy for DMA right or mem to device (vice versa) transfers?
>>
>> The device is memory mapped so it would be only memcopy.
>>>
>>> Btw many driver in sdhci do use dmaengine APIs and yes we are missing
>>> support in framework than individual drivers
>>
>> I think dmaengine APIs is used only when the platform uses system DMA and not
>> ADMA within the SDHCI controller. IOW there is no dma_async_device_register()
>> to register ADMA in SDHCI with DMA subsystem.
> 
> We are looking it from the different point of view. You are looking for
> dmaengine drivers in that (which would be in drivers/dma/) and I am
> pointing to users of dmaengine in that.
> 
> So the users in mmc would be ones using dmaengine APIs:
> $git grep -l dmaengine_prep_* drivers/mmc/
> 
> which tells me 17 drivers!

right. For the endpoint case, drivers/pci/controller should register with the
dmaengine i.e if the controller has aN embedded DMA (I think it should be okay
to keep that in drivers/pci/controller itself instead of drivers/dma) and
drivers/pci/endpoint/functions/ should use dmaengine API's (Depending on the
platform, this will either use system DMA or DMA within the PCI controller).

Thanks
Kishon


Re: [PATCH] sched/core: add __sched tag for io_schedule()

2019-06-02 Thread Gao Xiang



On 2019/5/31 22:37, Tejun Heo wrote:
> On Fri, May 31, 2019 at 04:29:12PM +0800, Gao Xiang wrote:
>> non-inline io_schedule() was introduced in
>> commit 10ab56434f2f ("sched/core: Separate out io_schedule_prepare() and 
>> io_schedule_finish()")
>> Keep in line with io_schedule_timeout, Otherwise
>> "/proc//wchan" will report io_schedule()
>> rather than its callers when waiting io.
>>
>> Reported-by: Jilong Kou 
>> Cc: Tejun Heo 
>> Cc: Ingo Molnar 
>> Cc: Peter Zijlstra 
>> Signed-off-by: Gao Xiang 
> 
> Acked-by: Tejun Heo 

Cc:  # 4.11+

Thanks Tejun. This patch will be needed for io performance analysis
since we found that Android systrace tool cannot show the callers of
iowait raised from io_schedule() on linux-4.14 LTS kernel.

Hi Andrew, could you kindly take this patch?

Thanks,
Gao Xiang

> 
> Thanks.
> 


Re: [PATCH -next v2] mm/hotplug: fix a null-ptr-deref during NUMA boot

2019-06-02 Thread Pingfan Liu
On Fri, May 31, 2019 at 5:03 PM Michal Hocko  wrote:
>
> On Thu 30-05-19 20:55:32, Pingfan Liu wrote:
> > On Wed, May 29, 2019 at 2:20 AM Michal Hocko  wrote:
> > >
> > > [Sorry for a late reply]
> > >
> > > On Thu 23-05-19 11:58:45, Pingfan Liu wrote:
> > > > On Wed, May 22, 2019 at 7:16 PM Michal Hocko  wrote:
> > > > >
> > > > > On Wed 22-05-19 15:12:16, Pingfan Liu wrote:
> > > [...]
> > > > > > But in fact, we already have for_each_node_state(nid, N_MEMORY) to
> > > > > > cover this purpose.
> > > > >
> > > > > I do not really think we want to spread N_MEMORY outside of the core 
> > > > > MM.
> > > > > It is quite confusing IMHO.
> > > > > .
> > > > But it has already like this. Just git grep N_MEMORY.
> > >
> > > I might be wrong but I suspect a closer review would reveal that the use
> > > will be inconsistent or dubious so following the existing users is not
> > > the best approach.
> > >
> > > > > > Furthermore, changing the definition of online may
> > > > > > break something in the scheduler, e.g. in task_numa_migrate(), where
> > > > > > it calls for_each_online_node.
> > > > >
> > > > > Could you be more specific please? Why should numa balancing consider
> > > > > nodes without any memory?
> > > > >
> > > > As my understanding, the destination cpu can be on a memory less node.
> > > > BTW, there are several functions in the scheduler facing the same
> > > > scenario, task_numa_migrate() is an example.
> > >
> > > Even if the destination node is memoryless then any migration would fail
> > > because there is no memory. Anyway I still do not see how using online
> > > node would break anything.
> > >
> > Suppose we have nodes A, B,C, where C is memory less but has little
> > distance to B, comparing with the one from A to B. Then if a task is
> > running on A, but prefer to run on B due to memory footprint.
> > task_numa_migrate() allows us to migrate the task to node C. Changing
> > for_each_online_node will break this.
>
> That would require the task to have preferred node to be C no? Or do I
> missunderstand the task migration logic?
I think in task_numa_migrate(), the migration logic should looks like:
  env.dst_nid = p->numa_preferred_nid; //Here dst nid is B
But later in
  if (env.best_cpu == -1 || (p->numa_group &&
p->numa_group->active_nodes > 1)) {
for_each_online_node(nid) {
[...]
   task_numa_find_cpu(, taskimp, groupimp); // Here is a
chance to change p->numa_preferred_nid

There are serveral other broken by changing for_each_online_node(),
-1. show_numa_stats()
-2. init_numa_topology_type(), where sched_numa_topology_type may be
mistaken evaluated.
-3. ... can check call to for_each_online_node() one by one in scheduler.

That is my understanding of the code.

Thanks,
  Pingfan


Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu
On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
> >
> > CPU2: if (b != 1)
> > CPU2: b = 1;
> 
> Stop right there.  The kernel is full of code that assumes that
> assignment to an int/long is atomic.  If your compiler breaks this
> assumption that we can kiss the kernel good-bye.

The slippery slope apparently started here:

: commit ea435467500612636f8f4fb639ff6e76b2496e4b
: Author: Matthew Wilcox 
: Date:   Tue Jan 6 14:40:39 2009 -0800
: 
: atomic_t: unify all arch definitions
:
: diff --git a/arch/x86/include/asm/atomic_32.h 
b/arch/x86/include/asm/atomic_32.h
: index ad5b9f6ecddf..85b46fba4229 100644
: --- a/arch/x86/include/asm/atomic_32.h
: +++ b/arch/x86/include/asm/atomic_32.h
: ...
: @@ -10,15 +11,6 @@
:   * resource counting etc..
:   */
:
: -/*
: - * Make sure gcc doesn't try to be clever and move things around
: - * on us. We need to use _exactly_ the address the user gave us,
: - * not some alias that contains the same information.
: - */
: -typedef struct {
: -   int counter;
: -} atomic_t;
:
: diff --git a/include/linux/types.h b/include/linux/types.h
: index 121f349cb7ec..3b864f2d9560 100644
: --- a/include/linux/types.h
: +++ b/include/linux/types.h
: @@ -195,6 +195,16 @@ typedef u32 phys_addr_t;
:  
:  typedef phys_addr_t resource_size_t;
:
: +typedef struct {
: +   volatile int counter;
: +} atomic_t;
: +

Before evolving into the READ_ONCE/WRITE_ONCE that we have now.

Linus, are we now really supporting a compiler where an assignment
(or a read) from an int/long/pointer can be non-atomic without the
volatile marker? Because if that's the case then we have a lot of
code to audit.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-02 Thread Pingfan Liu
On Sat, Jun 1, 2019 at 1:06 AM John Hubbard  wrote:
>
> On 5/31/19 4:05 AM, Pingfan Liu wrote:
> > On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
> >> On 5/30/19 2:47 PM, Ira Weiny wrote:
> >>> On Thu, May 30, 2019 at 06:54:04AM +0800, Pingfan Liu wrote:
> >> [...]
> >> Rather lightly tested...I've compile-tested with CONFIG_CMA and 
> >> !CONFIG_CMA,
> >> and boot tested with CONFIG_CMA, but could use a second set of eyes on 
> >> whether
> >> I've added any off-by-one errors, or worse. :)
> >>
> > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug 
> > fix.
> >
>
> Sure, that's why I sent it. :)  Note that Ira also recommended splitting the
> "nr --> nr_pinned" renaming into a separate patch.
>
Thanks for your kind help. I will split out nr_pinned to a separate patch.

Regards,
  Pingfan


Re: [PATCH] mm/gup: fix omission of check on FOLL_LONGTERM in get_user_pages_fast()

2019-06-02 Thread Pingfan Liu
On Sat, Jun 1, 2019 at 1:12 AM Ira Weiny  wrote:
>
> On Fri, May 31, 2019 at 07:05:27PM +0800, Pingfan Liu wrote:
> > On Fri, May 31, 2019 at 7:21 AM John Hubbard  wrote:
> > >
> > >
> > > Rather lightly tested...I've compile-tested with CONFIG_CMA and 
> > > !CONFIG_CMA,
> > > and boot tested with CONFIG_CMA, but could use a second set of eyes on 
> > > whether
> > > I've added any off-by-one errors, or worse. :)
> > >
> > Do you mind I send V2 based on your above patch? Anyway, it is a simple bug 
> > fix.
>
> FWIW please split out the nr_pinned change to a separate patch.
>
OK.

Thanks,
  Pingfan


Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu
On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote:
>
>   CPU2: if (b != 1)
>   CPU2: b = 1;

Stop right there.  The kernel is full of code that assumes that
assignment to an int/long is atomic.  If your compiler breaks this
assumption that we can kiss the kernel good-bye.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH 1/2] dt-bindings: i3c: Document MediaTek I3C master bindings

2019-06-02 Thread Qii Wang
Document MediaTek I3C master DT bindings.

Signed-off-by: Qii Wang 
---
 .../devicetree/bindings/i3c/mtk,i3c-master.txt |   50 
 1 file changed, 50 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/mtk,i3c-master.txt

diff --git a/Documentation/devicetree/bindings/i3c/mtk,i3c-master.txt 
b/Documentation/devicetree/bindings/i3c/mtk,i3c-master.txt
new file mode 100644
index 000..89ec380
--- /dev/null
+++ b/Documentation/devicetree/bindings/i3c/mtk,i3c-master.txt
@@ -0,0 +1,50 @@
+Bindings for MediaTek I3C master block
+=
+
+Required properties:
+
+- compatible: shall be "mediatek,i3c-master"
+- reg: physical base address of the controller and apdma base, length of
+  memory mapped region.
+- reg-names: should be "main" for controller and "dma" for apdma.
+- interrupts: interrupt number to the cpu.
+- clock-div: the fixed value for frequency divider of clock source in i3c
+  module. Each IC may be different.
+- clocks: clock name from clock manager.
+- clock-names: must include "main" and "dma".
+
+Mandatory properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- #address-cells: shall be set to 3
+- #size-cells: shall be set to 0
+
+Optional properties defined by the generic binding (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details):
+
+- i2c-scl-hz
+- i3c-scl-hz
+
+I3C device connected on the bus follow the generic description (see
+Documentation/devicetree/bindings/i3c/i3c.txt for more details).
+
+Example:
+
+   i3c0: i3c@1100d000 {
+   compatible = "mediatek,i3c-master";
+   reg = <0x1100d000 0x100>,
+ <0x11000300 0x80>;
+   reg-names = "main", "dma";
+   interrupts = ;
+   clock-div = <16>;
+   clocks = <_ck>, <_dma_ck>;
+   clock-names = "main", "dma";
+   #address-cells = <1>;
+   #size-cells = <0>;
+   i2c-scl-hz = <10>;
+
+   nunchuk: nunchuk@52 {
+   compatible = "nintendo,nunchuk";
+   reg = <0x52 0x8010 0>;
+   };
+   };
-- 
1.7.9.5



[PATCH 2/2] i3c: master: Add driver for MediaTek IP

2019-06-02 Thread Qii Wang
Add a driver for MediaTek I3C master IP.

Signed-off-by: Qii Wang 
---
 drivers/i3c/master/Kconfig  |   10 +
 drivers/i3c/master/Makefile |1 +
 drivers/i3c/master/i3c-master-mtk.c | 1246 +++
 3 files changed, 1257 insertions(+)
 create mode 100644 drivers/i3c/master/i3c-master-mtk.c

diff --git a/drivers/i3c/master/Kconfig b/drivers/i3c/master/Kconfig
index 26c6b58..acc00d9 100644
--- a/drivers/i3c/master/Kconfig
+++ b/drivers/i3c/master/Kconfig
@@ -20,3 +20,13 @@ config DW_I3C_MASTER
 
  This driver can also be built as a module.  If so, the module
  will be called dw-i3c-master.
+
+config MTK_I3C_MASTER
+   tristate "MediaTek I3C master driver"
+   depends on I3C
+   depends on HAS_IOMEM
+   depends on !(ALPHA || PARISC)
+   help
+ This selects the MediaTek(R) I3C master controller driver.
+ If you want to use MediaTek(R) I3C interface, say Y here.
+ If unsure, say N or M.
diff --git a/drivers/i3c/master/Makefile b/drivers/i3c/master/Makefile
index fc53939..fe7ccf5 100644
--- a/drivers/i3c/master/Makefile
+++ b/drivers/i3c/master/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_CDNS_I3C_MASTER)  += i3c-master-cdns.o
 obj-$(CONFIG_DW_I3C_MASTER)+= dw-i3c-master.o
+obj-$(CONFIG_MTK_I3C_MASTER)   += i3c-master-mtk.o
diff --git a/drivers/i3c/master/i3c-master-mtk.c 
b/drivers/i3c/master/i3c-master-mtk.c
new file mode 100644
index 000..a209bb6
--- /dev/null
+++ b/drivers/i3c/master/i3c-master-mtk.c
@@ -0,0 +1,1246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 MediaTek Design Systems Inc.
+ *
+ * Author: Qii Wang 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DRV_NAME   "i3c-master-mtk"
+
+#define SLAVE_ADDR 0x04
+#define INTR_MASK  0x08
+#define INTR_STAT  0x0c
+#define INTR_TRANSAC_COMP  BIT(0)
+#define INTR_ACKERRGENMASK(2, 1)
+#define INTR_ARB_LOST  BIT(3)
+#define INTR_RS_MULTI  BIT(4)
+#define INTR_MAS_ERR   BIT(8)
+#define INTR_ALL   (INTR_MAS_ERR | INTR_ARB_LOST |\
+   INTR_ACKERR | INTR_TRANSAC_COMP)
+
+#define CONTROL0x10
+#define CONTROL_WRAPPERBIT(0)
+#define CONTROL_RS BIT(1)
+#define CONTROL_DMA_EN BIT(2)
+#define CONTROL_CLK_EXT_EN BIT(3)
+#define CONTROL_DIR_CHANGE  BIT(4)
+#define CONTROL_ACKERR_DET_EN  BIT(5)
+#define CONTROL_LEN_CHANGE BIT(6)
+#define CONTROL_DMAACK_EN  BIT(8)
+#define CONTROL_ASYNC_MODE BIT(9)
+
+#define TRANSFER_LEN   0x14
+#define TRANSAC_LEN0x18
+#define TRANSAC_LEN_WRRD   0x0002
+#define TRANS_ONE_LEN  0x0001
+
+#define DELAY_LEN  0x1c
+#define DELAY_LEN_DEFAULT  0x000a
+
+#define TIMING 0x20
+#define TIMING_VALUE(sample_cnt, step_cnt) ({ \
+   typeof(sample_cnt) sample_cnt_ = (sample_cnt); \
+   typeof(step_cnt) step_cnt_ = (step_cnt); \
+   (((sample_cnt_) << 8) | (step_cnt_)); \
+})
+
+#define START  0x24
+#define START_EN   BIT(0)
+#define START_MUL_TRIG BIT(14)
+#define START_MUL_CNFG BIT(15)
+
+#define EXT_CONF   0x28
+#define EXT_CONF_DEFAULT   0x0a1f
+
+#define LTIMING0x2c
+#define LTIMING_VALUE(sample_cnt, step_cnt) ({ \
+   typeof(sample_cnt) sample_cnt_ = (sample_cnt); \
+   typeof(step_cnt) step_cnt_ = (step_cnt); \
+   (((sample_cnt_) << 6) | (step_cnt_) | \
+   ((sample_cnt_) << 12) | ((step_cnt_) << 9)); \
+})
+
+#define HS 0x30
+#define HS_CLR_VALUE   0x
+#define HS_DEFAULT_VALUE   0x0083
+#define HS_VALUE(sample_cnt, step_cnt) ({ \
+   typeof(sample_cnt) sample_cnt_ = (sample_cnt); \
+   typeof(step_cnt) step_cnt_ = (step_cnt); \
+   (HS_DEFAULT_VALUE | \
+   ((sample_cnt_) << 12) | ((step_cnt_) << 8)); \
+})
+
+#define IO_CONFIG  0x34
+#define IO_CONFIG_PUSH_PULL0x
+
+#define FIFO_ADDR_CLR  0x38
+#define FIFO_CLR   0x0003
+
+#define MCU_INTR   0x40
+#define MCU_INTR_ENBIT(0)
+
+#define TRANSFER_LEN_AUX   0x44
+#define CLOCK_DIV  0x48
+#define CLOCK_DIV_DEFAULT  ((INTER_CLK_DIV - 1) << 8 |\
+   (INTER_CLK_DIV - 1))
+
+#define SOFTRESET  0x50
+#define SOFT_RST   BIT(0)
+
+#define TRAFFIC0x54
+#define TRAFFIC_DAA_EN BIT(4)
+#define TRAFFIC_TBIT   BIT(7)
+#define TRAFFIC_HEAD_ONLY  BIT(9)
+#define TRAFFIC_SKIP_SLV_ADDR  BIT(10)
+#define TRAFFIC_HANDOFFBIT(14)
+
+#define DEF_DA 0x68
+#define DEF_DAA_SLV_PARITY BIT(8)
+
+#define SHAPE  0x6c
+#define SHAPE_TBIT_STALL   BIT(1)
+
+#define HFIFO_DATA   

[PATCH 0/2] Add MediaTek I3C master controller driver

2019-06-02 Thread Qii Wang
This series are based on 5.2-rc1, we provide two patches to
support MediaTek I3C master controller.

Qii Wang (2):
  dt-bindings: i3c: Document MediaTek I3C master bindings
  i3c: master: Add driver for MediaTek IP

 .../devicetree/bindings/i3c/mtk,i3c-master.txt |   50 +
 drivers/i3c/master/Kconfig |   10 +
 drivers/i3c/master/Makefile|1 +
 drivers/i3c/master/i3c-master-mtk.c| 1246 
 4 files changed, 1307 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/i3c/mtk,i3c-master.txt
 create mode 100644 drivers/i3c/master/i3c-master-mtk.c

-- 
1.7.9.5


Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Paul E. McKenney
On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote:
> On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu  
> > wrote:
> > >
> > > You can't then go and decide to remove the compiler barrier!  To do
> > > that you'd need to audit every single use of rcu_read_lock in the
> > > kernel to ensure that they're not depending on the compiler barrier.
> > 
> > What's the possible case where it would matter when there is no preemption?
> 
> The case we were discussing is from net/ipv4/inet_fragment.c from
> the net-next tree:
> 
> void fqdir_exit(struct fqdir *fqdir)
> {
>   ...
>   fqdir->dead = true;
> 
>   /* call_rcu is supposed to provide memory barrier semantics,
>* separating the setting of fqdir->dead with the destruction
>* work.  This implicit barrier is paired with inet_frag_kill().
>*/
> 
>   INIT_RCU_WORK(>destroy_rwork, fqdir_rwork_fn);
>   queue_rcu_work(system_wq, >destroy_rwork);
> }
> 
> and
> 
> void inet_frag_kill(struct inet_frag_queue *fq)
> {
>   ...
>   rcu_read_lock();
>   /* The RCU read lock provides a memory barrier
>* guaranteeing that if fqdir->dead is false then
>* the hash table destruction will not start until
>* after we unlock.  Paired with inet_frags_exit_net().
>*/
>   if (!fqdir->dead) {
>   rhashtable_remove_fast(>rhashtable, >node,
>  fqdir->f->rhash_params);
>   ...
>   }
>   ...
>   rcu_read_unlock();
>   ...
> }
> 
> I simplified this to
> 
> Initial values:
> 
> a = 0
> b = 0
> 
> CPU1  CPU2
>   
> a = 1 rcu_read_lock
> synchronize_rcu   if (a == 0)
> b = 2 b = 1
>   rcu_read_unlock
> 
> On exit we want this to be true:
> b == 2
> 
> Now what Paul was telling me is that unless every memory operation
> is done with READ_ONCE/WRITE_ONCE then his memory model shows that
> the exit constraint won't hold.

But please note that the plain-variable portion of the memory model is
very new and likely still has a bug or two.  In fact, see below.

>  IOW, we need
> 
> CPU1  CPU2
>   
> WRITE_ONCE(a, 1)  rcu_read_lock
> synchronize_rcu   if (READ_ONCE(a) == 0)
> WRITE_ONCE(b, 2)  WRITE_ONCE(b, 1)
>   rcu_read_unlock
> 
> Now I think this bullshit because if we really needed these compiler
> barriers then we surely would need real memory barriers to go with
> them.

On the one hand, you have no code before your rcu_read_lock() and also
1no code after your rcu_read_unlock().  So in this particular example,
adding compiler barriers to these guys won't help you.

On the other hand, on CPU 1's write to "b", I agree with you and disagree
with the model, though perhaps my partners in LKMM crime will show me the
error of my ways on this point.  On CPU 2's write to "b", I can see the
memory model's point, but getting there requires some gymnastics on the
part of both the compiler and the CPU.  The WRITE_ONCE() and READ_ONCE()
for "a" is the normal requirement for variables that are concurrently
loaded and stored.

Please note that garden-variety uses of RCU have similar requirements,
namely the rcu_assign_pointer() on the one side and the rcu_dereference()
on the other.  Your use case allows rcu_assign_pointer() to be weakened
to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE()
(not that this last is all that much of a weakening these days).

> In fact, the sole purpose of the RCU mechanism is to provide those
> memory barriers.  Quoting from
> Documentation/RCU/Design/Requirements/Requirements.html:
> 
>   Each CPU that has an RCU read-side critical section that
>   begins before synchronize_rcu() starts is
>   guaranteed to execute a full memory barrier between the time
>   that the RCU read-side critical section ends and the time that
>   synchronize_rcu() returns.
>   Without this guarantee, a pre-existing RCU read-side critical section
>   might hold a reference to the newly removed struct foo
>   after the kfree() on line14 of
>   remove_gp_synchronous().
>   Each CPU that has an RCU read-side critical section that ends
>   after synchronize_rcu() returns is guaranteed
>   to execute a full memory barrier between the time that
>   synchronize_rcu() begins and the time that the RCU
>   read-side critical section begins.
>   Without this guarantee, a later RCU read-side critical section
>   running after the kfree() on line14 of
>   

[PATCH v10 4/7] rpmsg: add rpmsg support for mt8183 SCP.

2019-06-02 Thread Pi-Hsun Shih
Add a simple rpmsg support for mt8183 SCP, that use IPI / IPC directly.

Signed-off-by: Pi-Hsun Shih 
---
Changes from v9, v8, v7:
 - No change.

Changes from v6:
 - Decouple mtk_rpmsg from mtk_scp by putting all necessary informations
   (name service IPI id, register/unregister/send functions) into a
   struct, and pass it to the mtk_rpmsg_create_rproc_subdev function.

Changes from v5:
 - CONFIG_MTK_SCP now selects CONFIG_RPMSG_MTK_SCP, and the dummy
   implementation for mtk_rpmsg_{create,destroy}_rproc_subdev when
   CONFIG_RPMSG_MTK_SCP is not defined is removed.

Changes from v4:
 - Match and fill the device tree node to the created rpmsg subdevice,
   so the rpmsg subdevice can utilize the properties and subnodes on
   device tree (This is similar to what drivers/rpmsg/qcom_smd.c does).

Changes from v3:
 - Change from unprepare to stop, to stop the rpmsg driver before the
   rproc is stopped, avoiding problem that some rpmsg would fail after
   rproc is stopped.
 - Add missing spin_lock_init, and use destroy_ept instead of kref_put.

Changes from v2:
 - Unregiser IPI handler on unprepare.
 - Lock the channel list on operations.
 - Move SCP_IPI_NS_SERVICE to 0xFF.

Changes from v1:
 - Do cleanup properly in mtk_rpmsg.c, which also removes the problem of
   short-lived work items.
 - Fix several issues checkpatch found.
---
 drivers/remoteproc/Kconfig|   1 +
 drivers/remoteproc/mtk_common.h   |   2 +
 drivers/remoteproc/mtk_scp.c  |  38 ++-
 drivers/remoteproc/mtk_scp_ipi.c  |   1 +
 drivers/rpmsg/Kconfig |   9 +
 drivers/rpmsg/Makefile|   1 +
 drivers/rpmsg/mtk_rpmsg.c | 396 ++
 include/linux/platform_data/mtk_scp.h |   4 +-
 include/linux/rpmsg/mtk_rpmsg.h   |  30 ++
 9 files changed, 477 insertions(+), 5 deletions(-)
 create mode 100644 drivers/rpmsg/mtk_rpmsg.c
 create mode 100644 include/linux/rpmsg/mtk_rpmsg.h

diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
index ad3a0de04d9e..82747a5b9caf 100644
--- a/drivers/remoteproc/Kconfig
+++ b/drivers/remoteproc/Kconfig
@@ -26,6 +26,7 @@ config IMX_REMOTEPROC
 config MTK_SCP
tristate "Mediatek SCP support"
depends on ARCH_MEDIATEK
+   select RPMSG_MTK_SCP
help
  Say y here to support Mediatek's System Companion Processor (SCP) via
  the remote processor framework.
diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
index 7504ae1bc0ef..19a907810271 100644
--- a/drivers/remoteproc/mtk_common.h
+++ b/drivers/remoteproc/mtk_common.h
@@ -54,6 +54,8 @@ struct mtk_scp {
void __iomem *cpu_addr;
phys_addr_t phys_addr;
size_t dram_size;
+
+   struct rproc_subdev *rpmsg_subdev;
 };
 
 /**
diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index bebecd470b8d..0c73aba6858d 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mtk_common.h"
 #include "remoteproc_internal.h"
@@ -513,6 +514,31 @@ static int scp_map_memory_region(struct mtk_scp *scp)
return 0;
 }
 
+static struct mtk_rpmsg_info mtk_scp_rpmsg_info = {
+   .send_ipi = scp_ipi_send,
+   .register_ipi = scp_ipi_register,
+   .unregister_ipi = scp_ipi_unregister,
+   .ns_ipi_id = SCP_IPI_NS_SERVICE,
+};
+
+static void scp_add_rpmsg_subdev(struct mtk_scp *scp)
+{
+   scp->rpmsg_subdev =
+   mtk_rpmsg_create_rproc_subdev(to_platform_device(scp->dev),
+ _scp_rpmsg_info);
+   if (scp->rpmsg_subdev)
+   rproc_add_subdev(scp->rproc, scp->rpmsg_subdev);
+}
+
+static void scp_remove_rpmsg_subdev(struct mtk_scp *scp)
+{
+   if (scp->rpmsg_subdev) {
+   rproc_remove_subdev(scp->rproc, scp->rpmsg_subdev);
+   mtk_rpmsg_destroy_rproc_subdev(scp->rpmsg_subdev);
+   scp->rpmsg_subdev = NULL;
+   }
+}
+
 static int scp_probe(struct platform_device *pdev)
 {
struct device *dev = >dev;
@@ -594,22 +620,25 @@ static int scp_probe(struct platform_device *pdev)
init_waitqueue_head(>run.wq);
init_waitqueue_head(>ack_wq);
 
+   scp_add_rpmsg_subdev(scp);
+
ret = devm_request_threaded_irq(dev, platform_get_irq(pdev, 0), NULL,
scp_irq_handler, IRQF_ONESHOT,
pdev->name, scp);
 
if (ret) {
dev_err(dev, "failed to request irq\n");
-   goto destroy_mutex;
+   goto remove_subdev;
}
 
ret = rproc_add(rproc);
if (ret)
-   goto destroy_mutex;
+   goto remove_subdev;
 
-   return ret;
+   return 0;
 
-destroy_mutex:
+remove_subdev:
+   scp_remove_rpmsg_subdev(scp);
mutex_destroy(>lock);
 free_rproc:
rproc_free(rproc);
@@ -621,6 

[PATCH v10 6/7] mfd: cros_ec: differentiate SCP from EC by feature bit.

2019-06-02 Thread Pi-Hsun Shih
System Companion Processor (SCP) is Cortex M4 co-processor on some
MediaTek platform that can run EC-style firmware. Since a SCP and EC
would both exist on a system, and use the cros_ec_dev driver, we need to
differentiate between them for the userspace, or they would both be
registered at /dev/cros_ec, causing a conflict.

Signed-off-by: Pi-Hsun Shih 
Acked-by: Enric Balletbo i Serra 
---
Changes from v9:
 - Remove changes in cros_ec_commands.h (which is sync in
   https://lore.kernel.org/lkml/20190518063949.GY4319@dell/T/).

Changes from v8:
 - No change.

Changes from v7:
 - Address comments in v7.
 - Rebase the series onto https://lore.kernel.org/patchwork/patch/1059196/.

Changes from v6, v5, v4, v3, v2:
 - No change.

Changes from v1:
 - New patch extracted from Patch 5.
---
 drivers/mfd/cros_ec_dev.c   | 10 ++
 include/linux/mfd/cros_ec.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/drivers/mfd/cros_ec_dev.c b/drivers/mfd/cros_ec_dev.c
index a5391f96eafd..66107de3dbce 100644
--- a/drivers/mfd/cros_ec_dev.c
+++ b/drivers/mfd/cros_ec_dev.c
@@ -440,6 +440,16 @@ static int ec_device_probe(struct platform_device *pdev)
ec_platform->ec_name = CROS_EC_DEV_TP_NAME;
}
 
+   /* Check whether this is actually a SCP rather than an EC. */
+   if (cros_ec_check_features(ec, EC_FEATURE_SCP)) {
+   dev_info(dev, "CrOS SCP MCU detected.\n");
+   /*
+* Help userspace differentiating ECs from SCP,
+* regardless of the probing order.
+*/
+   ec_platform->ec_name = CROS_EC_DEV_SCP_NAME;
+   }
+
/*
 * Add the class device
 * Link to the character device for creating the /dev entry
diff --git a/include/linux/mfd/cros_ec.h b/include/linux/mfd/cros_ec.h
index cfa78bb4990f..751cb3756d49 100644
--- a/include/linux/mfd/cros_ec.h
+++ b/include/linux/mfd/cros_ec.h
@@ -27,6 +27,7 @@
 #define CROS_EC_DEV_PD_NAME "cros_pd"
 #define CROS_EC_DEV_TP_NAME "cros_tp"
 #define CROS_EC_DEV_ISH_NAME "cros_ish"
+#define CROS_EC_DEV_SCP_NAME "cros_scp"
 
 /*
  * The EC is unresponsive for a time after a reboot command.  Add a
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH v10 7/7] arm64: dts: mt8183: add scp node

2019-06-02 Thread Pi-Hsun Shih
From: Eddie Huang 

Add scp node to mt8183 and mt8183-evb

Signed-off-by: Erin Lo 
Signed-off-by: Pi-Hsun Shih 
Signed-off-by: Eddie Huang 
---
Changes from v9:
 - Remove extra reserve-memory-vpu_share node.

Changes from v8:
 - New patch.
---
 arch/arm64/boot/dts/mediatek/mt8183-evb.dts | 11 +++
 arch/arm64/boot/dts/mediatek/mt8183.dtsi| 12 
 2 files changed, 23 insertions(+)

diff --git a/arch/arm64/boot/dts/mediatek/mt8183-evb.dts 
b/arch/arm64/boot/dts/mediatek/mt8183-evb.dts
index d8e555cbb5d3..e46e34ce3159 100644
--- a/arch/arm64/boot/dts/mediatek/mt8183-evb.dts
+++ b/arch/arm64/boot/dts/mediatek/mt8183-evb.dts
@@ -24,6 +24,17 @@
chosen {
stdout-path = "serial0:921600n8";
};
+
+   reserved-memory {
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+   scp_mem_reserved: scp_mem_region {
+   compatible = "shared-dma-pool";
+   reg = <0 0x5000 0 0x290>;
+   no-map;
+   };
+   };
 };
 
  {
diff --git a/arch/arm64/boot/dts/mediatek/mt8183.dtsi 
b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
index c2749c4631bc..133146b52904 100644
--- a/arch/arm64/boot/dts/mediatek/mt8183.dtsi
+++ b/arch/arm64/boot/dts/mediatek/mt8183.dtsi
@@ -254,6 +254,18 @@
clock-names = "spi", "wrap";
};
 
+   scp: scp@1050 {
+   compatible = "mediatek,mt8183-scp";
+   reg = <0 0x1050 0 0x8>,
+ <0 0x105c 0 0x5000>;
+   reg-names = "sram", "cfg";
+   interrupts = ;
+   clocks = < CLK_INFRA_SCPSYS>;
+   clock-names = "main";
+   memory-region = <_mem_reserved>;
+   status = "disabled";
+   };
+
auxadc: auxadc@11001000 {
compatible = "mediatek,mt8183-auxadc",
 "mediatek,mt8173-auxadc";
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH v10 3/7] remoteproc: mt8183: add reserved memory manager API

2019-06-02 Thread Pi-Hsun Shih
From: Erin Lo 

Add memory table mapping API for other driver to lookup
reserved physical and virtual memory

Signed-off-by: Erin Lo 
Signed-off-by: Pi-Hsun Shih 
---
Changes from v9:
 - No change.

Changes from v8:
 - Add more reserved regions for camera ISP.

Changes from v7, v6, v5:
 - No change.

Changes from v4:
 - New patch.
---
 drivers/remoteproc/mtk_scp.c  | 135 ++
 include/linux/platform_data/mtk_scp.h |  24 +
 2 files changed, 159 insertions(+)

diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index c4d900f4fe1c..bebecd470b8d 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -348,6 +348,137 @@ void *scp_mapping_dm_addr(struct platform_device *pdev, 
u32 mem_addr)
 }
 EXPORT_SYMBOL_GPL(scp_mapping_dm_addr);
 
+#if SCP_RESERVED_MEM
+phys_addr_t scp_mem_base_phys;
+phys_addr_t scp_mem_base_virt;
+phys_addr_t scp_mem_size;
+
+static struct scp_reserve_mblock scp_reserve_mblock[] = {
+   {
+   .num = SCP_ISP_MEM_ID,
+   .start_phys = 0x0,
+   .start_virt = 0x0,
+   .size = 0x20, /*2MB*/
+   },
+   {
+   .num = SCP_ISP_MEM2_ID,
+   .start_phys = 0x0,
+   .start_virt = 0x0,
+   .size = 0x80, /*8MB*/
+   },
+   {
+   .num = SCP_DIP_MEM_ID,
+   .start_phys = 0x0,
+   .start_virt = 0x0,
+   .size = 0x90, /*9MB*/
+   },
+   {
+   .num = SCP_MDP_MEM_ID,
+   .start_phys = 0x0,
+   .start_virt = 0x0,
+   .size = 0x60, /*6MB*/
+   },
+   {
+   .num = SCP_FD_MEM_ID,
+   .start_phys = 0x0,
+   .start_virt = 0x0,
+   .size = 0x10, /*1MB*/
+   },
+};
+
+static int scp_reserve_mem_init(struct mtk_scp *scp)
+{
+   enum scp_reserve_mem_id_t id;
+   phys_addr_t accumlate_memory_size = 0;
+
+   scp_mem_base_phys = (phys_addr_t) (scp->phys_addr + MAX_CODE_SIZE);
+   scp_mem_size = (phys_addr_t) (scp->dram_size - MAX_CODE_SIZE);
+
+   dev_info(scp->dev,
+"phys:0x%llx - 0x%llx (0x%llx)\n",
+scp_mem_base_phys,
+scp_mem_base_phys + scp_mem_size,
+scp_mem_size);
+   accumlate_memory_size = 0;
+   for (id = 0; id < SCP_NUMS_MEM_ID; id++) {
+   scp_reserve_mblock[id].start_phys =
+   scp_mem_base_phys + accumlate_memory_size;
+   accumlate_memory_size += scp_reserve_mblock[id].size;
+   dev_info(scp->dev,
+"[reserve_mem:%d]: phys:0x%llx - 0x%llx (0x%llx)\n",
+id, scp_reserve_mblock[id].start_phys,
+scp_reserve_mblock[id].start_phys +
+scp_reserve_mblock[id].size,
+scp_reserve_mblock[id].size);
+   }
+   return 0;
+}
+
+static int scp_reserve_memory_ioremap(struct mtk_scp *scp)
+{
+   enum scp_reserve_mem_id_t id;
+   phys_addr_t accumlate_memory_size = 0;
+
+   scp_mem_base_virt = (phys_addr_t)(size_t)ioremap_wc(scp_mem_base_phys,
+   scp_mem_size);
+
+   dev_info(scp->dev,
+"virt:0x%llx - 0x%llx (0x%llx)\n",
+   (phys_addr_t)scp_mem_base_virt,
+   (phys_addr_t)scp_mem_base_virt + (phys_addr_t)scp_mem_size,
+   scp_mem_size);
+   for (id = 0; id < SCP_NUMS_MEM_ID; id++) {
+   scp_reserve_mblock[id].start_virt =
+   scp_mem_base_virt + accumlate_memory_size;
+   accumlate_memory_size += scp_reserve_mblock[id].size;
+   }
+   /* the reserved memory should be larger then expected memory
+* or scp_reserve_mblock does not match dts
+*/
+   WARN_ON(accumlate_memory_size > scp_mem_size);
+#ifdef DEBUG
+   for (id = 0; id < NUMS_MEM_ID; id++) {
+   dev_info(scp->dev,
+"[mem_reserve-%d] 
phys:0x%llx,virt:0x%llx,size:0x%llx\n",
+id,
+scp_get_reserve_mem_phys(id),
+scp_get_reserve_mem_virt(id),
+scp_get_reserve_mem_size(id));
+   }
+#endif
+   return 0;
+}
+phys_addr_t scp_get_reserve_mem_phys(enum scp_reserve_mem_id_t id)
+{
+   if (id >= SCP_NUMS_MEM_ID) {
+   pr_err("[SCP] no reserve memory for %d", id);
+   return 0;
+   } else
+   return scp_reserve_mblock[id].start_phys;
+}
+EXPORT_SYMBOL_GPL(scp_get_reserve_mem_phys);
+
+phys_addr_t scp_get_reserve_mem_virt(enum scp_reserve_mem_id_t id)
+{
+   if (id >= SCP_NUMS_MEM_ID) {
+   pr_err("[SCP] no reserve memory for %d", id);
+   return 0;
+   } else
+   return scp_reserve_mblock[id].start_virt;
+}

[PATCH v10 5/7] dt-bindings: Add binding for cros-ec-rpmsg.

2019-06-02 Thread Pi-Hsun Shih
Add a DT binding documentation for ChromeOS EC driver over rpmsg.

Signed-off-by: Pi-Hsun Shih 
Acked-by: Rob Herring 
---
Changes from v9, v8, v7, v6:
 - No change.

Changes from v5:
 - New patch.
---
 Documentation/devicetree/bindings/mfd/cros-ec.txt | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/mfd/cros-ec.txt 
b/Documentation/devicetree/bindings/mfd/cros-ec.txt
index 6245c9b1a68b..4860eabd0f72 100644
--- a/Documentation/devicetree/bindings/mfd/cros-ec.txt
+++ b/Documentation/devicetree/bindings/mfd/cros-ec.txt
@@ -3,7 +3,7 @@ ChromeOS Embedded Controller
 Google's ChromeOS EC is a Cortex-M device which talks to the AP and
 implements various function such as keyboard and battery charging.
 
-The EC can be connect through various means (I2C, SPI, LPC) and the
+The EC can be connect through various means (I2C, SPI, LPC, RPMSG) and the
 compatible string used depends on the interface. Each connection method has
 its own driver which connects to the top level interface-agnostic EC driver.
 Other Linux driver (such as cros-ec-keyb for the matrix keyboard) connect to
@@ -17,6 +17,9 @@ Required properties (SPI):
 - compatible: "google,cros-ec-spi"
 - reg: SPI chip select
 
+Required properties (RPMSG):
+- compatible: "google,cros-ec-rpmsg"
+
 Optional properties (SPI):
 - google,cros-ec-spi-pre-delay: Some implementations of the EC need a little
   time to wake up from sleep before they can receive SPI transfers at a high
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH v10 2/7] remoteproc/mediatek: add SCP support for mt8183

2019-06-02 Thread Pi-Hsun Shih
From: Erin Lo 

Provide a basic driver to control Cortex M4 co-processor

Signed-off-by: Erin Lo 
Signed-off-by: Nicolas Boichat 
Signed-off-by: Pi-Hsun Shih 
---
Changes from v9:
 - No change.

Changes from v8:
 - Add a missing space.

Changes from v7:
 - Moved the location of shared SCP buffer.
 - Fix clock enable/disable sequence.
 - Add more IPI ID that would be used.

Changes from v6:
 - No change.

Changes from v5:
 - Changed some space to tab.

Changes from v4:
 - Rename most function from mtk_scp_* to scp_*.
 - Change the irq to threaded handler.
 - Load ELF file instead of plain binary file as firmware by default
   (Squashed patch 6 in v4 into this patch).

Changes from v3:
 - Fix some issue found by checkpatch.
 - Make writes aligned in scp_ipi_send.

Changes from v2:
 - Squash patch 3 from v2 (separate the ipi interface) into this patch.
 - Remove unused name argument from scp_ipi_register.
 - Add scp_ipi_unregister for proper cleanup.
 - Move IPI ids in sync with firmware.
 - Add mb() in proper place, and correctly clear the run->signaled.

Changes from v1:
 - Extract functions and rename variables in mtk_scp.c.
---
 drivers/remoteproc/Kconfig|   9 +
 drivers/remoteproc/Makefile   |   1 +
 drivers/remoteproc/mtk_common.h   |  75 
 drivers/remoteproc/mtk_scp.c  | 513 ++
 drivers/remoteproc/mtk_scp_ipi.c  | 162 
 include/linux/platform_data/mtk_scp.h | 141 +++
 6 files changed, 901 insertions(+)
 create mode 100644 drivers/remoteproc/mtk_common.h
 create mode 100644 drivers/remoteproc/mtk_scp.c
 create mode 100644 drivers/remoteproc/mtk_scp_ipi.c
 create mode 100644 include/linux/platform_data/mtk_scp.h

diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
index 18be41b8aa7e..ad3a0de04d9e 100644
--- a/drivers/remoteproc/Kconfig
+++ b/drivers/remoteproc/Kconfig
@@ -23,6 +23,15 @@ config IMX_REMOTEPROC
 
  It's safe to say N here.
 
+config MTK_SCP
+   tristate "Mediatek SCP support"
+   depends on ARCH_MEDIATEK
+   help
+ Say y here to support Mediatek's System Companion Processor (SCP) via
+ the remote processor framework.
+
+ It's safe to say N here.
+
 config OMAP_REMOTEPROC
tristate "OMAP remoteproc support"
depends on ARCH_OMAP4 || SOC_OMAP5
diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
index ce5d061e92be..16b3e5e7a81c 100644
--- a/drivers/remoteproc/Makefile
+++ b/drivers/remoteproc/Makefile
@@ -10,6 +10,7 @@ remoteproc-y  += remoteproc_sysfs.o
 remoteproc-y   += remoteproc_virtio.o
 remoteproc-y   += remoteproc_elf_loader.o
 obj-$(CONFIG_IMX_REMOTEPROC)   += imx_rproc.o
+obj-$(CONFIG_MTK_SCP)  += mtk_scp.o mtk_scp_ipi.o
 obj-$(CONFIG_OMAP_REMOTEPROC)  += omap_remoteproc.o
 obj-$(CONFIG_WKUP_M3_RPROC)+= wkup_m3_rproc.o
 obj-$(CONFIG_DA8XX_REMOTEPROC) += da8xx_remoteproc.o
diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
new file mode 100644
index ..7504ae1bc0ef
--- /dev/null
+++ b/drivers/remoteproc/mtk_common.h
@@ -0,0 +1,75 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2018 MediaTek Inc.
+ */
+
+#ifndef __RPROC_MTK_COMMON_H
+#define __RPROC_MTK_COMMON_H
+
+#include 
+#include 
+#include 
+#include 
+
+#define MT8183_SW_RSTN 0x0
+#define MT8183_SW_RSTN_BIT BIT(0)
+#define MT8183_SCP_TO_HOST 0x1C
+#define MT8183_SCP_IPC_INT_BIT BIT(0)
+#define MT8183_SCP_WDT_INT_BIT BIT(8)
+#define MT8183_HOST_TO_SCP 0x28
+#define MT8183_HOST_IPC_INT_BITBIT(0)
+#define MT8183_SCP_SRAM_PDN0x402C
+
+#define SCP_FW_VER_LEN 32
+
+struct scp_run {
+   u32 signaled;
+   s8 fw_ver[SCP_FW_VER_LEN];
+   u32 dec_capability;
+   u32 enc_capability;
+   wait_queue_head_t wq;
+};
+
+struct scp_ipi_desc {
+   scp_ipi_handler_t handler;
+   void *priv;
+};
+
+struct mtk_scp {
+   struct device *dev;
+   struct rproc *rproc;
+   struct clk *clk;
+   void __iomem *reg_base;
+   void __iomem *sram_base;
+   size_t sram_size;
+
+   struct share_obj *recv_buf;
+   struct share_obj *send_buf;
+   struct scp_run run;
+   struct mutex lock; /* for protecting mtk_scp data structure */
+   struct scp_ipi_desc ipi_desc[SCP_IPI_MAX];
+   bool ipi_id_ack[SCP_IPI_MAX];
+   wait_queue_head_t ack_wq;
+
+   void __iomem *cpu_addr;
+   phys_addr_t phys_addr;
+   size_t dram_size;
+};
+
+/**
+ * struct share_obj - SRAM buffer shared with
+ *   AP and SCP
+ *
+ * @id:IPI id
+ * @len:   share buffer length
+ * @share_buf: share buffer data
+ */
+struct share_obj {
+   s32 id;
+   u32 len;
+   u8 share_buf[288];
+};
+
+void scp_memcpy_aligned(void 

[PATCH v10 1/7] dt-bindings: Add a binding for Mediatek SCP

2019-06-02 Thread Pi-Hsun Shih
From: Erin Lo 

Add a DT binding documentation of SCP for the
MT8183 SoC from Mediatek.

Signed-off-by: Erin Lo 
Signed-off-by: Pi-Hsun Shih 
Reviewed-by: Rob Herring 
---
Changes from v9, v8, v7, v6:
 - No change.

Changes from v5:
 - Remove dependency on CONFIG_RPMSG_MTK_SCP.

Changes from v4:
 - Add detail of more properties.
 - Document the usage of mtk,rpmsg-name in subnode from the new design.

Changes from v3:
 - No change.

Changes from v2:
 - No change. I realized that for this patch series, there's no need to
   add anything under the mt8183-scp node (neither the mt8183-rpmsg or
   the cros-ec-rpmsg) for them to work, since mt8183-rpmsg is added
   directly as a rproc_subdev by code, and cros-ec-rpmsg is dynamically
   created by SCP name service.

Changes from v1:
 - No change.
---
 .../bindings/remoteproc/mtk,scp.txt   | 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/remoteproc/mtk,scp.txt

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.txt 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.txt
new file mode 100644
index ..3ba668bab14b
--- /dev/null
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.txt
@@ -0,0 +1,36 @@
+Mediatek SCP Bindings
+
+
+This binding provides support for ARM Cortex M4 Co-processor found on some
+Mediatek SoCs.
+
+Required properties:
+- compatible   Should be "mediatek,mt8183-scp"
+- reg  Should contain the address ranges for the two memory
+   regions, SRAM and CFG.
+- reg-namesContains the corresponding names for the two memory
+   regions. These should be named "sram" & "cfg".
+- clocks   Clock for co-processor (See: 
../clock/clock-bindings.txt)
+- clock-names  Contains the corresponding name for the clock. This
+   should be named "main".
+
+Subnodes
+
+
+Subnodes of the SCP represent rpmsg devices. The names of the devices are not
+important. The properties of these nodes are defined by the individual bindings
+for the rpmsg devices - but must contain the following property:
+
+- mtk,rpmsg-name   Contains the name for the rpmsg device. Used to match
+   the subnode to rpmsg device announced by SCP.
+
+Example:
+
+   scp: scp@1050 {
+   compatible = "mediatek,mt8183-scp";
+   reg = <0 0x1050 0 0x8>,
+ <0 0x105c 0 0x5000>;
+   reg-names = "sram", "cfg";
+   clocks = < CLK_INFRA_SCPSYS>;
+   clock-names = "main";
+   };
-- 
2.22.0.rc1.257.g3120a18244-goog



[PATCH v10 0/7] Add support for mt8183 SCP.

2019-06-02 Thread Pi-Hsun Shih
Add support for controlling and communicating with mt8183's system
control processor (SCP), using the remoteproc & rpmsg framework.
And also add a cros_ec driver for CrOS EC host command over rpmsg.

The overall structure of the series is:
* remoteproc/mtk_scp.c: Control the start / stop of SCP (Patch 2, 3).
* remoteproc/mtk_scp_ipi.c: Communicates to SCP using inter-processor
  interrupt (IPI) and shared memory (Patch 2, 3).
* rpmsg/mtk_rpmsg.c: Wrapper to wrap the IPI communication into a rpmsg
  device. Supports name service for SCP firmware to
  announce channels (Patch 4).
* platform/chrome/cros_ec_rpmsg.c: Communicates with the SCP over the
  rpmsg framework (like what platform/chrome/cros_ec_{i2c,spi}.c does)
  (Patch 5, 6).
* add scp dts node to mt8183 platform (Patch 7).

This series (In particular, Patch 7) is based on
https://patchwork.kernel.org/cover/10962385/.

Changes from v9:
 - Remove reserve-memory-vpu_share node.
 - Remove change to cros_ec_commands.h (That is already in
   https://lore.kernel.org/lkml/20190518063949.GY4319@dell/T/)

Changes from v8:
 - Rebased onto https://patchwork.kernel.org/cover/10962385/.
 - Drop merged cros_ec_rpmsg patch, and add scp dts node patch.
 - Add more reserved memory region.

Changes from v7:
 - Rebase onto https://lore.kernel.org/patchwork/patch/1059196/.
 - Fix clock enable/disable timing for SCP driver.
 - Add more SCP IPI ID.

Changes from v6:
 - Decouple mtk_rpmsg from mtk_scp.
 - Change data of EC response to be aligned to 4 bytes.

Changes from v5:
 - Add device tree binding document for cros_ec_rpmsg.
 - Better document in comments for cros_ec_rpmsg.
 - Remove dependency on CONFIG_ in binding tree document.

Changes from v4:
 - Merge patch 6 (Load ELF firmware) into patch 2, so the driver loads
   ELF firmware by default, and no longer accept plain binary.
 - rpmsg_device listed in device tree (as a child of the SCP node) would
   have it's device tree node mapped to the rpmsg_device, so the rpmsg
   driver can use the properties on device tree.

Changes from v3:
 - Make writing to SCP SRAM aligned.
 - Add a new patch (Patch 6) to load ELF instead of bin firmware.
 - Add host event support for EC driver.
 - Fix some bugs found in testing (missing spin_lock_init,
   rproc_subdev_unprepare to rproc_subdev_stop).
 - Fix some coding style issue found by checkpatch.pl.

Changes from v2:
 - Fold patch 3 into patch 2 in v2.
 - Move IPI id around to support cross-testing for old and new firmware.
 - Finish more TODO items.

Changes from v1:
 - Extract functions and rename variables in mtk_scp.c.
 - Do cleanup properly in mtk_rpmsg.c, which also removes the problem of
   short-lived work items.
 - Code format fix based on feedback for cros_ec_rpmsg.c.
 - Extract feature detection for SCP into separate patch (Patch 6).

Eddie Huang (1):
  arm64: dts: mt8183: add scp node

Erin Lo (3):
  dt-bindings: Add a binding for Mediatek SCP
  remoteproc/mediatek: add SCP support for mt8183
  remoteproc: mt8183: add reserved memory manager API

Pi-Hsun Shih (3):
  rpmsg: add rpmsg support for mt8183 SCP.
  dt-bindings: Add binding for cros-ec-rpmsg.
  mfd: cros_ec: differentiate SCP from EC by feature bit.

 .../devicetree/bindings/mfd/cros-ec.txt   |   5 +-
 .../bindings/remoteproc/mtk,scp.txt   |  36 +
 arch/arm64/boot/dts/mediatek/mt8183-evb.dts   |  11 +
 arch/arm64/boot/dts/mediatek/mt8183.dtsi  |  12 +
 drivers/mfd/cros_ec_dev.c |  10 +
 drivers/remoteproc/Kconfig|  10 +
 drivers/remoteproc/Makefile   |   1 +
 drivers/remoteproc/mtk_common.h   |  77 ++
 drivers/remoteproc/mtk_scp.c  | 678 ++
 drivers/remoteproc/mtk_scp_ipi.c  | 163 +
 drivers/rpmsg/Kconfig |   9 +
 drivers/rpmsg/Makefile|   1 +
 drivers/rpmsg/mtk_rpmsg.c | 396 ++
 include/linux/mfd/cros_ec.h   |   1 +
 include/linux/platform_data/mtk_scp.h | 167 +
 include/linux/rpmsg/mtk_rpmsg.h   |  30 +
 16 files changed, 1606 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/remoteproc/mtk,scp.txt
 create mode 100644 drivers/remoteproc/mtk_common.h
 create mode 100644 drivers/remoteproc/mtk_scp.c
 create mode 100644 drivers/remoteproc/mtk_scp_ipi.c
 create mode 100644 drivers/rpmsg/mtk_rpmsg.c
 create mode 100644 include/linux/platform_data/mtk_scp.h
 create mode 100644 include/linux/rpmsg/mtk_rpmsg.h

-- 
2.22.0.rc1.257.g3120a18244-goog



Re: [PATCH] scsi: ibmvscsi: Don't use rc uninitialized in ibmvscsi_do_work

2019-06-02 Thread Nathan Chancellor
Hi Michael,

On Sun, Jun 02, 2019 at 08:15:38PM +1000, Michael Ellerman wrote:
> Hi Nathan,
> 
> It's always preferable IMHO to keep any initialisation as localised as
> possible, so that the compiler can continue to warn about uninitialised
> usages elsewhere. In this case that would mean doing the rc = 0 in the
> switch, something like:

I am certainly okay with implementing this in a v2. I mulled over which
would be preferred, I suppose I guessed wrong :) Thank you for the
review and input.

> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c 
> b/drivers/scsi/ibmvscsi/ibmvscsi.c
> index 727c31dc11a0..7ee5755cf636 100644
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -2123,9 +2123,6 @@ static void ibmvscsi_do_work(struct ibmvscsi_host_data 
> *hostdata)
>  
> spin_lock_irqsave(hostdata->host->host_lock, flags);
> switch (hostdata->action) {
> -   case IBMVSCSI_HOST_ACTION_NONE:
> -   case IBMVSCSI_HOST_ACTION_UNBLOCK:
> -   break;
> case IBMVSCSI_HOST_ACTION_RESET:
> spin_unlock_irqrestore(hostdata->host->host_lock, flags);
> rc = ibmvscsi_reset_crq_queue(>queue, hostdata);
> @@ -2142,7 +2139,10 @@ static void ibmvscsi_do_work(struct ibmvscsi_host_data 
> *hostdata)
> if (!rc)
> rc = ibmvscsi_send_crq(hostdata, 
> 0xC001LL, 0);
> break;
> +   case IBMVSCSI_HOST_ACTION_NONE:
> +   case IBMVSCSI_HOST_ACTION_UNBLOCK:
> default:
> +   rc = 0;
> break;
> }
> 
> 
> But then that makes me wonder if that's actually correct?
> 
> If we get an action that we don't recognise should we just throw it away
> like that? (by doing hostdata->action = IBMVSCSI_HOST_ACTION_NONE). Tyrel?

However, because of this, I will hold off on v2 until Tyrel can give
some feedback.

Thanks,
Nathan


[PATCH] ipvlan: Don't propagate IFF_ALLMULTI changes on down interfaces.

2019-06-02 Thread Young Xiao
Clearing the IFF_ALLMULTI flag on a down interface could cause an allmulti
overflow on the underlying interface.

Attempting the set IFF_ALLMULTI on the underlying interface would cause an
error and the log message:

"allmulti touches root, set allmulti failed."

Signed-off-by: Young Xiao <92siuy...@gmail.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index bbeb162..523bb83 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -242,8 +242,10 @@ static void ipvlan_change_rx_flags(struct net_device *dev, 
int change)
struct ipvl_dev *ipvlan = netdev_priv(dev);
struct net_device *phy_dev = ipvlan->phy_dev;
 
-   if (change & IFF_ALLMULTI)
-   dev_set_allmulti(phy_dev, dev->flags & IFF_ALLMULTI? 1 : -1);
+   if (dev->flags & IFF_UP) {
+   if (change & IFF_ALLMULTI)
+   dev_set_allmulti(phy_dev, dev->flags & IFF_ALLMULTI ? 1 
: -1);
+   }
 }
 
 static void ipvlan_set_multicast_mac_filter(struct net_device *dev)
-- 
2.7.4



Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu
On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote:
>
> Please note that preemptible Tree RCU has lacked the compiler barrier on
> all but the outermost rcu_read_unlock() for years before Boqun's patch.

Actually this is not true.  Boqun's patch (commit bb73c52bad36) does
not add a barrier() to __rcu_read_lock.  In fact I dug into the git
history and this compiler barrier() has existed in preemptible tree
RCU since the very start in 2009:

: commit f41d911f8c49a5d65c86504c19e8204bb605c4fd
: Author: Paul E. McKenney 
: Date:   Sat Aug 22 13:56:52 2009 -0700
:
: rcu: Merge preemptable-RCU functionality into hierarchical RCU
:
: +/*
: + * Tree-preemptable RCU implementation for rcu_read_lock().
: + * Just increment ->rcu_read_lock_nesting, shared state will be updated
: + * if we block.
: + */
: +void __rcu_read_lock(void)
: +{
: +   ACCESS_ONCE(current->rcu_read_lock_nesting)++;
: +   barrier();  /* needed if we ever invoke rcu_read_lock in rcutree.c */
: +}
: +EXPORT_SYMBOL_GPL(__rcu_read_lock);

However, you are correct that in the non-preempt tree RCU case,
the compiler barrier in __rcu_read_lock was not always present.
In fact it was added by:

: commit 386afc91144b36b42117b0092893f15bc8798a80
: Author: Linus Torvalds 
: Date:   Tue Apr 9 10:48:33 2013 -0700
:
: spinlocks and preemption points need to be at least compiler barriers

I suspect this is what prompted you to remove it in 2015.

> I do not believe that reverting that patch will help you at all.
> 
> But who knows?  So please point me at the full code body that was being
> debated earlier on this thread.  It will no doubt take me quite a while to
> dig through it, given my being on the road for the next couple of weeks,
> but so it goes.

Please refer to my response to Linus for the code in question.

In any case, I am now even more certain that compiler barriers are
not needed in the code in question.  The reasoning is quite simple.
If you need those compiler barriers then you surely need real memory
barriers.

Vice versa, if real memory barriers are already present thanks to
RCU, then you don't need those compiler barriers.

In fact this calls into question the use of READ_ONCE/WRITE_ONCE in
RCU primitives such as rcu_dereference and rcu_assign_pointer.  IIRC
when RCU was first added to the Linux kernel we did not have compiler
barriers in rcu_dereference and rcu_assign_pointer.  They were added
later on.

As compiler barriers per se are useless, these are surely meant to
be coupled with the memory barriers provided by RCU grace periods
and synchronize_rcu.  But then those real memory barriers would have
compiler barriers too.  So why do we need the compiler barriers in
rcu_dereference and rcu_assign_pointer?

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


RE: [PATCH] arm64: dts: imx8mm: Fix build warnings

2019-06-02 Thread Anson Huang
Hi, Fabio

> -Original Message-
> From: Fabio Estevam 
> Sent: Monday, June 3, 2019 10:49 AM
> To: Anson Huang 
> Cc: Rob Herring ; Mark Rutland
> ; Shawn Guo ; Sascha
> Hauer ; Sascha Hauer ;
> Leonard Crestez ; Aisheng Dong
> ; viresh kumar ; Jacky
> Bai ; open list:OPEN FIRMWARE AND FLATTENED
> DEVICE TREE BINDINGS ; moderated
> list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE  ker...@lists.infradead.org>; linux-kernel ; dl-
> linux-imx 
> Subject: Re: [PATCH] arm64: dts: imx8mm: Fix build warnings
> 
> Hi Anson,
> 
> On Sun, Jun 2, 2019 at 9:46 PM  wrote:
> >
> > From: Anson Huang 
> >
> > This patch fixes below build warning with "W=1":
> 
> I have already sent patches to fix these warnings.

OK, thanks, then please ignore this patch.

Anson.


Re: [PATCH 3/3] ACPI / device_sysfs: Add eject show attr to monitor eject status

2019-06-02 Thread Chester Lin
On Fri, May 31, 2019 at 06:38:59AM -0700, Greg KH wrote:
> On Fri, May 31, 2019 at 02:56:42PM +0800, Chester Lin wrote:
> > An acpi_eject_show attribute for users to monitor current status because
> > sometimes it might take time to finish an ejection so we need to know
> > whether it is still in progress or not.
> > 
> > Signed-off-by: Chester Lin 
> > ---
> >  drivers/acpi/device_sysfs.c | 20 +++-
> >  drivers/acpi/internal.h |  1 +
> >  drivers/acpi/scan.c | 27 +++
> >  3 files changed, 47 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/acpi/device_sysfs.c b/drivers/acpi/device_sysfs.c
> > index 78c2653bf020..70b22eec6bbc 100644
> > --- a/drivers/acpi/device_sysfs.c
> > +++ b/drivers/acpi/device_sysfs.c
> > @@ -403,7 +403,25 @@ acpi_eject_store(struct device *d, struct 
> > device_attribute *attr,
> > return status == AE_NO_MEMORY ? -ENOMEM : -EAGAIN;
> >  }
> >  
> > -static DEVICE_ATTR(eject, 0200, NULL, acpi_eject_store);
> > +static ssize_t acpi_eject_show(struct device *d,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   struct acpi_device *acpi_device = to_acpi_device(d);
> > +   acpi_object_type not_used;
> > +   acpi_status status;
> > +
> > +   if ((!acpi_device->handler || !acpi_device->handler->hotplug.enabled)
> > +   && !acpi_device->driver)
> > +   return -ENODEV;
> > +
> > +   status = acpi_get_type(acpi_device->handle, _used);
> > +   if (ACPI_FAILURE(status) || !acpi_device->flags.ejectable)
> > +   return -ENODEV;
> > +
> > +   return sprintf(buf, "%s\n", acpi_eject_status_string(acpi_device));
> > +}
> > +
> > +static DEVICE_ATTR(eject, 0644, acpi_eject_show, acpi_eject_store);
> 
> DEVICE_ATTR_RW()?
> 
> And you need to document the new sysfs file in Documentation/ABI/
> 
> thanks,
> 
> greg k-h
> 

Hi Greg,

Thank you for the reminder and I will fix these two in v2.

Regards,
Chester


Re: [PATCH] arm64: dts: imx8mm: Fix build warnings

2019-06-02 Thread Fabio Estevam
Hi Anson,

On Sun, Jun 2, 2019 at 9:46 PM  wrote:
>
> From: Anson Huang 
>
> This patch fixes below build warning with "W=1":

I have already sent patches to fix these warnings.


Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Herbert Xu
On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote:
> On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu  
> wrote:
> >
> > You can't then go and decide to remove the compiler barrier!  To do
> > that you'd need to audit every single use of rcu_read_lock in the
> > kernel to ensure that they're not depending on the compiler barrier.
> 
> What's the possible case where it would matter when there is no preemption?

The case we were discussing is from net/ipv4/inet_fragment.c from
the net-next tree:

void fqdir_exit(struct fqdir *fqdir)
{
...
fqdir->dead = true;

/* call_rcu is supposed to provide memory barrier semantics,
 * separating the setting of fqdir->dead with the destruction
 * work.  This implicit barrier is paired with inet_frag_kill().
 */

INIT_RCU_WORK(>destroy_rwork, fqdir_rwork_fn);
queue_rcu_work(system_wq, >destroy_rwork);
}

and

void inet_frag_kill(struct inet_frag_queue *fq)
{
...
rcu_read_lock();
/* The RCU read lock provides a memory barrier
 * guaranteeing that if fqdir->dead is false then
 * the hash table destruction will not start until
 * after we unlock.  Paired with inet_frags_exit_net().
 */
if (!fqdir->dead) {
rhashtable_remove_fast(>rhashtable, >node,
   fqdir->f->rhash_params);
...
}
...
rcu_read_unlock();
...
}

I simplified this to

Initial values:

a = 0
b = 0

CPU1CPU2

a = 1   rcu_read_lock
synchronize_rcu if (a == 0)
b = 2   b = 1
rcu_read_unlock

On exit we want this to be true:
b == 2

Now what Paul was telling me is that unless every memory operation
is done with READ_ONCE/WRITE_ONCE then his memory model shows that
the exit constraint won't hold.  IOW, we need

CPU1CPU2

WRITE_ONCE(a, 1)rcu_read_lock
synchronize_rcu if (READ_ONCE(a) == 0)
WRITE_ONCE(b, 2)WRITE_ONCE(b, 1)
rcu_read_unlock

Now I think this bullshit because if we really needed these compiler
barriers then we surely would need real memory barriers to go with
them.

In fact, the sole purpose of the RCU mechanism is to provide those
memory barriers.  Quoting from
Documentation/RCU/Design/Requirements/Requirements.html:

Each CPU that has an RCU read-side critical section that
begins before synchronize_rcu() starts is
guaranteed to execute a full memory barrier between the time
that the RCU read-side critical section ends and the time that
synchronize_rcu() returns.
Without this guarantee, a pre-existing RCU read-side critical section
might hold a reference to the newly removed struct foo
after the kfree() on line14 of
remove_gp_synchronous().
Each CPU that has an RCU read-side critical section that ends
after synchronize_rcu() returns is guaranteed
to execute a full memory barrier between the time that
synchronize_rcu() begins and the time that the RCU
read-side critical section begins.
Without this guarantee, a later RCU read-side critical section
running after the kfree() on line14 of
remove_gp_synchronous() might
later run do_something_gp() and find the
newly deleted struct foo.

My review of the RCU code shows that these memory barriers are
indeed present (at least when we're not in tiny mode where all
this discussion would be moot anyway).  For example, in call_rcu
we eventually get down to rcu_segcblist_enqueue which has an smp_mb.
On the reader side (correct me if I'm wrong Paul) the memory
barrier is implicitly coming from the scheduler.

My point is that within our kernel whenever we have a CPU memory
barrier we always have a compiler barrier too.  Therefore my code
example above does not need any extra compiler barriers such as
the ones provided by READ_ONCE/WRITE_ONCE.

I think perhaps Paul was perhaps thinking that I'm expecting
rcu_read_lock/rcu_read_unlock themselves to provide the memory
or compiler barriers.  That would indeed be wrong but this is
not what I need.  All I need is the RCU semantics as documented
for there to be memory and compiler barriers around the whole
grace period.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[cgroup] c03cd7738a: BUG:KASAN:slab-out-of-bounds_in_c

2019-06-02 Thread kernel test robot
FYI, we noticed the following commit (built with gcc-7):

commit: c03cd7738a83b13739f00546166969342c8ff014 ("cgroup: Include dying 
leaders with live threads in PROCS iterations")
https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next

in testcase: trinity
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 2G

caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):


+--+++
|  | b636fd38dc | 
c03cd7738a |
+--+++
| boot_successes   | 18 | 5 
 |
| boot_failures| 1  | 9 
 |
| BUG:kernel_hang_in_boot-around-mounting-root_stage   | 1  |   
 |
| BUG:KASAN:slab-out-of-bounds_in_c| 0  | 7 
 |
| WARNING:at_lib/refcount.c:#refcount_inc_checked  | 0  | 8 
 |
| RIP:refcount_inc_checked | 0  | 8 
 |
| WARNING:at_lib/refcount.c:#refcount_sub_and_test_checked | 0  | 8 
 |
| RIP:refcount_sub_and_test_checked| 0  | 8 
 |
| BUG:KASAN:use-after-free_in_c| 0  | 1 
 |
+--+++


If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


[   18.337218] BUG: KASAN: slab-out-of-bounds in 
css_task_iter_advance+0x1bd/0x240
[   18.338974] Read of size 4 at addr 888050ff294c by task systemd/1
[   18.340408] 
[   18.340960] CPU: 1 PID: 1 Comm: systemd Not tainted 5.2.0-rc2-00013-gc03cd77 
#1
[   18.342728] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1 04/01/2014
[   18.344685] Call Trace:
[   18.345424]  dump_stack+0x7d/0xb8
[   18.346304]  ? css_task_iter_advance+0x1bd/0x240
[   18.347420]  print_address_description+0xa1/0x330
[   18.348547]  ? css_task_iter_advance+0x1bd/0x240
[   18.349658]  ? css_task_iter_advance+0x1bd/0x240
[   18.350767]  ? css_task_iter_advance+0x1bd/0x240
[   18.351878]  __kasan_report+0x11d/0x163
[   18.352850]  ? css_task_iter_advance+0x1bd/0x240
[   18.353965]  kasan_report+0x2f/0x40
[   18.354873]  __asan_load4+0x6a/0x90
[   18.355780]  css_task_iter_advance+0x1bd/0x240
[   18.356857]  css_task_iter_start+0xd0/0x120
[   18.357889]  pidlist_array_load+0x107/0x540
[   18.358921]  ? cgroup_pidlist_find+0xa0/0xa0
[   18.359972]  cgroup_pidlist_start+0x24e/0x2b0
[   18.361037]  cgroup_seqfile_start+0x57/0x60
[   18.362065]  ? cgroup_file_release+0x60/0x60
[   18.363111]  kernfs_seq_start+0x86/0xd0
[   18.364080]  seq_read+0x16e/0x750
[   18.364960]  kernfs_fop_read+0x23c/0x2b0
[   18.365949]  ? security_file_permission+0x140/0x1c0
[   18.367106]  ? kernfs_fop_write+0x280/0x280
[   18.368149]  __vfs_read+0x59/0xb0
[   18.369024]  vfs_read+0xeb/0x1d0
[   18.369888]  ksys_read+0x134/0x1b0
[   18.370787]  ? kernel_write+0xa0/0xa0
[   18.371734]  ? __this_cpu_preempt_check+0x2f/0x150
[   18.372922]  __x64_sys_read+0x43/0x50
[   18.373876]  do_syscall_64+0xd3/0x3a0
[   18.374930]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   18.376117] RIP: 0033:0x7f02f62f56d0
[   18.377046] Code: b6 fe ff ff 48 8d 3d 17 be 08 00 48 83 ec 08 e8 06 db 01 
00 66 0f 1f 44 00 00 83 3d 39 30 2c 00 00 75 10 b8 00 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 31 c3 48 83 ec 08 e8 de 9b 01 00 48 89 04 24
[   18.381026] RSP: 002b:7ffd6f7342b8 EFLAGS: 0246 ORIG_RAX: 

[   18.382834] RAX: ffda RBX: 55f99ecc8110 RCX: 7f02f62f56d0
[   18.384393] RDX: 1000 RSI: 55f99ec82710 RDI: 0023
[   18.385954] RBP: 0d68 R08: 7f02f65b41a8 R09: 1010
[   18.387509] R10: 0050 R11: 0246 R12: 7f02f65b0440
[   18.389058] R13: 7f02f65af900 R14:  R15: 
[   18.390621] 
[   18.391170] Allocated by task 1:
[   18.392034]  __kasan_kmalloc+0xe4/0x150
[   18.393199]  kasan_kmalloc+0x28/0x40
[   18.394119]  find_css_set+0x1ad/0x770
[   18.395058]  cgroup_migrate_prepare_dst+0x10d/0x3a0
[   18.396226]  cgroup_attach_task+0x1ee/0x290
[   18.397258]  __cgroup1_procs_write+0x17a/0x210
[   18.398523]  cgroup1_procs_write+0x2a/0x40
[   18.399539]  cgroup_file_write+0x190/0x330
[   18.400559]  kernfs_fop_write+0x1d9/0x280
[   18.401563]  __vfs_write+0x59/0xb0
[   18.402461]  vfs_write+0x13c/0x2d0
[   18.403354]  ksys_write+0x134/0x1b0
[   18.404262]  __x64_sys_write+0x43/0x50
[   18.405218]  do_syscall_64+0xd3/0x3a0

RE: [PATCH] usb: dwc3: Enable the USB snooping

2019-06-02 Thread Ran Wang
Hi Felipe,

On Thursday, May 30, 2019 17:09, Ran Wang wrote:
> 
> 
> > >> >> >  /* Global Debug Queue/FIFO Space Available Register */
> > >> >> >  #define DWC3_GDBGFIFOSPACE_NUM(n)  ((n) & 0x1f)
> > >> >> >  #define DWC3_GDBGFIFOSPACE_TYPE(n) (((n) << 5) & 0x1e0)
> > >> >> > @@ -859,6 +867,7 @@ struct dwc3_scratchpad_array {
> > >> >> >   * 3   - Reserved
> > >> >> >   * @imod_interval: set the interrupt moderation interval in 250ns
> > >> >> >   * increments or 0 to disable.
> > >> >> > + * @dma_coherent: set if enable dma-coherent.
> > >> >>
> > >> >> you're not enabling dma coherency, you're enabling cache snooping.
> > >> >> And this property should describe that. Also, keep in mind that
> > >> >> different devices may want different cache types for each of
> > >> >> those fields, so your property would have to be a lot more
> > >> >> complex. Something
> > like:
> > >> >>
> > >> >>   snps,cache-type = , , ...
> > >> >>
> > >> >> Then driver would have to parse this properly to setup GSBUSCFG0.
> > >
> > > According to the DesignWare Cores SuperSpeed USB 3.0 Controller
> > > Databook (v2.60a), it has described Type Bit Assignments for all
> > > supported
> > master bus type:
> > > AHB, AXI3, AXI4 and Native. I found the bit definition are different
> > > among
> > them.
> > > So, for the example you gave above, feel a little bit confused.
> > > Did you mean:
> > > snps,cache-type = ,  > > "cacheable">, , 
> >
> > yeah, something like that.
> 
> I think DATA_RD  should be a macro, right? So, where I can put its define?
> Create a dwc3.h in include/dt-bindings/usb/ ?

Could you please give me some advice here? I'd like to prepare next version 
patch after
getting this settled.

> Another question about this remain open is: DWC3 data book's Table 6-5 Cache
> Type Bit Assignments show that bits definition will differ per MBUS_TYPEs as
> below:
> 
>  MBUS_TYPE| bit[3]   |bit[2]   |bit[1] |bit[0]
>  
>  AHB  |Cacheable |Bufferable   |Privilegge |Data
>  AXI3 |Write Allocate|Read Allocate|Cacheable  |Bufferable
>  AXI4 |Allocate Other|Allocate |Modifiable |Bufferable
>  AXI4 |Other Allocate|Allocate |Modifiable |Bufferable
>  Native   |Same as AXI   |Same as AXI  |Same as AXI|Same as AXI
>  
>  Note: The AHB, AXI3, AXI4, and PCIe busses use different names for certain
>  signals, which have the same meaning:
>Bufferable = Posted
>Cacheable = Modifiable = Snoop (negation of No Snoop)
> 
> For Layerscape SoCs, MBUS_TYPE is AXI3. So I am not sure how to use
> snps,cache-type = , to cover all MBUS_TYPE?
> (you can notice that AHB and AXI3's cacheable are on different bit) Or I just 
> need
> to handle AXI3 case?

Also on this open. Thank you in advance.

Regards,
Ran


[PATCH] unicore32: check stack pointer in get_wchan

2019-06-02 Thread Young Xiao
get_wchan() is lockless. Task may wakeup at any time and change its own 
stack, thus each next stack frame may be overwritten and filled with 
random stuff.

This patch fixes oops in unwind_frame() by adding stack pointer validation
on each step (as x86 code do), unwind_frame() already checks frame pointer.

See commit 1b15ec7a7427 ("ARM: 7912/1: check stack pointer in get_wchan")
for details.

Signed-off-by: Young Xiao <92siuy...@gmail.com>
---
 arch/unicore32/kernel/process.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/unicore32/kernel/process.c b/arch/unicore32/kernel/process.c
index 2bc10b8..1899ebc 100644
--- a/arch/unicore32/kernel/process.c
+++ b/arch/unicore32/kernel/process.c
@@ -277,6 +277,7 @@ EXPORT_SYMBOL(dump_fpu);
 unsigned long get_wchan(struct task_struct *p)
 {
struct stackframe frame;
+   unsigned long stack_page;
int count = 0;
if (!p || p == current || p->state == TASK_RUNNING)
return 0;
@@ -285,9 +286,11 @@ unsigned long get_wchan(struct task_struct *p)
frame.sp = thread_saved_sp(p);
frame.lr = 0;   /* recovered from the stack */
frame.pc = thread_saved_pc(p);
+   stack_page = (unsigned long)task_stack_page(p);
do {
-   int ret = unwind_frame();
-   if (ret < 0)
+   if (frame.sp < stack_page ||
+   frame.sp >= stack_page + THREAD_SIZE ||
+   unwind_frame() < 0)
return 0;
if (!in_sched_functions(frame.pc))
return frame.pc;
-- 
2.7.4



Re: [PATCH v3 1/3] PCI: Introduce pcibios_ignore_alignment_request

2019-06-02 Thread Shawn Anastasio




On 5/30/19 10:56 PM, Alexey Kardashevskiy wrote:



On 31/05/2019 08:49, Shawn Anastasio wrote:

On 5/29/19 10:39 PM, Alexey Kardashevskiy wrote:



On 28/05/2019 17:39, Shawn Anastasio wrote:



On 5/28/19 1:27 AM, Alexey Kardashevskiy wrote:



On 28/05/2019 15:36, Oliver wrote:

On Tue, May 28, 2019 at 2:03 PM Shawn Anastasio 
wrote:


Introduce a new pcibios function pcibios_ignore_alignment_request
which allows the PCI core to defer to platform-specific code to
determine whether or not to ignore alignment requests for PCI
resources.

The existing behavior is to simply ignore alignment requests when
PCI_PROBE_ONLY is set. This is behavior is maintained by the
default implementation of pcibios_ignore_alignment_request.

Signed-off-by: Shawn Anastasio 
---
    drivers/pci/pci.c   | 9 +++--
    include/linux/pci.h | 1 +
    2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8abc843b1615..8207a09085d1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5882,6 +5882,11 @@ resource_size_t __weak
pcibios_default_alignment(void)
   return 0;
    }

+int __weak pcibios_ignore_alignment_request(void)
+{
+   return pci_has_flag(PCI_PROBE_ONLY);
+}
+
    #define RESOURCE_ALIGNMENT_PARAM_SIZE COMMAND_LINE_SIZE
    static char
resource_alignment_param[RESOURCE_ALIGNMENT_PARAM_SIZE] = {0};
    static DEFINE_SPINLOCK(resource_alignment_lock);
@@ -5906,9 +5911,9 @@ static resource_size_t
pci_specified_resource_alignment(struct pci_dev *dev,
   p = resource_alignment_param;
   if (!*p && !align)
   goto out;
-   if (pci_has_flag(PCI_PROBE_ONLY)) {
+   if (pcibios_ignore_alignment_request()) {
   align = 0;
-   pr_info_once("PCI: Ignoring requested alignments
(PCI_PROBE_ONLY)\n");
+   pr_info_once("PCI: Ignoring requested alignments\n");
   goto out;
   }


I think the logic here is questionable to begin with. If the user has
explicitly requested re-aligning a resource via the command line then
we should probably do it even if PCI_PROBE_ONLY is set. When it breaks
they get to keep the pieces.

That said, the real issue here is that PCI_PROBE_ONLY probably
shouldn't be set under qemu/kvm. Under the other hypervisor (PowerVM)
hotplugged devices are configured by firmware before it's passed to
the guest and we need to keep the FW assignments otherwise things
break. QEMU however doesn't do any BAR assignments and relies on that
being handled by the guest. At boot time this is done by SLOF, but
Linux only keeps SLOF around until it's extracted the device-tree.
Once that's done SLOF gets blown away and the kernel needs to do it's
own BAR assignments. I'm guessing there's a hack in there to make it
work today, but it's a little surprising that it works at all...



The hack is to run a modified qemu-aware "/usr/sbin/rtas_errd" in the
guest which receives an event from qemu (RAS_EPOW from
/proc/interrupts), fetches device tree chunks (and as I understand it -
they come with BARs from phyp but without from qemu) and writes "1" to
"/sys/bus/pci/rescan" which calls pci_assign_resource() eventually:


Interesting. Does this mean that the PHYP hotplug path doesn't
call pci_assign_resource?



I'd expect dlpar_add_slot() to be called under phyp and eventually
pci_device_add() which (I think) may or may not trigger later
reassignment.



If so it means the patch may not
break that platform after all, though it still may not be
the correct way of doing things.



We should probably stop enforcing the PCI_PROBE_ONLY flag - it seems
that (unless resource_alignment= is used) the pseries guest should just
walk through all allocated resources and leave them unchanged.


If we add a pcibios_default_alignment() implementation like was
suggested earlier, then it will behave as if the user has
specified resource_alignment= by default and SLOF's assignments
won't be honored (I think).



I removed pci_add_flags(PCI_PROBE_ONLY) from pSeries_setup_arch and
tried booting with and without pci=resource_alignment= and I can see no
difference - BARs are still aligned to 64K as programmed in SLOF; if I
hack SLOF to align to 4K or 32K - BARs get packed and the guest leaves
them unchanged.



I guess it boils down to one question - is it important that we
observe SLOF's initial BAR assignments?


It isn't if it's SLOF but it is if it's phyp. It used to not
allow/support BAR reassignment and even if it does not, I'd rather avoid
touching them.


A quick update. I tried removing pci_add_flags(PCI_PROBE_ONLY) which
worked, but if I add an implementation of pcibios_default_alignment
which simply returns PAGE_SIZE, my VM fails to boot and many errors
from the virtio disk driver are printed to the console.

After some investigation, it seems that with pcibios_default_alignment
present, Linux will reallocate all resources provided by SLOF on
boot. I'm still not sure why exactly 

[PATCH v3] mtd: rawnand: Add Macronix NAND read retry support

2019-06-02 Thread Mason Yang
Add support for Macronix NAND read retry.

Macronix NANDs support specific read operation for data recovery,
which can be enabled with a SET_FEATURE.
Driver checks byte 167 of Vendor Blocks in ONFI parameter page table
to see if this high-reliability function is supported.

Signed-off-by: Mason Yang 
---
 drivers/mtd/nand/raw/nand_macronix.c | 45 
 1 file changed, 45 insertions(+)

diff --git a/drivers/mtd/nand/raw/nand_macronix.c 
b/drivers/mtd/nand/raw/nand_macronix.c
index fad57c3..58511ae 100644
--- a/drivers/mtd/nand/raw/nand_macronix.c
+++ b/drivers/mtd/nand/raw/nand_macronix.c
@@ -8,6 +8,50 @@
 
 #include "internals.h"
 
+#define MACRONIX_READ_RETRY_BIT BIT(0)
+#define MACRONIX_NUM_READ_RETRY_MODES 6
+
+struct nand_onfi_vendor_macronix {
+   u8 reserved;
+   u8 reliability_func;
+} __packed;
+
+static int macronix_nand_setup_read_retry(struct nand_chip *chip, int mode)
+{
+   u8 feature[ONFI_SUBFEATURE_PARAM_LEN];
+
+   if (!chip->parameters.supports_set_get_features ||
+   !test_bit(ONFI_FEATURE_ADDR_READ_RETRY,
+ chip->parameters.set_feature_list))
+   return -ENOTSUPP;
+
+   feature[0] = mode;
+   return nand_set_features(chip, ONFI_FEATURE_ADDR_READ_RETRY, feature);
+}
+
+static void macronix_nand_onfi_init(struct nand_chip *chip)
+{
+   struct nand_parameters *p = >parameters;
+   struct nand_onfi_vendor_macronix *mxic;
+
+   if (!p->onfi)
+   return;
+
+   mxic = (struct nand_onfi_vendor_macronix *)p->onfi->vendor;
+   if ((mxic->reliability_func & MACRONIX_READ_RETRY_BIT) == 0)
+   return;
+
+   chip->read_retries = MACRONIX_NUM_READ_RETRY_MODES;
+   chip->setup_read_retry = macronix_nand_setup_read_retry;
+
+   if (p->supports_set_get_features) {
+   bitmap_set(p->set_feature_list,
+  ONFI_FEATURE_ADDR_READ_RETRY, 1);
+   bitmap_set(p->get_feature_list,
+  ONFI_FEATURE_ADDR_READ_RETRY, 1);
+   }
+}
+
 /*
  * Macronix AC series does not support using SET/GET_FEATURES to change
  * the timings unlike what is declared in the parameter page. Unflag
@@ -56,6 +100,7 @@ static int macronix_nand_init(struct nand_chip *chip)
chip->options |= NAND_BBM_FIRSTPAGE | NAND_BBM_SECONDPAGE;
 
macronix_nand_fix_broken_get_timings(chip);
+   macronix_nand_onfi_init(chip);
 
return 0;
 }
-- 
1.9.1



Re: [v4 3/7] drm/mediatek: add dsi reg commit disable control

2019-06-02 Thread CK Hu
Hi, Jitao:

On Sat, 2019-06-01 at 17:26 +0800, Jitao Shi wrote:
> New DSI IP has shadow register and working reg. The register
> values are writen to shadow register. And then trigger with
> commit reg, the register values will be moved working register.
> 
> This fucntion is defualt on. But this driver doesn't use this
> function. So add the disable control.

Reviewed-by: CK Hu 

> 
> Signed-off-by: Jitao Shi 
> ---
>  drivers/gpu/drm/mediatek/mtk_dsi.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/gpu/drm/mediatek/mtk_dsi.c 
> b/drivers/gpu/drm/mediatek/mtk_dsi.c
> index a48db056df6c..eea47294079e 100644
> --- a/drivers/gpu/drm/mediatek/mtk_dsi.c
> +++ b/drivers/gpu/drm/mediatek/mtk_dsi.c
> @@ -131,6 +131,10 @@
>  #define VM_CMD_ENBIT(0)
>  #define TS_VFP_ENBIT(5)
>  
> +#define DSI_SHADOW_DEBUG 0x190U
> +#define FORCE_COMMIT BIT(0)
> +#define BYPASS_SHADOWBIT(1)
> +
>  #define CONFIG   (0xff << 0)
>  #define SHORT_PACKET 0
>  #define LONG_PACKET  2
> @@ -157,6 +161,7 @@ struct phy;
>  
>  struct mtk_dsi_driver_data {
>   const u32 reg_cmdq_off;
> + bool has_shadow_ctl;
>  };
>  
>  struct mtk_dsi {
> @@ -594,6 +599,11 @@ static int mtk_dsi_poweron(struct mtk_dsi *dsi)
>   }
>  
>   mtk_dsi_enable(dsi);
> +
> + if (dsi->driver_data->has_shadow_ctl)
> + writel(FORCE_COMMIT | BYPASS_SHADOW,
> +dsi->regs + DSI_SHADOW_DEBUG);
> +
>   mtk_dsi_reset_engine(dsi);
>   mtk_dsi_phy_timconfig(dsi);
>  




linux-next: build failure after merge of the clockevents tree

2019-06-02 Thread Stephen Rothwell
Hi Daniel,

After merging the clockevents tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/clocksource/timer-atmel-tcb.c: In function 'tcb_clksrc_init':
drivers/clocksource/timer-atmel-tcb.c:448:17: error: invalid use of undefined 
type 'struct delay_timer'
   tc_delay_timer.read_current_timer = tc_delay_timer_read32;
 ^
drivers/clocksource/timer-atmel-tcb.c:461:17: error: invalid use of undefined 
type 'struct delay_timer'
   tc_delay_timer.read_current_timer = tc_delay_timer_read;
 ^
drivers/clocksource/timer-atmel-tcb.c:476:16: error: invalid use of undefined 
type 'struct delay_timer'
  tc_delay_timer.freq = divided_rate;
^
drivers/clocksource/timer-atmel-tcb.c:477:2: error: implicit declaration of 
function 'register_current_timer_delay'; did you mean 'read_current_timer'? 
[-Werror=implicit-function-declaration]
  register_current_timer_delay(_delay_timer);
  ^~~~
  read_current_timer
drivers/clocksource/timer-atmel-tcb.c: At top level:
drivers/clocksource/timer-atmel-tcb.c:129:27: error: storage size of 
'tc_delay_timer' isn't known
 static struct delay_timer tc_delay_timer;
   ^~
cc1: some warnings being treated as errors

Caused by commit

  dd40f5020581 ("clocksource/drivers/tcb_clksrc: Register delay timer")

I have reverted that commit for today.

-- 
Cheers,
Stephen Rothwell


pgpm5APr5KHIH.pgp
Description: OpenPGP digital signature


[PATCH V2 net-next 09/10] net: hns3: add opcode about query and clear RAS & MSI-X to special opcode

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

There are four commands being used to query and clear RAS and MSI-X
interrupts status. They should be contained in array of special opcodes
because these commands have several descriptors, and we need to judge
return value in the first descriptor rather than the last one as other
opcodes. In addition, we shouldn't set the NEXT_FLAG of first descriptor.

This patch fixes above issues.

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |  6 +-
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 16 
 2 files changed, 5 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
index e532905..7a3bde7 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c
@@ -173,7 +173,11 @@ static bool hclge_is_special_opcode(u16 opcode)
 HCLGE_OPC_STATS_MAC,
 HCLGE_OPC_STATS_MAC_ALL,
 HCLGE_OPC_QUERY_32_BIT_REG,
-HCLGE_OPC_QUERY_64_BIT_REG};
+HCLGE_OPC_QUERY_64_BIT_REG,
+HCLGE_QUERY_CLEAR_MPF_RAS_INT,
+HCLGE_QUERY_CLEAR_PF_RAS_INT,
+HCLGE_QUERY_CLEAR_ALL_MPF_MSIX_INT,
+HCLGE_QUERY_CLEAR_ALL_PF_MSIX_INT};
int i;
 
for (i = 0; i < ARRAY_SIZE(spec_opcode); i++) {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index 83b07ce..b4a7e6a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -1098,8 +1098,6 @@ static int hclge_handle_mpf_ras_error(struct hclge_dev 
*hdev,
/* query all main PF RAS errors */
hclge_cmd_setup_basic_desc([0], HCLGE_QUERY_CLEAR_MPF_RAS_INT,
   true);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], num);
if (ret) {
dev_err(dev, "query all mpf ras int cmd failed (%d)\n", ret);
@@ -1262,8 +1260,6 @@ static int hclge_handle_mpf_ras_error(struct hclge_dev 
*hdev,
 
/* clear all main PF RAS errors */
hclge_cmd_reuse_desc([0], false);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], num);
if (ret)
dev_err(dev, "clear all mpf ras int cmd failed (%d)\n", ret);
@@ -1293,8 +1289,6 @@ static int hclge_handle_pf_ras_error(struct hclge_dev 
*hdev,
/* query all PF RAS errors */
hclge_cmd_setup_basic_desc([0], HCLGE_QUERY_CLEAR_PF_RAS_INT,
   true);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], num);
if (ret) {
dev_err(dev, "query all pf ras int cmd failed (%d)\n", ret);
@@ -1348,8 +1342,6 @@ static int hclge_handle_pf_ras_error(struct hclge_dev 
*hdev,
 
/* clear all PF RAS errors */
hclge_cmd_reuse_desc([0], false);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], num);
if (ret)
dev_err(dev, "clear all pf ras int cmd failed (%d)\n", ret);
@@ -1667,8 +1659,6 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
/* query all main PF MSIx errors */
hclge_cmd_setup_basic_desc([0], HCLGE_QUERY_CLEAR_ALL_MPF_MSIX_INT,
   true);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], mpf_bd_num);
if (ret) {
dev_err(dev, "query all mpf msix int cmd failed (%d)\n",
@@ -1700,8 +1690,6 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
 
/* clear all main PF MSIx errors */
hclge_cmd_reuse_desc([0], false);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], mpf_bd_num);
if (ret) {
dev_err(dev, "clear all mpf msix int cmd failed (%d)\n",
@@ -1713,8 +1701,6 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
memset(desc, 0, bd_num * sizeof(struct hclge_desc));
hclge_cmd_setup_basic_desc([0], HCLGE_QUERY_CLEAR_ALL_PF_MSIX_INT,
   true);
-   desc[0].flag |= cpu_to_le16(HCLGE_CMD_FLAG_NEXT);
-
ret = hclge_cmd_send(>hw, [0], pf_bd_num);
if (ret) {
dev_err(dev, "query all pf msix int cmd failed (%d)\n",
@@ -1753,8 +1739,6 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
 
/* clear all PF MSIx errors */
hclge_cmd_reuse_desc([0], false);
-   desc[0].flag |= 

[PATCH V2 net-next 10/10] net: hns3: delay and separate enabling of NIC and ROCE HW errors

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

All RAS and MSI-X should be enabled just in the final stage of HNS3
initialization. It means that they should be enabled in
hclge_init_xxx_client_instance instead of hclge_ae_dev(). Especially
MSI-X, if it is enabled before opening vector0 IRQ, there are some
chances that a MSI-X error will cause failure on initialization of
 NIC client instane. So this patch delays enabling of HW errors.
Otherwise, we also separate enabling of ROCE RAS from NIC, because
it's not reasonable to enable ROCE RAS if we even don't have a ROCE
driver.

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong tan 
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c |  9 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h |  3 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 45 +++---
 3 files changed, 36 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index b4a7e6a..784512d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -1493,7 +1493,7 @@ hclge_log_and_clear_rocee_ras_error(struct hclge_dev 
*hdev)
return reset_type;
 }
 
-static int hclge_config_rocee_ras_interrupt(struct hclge_dev *hdev, bool en)
+int hclge_config_rocee_ras_interrupt(struct hclge_dev *hdev, bool en)
 {
struct device *dev = >pdev->dev;
struct hclge_desc desc;
@@ -1566,10 +1566,9 @@ static const struct hclge_hw_blk hw_blk[] = {
{ /* sentinel */ }
 };
 
-int hclge_hw_error_set_state(struct hclge_dev *hdev, bool state)
+int hclge_config_nic_hw_error(struct hclge_dev *hdev, bool state)
 {
const struct hclge_hw_blk *module = hw_blk;
-   struct device *dev = >pdev->dev;
int ret = 0;
 
while (module->name) {
@@ -1581,10 +1580,6 @@ int hclge_hw_error_set_state(struct hclge_dev *hdev, 
bool state)
module++;
}
 
-   ret = hclge_config_rocee_ras_interrupt(hdev, state);
-   if (ret)
-   dev_err(dev, "fail(%d) to configure ROCEE err int\n", ret);
-
return ret;
 }
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
index c56b11e..81d115a 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
@@ -119,7 +119,8 @@ struct hclge_hw_error {
 };
 
 int hclge_config_mac_tnl_int(struct hclge_dev *hdev, bool en);
-int hclge_hw_error_set_state(struct hclge_dev *hdev, bool state);
+int hclge_config_nic_hw_error(struct hclge_dev *hdev, bool state);
+int hclge_config_rocee_ras_interrupt(struct hclge_dev *hdev, bool en);
 pci_ers_result_t hclge_handle_hw_ras_error(struct hnae3_ae_dev *ae_dev);
 int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
   unsigned long *reset_requests);
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 4873a8e..35d2a45 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -8202,10 +8202,16 @@ static int hclge_init_nic_client_instance(struct 
hnae3_ae_dev *ae_dev,
set_bit(HCLGE_STATE_NIC_REGISTERED, >state);
hnae3_set_client_init_flag(client, ae_dev, 1);
 
+   /* Enable nic hw error interrupts */
+   ret = hclge_config_nic_hw_error(hdev, true);
+   if (ret)
+   dev_err(_dev->pdev->dev,
+   "fail(%d) to enable hw error interrupts\n", ret);
+
if (netif_msg_drv(>vport->nic))
hclge_info_show(hdev);
 
-   return 0;
+   return ret;
 }
 
 static int hclge_init_roce_client_instance(struct hnae3_ae_dev *ae_dev,
@@ -8285,7 +8291,13 @@ static int hclge_init_client_instance(struct 
hnae3_client *client,
}
}
 
-   return 0;
+   /* Enable roce ras interrupts */
+   ret = hclge_config_rocee_ras_interrupt(hdev, true);
+   if (ret)
+   dev_err(_dev->pdev->dev,
+   "fail(%d) to enable roce ras interrupts\n", ret);
+
+   return ret;
 
 clear_nic:
hdev->nic_client = NULL;
@@ -8589,13 +8601,6 @@ static int hclge_init_ae_dev(struct hnae3_ae_dev *ae_dev)
goto err_mdiobus_unreg;
}
 
-   ret = hclge_hw_error_set_state(hdev, true);
-   if (ret) {
-   dev_err(>dev,
-   "fail(%d) to enable hw error interrupts\n", ret);
-   goto err_mdiobus_unreg;
-   }
-
INIT_KFIFO(hdev->mac_tnl_log);
 
hclge_dcb_ops_set(hdev);
@@ -8719,15 +8724,26 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev 
*ae_dev)
}
 
/* Re-enable the hw error interrupts because
-* the interrupts get disabled on core/global 

[PATCH V2 net-next 06/10] net: hns3: set ops to null when unregister ad_dev

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

The hclge/hclgevf and hns3 module can be unloaded independently,
when hclge/hclgevf unloaded firstly, the ops of ae_dev should
be set to NULL, otherwise it will cause an use-after-free problem.

Fixes: 38caee9d3ee8 ("net: hns3: Add support of the HNAE3 framework")
Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.c 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.c
index fa8b850..738e013 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.c
@@ -251,6 +251,7 @@ void hnae3_unregister_ae_algo(struct hnae3_ae_algo *ae_algo)
 
ae_algo->ops->uninit_ae_dev(ae_dev);
hnae3_set_bit(ae_dev->flag, HNAE3_DEV_INITED_B, 0);
+   ae_dev->ops = NULL;
}
 
list_del(_algo->node);
@@ -351,6 +352,7 @@ void hnae3_unregister_ae_dev(struct hnae3_ae_dev *ae_dev)
 
ae_algo->ops->uninit_ae_dev(ae_dev);
hnae3_set_bit(ae_dev->flag, HNAE3_DEV_INITED_B, 0);
+   ae_dev->ops = NULL;
}
 
list_del(_dev->node);
-- 
2.7.4



[PATCH V2 net-next 03/10] net: hns3: fix VLAN filter restore issue after reset

2019-06-02 Thread Huazhong Tan
From: Jian Shen 

In orginal codes, the driver only restore VLAN filter entries
for PF after reset, the VLAN entries of VF will lose in this
case.

This patch fixes it by recording VLAN IDs for each function
when add VLAN, and restore the VLAN IDs after reset.

Fixes: 681ec3999b3d ("net: hns3: fix for vlan table lost problem when 
resetting")
Signed-off-by: Jian Shen 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  3 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c| 34 ++
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|  1 -
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 42 +++---
 4 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index 51c2ff1..2e478d9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -338,6 +338,8 @@ struct hnae3_ae_dev {
  *   Set vlan filter config of Ports
  * set_vf_vlan_filter()
  *   Set vlan filter config of vf
+ * restore_vlan_table()
+ *   Restore vlan filter entries after reset
  * enable_hw_strip_rxvtag()
  *   Enable/disable hardware strip vlan tag of packets received
  * set_gro_en
@@ -505,6 +507,7 @@ struct hnae3_ae_ops {
void (*set_timer_task)(struct hnae3_handle *handle, bool enable);
int (*mac_connect_phy)(struct hnae3_handle *handle);
void (*mac_disconnect_phy)(struct hnae3_handle *handle);
+   void (*restore_vlan_table)(struct hnae3_handle *handle);
 };
 
 struct hnae3_dcb_ops {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index f6dc305..1e68bcb 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1548,15 +1548,11 @@ static int hns3_vlan_rx_add_vid(struct net_device 
*netdev,
__be16 proto, u16 vid)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);
-   struct hns3_nic_priv *priv = netdev_priv(netdev);
int ret = -EIO;
 
if (h->ae_algo->ops->set_vlan_filter)
ret = h->ae_algo->ops->set_vlan_filter(h, proto, vid, false);
 
-   if (!ret)
-   set_bit(vid, priv->active_vlans);
-
return ret;
 }
 
@@ -1564,33 +1560,11 @@ static int hns3_vlan_rx_kill_vid(struct net_device 
*netdev,
 __be16 proto, u16 vid)
 {
struct hnae3_handle *h = hns3_get_handle(netdev);
-   struct hns3_nic_priv *priv = netdev_priv(netdev);
int ret = -EIO;
 
if (h->ae_algo->ops->set_vlan_filter)
ret = h->ae_algo->ops->set_vlan_filter(h, proto, vid, true);
 
-   if (!ret)
-   clear_bit(vid, priv->active_vlans);
-
-   return ret;
-}
-
-static int hns3_restore_vlan(struct net_device *netdev)
-{
-   struct hns3_nic_priv *priv = netdev_priv(netdev);
-   int ret = 0;
-   u16 vid;
-
-   for_each_set_bit(vid, priv->active_vlans, VLAN_N_VID) {
-   ret = hns3_vlan_rx_add_vid(netdev, htons(ETH_P_8021Q), vid);
-   if (ret) {
-   netdev_err(netdev, "Restore vlan: %d filter, ret:%d\n",
-  vid, ret);
-   return ret;
-   }
-   }
-
return ret;
 }
 
@@ -4301,12 +4275,8 @@ static int hns3_reset_notify_restore_enet(struct 
hnae3_handle *handle)
vlan_filter_enable = netdev->flags & IFF_PROMISC ? false : true;
hns3_enable_vlan_filter(netdev, vlan_filter_enable);
 
-   /* Hardware table is only clear when pf resets */
-   if (!(handle->flags & HNAE3_SUPPORT_VF)) {
-   ret = hns3_restore_vlan(netdev);
-   if (ret)
-   return ret;
-   }
+   if (handle->ae_algo->ops->restore_vlan_table)
+   handle->ae_algo->ops->restore_vlan_table(handle);
 
return hns3_restore_fd_rules(netdev);
 }
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
index 408efd5..efab15f 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.h
@@ -550,7 +550,6 @@ struct hns3_nic_priv {
struct notifier_block notifier_block;
/* Vxlan/Geneve information */
struct hns3_udp_tunnel udp_tnl[HNS3_UDP_TNL_MAX];
-   unsigned long active_vlans[BITS_TO_LONGS(VLAN_N_VID)];
struct hns3_enet_coalesce tx_coal;
struct hns3_enet_coalesce rx_coal;
 };
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 1215455..4873a8e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -7401,10 +7401,6 @@ static void 

Re: [PATCH v2 net-next] net: link_watch: prevent starvation when processing linkwatch wq

2019-06-02 Thread Yunsheng Lin
On 2019/5/31 19:17, Salil Mehta wrote:
>> From: netdev-ow...@vger.kernel.org [mailto:netdev-
>> ow...@vger.kernel.org] On Behalf Of Yunsheng Lin
>> Sent: Friday, May 31, 2019 10:01 AM
>> To: da...@davemloft.net
>> Cc: hkallwe...@gmail.com; f.faine...@gmail.com;
>> step...@networkplumber.org; net...@vger.kernel.org; linux-
>> ker...@vger.kernel.org; Linuxarm 
>> Subject: [PATCH v2 net-next] net: link_watch: prevent starvation when
>> processing linkwatch wq
>>
>> When user has configured a large number of virtual netdev, such
>> as 4K vlans, the carrier on/off operation of the real netdev
>> will also cause it's virtual netdev's link state to be processed
>> in linkwatch. Currently, the processing is done in a work queue,
>> which may cause cpu and rtnl locking starvation problem.
>>
>> This patch releases the cpu and rtnl lock when link watch worker
>> has processed a fixed number of netdev' link watch event.
>>
>> Currently __linkwatch_run_queue is called with rtnl lock, so
>> enfore it with ASSERT_RTNL();
>>
>> Signed-off-by: Yunsheng Lin 
>> ---
>> V2: use cond_resched and rtnl_unlock after processing a fixed
>> number of events
>> ---
>>  net/core/link_watch.c | 17 +
>>  1 file changed, 17 insertions(+)
>>
>> diff --git a/net/core/link_watch.c b/net/core/link_watch.c
>> index 7f51efb..07eebfb 100644
>> --- a/net/core/link_watch.c
>> +++ b/net/core/link_watch.c
>> @@ -168,9 +168,18 @@ static void linkwatch_do_dev(struct net_device
>> *dev)
>>
>>  static void __linkwatch_run_queue(int urgent_only)
>>  {
>> +#define MAX_DO_DEV_PER_LOOP 100
>> +
>> +int do_dev = MAX_DO_DEV_PER_LOOP;
>>  struct net_device *dev;
>>  LIST_HEAD(wrk);
>>
>> +ASSERT_RTNL();
>> +
>> +/* Give urgent case more budget */
>> +if (urgent_only)
>> +do_dev += MAX_DO_DEV_PER_LOOP;
>> +
>>  /*
>>   * Limit the number of linkwatch events to one
>>   * per second so that a runaway driver does not
>> @@ -200,6 +209,14 @@ static void __linkwatch_run_queue(int urgent_only)
>>  }
>>  spin_unlock_irq(_lock);
>>  linkwatch_do_dev(dev);
>> +
>> +if (--do_dev < 0) {
>> +rtnl_unlock();
>> +cond_resched();
> 
> 
> 
> Sorry, missed in my earlier comment. I could see multiple problems here
> and please correct me if I am wrong:
> 
> 1. It looks like releasing the rtnl_lock here and then res-scheduling might
>not be safe, especially when you have already held *lweventlist_lock*
>(which is global and not per-netdev), and when you are trying to
>reschedule. This can cause *deadlock* with itself.
> 
>Reason: once you release the rtnl_lock() the similar leg of function 
>netdev_wait_allrefs() could be called for some other netdevice which
>might end up in waiting for same global linkwatch event list lock
>i.e. *lweventlist_lock*.

lweventlist_lock has been released before releasing the rtnl_lock and
rescheduling.

> 
> 2. After releasing the rtnl_lock() we have not ensured that all the rcu
>operations are complete. Perhaps we need to take rcu_barrier() before
>retaking the rtnl_lock()
Why do we need to ensure all the rcu operations are complete here?

> 
> 
> 
> 
>> +do_dev = MAX_DO_DEV_PER_LOOP;
> 
> 
> 
> Here, I think rcu_barrier() should exist.

In netdev_wait_allrefs, rcu_barrier is indeed called between
__rtnl_unlock and rtnl_lock and is added by below commit
0115e8e30d6f ("net: remove delay at device dismantle"), which
seems to work with NETDEV_UNREGISTER_FINAL.

And the NETDEV_UNREGISTER_FINAL is removed by commit
070f2d7e264a ("net: Drop NETDEV_UNREGISTER_FINAL"), which says
something about whether the rcu_barrier is still needed.

"dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else."

So the reason of calling rcu_barrier in netdev_wait_allrefs
is unclear now.

Also rcu_barrier in netdev_wait_allrefs is added to fix the
device dismantle problem, so for linkwatch, maybe it is not
needed.

> 
> 
> 
>> +rtnl_lock();
>> +}
>> +
>>  spin_lock_irq(_lock);
>>  }
> 
> 
> .
> 



[PATCH V2 net-next 04/10] net: hns3: set the port shaper according to MAC speed

2019-06-02 Thread Huazhong Tan
From: Yunsheng Lin 

This patch sets the port shaper according to the MAC speed as
suggested by hardware user manual.

Signed-off-by: Yunsheng Lin 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
index a7bbb6d..fac5193 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c
@@ -397,7 +397,7 @@ static int hclge_tm_port_shaper_cfg(struct hclge_dev *hdev)
u8 ir_u, ir_b, ir_s;
int ret;
 
-   ret = hclge_shaper_para_calc(HCLGE_ETHER_MAX_RATE,
+   ret = hclge_shaper_para_calc(hdev->hw.mac.speed,
 HCLGE_SHAPER_LVL_PORT,
 _b, _u, _s);
if (ret)
-- 
2.7.4



[PATCH V2 net-next 07/10] net: hns3: add handling of two bits in MAC tunnel interrupts

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

LINK_UP and LINK_DOWN are two bits of MAC tunnel interrupts, but previous
HNS3 driver didn't handle them. If they were enabled, value of these two
bits will change during link down and link up, which will cause HNS3
driver keep receiving IRQ but can't handle them.

This patch adds handling of these two bits of interrupts, we will record
and clear them as what we do to other MAC tunnel interrupts.

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 2 +-
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
index ed1f533..e1007d9 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
@@ -1053,7 +1053,7 @@ static void hclge_dbg_dump_mac_tnl_status(struct 
hclge_dev *hdev)
 
while (kfifo_get(>mac_tnl_log, )) {
rem_nsec = do_div(stats.time, HCLGE_BILLION_NANO_SECONDS);
-   dev_info(>pdev->dev, "[%07lu.%03lu]status = 0x%x\n",
+   dev_info(>pdev->dev, "[%07lu.%03lu] status = 0x%x\n",
 (unsigned long)stats.time, rem_nsec / 1000,
 stats.status);
}
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
index 9645590..c56b11e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h
@@ -47,9 +47,9 @@
 #define HCLGE_NCSI_ERR_INT_TYPE0x9
 #define HCLGE_MAC_COMMON_ERR_INT_EN0x107FF
 #define HCLGE_MAC_COMMON_ERR_INT_EN_MASK   0x107FF
-#define HCLGE_MAC_TNL_INT_EN   GENMASK(7, 0)
-#define HCLGE_MAC_TNL_INT_EN_MASK  GENMASK(7, 0)
-#define HCLGE_MAC_TNL_INT_CLR  GENMASK(7, 0)
+#define HCLGE_MAC_TNL_INT_EN   GENMASK(9, 0)
+#define HCLGE_MAC_TNL_INT_EN_MASK  GENMASK(9, 0)
+#define HCLGE_MAC_TNL_INT_CLR  GENMASK(9, 0)
 #define HCLGE_PPU_MPF_ABNORMAL_INT0_EN GENMASK(31, 0)
 #define HCLGE_PPU_MPF_ABNORMAL_INT0_EN_MASKGENMASK(31, 0)
 #define HCLGE_PPU_MPF_ABNORMAL_INT1_EN GENMASK(31, 0)
-- 
2.7.4



[PATCH V2 net-next 05/10] net: hns3: add a check to pointer in error_detected and slot_reset

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

If we add a VF without loading hclgevf.ko and then there is a RAS error
occurs, PCIe AER will call error_detected and slot_reset of all functions,
and will get a NULL pointer when we check ad_dev->ops->handle_hw_ras_error.
This will cause a call trace and failures on handling of follow-up RAS
errors.

This patch check ae_dev and ad_dev->ops at first to solve above issues.

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 1e68bcb..0501b78 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1920,9 +1920,9 @@ static pci_ers_result_t hns3_error_detected(struct 
pci_dev *pdev,
if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
 
-   if (!ae_dev) {
+   if (!ae_dev || !ae_dev->ops) {
dev_err(>dev,
-   "Can't recover - error happened during device init\n");
+   "Can't recover - error happened before device 
initialized\n");
return PCI_ERS_RESULT_NONE;
}
 
@@ -1941,6 +1941,9 @@ static pci_ers_result_t hns3_slot_reset(struct pci_dev 
*pdev)
 
dev_info(dev, "requesting reset due to PCI error\n");
 
+   if (!ae_dev || !ae_dev->ops)
+   return PCI_ERS_RESULT_NONE;
+
/* request the reset */
if (ae_dev->ops->reset_event) {
if (!ae_dev->override_pci_need_reset)
-- 
2.7.4



[PATCH V2 net-next 02/10] net: hns3: don't configure new VLAN ID into VF VLAN table when it's full

2019-06-02 Thread Huazhong Tan
From: Jian Shen 

VF VLAN table can only support no more than 256 VLANs. When user
adds too many VLANs, the VF VLAN table will be full, and firmware
will close the VF VLAN table for the function. When VF VLAN table
is full, and user keeps adding new VLANs, it's unnecessary to
configure the VF VLAN table, because it will always fail, and print
warning message. The worst case is adding 4K VLANs, and doing reset,
it will take much time to restore these VLANs, which may cause VF
reset fail by timeout.

Fixes: 6c251711b37f ("net: hns3: Disable vf vlan filter when vf vlan table is 
full")
Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 8 
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index f0f618d..1215455 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -7025,6 +7025,12 @@ static int hclge_set_vf_vlan_common(struct hclge_dev 
*hdev, int vfid,
u8 vf_byte_off;
int ret;
 
+   /* if vf vlan table is full, firmware will close vf vlan filter, it
+* is unable and unnecessary to add new vlan id to vf vlan filter
+*/
+   if (test_bit(vfid, hdev->vf_vlan_full) && !is_kill)
+   return 0;
+
hclge_cmd_setup_basic_desc([0],
   HCLGE_OPC_VLAN_FILTER_VF_CFG, false);
hclge_cmd_setup_basic_desc([1],
@@ -7060,6 +7066,7 @@ static int hclge_set_vf_vlan_common(struct hclge_dev 
*hdev, int vfid,
return 0;
 
if (req0->resp_code == HCLGE_VF_VLAN_NO_ENTRY) {
+   set_bit(vfid, hdev->vf_vlan_full);
dev_warn(>pdev->dev,
 "vf vlan table is full, vf vlan filter is 
disabled\n");
return 0;
@@ -8621,6 +8628,7 @@ static int hclge_reset_ae_dev(struct hnae3_ae_dev *ae_dev)
 
hclge_stats_clear(hdev);
memset(hdev->vlan_table, 0, sizeof(hdev->vlan_table));
+   memset(hdev->vf_vlan_full, 0, sizeof(hdev->vf_vlan_full));
 
ret = hclge_cmd_init(hdev);
if (ret) {
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
index 2b3bc95..414f7db 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.h
@@ -820,6 +820,7 @@ struct hclge_dev {
struct hclge_vlan_type_cfg vlan_type_cfg;
 
unsigned long vlan_table[VLAN_N_VID][BITS_TO_LONGS(HCLGE_VPORT_NUM)];
+   unsigned long vf_vlan_full[BITS_TO_LONGS(HCLGE_VPORT_NUM)];
 
struct hclge_fd_cfg fd_cfg;
struct hlist_head fd_rule_list;
-- 
2.7.4



[PATCH V2 net-next 08/10] net: hns3: remove setting bit of reset_requests when handling mac tunnel interrupts

2019-06-02 Thread Huazhong Tan
From: Weihang Li 

We shouldn't set HNAE3_NONE_RESET bit of the variable that represents a
reset request during handling of MSI-X errors, or may cause issue when
trigger reset.

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index 55c4a1b..83b07ce 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -1783,7 +1783,6 @@ int hclge_handle_hw_msix_error(struct hclge_dev *hdev,
ret = hclge_clear_mac_tnl_int(hdev);
if (ret)
dev_err(dev, "clear mac tnl int failed (%d)\n", ret);
-   set_bit(HNAE3_NONE_RESET, reset_requests);
}
 
 msi_error:
-- 
2.7.4



[PATCH V2 net-next 01/10] net: hns3: remove redundant core reset

2019-06-02 Thread Huazhong Tan
Since core reset is similar to the global reset, so this
patch removes it and uses global reset to replace it.

Signed-off-by: Huazhong Tan 
Signed-off-by: Peng Li 
---
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|  1 -
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c | 24 +--
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 28 --
 3 files changed, 12 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hnae3.h 
b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
index a18645e..51c2ff1 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hnae3.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hnae3.h
@@ -154,7 +154,6 @@ enum hnae3_reset_type {
HNAE3_VF_FULL_RESET,
HNAE3_FLR_RESET,
HNAE3_FUNC_RESET,
-   HNAE3_CORE_RESET,
HNAE3_GLOBAL_RESET,
HNAE3_IMP_RESET,
HNAE3_UNKNOWN_RESET,
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
index 4ac8063..55c4a1b 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c
@@ -87,25 +87,25 @@ static const struct hclge_hw_error 
hclge_msix_sram_ecc_int[] = {
 
 static const struct hclge_hw_error hclge_igu_int[] = {
{ .int_msk = BIT(0), .msg = "igu_rx_buf0_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(2), .msg = "igu_rx_buf1_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ /* sentinel */ }
 };
 
 static const struct hclge_hw_error hclge_igu_egu_tnl_int[] = {
{ .int_msk = BIT(0), .msg = "rx_buf_overflow",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(1), .msg = "rx_stp_fifo_overflow",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(2), .msg = "rx_stp_fifo_undeflow",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(3), .msg = "tx_buf_overflow",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(4), .msg = "tx_buf_underrun",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(5), .msg = "rx_stp_buf_overflow",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ /* sentinel */ }
 };
 
@@ -413,13 +413,13 @@ static const struct hclge_hw_error 
hclge_ppu_mpf_abnormal_int_st2[] = {
 
 static const struct hclge_hw_error hclge_ppu_mpf_abnormal_int_st3[] = {
{ .int_msk = BIT(4), .msg = "gro_bd_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(5), .msg = "gro_context_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(6), .msg = "rx_stash_cfg_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ .int_msk = BIT(7), .msg = "axi_rd_fbd_ecc_mbit_err",
- .reset_level = HNAE3_CORE_RESET },
+ .reset_level = HNAE3_GLOBAL_RESET },
{ /* sentinel */ }
 };
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 0545f38..f0f618d 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2706,15 +2706,6 @@ static u32 hclge_check_event_cause(struct hclge_dev 
*hdev, u32 *clearval)
return HCLGE_VECTOR0_EVENT_RST;
}
 
-   if (BIT(HCLGE_VECTOR0_CORERESET_INT_B) & rst_src_reg) {
-   dev_info(>pdev->dev, "core reset interrupt\n");
-   set_bit(HCLGE_STATE_CMD_DISABLE, >state);
-   set_bit(HNAE3_CORE_RESET, >reset_pending);
-   *clearval = BIT(HCLGE_VECTOR0_CORERESET_INT_B);
-   hdev->rst_stats.core_rst_cnt++;
-   return HCLGE_VECTOR0_EVENT_RST;
-   }
-
/* check for vector0 msix event source */
if (msix_src_reg & HCLGE_VECTOR0_REG_MSIX_MASK) {
dev_dbg(>pdev->dev, "received event 0x%x\n",
@@ -2941,10 +2932,6 @@ static int hclge_reset_wait(struct hclge_dev *hdev)
reg = HCLGE_GLOBAL_RESET_REG;
reg_bit = HCLGE_GLOBAL_RESET_BIT;
break;
-   case HNAE3_CORE_RESET:
-   reg = HCLGE_GLOBAL_RESET_REG;
-   reg_bit = HCLGE_CORE_RESET_BIT;
-   break;
case HNAE3_FUNC_RESET:
reg = HCLGE_FUN_RST_ING;
reg_bit = HCLGE_FUN_RST_ING_B;
@@ -3076,12 +3063,6 @@ 

[PATCH V2 net-next 00/10] code optimizations & bugfixes for HNS3 driver

2019-06-02 Thread Huazhong Tan
This patch-set includes code optimizations and bugfixes for the HNS3
ethernet controller driver.

[patch 1/10] removes the redundant core reset type

[patch 2/10 - 3/10] fixes two VLAN related issues

[patch 4/10] fixes a TM issue

[patch 5/10 - 10/10] includes some patches related to RAS & MSI-X error

Change log:
V1->V2: removes two patches which needs to change HNS's infiniband
driver as well, they will be upstreamed later with the
infiniband's one.

Huazhong Tan (1):
  net: hns3: remove redundant core reset

Jian Shen (2):
  net: hns3: don't configure new VLAN ID into VF VLAN table when it's
full
  net: hns3: fix VLAN filter restore issue after reset

Weihang Li (6):
  net: hns3: add a check to pointer in error_detected and slot_reset
  net: hns3: set ops to null when unregister ad_dev
  net: hns3: add handling of two bits in MAC tunnel interrupts
  net: hns3: remove setting bit of reset_requests when handling mac
tunnel interrupts
  net: hns3: add opcode about query and clear RAS & MSI-X to special
opcode
  net: hns3: delay and separate enabling of NIC and ROCE HW errors

Yunsheng Lin (1):
  net: hns3: set the port shaper according to MAC speed

 drivers/net/ethernet/hisilicon/hns3/hnae3.c|   2 +
 drivers/net/ethernet/hisilicon/hns3/hnae3.h|   4 +-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c|  41 ++-
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.h|   1 -
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.c |   6 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c |   2 +-
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.c |  50 +++--
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_err.h |   9 +-
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.c| 123 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_main.h|   1 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_tm.c  |   2 +-
 11 files changed, 117 insertions(+), 124 deletions(-)

-- 
2.7.4



[v2, PATCH 1/4] net: stmmac: dwmac-mediatek: enable Ethernet power domain

2019-06-02 Thread Biao Huang
add Ethernet power on/off operations in init/exit flow.

Signed-off-by: Biao Huang 
---
 .../net/ethernet/stmicro/stmmac/dwmac-mediatek.c   |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
index 126b66b..b84269e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -298,6 +299,9 @@ static int mediatek_dwmac_init(struct platform_device 
*pdev, void *priv)
return ret;
}
 
+   pm_runtime_enable(>dev);
+   pm_runtime_get_sync(>dev);
+
return 0;
 }
 
@@ -307,6 +311,9 @@ static void mediatek_dwmac_exit(struct platform_device 
*pdev, void *priv)
const struct mediatek_dwmac_variant *variant = plat->variant;
 
clk_bulk_disable_unprepare(variant->num_clks, plat->clks);
+
+   pm_runtime_put_sync(>dev);
+   pm_runtime_disable(>dev);
 }
 
 static int mediatek_dwmac_probe(struct platform_device *pdev)
-- 
1.7.9.5



[v2, PATCH 4/4] net: stmmac: dwmac4: fix flow control issue

2019-06-02 Thread Biao Huang
Current dwmac4_flow_ctrl will not clear
GMAC_RX_FLOW_CTRL_RFE/GMAC_RX_FLOW_CTRL_RFE bits,
so MAC hw will keep flow control on although expecting
flow control off by ethtool. Add codes to fix it.

Fixes: 477286b53f55 ("stmmac: add GMAC4 core support")
Signed-off-by: Biao Huang 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
index 2544cff..9322b71 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c
@@ -488,8 +488,9 @@ static void dwmac4_flow_ctrl(struct mac_device_info *hw, 
unsigned int duplex,
if (fc & FLOW_RX) {
pr_debug("\tReceive Flow-Control ON\n");
flow |= GMAC_RX_FLOW_CTRL_RFE;
-   writel(flow, ioaddr + GMAC_RX_FLOW_CTRL);
}
+   writel(flow, ioaddr + GMAC_RX_FLOW_CTRL);
+
if (fc & FLOW_TX) {
pr_debug("\tTransmit Flow-Control ON\n");
 
@@ -497,7 +498,7 @@ static void dwmac4_flow_ctrl(struct mac_device_info *hw, 
unsigned int duplex,
pr_debug("\tduplex mode: PAUSE %d\n", pause_time);
 
for (queue = 0; queue < tx_cnt; queue++) {
-   flow |= GMAC_TX_FLOW_CTRL_TFE;
+   flow = GMAC_TX_FLOW_CTRL_TFE;
 
if (duplex)
flow |=
@@ -505,6 +506,9 @@ static void dwmac4_flow_ctrl(struct mac_device_info *hw, 
unsigned int duplex,
 
writel(flow, ioaddr + GMAC_QX_TX_FLOW_CTRL(queue));
}
+   } else {
+   for (queue = 0; queue < tx_cnt; queue++)
+   writel(0, ioaddr + GMAC_QX_TX_FLOW_CTRL(queue));
}
 }
 
-- 
1.7.9.5



[v2, PATCH 2/4] net: stmmac: dwmac-mediatek: disable rx watchdog

2019-06-02 Thread Biao Huang
disable rx watchdog for dwmac-mediatek, then the hw will
issue a rx interrupt once receiving a packet, so the responding time
for rx path will be reduced.

Signed-off-by: Biao Huang 
---
 .../net/ethernet/stmicro/stmmac/dwmac-mediatek.c   |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
index b84269e..79f2ee3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-mediatek.c
@@ -356,6 +356,7 @@ static int mediatek_dwmac_probe(struct platform_device 
*pdev)
plat_dat->has_gmac4 = 1;
plat_dat->has_gmac = 0;
plat_dat->pmt = 0;
+   plat_dat->riwt_off = 1;
plat_dat->maxmtu = ETH_DATA_LEN;
plat_dat->bsp_priv = priv_plat;
plat_dat->init = mediatek_dwmac_init;
-- 
1.7.9.5



[v2, PATCH 0/4] complete dwmac-mediatek driver and fix flow control issue

2019-06-02 Thread Biao Huang
Changes in v2:  
patch#1: there is no extra action in mediatek_dwmac_remove, remove it   
 

v1: 
This series mainly complete dwmac-mediatek driver:  
1. add power on/off operations for dwmac-mediatek.  
2. disable rx watchdog to reduce rx path reponding time.
3. change the default value of tx-frames from 25 to 1, so   
   ptp4l will test pass by default. 

and also fix the issue that flow control won't be disabled any more 
once being enabled. 

Biao Huang (4): 
  net: stmmac: dwmac-mediatek: enable Ethernet power domain 
  net: stmmac: dwmac-mediatek: disable rx watchdog  
  net: stmmac: modify default value of tx-frames
  net: stmmac: dwmac4: fix flow control issue   

 drivers/net/ethernet/stmicro/stmmac/common.h   |2 +-   
 .../net/ethernet/stmicro/stmmac/dwmac-mediatek.c   |8  
 drivers/net/ethernet/stmicro/stmmac/dwmac4_core.c  |8 ++-- 
 3 files changed, 15 insertions(+), 3 deletions(-)  

--  
1.7.9.5



[v2, PATCH 3/4] net: stmmac: modify default value of tx-frames

2019-06-02 Thread Biao Huang
the default value of tx-frames is 25, it's too late when
passing tstamp to stack, then the ptp4l will fail:

ptp4l -i eth0 -f gPTP.cfg -m
ptp4l: selected /dev/ptp0 as PTP clock
ptp4l: port 1: INITIALIZING to LISTENING on INITIALIZE
ptp4l: port 0: INITIALIZING to LISTENING on INITIALIZE
ptp4l: port 1: link up
ptp4l: timed out while polling for tx timestamp
ptp4l: increasing tx_timestamp_timeout may correct this issue,
   but it is likely caused by a driver bug
ptp4l: port 1: send peer delay response failed
ptp4l: port 1: LISTENING to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

ptp4l tests pass when changing the tx-frames from 25 to 1 with
ethtool -C option.
It should be fine to set tx-frames default value to 1, so ptp4l will pass
by default.

Signed-off-by: Biao Huang 
---
 drivers/net/ethernet/stmicro/stmmac/common.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 26bbcd8..6a08cec 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -261,7 +261,7 @@ struct stmmac_safety_stats {
 #define STMMAC_COAL_TX_TIMER   1000
 #define STMMAC_MAX_COAL_TX_TICK10
 #define STMMAC_TX_MAX_FRAMES   256
-#define STMMAC_TX_FRAMES   25
+#define STMMAC_TX_FRAMES   1
 
 /* Packets types */
 enum packets_types {
-- 
1.7.9.5



[PATCH] arm64: dts: imx8mm: Move gic node into soc node

2019-06-02 Thread Anson . Huang
From: Anson Huang 

GIC is inside of SoC from architecture perspective, it should
be located inside of soc node in DT.

Signed-off-by: Anson Huang 
---
 arch/arm64/boot/dts/freescale/imx8mm.dtsi | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/imx8mm.dtsi 
b/arch/arm64/boot/dts/freescale/imx8mm.dtsi
index dc99f45..429312e 100644
--- a/arch/arm64/boot/dts/freescale/imx8mm.dtsi
+++ b/arch/arm64/boot/dts/freescale/imx8mm.dtsi
@@ -169,15 +169,6 @@
clock-output-names = "clk_ext4";
};
 
-   gic: interrupt-controller@3880 {
-   compatible = "arm,gic-v3";
-   reg = <0x0 0x3880 0 0x1>, /* GIC Dist */
- <0x0 0x3888 0 0xC>; /* GICR (RD_base + SGI_base) 
*/
-   #interrupt-cells = <3>;
-   interrupt-controller;
-   interrupts = ;
-   };
-
psci {
compatible = "arm,psci-1.0";
method = "smc";
@@ -739,6 +730,15 @@
dma-names = "rx-tx";
status = "disabled";
};
+
+   gic: interrupt-controller@3880 {
+   compatible = "arm,gic-v3";
+   reg = <0x3880 0x1>, /* GIC Dist */
+ <0x3888 0xc>; /* GICR (RD_base + 
SGI_base) */
+   #interrupt-cells = <3>;
+   interrupt-controller;
+   interrupts = ;
+   };
};
 
usbphynop1: usbphynop1 {
-- 
2.7.4



Re: [PATCH net-next 00/12] code optimizations & bugfixes for HNS3 driver

2019-06-02 Thread tanhuazhong




On 2019/6/1 8:18, David Miller wrote:

From: David Miller 
Date: Fri, 31 May 2019 17:15:29 -0700 (PDT)


From: Huazhong Tan 
Date: Fri, 31 May 2019 16:54:46 +0800


This patch-set includes code optimizations and bugfixes for the HNS3
ethernet controller driver.

[patch 1/12] removes the redundant core reset type

[patch 2/12 - 3/12] fixes two VLAN related issues

[patch 4/12] fixes a TM issue

[patch 5/12 - 12/12] includes some patches related to RAS & MSI-X error


Series applied.


I reverted, you need to actually build test the infiniband side of your
driver.

drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function 
‘hns_roce_v2_msix_interrupt_abn’:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5032:14: warning: passing argument 2 of 
‘ops->set_default_reset_request’ makes pointer from integer without a cast 
[-Wint-conversion]
   HNAE3_FUNC_RESET);
   ^~~~
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5032:14: note: expected ‘long 
unsigned int *’ but argument is of type ‘int’
   C-c C-cmake[5]: *** Deleting file 'drivers/net/wireless/ath/carl9170/cmd.o'



Sorry, I will remove [10/12 - 11/12] for V2, these two patches needs to 
modify HNS's infiniband driver at the same time, so they will be 
upstreamed later with the infiniband's one.




Re: rcu_read_lock lost its compiler barrier

2019-06-02 Thread Paul E. McKenney
On Sun, Jun 02, 2019 at 01:56:07PM +0800, Herbert Xu wrote:
> Digging up an old email because I was not aware of this previously
> but Paul pointed me to it during another discussion.
> 
> On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote:
> >
> > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > > > index d63bb77..6c3cece 100644
> > > > --- a/include/linux/rcupdate.h
> > > > +++ b/include/linux/rcupdate.h
> > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void);
> > > >  
> > > >  static inline void __rcu_read_lock(void)
> > > >  {
> > > > -   preempt_disable();
> > > > +   if (IS_ENABLED(CONFIG_PREEMPT_COUNT))
> > > > +   preempt_disable();
> > > 
> > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right?
> > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock().
> > > 
> > > So perhaps we can get rid of the IS_ENABLED() check?
> > 
> > Actually, barrier() is not intended to be implied by rcu_read_lock().
> > In a non-preemptible RCU implementation, it doesn't help anything
> > to have the compiler flush its temporaries upon rcu_read_lock()
> > and rcu_read_unlock().
> 
> This is seriously broken.  RCU has been around for years and is
> used throughout the kernel while the compiler barrier existed.

Please note that preemptible Tree RCU has lacked the compiler barrier on
all but the outermost rcu_read_unlock() for years before Boqun's patch.

So exactly where in the code that we are currently discussing
are you relying on compiler barriers in either rcu_read_lock() or
rcu_read_unlock()?

The grace-period guarantee allows the compiler ordering to be either in
the readers (SMP&), in the grace-period mechanism (SMP&&!PREEMPT),
or both (SRCU).

> You can't then go and decide to remove the compiler barrier!  To do
> that you'd need to audit every single use of rcu_read_lock in the
> kernel to ensure that they're not depending on the compiler barrier.
> 
> This is also contrary to the definition of almost every other
> *_lock primitive in the kernel where the compiler barrier is
> included.
> 
> So please revert this patch.

I do not believe that reverting that patch will help you at all.

But who knows?  So please point me at the full code body that was being
debated earlier on this thread.  It will no doubt take me quite a while to
dig through it, given my being on the road for the next couple of weeks,
but so it goes.

Thanx, Paul



[PATCH V2 3/3] arm64: defconfig: Select CONFIG_CLK_IMX8MN by default

2019-06-02 Thread Anson . Huang
From: Anson Huang 

Enable CONFIG_CLK_IMX8MN to support i.MX8MN clock driver.

Signed-off-by: Anson Huang 
---
No changes.
---
 arch/arm64/configs/defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 8d4f25c..aef797c 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -654,6 +654,7 @@ CONFIG_COMMON_CLK_CS2000_CP=y
 CONFIG_COMMON_CLK_S2MPS11=y
 CONFIG_CLK_QORIQ=y
 CONFIG_COMMON_CLK_PWM=y
+CONFIG_CLK_IMX8MN=y
 CONFIG_CLK_IMX8MM=y
 CONFIG_CLK_IMX8MQ=y
 CONFIG_CLK_IMX8QXP=y
-- 
2.7.4



[PATCH V2 1/3] dt-bindings: imx: Add clock binding doc for i.MX8MN

2019-06-02 Thread Anson . Huang
From: Anson Huang 

Add the clock binding doc for i.MX8MN.

Signed-off-by: Anson Huang 
---
No changes.
---
 .../devicetree/bindings/clock/imx8mn-clock.txt |  29 +++
 include/dt-bindings/clock/imx8mn-clock.h   | 215 +
 2 files changed, 244 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/clock/imx8mn-clock.txt
 create mode 100644 include/dt-bindings/clock/imx8mn-clock.h

diff --git a/Documentation/devicetree/bindings/clock/imx8mn-clock.txt 
b/Documentation/devicetree/bindings/clock/imx8mn-clock.txt
new file mode 100644
index 000..d83db5c
--- /dev/null
+++ b/Documentation/devicetree/bindings/clock/imx8mn-clock.txt
@@ -0,0 +1,29 @@
+* Clock bindings for NXP i.MX8M Nano
+
+Required properties:
+- compatible: Should be "fsl,imx8mn-ccm"
+- reg: Address and length of the register set
+- #clock-cells: Should be <1>
+- clocks: list of clock specifiers, must contain an entry for each required
+  entry in clock-names
+- clock-names: should include the following entries:
+- "osc_32k"
+- "osc_24m"
+- "clk_ext1"
+- "clk_ext2"
+- "clk_ext3"
+- "clk_ext4"
+
+clk: clock-controller@3038 {
+   compatible = "fsl,imx8mn-ccm";
+   reg = <0x0 0x3038 0x0 0x1>;
+   #clock-cells = <1>;
+   clocks = <_32k>, <_24m>, <_ext1>, <_ext2>,
+<_ext3>, <_ext4>;
+   clock-names = "osc_32k", "osc_24m", "clk_ext1", "clk_ext2",
+ "clk_ext3", "clk_ext4";
+};
+
+The clock consumer should specify the desired clock by having the clock
+ID in its "clocks" phandle cell. See include/dt-bindings/clock/imx8mn-clock.h
+for the full list of i.MX8M Nano clock IDs.
diff --git a/include/dt-bindings/clock/imx8mn-clock.h 
b/include/dt-bindings/clock/imx8mn-clock.h
new file mode 100644
index 000..5255b1c
--- /dev/null
+++ b/include/dt-bindings/clock/imx8mn-clock.h
@@ -0,0 +1,215 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright 2018-2019 NXP
+ */
+
+#ifndef __DT_BINDINGS_CLOCK_IMX8MN_H
+#define __DT_BINDINGS_CLOCK_IMX8MN_H
+
+#define IMX8MN_CLK_DUMMY   0
+#define IMX8MN_CLK_32K 1
+#define IMX8MN_CLK_24M 2
+#define IMX8MN_OSC_HDMI_CLK3
+#define IMX8MN_CLK_EXT14
+#define IMX8MN_CLK_EXT25
+#define IMX8MN_CLK_EXT36
+#define IMX8MN_CLK_EXT47
+#define IMX8MN_AUDIO_PLL1_REF_SEL  8
+#define IMX8MN_AUDIO_PLL2_REF_SEL  9
+#define IMX8MN_VIDEO_PLL1_REF_SEL  10
+#define IMX8MN_DRAM_PLL_REF_SEL11
+#define IMX8MN_GPU_PLL_REF_SEL 12
+#define IMX8MN_VPU_PLL_REF_SEL 13
+#define IMX8MN_ARM_PLL_REF_SEL 14
+#define IMX8MN_SYS_PLL1_REF_SEL15
+#define IMX8MN_SYS_PLL2_REF_SEL16
+#define IMX8MN_SYS_PLL3_REF_SEL17
+#define IMX8MN_AUDIO_PLL1  18
+#define IMX8MN_AUDIO_PLL2  19
+#define IMX8MN_VIDEO_PLL1  20
+#define IMX8MN_DRAM_PLL21
+#define IMX8MN_GPU_PLL 22
+#define IMX8MN_VPU_PLL 23
+#define IMX8MN_ARM_PLL 24
+#define IMX8MN_SYS_PLL125
+#define IMX8MN_SYS_PLL226
+#define IMX8MN_SYS_PLL327
+#define IMX8MN_AUDIO_PLL1_BYPASS   28
+#define IMX8MN_AUDIO_PLL2_BYPASS   29
+#define IMX8MN_VIDEO_PLL1_BYPASS   30
+#define IMX8MN_DRAM_PLL_BYPASS 31
+#define IMX8MN_GPU_PLL_BYPASS  32
+#define IMX8MN_VPU_PLL_BYPASS  33
+#define IMX8MN_ARM_PLL_BYPASS  34
+#define IMX8MN_SYS_PLL1_BYPASS 35
+#define IMX8MN_SYS_PLL2_BYPASS 36
+#define IMX8MN_SYS_PLL3_BYPASS 37
+#define IMX8MN_AUDIO_PLL1_OUT  38
+#define IMX8MN_AUDIO_PLL2_OUT  39
+#define IMX8MN_VIDEO_PLL1_OUT  40
+#define IMX8MN_DRAM_PLL_OUT41
+#define IMX8MN_GPU_PLL_OUT 42
+#define IMX8MN_VPU_PLL_OUT 43
+#define IMX8MN_ARM_PLL_OUT 44
+#define IMX8MN_SYS_PLL1_OUT45
+#define IMX8MN_SYS_PLL2_OUT46
+#define IMX8MN_SYS_PLL3_OUT47
+#define IMX8MN_SYS_PLL1_40M48
+#define IMX8MN_SYS_PLL1_80M49
+#define IMX8MN_SYS_PLL1_100M   50
+#define IMX8MN_SYS_PLL1_133M   51
+#define IMX8MN_SYS_PLL1_160M   52
+#define IMX8MN_SYS_PLL1_200M   53
+#define IMX8MN_SYS_PLL1_266M   54
+#define IMX8MN_SYS_PLL1_400M   

[PATCH V2 2/3] clk: imx: Add support for i.MX8MN clock driver

2019-06-02 Thread Anson . Huang
From: Anson Huang 

This patch adds i.MX8MN clock driver support.

Signed-off-by: Anson Huang 
---
Changes since V1:
- add GPIOx clocks.
---
 drivers/clk/imx/Kconfig  |   6 +
 drivers/clk/imx/Makefile |   1 +
 drivers/clk/imx/clk-imx8mn.c | 614 +++
 3 files changed, 621 insertions(+)
 create mode 100644 drivers/clk/imx/clk-imx8mn.c

diff --git a/drivers/clk/imx/Kconfig b/drivers/clk/imx/Kconfig
index 0eaf418..1ac0c79 100644
--- a/drivers/clk/imx/Kconfig
+++ b/drivers/clk/imx/Kconfig
@@ -14,6 +14,12 @@ config CLK_IMX8MM
help
Build the driver for i.MX8MM CCM Clock Driver
 
+config CLK_IMX8MN
+   bool "IMX8MN CCM Clock Driver"
+   depends on ARCH_MXC && ARM64
+   help
+   Build the driver for i.MX8MN CCM Clock Driver
+
 config CLK_IMX8MQ
bool "IMX8MQ CCM Clock Driver"
depends on ARCH_MXC && ARM64
diff --git a/drivers/clk/imx/Makefile b/drivers/clk/imx/Makefile
index 05641c6..70a55cd 100644
--- a/drivers/clk/imx/Makefile
+++ b/drivers/clk/imx/Makefile
@@ -25,6 +25,7 @@ obj-$(CONFIG_MXC_CLK_SCU) += \
clk-scu.o \
clk-lpcg-scu.o
 
+obj-$(CONFIG_CLK_IMX8MN) += clk-imx8mn.o
 obj-$(CONFIG_CLK_IMX8MM) += clk-imx8mm.o
 obj-$(CONFIG_CLK_IMX8MQ) += clk-imx8mq.o
 obj-$(CONFIG_CLK_IMX8QXP) += clk-imx8qxp.o clk-imx8qxp-lpcg.o
diff --git a/drivers/clk/imx/clk-imx8mn.c b/drivers/clk/imx/clk-imx8mn.c
new file mode 100644
index 000..7a92c75a
--- /dev/null
+++ b/drivers/clk/imx/clk-imx8mn.c
@@ -0,0 +1,614 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2018-2019 NXP.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "clk.h"
+
+static u32 share_count_sai2;
+static u32 share_count_sai3;
+static u32 share_count_sai5;
+static u32 share_count_sai6;
+static u32 share_count_sai7;
+static u32 share_count_disp;
+static u32 share_count_pdm;
+static u32 share_count_nand;
+
+enum {
+   ARM_PLL,
+   GPU_PLL,
+   VPU_PLL,
+   SYS_PLL1,
+   SYS_PLL2,
+   SYS_PLL3,
+   DRAM_PLL,
+   AUDIO_PLL1,
+   AUDIO_PLL2,
+   VIDEO_PLL2,
+   NR_PLLS,
+};
+
+#define PLL_1416X_RATE(_rate, _m, _p, _s)  \
+   {   \
+   .rate   =   (_rate),\
+   .mdiv   =   (_m),   \
+   .pdiv   =   (_p),   \
+   .sdiv   =   (_s),   \
+   }
+
+#define PLL_1443X_RATE(_rate, _m, _p, _s, _k)  \
+   {   \
+   .rate   =   (_rate),\
+   .mdiv   =   (_m),   \
+   .pdiv   =   (_p),   \
+   .sdiv   =   (_s),   \
+   .kdiv   =   (_k),   \
+   }
+
+static const struct imx_pll14xx_rate_table imx8mn_pll1416x_tbl[] = {
+   PLL_1416X_RATE(18U, 225, 3, 0),
+   PLL_1416X_RATE(16U, 200, 3, 0),
+   PLL_1416X_RATE(12U, 300, 3, 1),
+   PLL_1416X_RATE(10U, 250, 3, 1),
+   PLL_1416X_RATE(8U,  200, 3, 1),
+   PLL_1416X_RATE(75000U,  250, 2, 2),
+   PLL_1416X_RATE(7U,  350, 3, 2),
+   PLL_1416X_RATE(6U,  300, 3, 2),
+};
+
+static const struct imx_pll14xx_rate_table imx8mn_audiopll_tbl[] = {
+   PLL_1443X_RATE(786432000U, 655, 5, 2, 23593),
+   PLL_1443X_RATE(722534400U, 301, 5, 1, 3670),
+};
+
+static const struct imx_pll14xx_rate_table imx8mn_videopll_tbl[] = {
+   PLL_1443X_RATE(65000U, 325, 3, 2, 0),
+   PLL_1443X_RATE(59400U, 198, 2, 2, 0),
+};
+
+static const struct imx_pll14xx_rate_table imx8mn_drampll_tbl[] = {
+   PLL_1443X_RATE(65000U, 325, 3, 2, 0),
+};
+
+static struct imx_pll14xx_clk imx8mn_audio_pll __initdata = {
+   .type = PLL_1443X,
+   .rate_table = imx8mn_audiopll_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_video_pll __initdata = {
+   .type = PLL_1443X,
+   .rate_table = imx8mn_videopll_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_dram_pll __initdata = {
+   .type = PLL_1443X,
+   .rate_table = imx8mn_drampll_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_arm_pll __initdata = {
+   .type = PLL_1416X,
+   .rate_table = imx8mn_pll1416x_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_gpu_pll __initdata = {
+   .type = PLL_1416X,
+   .rate_table = imx8mn_pll1416x_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_vpu_pll __initdata = {
+   .type = PLL_1416X,
+   .rate_table = imx8mn_pll1416x_tbl,
+};
+
+static struct imx_pll14xx_clk imx8mn_sys_pll __initdata = {
+   .type = PLL_1416X,
+   .rate_table = imx8mn_pll1416x_tbl,

Re: [PATCH v4] ARM: dts: aspeed: Add YADRO VESNIN BMC

2019-06-02 Thread Andrew Jeffery



On Fri, 31 May 2019, at 18:40, Alexander Filippov wrote:
> VESNIN is an OpenPower machine with an Aspeed 2400 BMC SoC manufactured
> by YADRO.
> 
> Signed-off-by: Alexander Filippov 

Reviewed-by: Andrew Jeffery 

> ---
>  arch/arm/boot/dts/Makefile  |   1 +
>  arch/arm/boot/dts/aspeed-bmc-opp-vesnin.dts | 224 
>  2 files changed, 225 insertions(+)
>  create mode 100644 arch/arm/boot/dts/aspeed-bmc-opp-vesnin.dts
> 
> diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
> index dab2914fa293..64a956372fe1 100644
> --- a/arch/arm/boot/dts/Makefile
> +++ b/arch/arm/boot/dts/Makefile
> @@ -1272,6 +1272,7 @@ dtb-$(CONFIG_ARCH_ASPEED) += \
>   aspeed-bmc-opp-lanyang.dtb \
>   aspeed-bmc-opp-palmetto.dtb \
>   aspeed-bmc-opp-romulus.dtb \
> + aspeed-bmc-opp-vesnin.dtb \
>   aspeed-bmc-opp-witherspoon.dtb \
>   aspeed-bmc-opp-zaius.dtb \
>   aspeed-bmc-portwell-neptune.dtb \
> diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-vesnin.dts 
> b/arch/arm/boot/dts/aspeed-bmc-opp-vesnin.dts
> new file mode 100644
> index ..0b9e29c3212e
> --- /dev/null
> +++ b/arch/arm/boot/dts/aspeed-bmc-opp-vesnin.dts
> @@ -0,0 +1,224 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +// Copyright 2019 YADRO
> +/dts-v1/;
> +
> +#include "aspeed-g4.dtsi"
> +#include 
> +
> +/ {
> + model = "Vesnin BMC";
> + compatible = "yadro,vesnin-bmc", "aspeed,ast2400";
> +
> + chosen {
> + stdout-path = 
> + bootargs = "console=ttyS4,115200 earlyprintk";
> + };
> +
> + memory {
> + reg = <0x4000 0x2000>;
> + };
> +
> + reserved-memory {
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges;
> +
> + vga_memory: framebuffer@5f00 {
> + no-map;
> + reg = <0x5f00 0x0100>; /* 16MB */
> + };
> + flash_memory: region@5c00 {
> + no-map;
> + reg = <0x5c00 0x0200>; /* 32M */
> + };
> + };
> +
> + leds {
> + compatible = "gpio-leds";
> +
> + heartbeat {
> + gpios = < ASPEED_GPIO(R, 4) GPIO_ACTIVE_LOW>;
> + };
> + power_red {
> + gpios = < ASPEED_GPIO(N, 1) GPIO_ACTIVE_LOW>;
> + };
> +
> + id_blue {
> + gpios = < ASPEED_GPIO(O, 0) GPIO_ACTIVE_LOW>;
> + };
> +
> + alarm_red {
> + gpios = < ASPEED_GPIO(N, 6) GPIO_ACTIVE_LOW>;
> + };
> +
> + alarm_yel {
> + gpios = < ASPEED_GPIO(N, 7) GPIO_ACTIVE_HIGH>;
> + };
> + };
> +
> + gpio-keys {
> + compatible = "gpio-keys";
> +
> + button_checkstop {
> + label = "checkstop";
> + linux,code = <74>;
> + gpios = < ASPEED_GPIO(P, 5) GPIO_ACTIVE_LOW>;
> + };
> +
> + button_identify {
> + label = "identify";
> + linux,code = <152>;
> + gpios = < ASPEED_GPIO(O, 7) GPIO_ACTIVE_LOW>;
> + };
> + };
> +};
> +
> + {
> + status = "okay";
> + flash@0 {
> + status = "okay";
> + m25p,fast-read;
> +label = "bmc";
> +#include "openbmc-flash-layout.dtsi"
> + };
> +};
> +
> + {
> + status = "okay";
> + pinctrl-names = "default";
> + pinctrl-0 = <_spi1debug_default>;
> +
> + flash@0 {
> + status = "okay";
> + label = "pnor";
> + m25p,fast-read;
> + };
> +};
> +
> + {
> + status = "okay";
> +
> + use-ncsi;
> + no-hw-checksum;
> +
> + pinctrl-names = "default";
> + pinctrl-0 = <_rmii1_default>;
> +};
> +
> +
> + {
> + status = "okay";
> +};
> +
> +_ctrl {
> + status = "okay";
> + memory-region = <_memory>;
> + flash = <>;
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> + pinctrl-names = "default";
> + pinctrl-0 = <_txd2_default _rxd2_default>;
> +};
> +
> + {
> + status = "okay";
> +
> + eeprom@50 {
> + compatible = "atmel,24c256";
> + reg = <0x50>;
> + pagesize = <64>;
> + };
> +};
> +
> + {
> + status = "okay";
> +
> + tmp75@49 {
> + compatible = "ti,tmp75";
> + reg = <0x49>;
> + };
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +};
> +
> + {
> + status = "okay";
> +
> + occ-hwmon@50 {
> + compatible = "ibm,p8-occ-hwmon";
> + reg = <0x50>;
> + };
> +};
> +
> + {
> + status = "okay";
> +
> + occ-hwmon@51 {
> + compatible = "ibm,p8-occ-hwmon";
> + reg = <0x51>;
> + };
> +};
> +
> + {
> + status = "okay";

[PATCH V2 2/3] arm64: dts: freescale: Add i.MX8MN dtsi support

2019-06-02 Thread Anson . Huang
From: Anson Huang 

The i.MX8M Nano Media Applications Processor is a new SoC of the i.MX8M
family, it is a 14nm FinFET product of the growing mScale family targeting
the consumer market. It is built in Samsung 14LPP to achieve both high
performance and low power consumption and relies on a powerful fully
coherent core complex based on a quad core ARM Cortex-A53 cluster,
Cortex-M7 low-power coprocessor and graphics accelerator.

This patch adds the basic dtsi support for i.MX8MN.

Signed-off-by: Anson Huang 
---
Changes since V1:
- fix build warnings of soc/aips bus unit name and reg properties;
- move gic into soc node;
- move usbphynop1/usbphynop2 node outside the soc node.
---
 arch/arm64/boot/dts/freescale/imx8mn.dtsi | 710 ++
 1 file changed, 710 insertions(+)
 create mode 100644 arch/arm64/boot/dts/freescale/imx8mn.dtsi

diff --git a/arch/arm64/boot/dts/freescale/imx8mn.dtsi 
b/arch/arm64/boot/dts/freescale/imx8mn.dtsi
new file mode 100644
index 000..1fb9148
--- /dev/null
+++ b/arch/arm64/boot/dts/freescale/imx8mn.dtsi
@@ -0,0 +1,710 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * Copyright 2019 NXP
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "imx8mn-pinfunc.h"
+
+/ {
+   compatible = "fsl,imx8mn";
+   interrupt-parent = <>;
+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   aliases {
+   ethernet0 = 
+   gpio0 = 
+   gpio1 = 
+   gpio2 = 
+   gpio3 = 
+   gpio4 = 
+   i2c0 = 
+   i2c1 = 
+   i2c2 = 
+   i2c3 = 
+   mmc0 = 
+   mmc1 = 
+   mmc2 = 
+   serial0 = 
+   serial1 = 
+   serial2 = 
+   serial3 = 
+   spi0 = 
+   spi1 = 
+   spi2 = 
+   };
+
+   cpus {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   A53_0: cpu@0 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a53";
+   reg = <0x0>;
+   clock-latency = <61036>;
+   clocks = < IMX8MN_CLK_ARM>;
+   enable-method = "psci";
+   next-level-cache = <_L2>;
+   };
+
+   A53_1: cpu@1 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a53";
+   reg = <0x1>;
+   clock-latency = <61036>;
+   clocks = < IMX8MN_CLK_ARM>;
+   enable-method = "psci";
+   next-level-cache = <_L2>;
+   };
+
+   A53_2: cpu@2 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a53";
+   reg = <0x2>;
+   clock-latency = <61036>;
+   clocks = < IMX8MN_CLK_ARM>;
+   enable-method = "psci";
+   next-level-cache = <_L2>;
+   };
+
+   A53_3: cpu@3 {
+   device_type = "cpu";
+   compatible = "arm,cortex-a53";
+   reg = <0x3>;
+   clock-latency = <61036>;
+   clocks = < IMX8MN_CLK_ARM>;
+   enable-method = "psci";
+   next-level-cache = <_L2>;
+   };
+
+   A53_L2: l2-cache0 {
+   compatible = "cache";
+   };
+   };
+
+   memory@4000 {
+   device_type = "memory";
+   reg = <0x0 0x4000 0 0x8000>;
+   };
+
+   osc_32k: clock-osc-32k {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <32768>;
+   clock-output-names = "osc_32k";
+   };
+
+   osc_24m: clock-osc-24m {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <2400>;
+   clock-output-names = "osc_24m";
+   };
+
+   clk_ext1: clock-ext1 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <13300>;
+   clock-output-names = "clk_ext1";
+   };
+
+   clk_ext2: clock-ext2 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <13300>;
+   clock-output-names = "clk_ext2";
+   };
+
+   clk_ext3: clock-ext3 {
+   compatible = "fixed-clock";
+   #clock-cells = <0>;
+   clock-frequency = <13300>;
+   clock-output-names = "clk_ext3";
+   };
+
+   clk_ext4: clock-ext4 {
+   compatible = "fixed-clock";
+  

[PATCH V2 3/3] arm64: dts: freescale: Add i.MX8MN DDR4 EVK board support

2019-06-02 Thread Anson . Huang
From: Anson Huang 

This patch adds basic i.MM8MN DDR4 EVK board support.

Signed-off-by: Anson Huang 
---
No changes.
---
 arch/arm64/boot/dts/freescale/Makefile|   1 +
 arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts | 217 ++
 2 files changed, 218 insertions(+)
 create mode 100644 arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts

diff --git a/arch/arm64/boot/dts/freescale/Makefile 
b/arch/arm64/boot/dts/freescale/Makefile
index 0bd122f..2cdd4cc 100644
--- a/arch/arm64/boot/dts/freescale/Makefile
+++ b/arch/arm64/boot/dts/freescale/Makefile
@@ -20,6 +20,7 @@ dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-ls2088a-rdb.dtb
 dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-lx2160a-qds.dtb
 dtb-$(CONFIG_ARCH_LAYERSCAPE) += fsl-lx2160a-rdb.dtb
 
+dtb-$(CONFIG_ARCH_MXC) += imx8mn-ddr4-evk.dtb
 dtb-$(CONFIG_ARCH_MXC) += imx8mm-evk.dtb
 dtb-$(CONFIG_ARCH_MXC) += imx8mq-evk.dtb
 dtb-$(CONFIG_ARCH_MXC) += imx8mq-zii-ultra-rmb3.dtb
diff --git a/arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts 
b/arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
new file mode 100644
index 000..da552c2
--- /dev/null
+++ b/arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * Copyright 2019 NXP
+ */
+
+/dts-v1/;
+
+#include "imx8mn.dtsi"
+
+/ {
+   model = "NXP i.MX8MNano DDR4 EVK board";
+   compatible = "fsl,imx8mn-ddr4-evk", "fsl,imx8mn";
+
+   chosen {
+   stdout-path = 
+   };
+
+   reg_usdhc2_vmmc: regulator-usdhc2 {
+   compatible = "regulator-fixed";
+   pinctrl-names = "default";
+   pinctrl-0 = <_reg_usdhc2_vmmc>;
+   regulator-name = "VSD_3V3";
+   regulator-min-microvolt = <330>;
+   regulator-max-microvolt = <330>;
+   gpio = < 19 GPIO_ACTIVE_HIGH>;
+   enable-active-high;
+   };
+};
+
+ {
+   pinctrl-names = "default";
+
+   pinctrl_fec1: fec1grp {
+   fsl,pins = <
+   MX8MN_IOMUXC_ENET_MDC_ENET1_MDC 0x3
+   MX8MN_IOMUXC_ENET_MDIO_ENET1_MDIO   0x3
+   MX8MN_IOMUXC_ENET_TD3_ENET1_RGMII_TD3   0x1f
+   MX8MN_IOMUXC_ENET_TD2_ENET1_RGMII_TD2   0x1f
+   MX8MN_IOMUXC_ENET_TD1_ENET1_RGMII_TD1   0x1f
+   MX8MN_IOMUXC_ENET_TD0_ENET1_RGMII_TD0   0x1f
+   MX8MN_IOMUXC_ENET_RD3_ENET1_RGMII_RD3   0x91
+   MX8MN_IOMUXC_ENET_RD2_ENET1_RGMII_RD2   0x91
+   MX8MN_IOMUXC_ENET_RD1_ENET1_RGMII_RD1   0x91
+   MX8MN_IOMUXC_ENET_RD0_ENET1_RGMII_RD0   0x91
+   MX8MN_IOMUXC_ENET_TXC_ENET1_RGMII_TXC   0x1f
+   MX8MN_IOMUXC_ENET_RXC_ENET1_RGMII_RXC   0x91
+   MX8MN_IOMUXC_ENET_RX_CTL_ENET1_RGMII_RX_CTL 0x91
+   MX8MN_IOMUXC_ENET_TX_CTL_ENET1_RGMII_TX_CTL 0x1f
+   MX8MN_IOMUXC_SAI2_RXC_GPIO4_IO220x19
+   >;
+   };
+
+   pinctrl_reg_usdhc2_vmmc: regusdhc2vmmc {
+   fsl,pins = <
+   MX8MN_IOMUXC_SD2_RESET_B_GPIO2_IO19 0x41
+   >;
+   };
+
+   pinctrl_uart2: uart2grp {
+   fsl,pins = <
+   MX8MN_IOMUXC_UART2_RXD_UART2_DCE_RX 0x140
+   MX8MN_IOMUXC_UART2_TXD_UART2_DCE_TX 0x140
+   >;
+   };
+
+   pinctrl_usdhc2_gpio: usdhc2grpgpio {
+   fsl,pins = <
+   MX8MN_IOMUXC_GPIO1_IO15_GPIO1_IO15  0x1c4
+   >;
+   };
+
+   pinctrl_usdhc2: usdhc2grp {
+   fsl,pins = <
+   MX8MN_IOMUXC_SD2_CLK_USDHC2_CLK 0x190
+   MX8MN_IOMUXC_SD2_CMD_USDHC2_CMD 0x1d0
+   MX8MN_IOMUXC_SD2_DATA0_USDHC2_DATA0 0x1d0
+   MX8MN_IOMUXC_SD2_DATA1_USDHC2_DATA1 0x1d0
+   MX8MN_IOMUXC_SD2_DATA2_USDHC2_DATA2 0x1d0
+   MX8MN_IOMUXC_SD2_DATA3_USDHC2_DATA3 0x1d0
+   MX8MN_IOMUXC_GPIO1_IO04_USDHC2_VSELECT  0x1d0
+   >;
+   };
+
+   pinctrl_usdhc2_100mhz: usdhc2grp100mhz {
+   fsl,pins = <
+   MX8MN_IOMUXC_SD2_CLK_USDHC2_CLK 0x194
+   MX8MN_IOMUXC_SD2_CMD_USDHC2_CMD 0x1d4
+   MX8MN_IOMUXC_SD2_DATA0_USDHC2_DATA0 0x1d4
+   MX8MN_IOMUXC_SD2_DATA1_USDHC2_DATA1 0x1d4
+   MX8MN_IOMUXC_SD2_DATA2_USDHC2_DATA2 0x1d4
+   MX8MN_IOMUXC_SD2_DATA3_USDHC2_DATA3 0x1d4
+   MX8MN_IOMUXC_GPIO1_IO04_USDHC2_VSELECT  0x1d0
+   >;
+   };
+
+   pinctrl_usdhc2_200mhz: usdhc2grp200mhz {
+   fsl,pins = <
+   

[PATCH V2 1/3] dt-bindings: arm: imx: Add the soc binding for i.MX8MN

2019-06-02 Thread Anson . Huang
From: Anson Huang 

This patch adds the soc & board binding for i.MX8MN.

Signed-off-by: Anson Huang 
---
No changes.
---
 Documentation/devicetree/bindings/arm/fsl.yaml | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/devicetree/bindings/arm/fsl.yaml 
b/Documentation/devicetree/bindings/arm/fsl.yaml
index 407138e..b1a5231 100644
--- a/Documentation/devicetree/bindings/arm/fsl.yaml
+++ b/Documentation/devicetree/bindings/arm/fsl.yaml
@@ -171,6 +171,12 @@ properties:
   - const: compulab,cl-som-imx7
   - const: fsl,imx7d
 
+  - description: i.MX8MN based Boards
+items:
+  - enum:
+  - fsl,imx8mn-ddr4-evk# i.MX8MN DDR4 EVK Board
+  - const: fsl,imx8mn
+
   - description: i.MX8MM based Boards
 items:
   - enum:
-- 
2.7.4



  1   2   3   >