Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit

2016-12-27 Thread Nicholas Piggin
On Tue, 27 Dec 2016 10:58:59 -0800
Linus Torvalds  wrote:

> On Tue, Dec 27, 2016 at 3:19 AM, Nicholas Piggin  wrote:
> >
> > Attached is part of a patch I've been mulling over for a while. I
> > expect you to hate it, and it does not solve this problem for x86,
> > but I like being able to propagate values from atomic ops back
> > to the compiler. Of course, volatile then can't be used either which
> > is another spanner...  
> 
> Yeah, that patch is disgusting, and doesn't even help x86.

No, although it would help some cases (but granted the bitops tend to
be problematic in this regard). To be clear I'm not asking to merge it,
just wondered your opinion. (We need something more for unlock_page
anyway because the memory barrier in the way).

> It also
> depends on the compiler doing the right thing in ways that are not
> obviously true.

Can you elaborate on this? GCC will do the optimization (modulo a
regression https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77647)

> I'd much rather just add the "xyz_return()" primitives for and/or, the
> way we already have atomic_add_return() and friends.
> 
> In fact, we could probably play games with bit numbering, and actually
> use the atomic ops we already have. For example, if the lock bit was
> the top bit, we could unlock by doing "atomic_add_return()" with that
> bit, and look at the remaining bits that way.
> 
> That would actually work really well on x86, since there we have
> "xadd", but we do *not* have "set/clear bit and return old word".
> 
> We could make a special case for just the page lock bit, make it bit #7, and 
> use
> 
>movb $128,%al
>lock xaddb %al,flags
> 
> and then test the bits in %al.
> 
> And all the RISC architectures would be ok with that too, because they
> can just use ll/sc and test the bits with that. So for them, adding a
> "atomic_long_and_return()" would be very natural in the general case.
> 
> Hmm?
> 
> The other alternative is to keep the lock bit as bit #0, and just make
> the contention bit be the high bit. Then, on x86, you can do
> 
> lock andb $0xfe,flags
> js contention
> 
> which might be even better. Again, it would be a very special
> operation just for unlock. Something like
> 
>bit_clear_and_branch_if_negative_byte(mem, label);
> 
> and again, it would be trivial to do on most architectures.
> 
> Let me try to write a patch or two for testing.

Patch seems okay, but it's kind of a horrible primitive. What if you
did clear_bit_unlock_and_test_bit, which does a __builtin_constant_p
test on the bit numbers and if they are < 7 and == 7, then do the
fastpath?

Nitpick, can the enum do "= 7" to catch careless bugs? Or BUILD_BUG_ON.

And I'd to do the same for PG_writeback. AFAIKS whatever approach is
used for PG_locked should work just the same, so no problem there.

Thanks,
Nick


Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit

2016-12-27 Thread Nicholas Piggin
On Mon, 26 Dec 2016 11:07:52 -0800
Linus Torvalds  wrote:

> On Sun, Dec 25, 2016 at 5:16 PM, Nicholas Piggin  wrote:
> >
> > I did actually play around with that. I could not get my skylake
> > to forward the result from a lock op to a subsequent load (the
> > latency was the same whether you use lock ; andb or lock ; andl
> > (32 cycles for my test loop) whereas with non-atomic versions I
> > was getting about 15 cycles for andb vs 2 for andl.  
> 
> Yes, interesting. It does look like the locked ops don't end up having
> the partial write issue and the size of the op doesn't matter.
> 
> But it's definitely the case that the write buffer hit immediately
> after the atomic read-modify-write ends up slowing things down, so the
> profile oddity isn't just a profile artifact. I wrote a stupid test
> program that did an atomic increment, and then read either the same
> value, or an adjacent value in memory (so same instruvtion sequence,
> the difference just being what memory location the read accessed).
> 
> Reading the same value after the atomic update was *much* more
> expensive than reading the adjacent value, so it causes some kind of
> pipeline hickup (by about 50% of the cost of the atomic op itself:
> iow, the "atomic-op followed by read same location" was over 1.5x
> slower than "atomic op followed by read of another location").
> 
> So the atomic ops don't serialize things entirely, but they *hate*
> having the value read (regardless of size) right after being updated,
> because it causes some kind of nasty pipeline issue.

Sure, I would expect independent operations to be able to run ahead
of the atomic op, and this might point to speculation of consistency
for loads -- an independent younger load can be executed speculatively
before the atomic op and flushed if the cacheline was lost before the
load is completed in order.

I bet forwarding from the store queue in case of a locked op is more
difficult. I guess it could be done in the same way, but the load hits
the store queue ahead of the cache then it's more work to then have
the load go to the cache so it can find the line to speculate on while
the flush is in progress. Common case of load hit non-atomic store
would not require this case so it may just not be worthwhile.

Anyway that's speculation (ha). What matters is we know the load is
nasty.

> 
> A cmpxchg does seem to avoid the issue.

Yes, I wonder what to do. POWER CPUs have very similar issues and we
have noticed unlock_page and several other cases where atomic ops cause
load stalls. With its ll/sc, POWER would prefer not to do a cmpxchg.

Attached is part of a patch I've been mulling over for a while. I
expect you to hate it, and it does not solve this problem for x86,
but I like being able to propagate values from atomic ops back
to the compiler. Of course, volatile then can't be used either which
is another spanner...

Short term option is to just have a specific primitive for
clear-unlock-and-test, which we kind of need anyway here to avoid the
memory barrier in an arch-independent way.

Thanks,
Nick

---

After removing the smp_mb__after_atomic and volatile from test_bit,
applying this directive to atomic primitives results in test_bit able
to recognise if the value is in a register. unlock_page improves:

 lwsync
 ldarx   r10,0,r3
 andcr10,r10,r9
 stdcx.  r10,0,r3
 bne-99c 
-ld  r9,0(r3)
-andi.   r10,r9,2
+andi.   r10,r10,2
 beqlr
 b   97c 
---
 arch/powerpc/include/asm/bitops.h   |  2 ++
 arch/powerpc/include/asm/local.h| 12 
 include/asm-generic/bitops/non-atomic.h |  2 +-
 include/linux/compiler.h| 19 +++
 mm/filemap.c|  2 +-
 5 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 59abc620f8e8..0c3e0c384b7d 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -70,6 +70,7 @@ static __inline__ void fn(unsigned long mask, \
: "=&r" (old), "+m" (*p)\
: "r" (mask), "r" (p)   \
: "cc", "memory");  \
+   compiler_assign_ptr_val(p, old);\
 }
 
 DEFINE_BITOP(set_bits, or, "")
@@ -117,6 +118,7 @@ static __inline__ unsigned long fn( \
: "=&r" (old), "=&r" (t)\
: "r" (mask), "r" (p)   \
: "cc", "memory");  \
+   compiler_assign_ptr_val(p, old);\
return (old & mask);   

Re: [PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit

2016-12-25 Thread Nicholas Piggin
On Sun, 25 Dec 2016 13:51:17 -0800
Linus Torvalds  wrote:

> On Sat, Dec 24, 2016 at 7:00 PM, Nicholas Piggin  wrote:
> > Add a new page flag, PageWaiters, to indicate the page waitqueue has
> > tasks waiting. This can be tested rather than testing waitqueue_active
> > which requires another cacheline load.  
> 
> Ok, I applied this one too. I think there's room for improvement, but
> I don't think it's going to help to just wait another release cycle
> and hope something happens.
> 
> Example room for improvement from a profile of unlock_page():
> 
>46.44 │  lock   andb $0xfe,(%rdi)
>34.22 │  mov(%rdi),%rax
> 
> this has the old "do atomic op on a byte, then load the whole word"
> issue that we used to have with the nasty zone lookup code too. And it
> causes a horrible pipeline hickup because the load will not forward
> the data from the (partial) store.
> 
>  Its' really a misfeature of our asm optimizations of the atomic bit
> ops. Using "andb" is slightly smaller, but in this case in particular,
> an "andq" would be a ton faster, and the mask still fits in an imm8,
> so it's not even hugely larger.

I did actually play around with that. I could not get my skylake
to forward the result from a lock op to a subsequent load (the
latency was the same whether you use lock ; andb or lock ; andl
(32 cycles for my test loop) whereas with non-atomic versions I
was getting about 15 cycles for andb vs 2 for andl.

I guess the lock op drains the store queue to coherency and does
not allow forwarding so as to provide the memory ordering
semantics.

> But it might also be a good idea to simply use a "cmpxchg" loop here.
> That also gives atomicity guarantees that we don't have with the
> "clear bit and then load the value".

cmpxchg ends up at 19 cycles including the initial load, so it
may be worthwhile. Powerpc has a similar problem with doing a
clear_bit; test_bit (not the size mismatch, but forwarding from
atomic ops being less capable).

Thanks,
Nick


[PATCH 2/2] mm: add PageWaiters indicating tasks are waiting for a page bit

2016-12-24 Thread Nicholas Piggin
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.

This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.

The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).

This improves the performance of page lock intensive microbenchmarks by
2-3%.

Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).

Signed-off-by: Nicholas Piggin 
---
 include/linux/mm.h |   2 +
 include/linux/page-flags.h |   9 ++
 include/linux/pagemap.h|  23 +++---
 include/linux/writeback.h  |   1 -
 include/trace/events/mmflags.h |   1 +
 init/main.c|   3 +-
 mm/filemap.c   | 181 +
 mm/internal.h  |   2 +
 mm/swap.c  |   2 +
 9 files changed, 174 insertions(+), 50 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..fe6b4036664a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1758,6 +1758,8 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
return ptl;
 }
 
+extern void __init pagecache_init(void);
+
 extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a57c909a15e4..c56b39890a41 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
PG_locked,  /* Page is locked. Don't touch. */
+   PG_waiters, /* Page has waiters, check its waitqueue */
PG_error,
PG_referenced,
PG_uptodate,
@@ -169,6 +170,9 @@ static __always_inline int PageCompound(struct page *page)
  * for compound page all operations related to the page flag applied to
  * head page.
  *
+ * PF_ONLY_HEAD:
+ * for compound page, callers only ever operate on the head page.
+ *
  * PF_NO_TAIL:
  * modifications of the page flag must be done on small or head pages,
  * checks can be done on tail pages too.
@@ -178,6 +182,9 @@ static __always_inline int PageCompound(struct page *page)
  */
 #define PF_ANY(page, enforce)  page
 #define PF_HEAD(page, enforce) compound_head(page)
+#define PF_ONLY_HEAD(page, enforce) ({ \
+   VM_BUG_ON_PGFLAGS(PageTail(page), page);\
+   page;})
 #define PF_NO_TAIL(page, enforce) ({   \
VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \
compound_head(page);})
@@ -255,6 +262,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, 
PF_ONLY_HEAD)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
@@ -743,6 +751,7 @@ static inline int page_has_private(struct page *page)
 
 #undef PF_ANY
 #undef PF_HEAD
+#undef PF_ONLY_HEAD
 #undef PF_NO_TAIL
 #undef PF_NO_COMPOUND
 #endif /* !__GENERATING_BOUNDS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7dbe9148b2f8..d7f25f754d60 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -486,22 +486,14 @@ static inline int lock_page_or_retry(struct page *page, 
struct mm_struct *mm,
  * and for filesystems which need to wait on PG_private.
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
-
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable_timeout(struct page *page,
-int bit_nr, unsigned long timeout);
-
-static inline int wait_on_page_locked_killable(struct page *page)
-{
-   if (!PageLocked(page))
-   return 0;
-   return wait_on_page_bit_killable(compound_head(page), PG_locked);
-}
+extern void wake_up_page_bit(struct page *page, int bit_nr);
 
-extern wait_queue_head_t *page_waitqueu

[PATCH 0/2] PageWaiters again

2016-12-24 Thread Nicholas Piggin
I cleaned up the changelog a bit and made a few tweaks to patch 1 as
described in my reply to Hugh. Resending with SOBs.

Thanks,
Nick

Nicholas Piggin (2):
  mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked
  mm: add PageWaiters indicating tasks are waiting for a page bit

 include/linux/mm.h |   2 +
 include/linux/page-flags.h |  33 ++--
 include/linux/pagemap.h|  23 +++---
 include/linux/writeback.h  |   1 -
 include/trace/events/mmflags.h |   2 +-
 init/main.c|   3 +-
 mm/filemap.c   | 181 +
 mm/internal.h  |   2 +
 mm/memory-failure.c|   4 +-
 mm/migrate.c   |  14 ++--
 mm/swap.c  |   2 +
 11 files changed, 199 insertions(+), 68 deletions(-)

-- 
2.11.0



[PATCH 1/2] mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked

2016-12-24 Thread Nicholas Piggin
A page is not added to the swap cache without being swap backed,
so PageSwapBacked mappings can use PG_owner_priv_1 for PageSwapCache.

Acked-by: Hugh Dickins 
Signed-off-by: Nicholas Piggin 
---
 include/linux/page-flags.h | 24 
 include/trace/events/mmflags.h |  1 -
 mm/memory-failure.c|  4 +---
 mm/migrate.c   | 14 --
 4 files changed, 25 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..a57c909a15e4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,7 +87,6 @@ enum pageflags {
PG_private_2,   /* If pagecache, has fs aux data */
PG_writeback,   /* Page is under writeback */
PG_head,/* A head page */
-   PG_swapcache,   /* Swap page: swp_entry_t in private */
PG_mappedtodisk,/* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked,  /* Page is backed by RAM/swap */
@@ -110,6 +109,9 @@ enum pageflags {
/* Filesystems */
PG_checked = PG_owner_priv_1,
 
+   /* SwapBacked */
+   PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
+
/* Two page bits are conscripted by FS-Cache to maintain local caching
 * state.  These bits are set on pages belonging to the netfs's inodes
 * when those inodes are being locally cached.
@@ -314,7 +316,13 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+static __always_inline int PageSwapCache(struct page *page)
+{
+   return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
+
+}
+SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
@@ -701,12 +709,12 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
  */
-#define PAGE_FLAGS_CHECK_AT_FREE \
-   (1UL << PG_lru   | 1UL << PG_locked| \
-1UL << PG_private | 1UL << PG_private_2 | \
-1UL << PG_writeback | 1UL << PG_reserved | \
-1UL << PG_slab  | 1UL << PG_swapcache | 1UL << PG_active | \
-1UL << PG_unevictable | __PG_MLOCKED)
+#define PAGE_FLAGS_CHECK_AT_FREE   \
+   (1UL << PG_lru  | 1UL << PG_locked  |   \
+1UL << PG_private  | 1UL << PG_private_2   |   \
+1UL << PG_writeback| 1UL << PG_reserved|   \
+1UL << PG_slab | 1UL << PG_active  |   \
+1UL << PG_unevictable  | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab48a2fb..30c2adbdebe8 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -95,7 +95,6 @@
{1UL << PG_private_2,   "private_2" },  \
{1UL << PG_writeback,   "writeback" },  \
{1UL << PG_head,"head"  },  \
-   {1UL << PG_swapcache,   "swapcache" },  \
{1UL << PG_mappedtodisk,"mappedtodisk"  },  \
{1UL << PG_reclaim, "reclaim"   },  \
{1UL << PG_swapbacked,  "swapbacked"},  \
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 19e796d36a62..f283c7e0a2a3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -764,12 +764,11 @@ static int me_huge_page(struct page *p, unsigned long pfn)
  */
 
 #define dirty  (1UL << PG_dirty)
-#define sc (1UL << PG_swapcache)
+#define sc ((1UL << PG_swapcache) | (1UL << PG_swapbacked))
 #define unevict(1UL << PG_unevictable)
 #define mlock  (1UL << PG_mlocked)
 #define writeback  (1UL << PG_writeback)
 #define lru(1UL << PG_lru)
-#define swapbacked (1UL << PG_swapbacked)
 #define head   (1UL << PG_head)
 #define slab   (1UL << PG_slab)
 #define reserved   (1UL << PG_reserved)
@@ -819,7 +818,6 @@ static struct page_state {
 #undef mlock
 #undef writeback
 #undef lru
-#undef swapbacked
 #undef head
 #undef slab
 #undef reserved
diff --git a/mm/migrate.c b/mm/migrate.c
index 0ed24b1fa77b..87f4d0f81819 100644
--- a/mm/migrate.c
+++ b/mm/migr

Re: [PATCH 1/2] mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked

2016-12-24 Thread Nicholas Piggin
On Thu, 22 Dec 2016 11:55:28 -0800 (PST)
Hugh Dickins  wrote:

> On Thu, 22 Dec 2016, Nicholas Piggin wrote:
> 
> I agree with every word of that changelog ;)
> 
> And I'll stamp this with
> Acked-by: Hugh Dickins 

Thanks Hugh.
 
> The thing that Peter remembers I commented on (which 0day caught too),
> was to remove PG_swapcache from PAGE_FLAGS_CHECK_AT_FREE: you've done
> that now, so this is good.  (Note in passing: wouldn't it be good to
> add PG_waiters to PAGE_FLAGS_CHECK_AT_FREE in the 2/2?)
> 
> Though I did yesterday notice a few more problematic uses of
> PG_swapcache, which you'll probably need to refine to exclude
> other uses of PG_owner_priv_1; though no great hurry for those,
> so not necessarily in this same patch.  Do your own grep, but
> 
> fs/proc/page.c derives its KPF_SWAPCACHE from PG_swapcache,
> needs refining.
> 
> kernel/kexec_core.c says VMCOREINFO_NUMBER(PG_swapcache):
> I haven't looked into what that's about, it will probably just
> have to be commented as now including other uses of the same bit.
> 
> mm/memory-failure.c has an error_states[] table that involves
> testing PG_swapcache as "sc", but looks as if it can be changed
> to factor in "swapbacked" too.

I've added the swapbacked check to mm/memory-failure.c, the others look
like they're just dealing with bit number, so not much to do about it
really. I also just made the migration case more explicit, seeing as the
others are.

Hopefully that doesn't negate your ack because I'm adding that too.

Thanks,
Nick


Re: [PATCH RESEND 0/2] kbuild: dead code elimination: ftrace fixes

2016-12-22 Thread Nicholas Piggin
On Thu, 22 Dec 2016 09:51:45 +0100
Marcin Nowakowski  wrote:

> Enabling dead code & data elimination currently breaks ftrace operation,
> as the __mcount_loc section is removed (as it is not referenced directly
> anywhere in the code).
> Moreover, there are a lot of entries missing in the __mcount_loc section
> as the recordmcount tool doesn't currently properly handle the section
> names as created by the use of -ffunction-sections.

Thanks for keeping on top of these. I didn't have any objections, so
if you would send them to Michal Marek for merge and cc linux-kbuild
that would be appreciated.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Wed, 21 Dec 2016 11:50:49 -0800
Linus Torvalds  wrote:

> On Wed, Dec 21, 2016 at 11:01 AM, Nicholas Piggin  wrote:
> > Peter's patch is less code and in that regard a bit nicer. I tried
> > going that way once, but I just thought it was a bit too sloppy to
> > do nicely with wait bit APIs.  
> 
> So I have to admit that when I read through your and PeterZ's patches
> back-to-back, yours was easier to understand.
> 
> PeterZ's is smaller but kind of subtle. The whole "return zero from
> lock_page_wait() and go around again" and the locking around that
> isn't exactly clear. In contrast, yours has the obvious waitqueue
> spinlock.
> 
> I'll think about it.  And yes, it would be good to have more testing,
> but at the same time xmas is imminent, and waiting around too much
> isn't going to help either..

Sure. Let's see if Dave and Mel get a chance to do some testing.

It might be a squeeze before Christmas. I realize we're going to fix
it anyway so on one hand might as well get something in. On the other
I didn't want to add a subtle bug then have everyone go on vacation.

How about I send up the page flag patch by Friday and that can bake
while the main patch gets more testing / review?

Thanks,
Nick


Re: Build warning on 32-bit PPC - bisected to commit 989cea5c14be

2016-12-21 Thread Nicholas Piggin
On Wed, 21 Dec 2016 13:49:07 -0600
Larry Finger  wrote:

> I am getting the following warning when I build kernel 4.9-git on my 
> PowerBook 
> G4 with a 32-bit PPC processor:
> 
>AS  arch/powerpc/kernel/misc_32.o
> arch/powerpc/kernel/misc_32.S:299:7: warning: "CONFIG_FSL_BOOKE" is not 
> defined 
> [-Wundef]
> 
> This problem has been bisected to commit 989cea5c14be ("kbuild: prevent 
> lib-ksyms.o rebuilds").
> 
> Thanks,
> 
> Larry

Hi Larry,

This is strange you've bisected it there, I can't see how that patch would
trigger it. That said, powerpc has had a few small build system glitches.

It looks like this warning could be fixed by changing #elif CONFIG_FSL_BOOKE
to #elif defined (CONFIG_FSL_BOOKE). Want to send a patch (if it works)?

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Thu, 22 Dec 2016 04:33:31 +1000
Nicholas Piggin  wrote:

> On Wed, 21 Dec 2016 10:02:27 -0800
> Linus Torvalds  wrote:
> 

> > I do think your approach of just re-using the existing bit waiting
> > with just a page-specific waiting function is nicer than Nick's "let's
> > just roll new waiting functions" approach. It also avoids the extra
> > initcall.
> > 
> > Nick, comments?  
> 
> Well yes we should take my patch 1 and use the new bit for this
> purpose regardless of what way we go with patch 2. I'll reply to
> that in the other mail.

Actually when I hit send, I thought your next mail was addressing a
different subject. So back here.

Peter's patch is less code and in that regard a bit nicer. I tried
going that way once, but I just thought it was a bit too sloppy to
do nicely with wait bit APIs.

- The page can be added to waitqueue without PageWaiters being set.
  This is transient condition where the lock is retested, but it
  remains that PageWaiters is not quite the same as waitqueue_active
  to some degree.

- This set + retest means every time a page gets a waiter, the cost
  is 2 test-and-set for the lock bit plus 2 spin_lock+spin_unlock for
  the waitqueue.

- Setting PageWaiters is done outside the waitqueue lock, so you also
  have a new interleavings to think about versus clearing the bit.

- It fails to clear up the bit and return to fastpath when there are
  hash collisions. Yes I know this is a rare case and on average it
  probably does not matter. But jitter is important, but also we
  really *want* to keep the waitqueue table small and lean like you
  have made it if possible. None of this 100KB per zone crap -- I do
  want to keep it small and tolerating collisions better would help
  that.

Anyway that's about my 2c. Keep in mind Mel just said he might have
seen a lockup with Peter's patch, and mine has not been hugely tested
either, so let's wait for a bit more testing before merging either.

Although we could start pipelining the process by merging patch 1 if
Hugh acks it (cc'ed), then I'll resend with SOB and Ack.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Wed, 21 Dec 2016 10:12:36 -0800
Linus Torvalds  wrote:

> On Wed, Dec 21, 2016 at 4:30 AM, Nicholas Piggin  wrote:
> >
> > I've been doing a bit of testing, and I don't know why you're seeing
> > this.
> >
> > I don't think I've been able to trigger any actual page lock contention
> > so nothing gets put on the waitqueue to really bounce cache lines around
> > that I can see.  
> 
> The "test is the waitqueue is empty" is going to cause cache misses
> even if there is no contention.
> 
> In fact, that's why I want the contention bit in the struct page - not
> because of any NUMA issues, but simply due to cache misses.
> 
> And yes, with no contention the bit waiting should hopefully be able
> to cache things shared - which should make the bouncing much less -
> but there's going to be a shitload of false sharing with any actual
> IO, so you will get bouncing due to that.

Well that's what I'm actually interested in, but I could not get it to
do much bouncing at all. There was a significant amount of writes going
through when having the backing store files on writeback filesystem,
but even that was not really triggering a lot of actual waiters.

Not that I don't believe it could happen, and Dave's system is a lot
bigger and faster and more NUMA than the one I was testing on. I'm
just curious.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Wed, 21 Dec 2016 10:02:27 -0800
Linus Torvalds  wrote:

> On Wed, Dec 21, 2016 at 12:32 AM, Peter Zijlstra  wrote:
> >
> > FWIW, here's mine.. compiles and boots on a NUMA x86_64 machine.  
> 
> So I like how your patch is smaller, but your patch is also broken.
> 
> First off, the whole contention bit is *not* NUMA-specific. It should
> help non-NUMA too, by avoiding the stupid extra cache miss.
> 
> Secondly, CONFIG_NUMA is a broken thing to test anyway, since adding a
> bit for the NUMA case can overflow the page flags as far as I can tell
> (MIPS seems to support NUMA on 32-bit, for example, but I didn't
> really check the Kconfig details). Making it dependent on 64-bit might
> be ok (and would fix the issue above - I don't think we really need to
> care too much about 32-bit any more)
> 
> But making it conditional at all means that now you have those two
> different cases for this, which is a maintenance nightmare. So don't
> do it even if we could say "screw 32-bit".
> 
> Anyway, the conditional thing could be fixed by just taking Nick's
> patch 1/2, and your patch (with the conditional bits stripped out).
> 
> I do think your approach of just re-using the existing bit waiting
> with just a page-specific waiting function is nicer than Nick's "let's
> just roll new waiting functions" approach. It also avoids the extra
> initcall.
> 
> Nick, comments?

Well yes we should take my patch 1 and use the new bit for this
purpose regardless of what way we go with patch 2. I'll reply to
that in the other mail.

Thanks,
Nick


[PATCH 2/2] mm: add PageWaiters bit to indicate waitqueue should be checked

2016-12-21 Thread Nicholas Piggin
---
 include/linux/mm.h |   2 +
 include/linux/page-flags.h |   9 +++
 include/linux/pagemap.h|  23 +++---
 include/linux/writeback.h  |   1 -
 include/trace/events/mmflags.h |   1 +
 init/main.c|   3 +-
 mm/filemap.c   | 180 +
 mm/internal.h  |   2 +
 mm/swap.c  |   2 +
 9 files changed, 173 insertions(+), 50 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4424784ac374..fe6b4036664a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1758,6 +1758,8 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
return ptl;
 }
 
+extern void __init pagecache_init(void);
+
 extern void free_area_init(unsigned long * zones_size);
 extern void free_area_init_node(int nid, unsigned long * zones_size,
unsigned long zone_start_pfn, unsigned long *zholes_size);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a57c909a15e4..c56b39890a41 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
PG_locked,  /* Page is locked. Don't touch. */
+   PG_waiters, /* Page has waiters, check its waitqueue */
PG_error,
PG_referenced,
PG_uptodate,
@@ -169,6 +170,9 @@ static __always_inline int PageCompound(struct page *page)
  * for compound page all operations related to the page flag applied to
  * head page.
  *
+ * PF_ONLY_HEAD:
+ * for compound page, callers only ever operate on the head page.
+ *
  * PF_NO_TAIL:
  * modifications of the page flag must be done on small or head pages,
  * checks can be done on tail pages too.
@@ -178,6 +182,9 @@ static __always_inline int PageCompound(struct page *page)
  */
 #define PF_ANY(page, enforce)  page
 #define PF_HEAD(page, enforce) compound_head(page)
+#define PF_ONLY_HEAD(page, enforce) ({ \
+   VM_BUG_ON_PGFLAGS(PageTail(page), page);\
+   page;})
 #define PF_NO_TAIL(page, enforce) ({   \
VM_BUG_ON_PGFLAGS(enforce && PageTail(page), page); \
compound_head(page);})
@@ -255,6 +262,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Waiters, waiters, PF_ONLY_HEAD) __CLEARPAGEFLAG(Waiters, waiters, 
PF_ONLY_HEAD)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
@@ -743,6 +751,7 @@ static inline int page_has_private(struct page *page)
 
 #undef PF_ANY
 #undef PF_HEAD
+#undef PF_ONLY_HEAD
 #undef PF_NO_TAIL
 #undef PF_NO_COMPOUND
 #endif /* !__GENERATING_BOUNDS_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7dbe9148b2f8..d7f25f754d60 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -486,22 +486,14 @@ static inline int lock_page_or_retry(struct page *page, 
struct mm_struct *mm,
  * and for filesystems which need to wait on PG_private.
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
-
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable_timeout(struct page *page,
-int bit_nr, unsigned long timeout);
-
-static inline int wait_on_page_locked_killable(struct page *page)
-{
-   if (!PageLocked(page))
-   return 0;
-   return wait_on_page_bit_killable(compound_head(page), PG_locked);
-}
+extern void wake_up_page_bit(struct page *page, int bit_nr);
 
-extern wait_queue_head_t *page_waitqueue(struct page *page);
 static inline void wake_up_page(struct page *page, int bit)
 {
-   __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+   if (!PageWaiters(page))
+   return;
+   wake_up_page_bit(page, bit);
 }
 
 /* 
@@ -517,6 +509,13 @@ static inline void wait_on_page_locked(struct page *page)
wait_on_page_bit(compound_head(page), PG_locked);
 }
 
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+   if (!PageLocked(page))
+   return 0;
+   return wait_on_page_bit_killable(compound_head(page), PG_locked);
+}
+
 /* 
  * Wait for a page to complete writeback
  */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index c78f9f0920b5..5527d910ba3d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -375,7 +375,6 @@ void global_dirty_limits(unsigned long *pbackground, 
unsigned long *pdirty);
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
 
 void wb_update_bandwidth(struct bdi_writeback *wb, unsigned

[PATCH 0/2] respin of PageWaiters patch

2016-12-21 Thread Nicholas Piggin
Seeing as Mel said he would test it (and maybe Dave could as well), I
will post my patches again. There was a couple of page flags bugs pointed
out last time, which should be fixed.

Thanks,
Nick




Nicholas Piggin (2):
  mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked
  mm: add PageWaiters bit to indicate waitqueue should be checked

 include/linux/mm.h |   2 +
 include/linux/page-flags.h |  33 ++--
 include/linux/pagemap.h|  23 +++---
 include/linux/writeback.h  |   1 -
 include/trace/events/mmflags.h |   2 +-
 init/main.c|   3 +-
 mm/filemap.c   | 180 +
 mm/internal.h  |   2 +
 mm/swap.c  |   2 +
 9 files changed, 189 insertions(+), 59 deletions(-)

-- 
2.11.0



[PATCH 1/2] mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked

2016-12-21 Thread Nicholas Piggin
---
 include/linux/page-flags.h | 24 
 include/trace/events/mmflags.h |  1 -
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda91238..a57c909a15e4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,7 +87,6 @@ enum pageflags {
PG_private_2,   /* If pagecache, has fs aux data */
PG_writeback,   /* Page is under writeback */
PG_head,/* A head page */
-   PG_swapcache,   /* Swap page: swp_entry_t in private */
PG_mappedtodisk,/* Has blocks allocated on-disk */
PG_reclaim, /* To be reclaimed asap */
PG_swapbacked,  /* Page is backed by RAM/swap */
@@ -110,6 +109,9 @@ enum pageflags {
/* Filesystems */
PG_checked = PG_owner_priv_1,
 
+   /* SwapBacked */
+   PG_swapcache = PG_owner_priv_1, /* Swap page: swp_entry_t in private */
+
/* Two page bits are conscripted by FS-Cache to maintain local caching
 * state.  These bits are set on pages belonging to the netfs's inodes
 * when those inodes are being locally cached.
@@ -314,7 +316,13 @@ PAGEFLAG_FALSE(HighMem)
 #endif
 
 #ifdef CONFIG_SWAP
-PAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+static __always_inline int PageSwapCache(struct page *page)
+{
+   return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags);
+
+}
+SETPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
+CLEARPAGEFLAG(SwapCache, swapcache, PF_NO_COMPOUND)
 #else
 PAGEFLAG_FALSE(SwapCache)
 #endif
@@ -701,12 +709,12 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)
  * Flags checked when a page is freed.  Pages being freed should not have
  * these flags set.  It they are, there is a problem.
  */
-#define PAGE_FLAGS_CHECK_AT_FREE \
-   (1UL << PG_lru   | 1UL << PG_locked| \
-1UL << PG_private | 1UL << PG_private_2 | \
-1UL << PG_writeback | 1UL << PG_reserved | \
-1UL << PG_slab  | 1UL << PG_swapcache | 1UL << PG_active | \
-1UL << PG_unevictable | __PG_MLOCKED)
+#define PAGE_FLAGS_CHECK_AT_FREE   \
+   (1UL << PG_lru  | 1UL << PG_locked  |   \
+1UL << PG_private  | 1UL << PG_private_2   |   \
+1UL << PG_writeback| 1UL << PG_reserved|   \
+1UL << PG_slab | 1UL << PG_active  |   \
+1UL << PG_unevictable  | __PG_MLOCKED)
 
 /*
  * Flags checked when a page is prepped for return by the page allocator.
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab48a2fb..30c2adbdebe8 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -95,7 +95,6 @@
{1UL << PG_private_2,   "private_2" },  \
{1UL << PG_writeback,   "writeback" },  \
{1UL << PG_head,"head"  },  \
-   {1UL << PG_swapcache,   "swapcache" },  \
{1UL << PG_mappedtodisk,"mappedtodisk"  },  \
{1UL << PG_reclaim, "reclaim"   },  \
{1UL << PG_swapbacked,  "swapbacked"},  \
-- 
2.11.0



Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen  wrote:

> I saw a 4.8->4.9 regression (details below) that I attributed to:
> 
>   9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global.  As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
> 
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly.  The patch is pretty darn
> tiny too.
> 
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).
> 
> This boots in a small VM and on a multi-node NUMA system, but has not
> been tested widely.  It definitely needs to be validated to make sure
> it properly initialzes the new structure in the various node hotplug
> cases before anybody goes applying it.
> 
> Original report below (sent privately by accident).
> 
>  include/linux/mmzone.h |5 +
>  kernel/sched/core.c|   16 
>  mm/filemap.c   |   22 +-
>  mm/page_alloc.c|5 +
>  4 files changed, 31 insertions(+), 17 deletions(-)
> 
> ---
> 
> I'm seeing a 4.8->4.9 regression:
> 
>   
> https://www.sr71.net/~dave/intel/2016-12-19-page_fault_3.processes.png
> 
> This is on a 160-thread 8-socket system.  The workloads is a bunch of
> processes faulting in pages to separate MAP_SHARED mappings:
> 
>   
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c
> 
> The smoking gun in the profiles is that __wake_up_bit() (called via
> unlock_page()) goes from ~1% to 4% in the profiles.
> 
> The workload is pretty maniacally touching random parts of the
> waitqueue, and missing the cache heavily.  Now that it is shared, I
> suspect we're transferring cachelines across node boundaries in a way
> that we were not with the per-zone waitqueues.
> 
> This is *only* showing up with MAP_SHARED pages, not with anonymous
> pages.  I think we do a lock_page()/unlock_page() pair in
> do_shared_fault(), which we avoid in the anonymous case.  Reverting:
> 
>   9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> restores things to 4.8 behavior.  The fact that automated testing didn't
> catch this probably means that it's pretty rare to find hardware that
> actually shows the problem, so I don't think it's worth reverting
> anything in mainline.

I've been doing a bit of testing, and I don't know why you're seeing
this.

I don't think I've been able to trigger any actual page lock contention
so nothing gets put on the waitqueue to really bounce cache lines around
that I can see. Yes there are some loads to check the waitqueue, but
those should be cached read shared among CPUs so not cause interconnect
traffic I would have thought.

After testing my PageWaiters patch, I'm maybe seeing 2% improvment on
this workload (although powerpc doesn't do so well on this one due to
virtual memory management overheads -- maybe x86 will show a bit more)

But the point is that after this, there should be no waitqueue activity
at all. I haven't chased up a system with a lot of IOPS connected that
would be needed to realistically test contended cases (which I thought
I needed to show an improvement from per-node tables, but your test
suggests not...)

Thanks,
Nick


> 
> ut, the commit says:
> 
> > As part of that earlier discussion, we had a much better solution for
> > the NUMA scalability issue - by just making the page lock have a
> > separate contention bit, the waitqueue doesn't even have to be looked at
> > for the normal case.  
> 
> So, maybe we should do that moving forward since we at least found one
> case that's pretty aversely affected.
> 
> Cc: Andreas Gruenbacher 
> Cc: Bob Peterson 
> Cc: Mel Gorman 
> Cc: Peter Zijlstra 
> Cc: Andy Lutomirski 
> Cc: Steven Whitehouse 
> Cc: Linus Torvalds 
> 
> ---
> 
>  b/include/linux/mmzone.h |5 +
>  b/kernel/sched/core.c|   16 
>  b/mm/filemap.c   |   22 +-
>  b/mm/page_alloc.c|5 +
>  4 files changed, 31 insertions(+), 17 deletions(-)
> 
> diff -puN include/linux/mmzone.h~static-per-zone-waitqueue 
> include/linux/mmzone.h
> --- a/include/linux/mmzone.h~static-per-zone-waitqueue2016-12-19 
> 11:35:12.210823059 -0800
> +++ b/include/linux/mmzone.h  2016-12-19 13:11:53.335170271 -0800
> @@ -27,6 +27,9 @@
>  #endif
>  #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
>  
> +#define WAIT_TABLE_BITS 8
> +#define WAIT_TABLE_SIZE (1 << WAIT_TABLE_BITS)
> +
>  /*
>   * PAGE_ALLOC_COSTL

Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-21 Thread Nicholas Piggin
On Wed, 21 Dec 2016 09:09:31 +0100
Peter Zijlstra  wrote:

> On Tue, Dec 20, 2016 at 10:02:46AM -0800, Linus Torvalds wrote:
> > On Tue, Dec 20, 2016 at 9:31 AM, Linus Torvalds
> >  wrote:  
> > >
> > > I'll go back and try to see why the page flag contention patch didn't
> > > get applied.  
> > 
> > Ahh, a combination of warring patches by Nick and PeterZ, and worry
> > about the page flag bits.  
> 
> I think Nick actually had a patch freeing up a pageflag, although Hugh
> had a comment on that.

Yeah I think he basically acked it. It had a small compound debug
false positive but I think it's okay. I'm just testing it again.
 
> That said, I'm not a huge fan of his waiters patch, I'm still not sure
> why he wants to write another whole wait loop, but whatever. Whichever
> you prefer I suppose.

Ah, I was waiting for some feedback, thanks.

Well I wanted to do it that way to keep the manipulation of the new
bit under the same lock as the waitqueue, so as not to introduce new
memory orderings vs testing waitqueue_active.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-20 Thread Nicholas Piggin
On Tue, 20 Dec 2016 12:58:25 +
Mel Gorman  wrote:

> On Tue, Dec 20, 2016 at 12:31:13PM +1000, Nicholas Piggin wrote:
> > On Mon, 19 Dec 2016 16:20:05 -0800
> > Dave Hansen  wrote:
> >   
> > > On 12/19/2016 03:07 PM, Linus Torvalds wrote:  
> > > > +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > > > +{
> > > > +   const int __maybe_unused nid = 
> > > > page_to_nid(virt_to_page(word));
> > > > +
> > > > +   return __bit_waitqueue(word, bit, nid);
> > > > 
> > > > No can do. Part of the problem with the old coffee was that it did that
> > > > virt_to_page() crud. That doesn't work with the virtually mapped stack. 
> > > > 
> > > 
> > > Ahhh, got it.
> > > 
> > > So, what did you have in mind?  Just redirect bit_waitqueue() to the
> > > "first_online_node" waitqueues?
> > > 
> > > wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > > {
> > > return __bit_waitqueue(word, bit, first_online_node);
> > > }
> > > 
> > > We could do some fancy stuff like only do virt_to_page() for things in
> > > the linear map, but I'm not sure we'll see much of a gain for it.  None
> > > of the other waitqueue users look as pathological as the 'struct page'
> > > ones.  Maybe:
> > > 
> > > wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > > {
> > >   int nid
> > >   if (word >= VMALLOC_START) /* all addrs not in linear map */
> > >   nid = first_online_node;
> > >   else
> > >   nid = page_to_nid(virt_to_page(word));
> > > return __bit_waitqueue(word, bit, nid);
> > > }  
> > 
> > I think he meant just make the page_waitqueue do the per-node thing
> > and leave bit_waitqueue as the global bit.
> >   
> 
> I'm pressed for time but at a glance, that might require a separate
> structure of wait_queues for page waitqueue. Most users of bit_waitqueue
> are not operating with pages. The first user is based on a word inside
> a block_device for example. All non-page users could assume node-0.

Yes it would require something or other like that. Trivial to keep things
balanced (if not local) over nodes by take a simple hash of the virtual
address to spread over the nodes. Or just keep using this separate global
table for the bit_waitqueue...

But before Linus grumps at me again, let's try to do the waitqueue
avoidance bit first before we worry about that :)

> It
> shrinks the available hash table space but as before, maybe collisions
> are not common enough to actually matter. That would be worth checking
> out. Alternatively, careful auditing to pick a node when it's known it's
> safe to call virt_to_page may work but it would be fragile.
> 
> Unfortunately I won't be able to review or test any patches until January
> 3rd after I'm back online properly. Right now, I have intermittent internet
> access at best. During the next 4 days, I know I definitely will not have
> any internet access.
> 
> The last time around, there were three patch sets to avoid the overhead for
> pages in particular. One was dropped (mine, based on Nick's old work) as
> it was too complicated. Peter had some patches but after enough hammering
> it failed due to a missed wakup that I didn't pin down before having to
> travel to a conference.
> 
> I hadn't tested Nick's prototype although it looked fine because others
> reviewed it before I looked and I was waiting for another version to
> appear. If one appears, I'll take a closer look and bash it across a few
> machines to see if it has any lost wakeup problems.
> 

Sure I'll respin it this week.

Thanks,
Nick


Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 16:20:05 -0800
Dave Hansen  wrote:

> On 12/19/2016 03:07 PM, Linus Torvalds wrote:
> > +wait_queue_head_t *bit_waitqueue(void *word, int bit)
> > +{
> > +   const int __maybe_unused nid = page_to_nid(virt_to_page(word));
> > +
> > +   return __bit_waitqueue(word, bit, nid);
> > 
> > No can do. Part of the problem with the old coffee was that it did that
> > virt_to_page() crud. That doesn't work with the virtually mapped stack.   
> 
> Ahhh, got it.
> 
> So, what did you have in mind?  Just redirect bit_waitqueue() to the
> "first_online_node" waitqueues?
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
> return __bit_waitqueue(word, bit, first_online_node);
> }
> 
> We could do some fancy stuff like only do virt_to_page() for things in
> the linear map, but I'm not sure we'll see much of a gain for it.  None
> of the other waitqueue users look as pathological as the 'struct page'
> ones.  Maybe:
> 
> wait_queue_head_t *bit_waitqueue(void *word, int bit)
> {
>   int nid
>   if (word >= VMALLOC_START) /* all addrs not in linear map */
>   nid = first_online_node;
>   else
>   nid = page_to_nid(virt_to_page(word));
> return __bit_waitqueue(word, bit, nid);
> }

I think he meant just make the page_waitqueue do the per-node thing
and leave bit_waitqueue as the global bit.

It would be cool if CPUs had an instruction that translates an address
though. You could avoid all that lookup and just do it with the TLB :)

Thanks,
Nick




Re: [RFC][PATCH] make global bitlock waitqueues per-node

2016-12-19 Thread Nicholas Piggin
On Mon, 19 Dec 2016 14:58:26 -0800
Dave Hansen  wrote:

> I saw a 4.8->4.9 regression (details below) that I attributed to:
> 
>   9dcb8b685f mm: remove per-zone hashtable of bitlock waitqueues
> 
> That commit took the bitlock waitqueues from being dynamically-allocated
> per-zone to being statically allocated and global.  As suggested by
> Linus, this makes them per-node, but keeps them statically-allocated.
> 
> It leaves us with more waitqueues than the global approach, inherently
> scales it up as we gain nodes, and avoids generating code for
> page_zone() which was evidently quite ugly.  The patch is pretty darn
> tiny too.
> 
> This turns what was a ~40% 4.8->4.9 regression into a 17% gain over
> what on 4.8 did.  That gain is a _bit_ surprising, but not entirely
> unexpected since we now get much simpler code from no page_zone() and a
> fixed-size array for which we don't have to follow a pointer (and get to
> do power-of-2 math).

I'll have to respin the PageWaiters patch and resend it. There were
just a couple of small issues picked up in review. I've just got side
tracked with getting a few other things done and haven't had time to
benchmark it properly.

I'd still like to see what per-node waitqueues does on top of that. If
it's significant for realistic workloads then it could be done for the
page waitqueues as Linus said.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-15 Thread Nicholas Piggin
On Thu, 15 Dec 2016 14:15:31 +0100
Hannes Frederic Sowa  wrote:

> On 15.12.2016 13:03, Nicholas Piggin wrote:
> > On Thu, 15 Dec 2016 12:19:02 +0100
> > Hannes Frederic Sowa  wrote:
> >   
> >> On 15.12.2016 03:06, Nicholas Piggin wrote:  
> >>> On Wed, 14 Dec 2016 15:04:36 +0100
> >>> Hannes Frederic Sowa  wrote:
> >>> 
> >>>> On 09.12.2016 17:03, Greg Kroah-Hartman wrote:
> >>>>> On Sat, Dec 10, 2016 at 01:56:53AM +1000, Nicholas Piggin wrote:  
> >>>>>> On Fri, 9 Dec 2016 15:36:04 +0100
> >>>>>> Stanislav Kozina  wrote:
> >>>>>>  
> >>>>>>>>>>> The question is how to provide a similar guarantee if a different 
> >>>>>>>>>>> way?
> >>>>>>>>>> As a tool to aid distro reviewers, modversions has some value, but 
> >>>>>>>>>> the
> >>>>>>>>>> debug info parsing tools that have been mentioned in this thread 
> >>>>>>>>>> seem
> >>>>>>>>>> superior (not that I've tested them).
> >>>>>>>>> On the other hand the big advantage of modversions is that it also
> >>>>>>>>> verifies the checksum during runtime (module loading). In other 
> >>>>>>>>> words, I
> >>>>>>>>> believe that any other solution should still generate some form of
> >>>>>>>>> checksum/watermark which can be easily checked for compatibility on
> >>>>>>>>> module load.
> >>>>>>>>> It should not be hard to add to the DWARF based tools though. We'd 
> >>>>>>>>> just
> >>>>>>>>> parse DWARF data instead of the C code.
> >>>>>>>> A runtime check is still done, with per-module vermagic which distros
> >>>>>>>> can change when they bump the ABI version. Is it really necessary to
> >>>>>>>> have more than that (i.e., per-symbol versioning)?
> >>>>>>>
> >>>>>>>  From my point of view, it is. We need to allow changing ABI for some 
> >>>>>>> modules while maintaining it for others.
> >>>>>>> In fact I think that there should be version not only for every 
> >>>>>>> exported 
> >>>>>>> symbol (in the EXPORT_SYMBOL() sense), but also for every public type 
> >>>>>>> (in the sense of eg. structure defined in the public header file).
> >>>>>>>   
> >>>>>>
> >>>>>> Well the distro can just append _v2, _v3 to the name of the function
> >>>>>> or type if it has to break compat for some reason. Would that be 
> >>>>>> enough?  
> >>>>>
> >>>>> There are other ways that distros can work around when upstream "breaks"
> >>>>> the ABI, sometimes they can rename functions, and others they can
> >>>>> "preload" structures with padding in anticipation for when/if fields get
> >>>>> added to them.  But that's all up to the distros, no need for us to
> >>>>> worry about that at all :)  
> >>>>
> >>>> The _v2 and _v3 functions are probably the ones that also get used by
> >>>> future backports in the distro kernel itself and are probably the reason
> >>>> for the ABI change in the first place. Thus going down this route will
> >>>> basically require distros to touch every future backport patch and will
> >>>> in general generate a big mess internally.
> >>>
> >>> What kind of big mess? You have to check the logic of each backport even
> >>> if it does apply cleanly, so the added overhead of the name change should
> >>> be relatively tiny, no?
> >>
> >> Basically single patches are backported in huge series. Reviewing each
> >> single patch also definitely makes sense, a review of the series as a
> >> whole is much more worthwhile because it focuses more on logic.
> >>
> >> The patches themselves are checked by individual robots or humans
> >> against merge conflict introduced mistakes which ring alarm bells for
> >> people to look more closely during 

Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-15 Thread Nicholas Piggin
On Thu, 15 Dec 2016 12:19:02 +0100
Hannes Frederic Sowa  wrote:

> On 15.12.2016 03:06, Nicholas Piggin wrote:
> > On Wed, 14 Dec 2016 15:04:36 +0100
> > Hannes Frederic Sowa  wrote:
> >   
> >> On 09.12.2016 17:03, Greg Kroah-Hartman wrote:  
> >>> On Sat, Dec 10, 2016 at 01:56:53AM +1000, Nicholas Piggin wrote:
> >>>> On Fri, 9 Dec 2016 15:36:04 +0100
> >>>> Stanislav Kozina  wrote:
> >>>>
> >>>>>>>>> The question is how to provide a similar guarantee if a different 
> >>>>>>>>> way?  
> >>>>>>>> As a tool to aid distro reviewers, modversions has some value, but 
> >>>>>>>> the
> >>>>>>>> debug info parsing tools that have been mentioned in this thread seem
> >>>>>>>> superior (not that I've tested them).  
> >>>>>>> On the other hand the big advantage of modversions is that it also
> >>>>>>> verifies the checksum during runtime (module loading). In other 
> >>>>>>> words, I
> >>>>>>> believe that any other solution should still generate some form of
> >>>>>>> checksum/watermark which can be easily checked for compatibility on
> >>>>>>> module load.
> >>>>>>> It should not be hard to add to the DWARF based tools though. We'd 
> >>>>>>> just
> >>>>>>> parse DWARF data instead of the C code.  
> >>>>>> A runtime check is still done, with per-module vermagic which distros
> >>>>>> can change when they bump the ABI version. Is it really necessary to
> >>>>>> have more than that (i.e., per-symbol versioning)?  
> >>>>>
> >>>>>  From my point of view, it is. We need to allow changing ABI for some 
> >>>>> modules while maintaining it for others.
> >>>>> In fact I think that there should be version not only for every 
> >>>>> exported 
> >>>>> symbol (in the EXPORT_SYMBOL() sense), but also for every public type 
> >>>>> (in the sense of eg. structure defined in the public header file).
> >>>>
> >>>> Well the distro can just append _v2, _v3 to the name of the function
> >>>> or type if it has to break compat for some reason. Would that be enough? 
> >>>>
> >>>
> >>> There are other ways that distros can work around when upstream "breaks"
> >>> the ABI, sometimes they can rename functions, and others they can
> >>> "preload" structures with padding in anticipation for when/if fields get
> >>> added to them.  But that's all up to the distros, no need for us to
> >>> worry about that at all :)
> >>
> >> The _v2 and _v3 functions are probably the ones that also get used by
> >> future backports in the distro kernel itself and are probably the reason
> >> for the ABI change in the first place. Thus going down this route will
> >> basically require distros to touch every future backport patch and will
> >> in general generate a big mess internally.  
> > 
> > What kind of big mess? You have to check the logic of each backport even
> > if it does apply cleanly, so the added overhead of the name change should
> > be relatively tiny, no?  
> 
> Basically single patches are backported in huge series. Reviewing each
> single patch also definitely makes sense, a review of the series as a
> whole is much more worthwhile because it focuses more on logic.
> 
> The patches themselves are checked by individual robots or humans
> against merge conflict introduced mistakes which ring alarm bells for
> people to look more closely during review.
> 
> Merge conflicts introduced mistakes definitely can happen because
> developers/backporters lose the focus from the actual logic but deal
> with shifting lines around or just fixing up postfixes to function names.
> 
> We still try to align the kernel as much as possible with upstream,
> because most developers can't really hold the differences between
> upstream and the internal functions in their heads (is this function RMW
> safe in this version but not that kernel version...).

I agree with all this, but in the case of a function rename, you can
automate it all with scripts if that's what you want.

When you have your list of exported symbols with non-zero version number,
then you can scr

Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-14 Thread Nicholas Piggin
On Wed, 14 Dec 2016 15:04:36 +0100
Hannes Frederic Sowa  wrote:

> On 09.12.2016 17:03, Greg Kroah-Hartman wrote:
> > On Sat, Dec 10, 2016 at 01:56:53AM +1000, Nicholas Piggin wrote:  
> >> On Fri, 9 Dec 2016 15:36:04 +0100
> >> Stanislav Kozina  wrote:
> >>  
> >>>>>>> The question is how to provide a similar guarantee if a different 
> >>>>>>> way?
> >>>>>> As a tool to aid distro reviewers, modversions has some value, but the
> >>>>>> debug info parsing tools that have been mentioned in this thread seem
> >>>>>> superior (not that I've tested them).
> >>>>> On the other hand the big advantage of modversions is that it also
> >>>>> verifies the checksum during runtime (module loading). In other words, I
> >>>>> believe that any other solution should still generate some form of
> >>>>> checksum/watermark which can be easily checked for compatibility on
> >>>>> module load.
> >>>>> It should not be hard to add to the DWARF based tools though. We'd just
> >>>>> parse DWARF data instead of the C code.
> >>>> A runtime check is still done, with per-module vermagic which distros
> >>>> can change when they bump the ABI version. Is it really necessary to
> >>>> have more than that (i.e., per-symbol versioning)?
> >>>
> >>>  From my point of view, it is. We need to allow changing ABI for some 
> >>> modules while maintaining it for others.
> >>> In fact I think that there should be version not only for every exported 
> >>> symbol (in the EXPORT_SYMBOL() sense), but also for every public type 
> >>> (in the sense of eg. structure defined in the public header file).  
> >>
> >> Well the distro can just append _v2, _v3 to the name of the function
> >> or type if it has to break compat for some reason. Would that be enough?  
> > 
> > There are other ways that distros can work around when upstream "breaks"
> > the ABI, sometimes they can rename functions, and others they can
> > "preload" structures with padding in anticipation for when/if fields get
> > added to them.  But that's all up to the distros, no need for us to
> > worry about that at all :)  
> 
> The _v2 and _v3 functions are probably the ones that also get used by
> future backports in the distro kernel itself and are probably the reason
> for the ABI change in the first place. Thus going down this route will
> basically require distros to touch every future backport patch and will
> in general generate a big mess internally.

What kind of big mess? You have to check the logic of each backport even
if it does apply cleanly, so the added overhead of the name change should
be relatively tiny, no?

> 
> I think it is important to keep versioning information outside of the
> source code. Some kind of modversions will still be required, but
> distros should be able to decide if they put in some kind of checksum or
> a string, what suites them most.

The module crc symbols are just an integer that requires a match, so it
could easily be populated by a list that the distro keeps, rather than
by genksyms. Most of the complexity is on the build side, so that would
still be an improvement for the kernel. So we *could* do this if the
distros need it.

Thanks,
Nick



Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-12 Thread Nicholas Piggin
On Mon, 12 Dec 2016 10:48:47 +0100
Stanislav Kozina  wrote:

>  A runtime check is still done, with per-module vermagic which distros
>  can change when they bump the ABI version. Is it really necessary to
>  have more than that (i.e., per-symbol versioning)?  
> >>>   From my point of view, it is. We need to allow changing ABI for some
> >>> modules while maintaining it for others.
> >>> In fact I think that there should be version not only for every exported
> >>> symbol (in the EXPORT_SYMBOL() sense), but also for every public type
> >>> (in the sense of eg. structure defined in the public header file).  
> >> Well the distro can just append _v2, _v3 to the name of the function
> >> or type if it has to break compat for some reason. Would that be enough?  
> > There are other ways that distros can work around when upstream "breaks"
> > the ABI, sometimes they can rename functions, and others they can
> > "preload" structures with padding in anticipation for when/if fields get
> > added to them.  But that's all up to the distros, no need for us to
> > worry about that at all :)  
> 
> Currently, the ABI version (checksum) is stored outside of the actual 
> code in the __ksymtab section. That means that the distributions can 
> still apply upstream patches cleanly and only update the version 
> checksum if these break ABI.
> 
> With the _v2, _v3 suffixes (or similar solutions) we'd be effectively 
> storing the ABI versions directly in the code and that would cause 
> conflicts when pulling further patches from upstream.
> 
> My view is that it would be than easier to maintain out-of-tree 
> modversions (or similar tool) rather than to solve all these conflicts.

That kind of name clash should hardly be an issue, in comparison with the
care it requires to merge fixes on top of a backport which has already
changed ABI. It tends to be trivially fixable just by search/replace
in the patchfile before applying, if nothing else. But you probably *want*
to be flagged on those things and merge by hand anyway.

Backporting alone I don't think can justify symbol versioning, but I'd
like to hear from distro maintainers if any disagree.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-11 Thread Nicholas Piggin
On Sat, 10 Dec 2016 13:41:03 +0100
Greg Kroah-Hartman  wrote:

> On Fri, Dec 09, 2016 at 11:46:54PM +0100, Dodji Seketeli wrote:
> > Hello,
> > 
> > Nicholas Piggin  a écrit:
> > 
> > [...]
> >   
> > > That said, a dwarf based checker tool should be able to do as good a job
> > > (maybe a bit better because report is very informative and it may pick up
> > > compiler alignments or padding options).  
> > 
> > So, Nicholas was kind enough to send me the two Linux Kernel binaries
> > that he built with the tiny little interface change that we were
> > discussing earlier.  Here is what the abidiff[1] tools says about that
> > interface change:
> > 
> > $ time ~/git/libabigail/kabidiff/build/tools/abidiff vmlinux.abi1.abi 
> > vmlinux.abi2.abi
> > Functions changes summary: 0 Removed, 1 Changed, 0 Added function
> > Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
> > 
> > 1 function with some indirect sub-type change:
> > 
> >   [C]'function int foo(blah*)' at memory.c:82:1 has some indirect 
> > sub-type changes:
> > parameter 1 of type 'blah*' has sub-type changes:
> >   in pointed to type 'struct blah' at memory.c:78:1:
> > type size changed from 32 to 64 bits
> > 1 data member insertion:
> >   'int blah::y', at offset 0 (in bits) at memory.c:79:1
> > 1 data member change:
> >  'int blah::x' offset changed from 0 to 32 (in bits) (by +32 
> > bits)
> > 
> > 
> > 
> > real0m2.595s
> > user0m2.489s
> > sys 0m0.108s
> > $ 
> > 
> > I kept the timing information to give you an idea of the time it takes
> > on a non-optimized build of abidiff.
> > 
> > One could for instance want that types that are not defined in header
> > files be kept out of the change report.  In that case it's possible to
> > write a little suppression specification file like this one:
> > 
> > $ cat vmlinux.abignore 
> > [suppress_type]
> >   source_location_not_regexp = .*\\.h
> > $
> > 
> > You can then pass that suppression file to the tool:
> > 
> > $ ~/git/libabigail/kabidiff/build/tools/abidiff --suppr 
> > vmlinux.abignore vmlinux.abi1.abi vmlinux.abi2.abi
> > Functions changes summary: 0 Removed, 0 Changed (1 filtered out), 0 
> > Added function
> > Variables changes summary: 0 Removed, 0 Changed, 0 Added variable
> > 
> > 
> > real0m2.574s
> > user0m2.473s
> > sys 0m0.102s
> > $
> > 
> > So this is the kind of interface change analysis tool we are working on
> > at the moment.
> > 
> > One could also imagine a tool that would compute a CRC that takes the
> > very same suppression specification files into account, letting people
> > to decide that some interface changes are OK.  That CRC would thus be
> > added to the special ELF sections we already have today.  We could keep
> > the modversion machinery, but with a greater dose of flexibility.
> > Whenever modversion detects a change, abidiff would tell people what the
> > change is exactly.
> > 
> > What do you guys think?  
> 
> YES YES YES!!!
> 
> Now I don't work on a distro anymore, but I would think that something
> like this would be really useful, pointing out exactly what changed is
> very important for distro maintainers to determine what they want to do
> (either fix up the abi change with strange hacks, or ignore it due to
> the change being in an area they don't care at all about, i.e. a random
> driver subsystem.)
> 
> So yes, I think this is really good stuff.  But if the distro
> maintainers correct me and think it's useless, then I need to revisit my
> view of exactly what they do for their customers :)

Agree completely. BTW (for those who might be looking into these tools),
we also have https://github.com/skozina/kabi-dw that Stanislav (cc'ed)
mentioned earlier.

It's true that the current modversions __crc_ matching infrastructure is
"just" a symbol versioning system, and we could keep it and just populate
it with something other than genksyms (e.g., a symbol version list provided
by distros). But the starting point should be *no* versioning and simply
using names to break linkage. Unless there's a compelling reason not to,
symbols are simpler, easier, everyone knows how they work.

The other question would be whether to pull a minimal tool into the kernel
source or keep them out of tree (but possibly add some helper scripts etc).
I guess we'll need to see what distros want.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-09 Thread Nicholas Piggin
On Fri, 09 Dec 2016 15:21:33 +
Ian Campbell  wrote:

> On Fri, 2016-12-09 at 13:33 +1000, Nicholas Piggin wrote:
> > 
> > Well I simply tested the outcome. If you have:
> > 
> > struct blah {
> >   int x;
> > };
> > int foo(struct blah *blah)
> > {
> >   return blah->x;
> > }
> > EXPORT(foo);
> > 
> > $ nm vmlinux | grep __crc_foo
> > a0cf13a0 A __crc_foo
> > 
> > Now change to
> > 
> > struct blah {
> >   int y;
> >   int x;
> > };
> > 
> > $ nm vmlinux | grep __crc_foo
> > a0cf13a0 A __crc_foo
> > 
> > It just doesn't catch these things.  
> 
> I found the same when I just added your snippet to init/main.c.
> 
> _But_ when I moved the struct into include/types.h (which happened to
> be included by init/main.c) then, with just x in the struct:
> 
> $ make -s init/main.{o,symtypes} && grep -E foo\|blah init/main.symtypes 
> && nm init/main.o  | grep __crc_foo
> s#blah struct blah { int x ; } 
> foo int foo ( s#blah * ) 
> 0cd0312e A __crc_foo
> 
> but adding y:
> 
> $ make -s init/main.{o,symtypes} && grep -E foo\|blah init/main.symtypes 
> && nm init/main.o  | grep __crc_foo
> s#blah struct blah { int x ; int y ; } 
> foo int foo ( s#blah * ) 
> eda220c6 A __crc_foo
> 
> So it does catch things in that case.
> 
> With struct blah inline in main.c it was:
> 
> $ make -s init/main.{o,symtypes} && grep -E foo\|blah init/main.symtypes 
> && nm init/main.o  | grep __crc_foo
> s#blah struct blah { UNKNOWN } 
> foo int foo ( s#blah * ) 
> a0cf13a0 A __crc_foo
> 
> So I suppose it only cares about structs which are in headers, which I
> guess makes sense. I think it is working in at least one of the
> important cases.

Aha thanks, well that's my mistake. Clever little bugger, isn't it? Okay
it's not so useless as I first thought!

That said, a dwarf based checker tool should be able to do as good a job
(maybe a bit better because report is very informative and it may pick up
compiler alignments or padding options). So I still think it's worth
looking at those if we can remove modversions.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-09 Thread Nicholas Piggin
On Fri, 9 Dec 2016 15:36:04 +0100
Stanislav Kozina  wrote:

>  The question is how to provide a similar guarantee if a different way?  
> >>> As a tool to aid distro reviewers, modversions has some value, but the
> >>> debug info parsing tools that have been mentioned in this thread seem
> >>> superior (not that I've tested them).  
> >> On the other hand the big advantage of modversions is that it also
> >> verifies the checksum during runtime (module loading). In other words, I
> >> believe that any other solution should still generate some form of
> >> checksum/watermark which can be easily checked for compatibility on
> >> module load.
> >> It should not be hard to add to the DWARF based tools though. We'd just
> >> parse DWARF data instead of the C code.  
> > A runtime check is still done, with per-module vermagic which distros
> > can change when they bump the ABI version. Is it really necessary to
> > have more than that (i.e., per-symbol versioning)?  
> 
>  From my point of view, it is. We need to allow changing ABI for some 
> modules while maintaining it for others.
> In fact I think that there should be version not only for every exported 
> symbol (in the EXPORT_SYMBOL() sense), but also for every public type 
> (in the sense of eg. structure defined in the public header file).

Well the distro can just append _v2, _v3 to the name of the function
or type if it has to break compat for some reason. Would that be enough?

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-09 Thread Nicholas Piggin
On Fri, 9 Dec 2016 08:55:51 +0100
Stanislav Kozina  wrote:

> >> The question is how to provide a similar guarantee if a different way?  
> > As a tool to aid distro reviewers, modversions has some value, but the
> > debug info parsing tools that have been mentioned in this thread seem
> > superior (not that I've tested them).  
> 
> On the other hand the big advantage of modversions is that it also 
> verifies the checksum during runtime (module loading). In other words, I 
> believe that any other solution should still generate some form of 
> checksum/watermark which can be easily checked for compatibility on 
> module load.
> It should not be hard to add to the DWARF based tools though. We'd just 
> parse DWARF data instead of the C code.

A runtime check is still done, with per-module vermagic which distros
can change when they bump the ABI version. Is it really necessary to
have more than that (i.e., per-symbol versioning)?

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-08 Thread Nicholas Piggin
On Thu, 1 Dec 2016 10:20:39 -0500
Don Zickus  wrote:

> On Thu, Dec 01, 2016 at 03:32:15PM +1100, Nicholas Piggin wrote:
> > > Anyway, MODVERSIONS is our way of protecting our kabi for the last 10 
> > > years.
> > > It isn't perfect and we have fixed the genksyms tool over the years, but 
> > > so
> > > far it mostly works fine.  
> > 
> > Okay. It would be good to get all the distros in on this.
> > 
> > What I want to do is work out exactly what it is that modversions is
> > giving you.
> > 
> > We know it's fairly nasty code to maintain and it does not detect ABI
> > changes very well. But it's not such a burden that we can't maintain
> > it if there are good reasons to keep it.  
> 
> Hi Nick,
> 
> I won't disagree with you there. :-)

Sorry for the late reply, I was moving house and got side tracked.

> modversions is a pretty heavy handed approach that basically says if all the
> symbols and types haven't changed for a given EXPORT_SYMBOL (recursively
> checked), then there is a high degree of confidence the OOT driver will not
> only load, but run correctly.

It's heavy handed in that it is quite complex in the kernel build system,
but it is also light handed in that it does not do a very good job.

I would say the degree of confidence is not very high. People have told
me modversions follows pointers to objects in its calculation, but I have
not seen that to be the case. Even if you did have that, it can not replace
a code review for semantics of data and code.

> The question is how to provide a similar guarantee if a different way?

As a tool to aid distro reviewers, modversions has some value, but the
debug info parsing tools that have been mentioned in this thread seem
superior (not that I've tested them).

> 
> We have plenty of customers with 10 year old drivers, where the expertise
> has long left the company.  The engineers still around, recompile and make
> tweaks to get things working on the latest RHEL.  Verify it passes testing
> and release it.  Then they hope to not touch it again for a few years until
> the next RHEL comes along.
> 
> Scary, huh? :-)

Oh yeah my aim here is not to make distro or out of tree module vendors
life harder, actually the opposite. If it turns out modversions really is
the best approach, I'm not in a position to complain about its complexity
because we have Suse and Redhat people maintaining the build and module
systems :) I just want to see if we can do things better.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-08 Thread Nicholas Piggin
On Thu, 1 Dec 2016 17:12:41 +0100
Michal Marek  wrote:

> On 2016-12-01 04:39, Nicholas Piggin wrote:
> > On Thu, 01 Dec 2016 02:35:54 +
> > Ben Hutchings  wrote:  
> >> As I understand it, genksyms incorporates the definitions of a
> >> function's parameter and return types - not just their names - and all
> >> the types they refer to, recursively.  So a structure size change
> >> should change the version of all functions where the function and its
> >> caller pass that structure between them, however indirectly.  It finds
> >> such indirect ABI breakage for me fairly regularly, though of course I
> >> don't know that it finds everything.  
> > 
> > It is only the type name.
> > 
> > Not only that but even if you did extend it further to structure type
> > arrangement then you still have to deal with other structures followed
> > via pointers. Or (rarer but not unheard of):
> > 
> > - changes to structures without changes of the types of their members
> > - changes to arguments without changes of their type  
> 
> This is already covered by genksyms. Try make V=1 with
> CONFIG_MODVERSIONS=y and add the -D option to one of the genksyms
> command. I wanted to paste the expanded signature for
> register_filesystem() as an example, but vger would probably drop the
> mail for being too big :).

Well I simply tested the outcome. If you have:

struct blah {
  int x;
};
int foo(struct blah *blah)
{
  return blah->x;
}
EXPORT(foo);

$ nm vmlinux | grep __crc_foo
a0cf13a0 A __crc_foo

Now change to

struct blah {
  int y;
  int x;
};

$ nm vmlinux | grep __crc_foo
a0cf13a0 A __crc_foo

It just doesn't catch these things. Honestly, stable ABI distros *have*
to review all patches to ensure the ABI is unchanged. Some tools could
help significantly, but for that, the debug info ABI checking tools that
have been mentioned in this thread are far better tool for this job than
modversions.

Thanks,
Nick


Re: linker-tables v5 testing

2016-12-08 Thread Nicholas Piggin
On Fri, 2 Dec 2016 07:49:52 -0800
"Luis R. Rodriguez"  wrote:

> On Thu, Dec 1, 2016 at 10:31 PM, Nicholas Piggin
>  wrote:
> >
> > On 2 Dec 2016 2:35 AM, "Luis R. Rodriguez"  wrote:  
> >>
> >> On Wed, Nov 30, 2016 at 9:20 PM, Nicholas Piggin 
> >> wrote:  
> >> > On Thu, 1 Dec 2016 16:04:30 +1100
> >> > Nicholas Piggin  wrote:
> >> >  
> >> >> On Wed, 30 Nov 2016 19:15:27 -0800
> >> >> "Luis R. Rodriguez"  wrote:
> >> >>  
> >> >> > On Wed, Nov 30, 2016 at 6:51 PM, Nicholas Piggin 
> >> >> > wrote:  
> >> >> > > On Wed, 30 Nov 2016 18:38:16 +0100
> >> >> > > "Luis R. Rodriguez"  wrote:
> >> >> > >  
> >> >> > >> On Wed, Nov 30, 2016 at 02:09:47PM +1100, Nicholas Piggin wrote:  
> >> >>  
> >> >> > >> What is wrong with that ? Separating linker table and section
> >> >> > >> ranges is  
> >> >> > >
> >> >> > > It's not that you separate those, of course you need that. It's
> >> >> > > that
> >> >> > > you also separate other sections from the input section
> >> >> > > descriptions:
> >> >> > >
> >> >> > > -   *(.text.hot .text .text.fixup .text.unlikely)
> >> >> > > \
> >> >> > > +   *(.text.hot .text)
> >> >> > > \
> >> >> > > +   *(SORT(.text.rng.*))
> >> >> > > \
> >> >> > > +   *(.text.fixup .text.unlikely)
> >> >> > > \  
> >> >
> >> > Ahh, you're doing it to avoid clash with compiler generated sections.  
> >>
> >> Nope, its for two reasons:
> >>
> >> 1) To be able to construct arrays without modifying the linker script
> >> we had to get crafty, and opted in for the trick of picking two
> >> arbitrary delimiters for use of section start, and section end, namely
> >> the tilde character ("~") and the empty string (""), and then stuffing
> >> anything in between. For this to work properly we must SORT() these
> >> specially crafted sections as well.
> >>
> >> 2) Because I don't want to regress .text if SORT()'ing it breaks
> >> something. In theory it should not but I rather be careful.  
> >
> > Oh yes I wasn't talking about the sort itself, but about why you broke up
> > the existing input section descriptions.  
> 
> Ah, but I was also talking about this, the only thing is...
> 
> > That was my only concern, e.g.,
> > .data and .data.[_0-9a-zA-Z] spoke be in the same description.  
> 
> I was talking about splitting up .data and both .data.tbl .data.rng --
> you were talking specifically only about .data and .data.[_0-9a-zA-Z]
> and now I see the concern! Yes, you are -- we can avoid this, its just
> your glob would also capture .data.tbl and data.rng and I want to sort
> these so I do it first. But as you already deduced, this should still
> have no harm as I am just stuffing it in well sorted first and the
> later glob just captures that. I could have just used .data..tbl and
> data..rng as you noted. Either way is fine by me.
> 
> Do you have a preference?

I would say use "..", just so there is less ambiguity and more
flexibility in moving around the input section descriptions if needed.

> 
> >> > The usual way to cope with that seems to be to use two dots for your
> >> > name.
> >> > .text..rng.*  
> >>
> >> I have been wondering why people started doing that, it was not clear
> >> nor documented anywhere. So no, it was not my original motivation, but
> >> if it helps, it will be good to document this as well.  
> >
> > Yeah it's convention because . can be part of a section name but not a C
> > symbol.  
> 
> Great, makes sense -- do you have references for this practice BTW or
> did you just infer this?

I believe I've seen something, perhaps it was in the binutils manuals,
but I tried just now and couldn't find anything.

It is a common practice to use dots in labels to avoid clashes with C
names in general. Specifically for sections we already have some cases
in there which I believe were added for similar reasons.

Thanks,
Nick


Re: linux-next: Tree for Dec 7 (kallsyms failure)

2016-12-07 Thread Nicholas Piggin
On Thu, 8 Dec 2016 15:29:35 +1100
Stephen Rothwell  wrote:

> Hi all,
> 
> On Wed, 7 Dec 2016 18:30:57 -0800 Randy Dunlap  wrote:
> >
> > On 12/07/16 15:56, Stephen Rothwell wrote:  
> > > 
> > > On Wed, 7 Dec 2016 15:42:32 -0800 Randy Dunlap  
> > > wrote:
> > >>
> > >> I started seeing this yesterday (2016-1206).
> > >> This is on x86_64.
> > >>
> > >> Anybody know about it?
> > >>
> > >> kallsyms failure: relative symbol value 0x8100 out of range 
> > >> in relative mode
> > > 
> > > I got a similar failure starting a few days ago on my powerpc
> > > allyesconfig build.  I was assuming that it was PowerPC specific, but
> > > noone has found a cause yet.
> > > 
> > 
> > It may just be an invalid randconfig.  I modified scripts/kallsyms.c and
> > I see this message:
> > kallsyms failure: relative symbol value 0x8100 [symbol: 
> > Tstartup_64] out of range in relative mode
> > 
> > and it makes sense that startup_64 would (or could) be at 
> > 0x8100...
> > especially since CONFIG_PHYSICAL_START=0x100 and
> > (from Documentation/x86/x86_64/mm.txt)
> > 8000 - 9fff (=512 MB)  kernel text mapping, from 
> > phys 0
> > 
> > 
> > Ard, what do you think about this?  
> 
> The similar failure I saw in the powerpc allyesconfig build
> 
> kallsyms failure: relative symbol value 0xc000 out of range in 
> relative mode
> 
> was caused by commit
> 
>   8ab2ae655bfe ("default exported asm symbols to zero")
> 
> which has been reverted in Linus' tree today.
> 

Huh, so it seems like the explicit 0 symbols were being picked up as the
kallsyms relative base, putting the 0xc... symbols beyond reach.


Re: linux-next: build failure in the powerpc allyesconfig build

2016-12-04 Thread Nicholas Piggin
On Mon, 5 Dec 2016 16:24:31 +1100
Stephen Rothwell  wrote:

> Hi all,
> 
> On Mon, 5 Dec 2016 16:22:04 +1100 Stephen Rothwell  
> wrote:
> >
> > After mergeing everything but Andrew's tree, today's linux-next build
> > (powerpc allyesconfig) failed like this:
> > 
> > kallsyms failure: relative symbol value 0xc000 out of range in 
> > relative mode
> > 
> > I have no idea what caused this, so I have left the powerpc allyesconfig
> > build broken for now.  
> 
> Just as a data point, I tried reverting commit
> 
>   dadc4a1bb9f0 ("powerpc/64: Fix placement of .text to be immediately 
> following .head.text")
> 
> but that did not help.

In scripts/kallsyms.c, is there a table[i].sym string you can print?


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-01 Thread Nicholas Piggin
On Thu, 1 Dec 2016 12:33:02 +0100
Stanislav Kozina  wrote:

> On 12/01/2016 12:09 PM, Nicholas Piggin wrote:
> > On Thu, 1 Dec 2016 11:48:09 +0100
> > Stanislav Kozina  wrote:
> >  
> >> On 12/01/2016 05:13 AM, Don Zickus wrote:
> >>
> >> ...
> >>  
> >>> I think GregKH pointed to one such tool, libabigail?  We are working on
> >>> others too.  
> >> I should mention one of the others here:
> >> https://github.com/skozina/kabi-dw
> >>
> >> It's quite comparable to libabigail in the way it works, the main
> >> differences are:
> >>- written in pure C
> >>- depends only on elf-utils and flex/yacc
> >>- it's much simpler (4k LOC)
> >>- stores the type information in the text files and compares those
> >> instead of directly comparing two sets of DWARF data  
> > Now this seems much better for distro ABI checking.
> >
> > The next question is, do they need any kernel support for rare cases
> > where they do have to break the ABI of an export? Simple rename of the
> > function with a _v2 postfix might be enough. We could retain some per
> > symbol versioning in the kernel if needed, but how much would it
> > actually help?  
> 
> The biggest pain point AFAICT is to identify what types (functions, 
> structs, enums, ...) should be considered a part of the stable ABI.

Sure. This is something an automated checker can't solve completely.
Any changes would have to be considered in terms of their impact to
the ABI. It's not just data but also instruction changes involved.

This is policy that should not be mandated by the kernel. Which is
why I'm in favor of using tools like this and just providing mechanism
so distros can implement their own polices.

> And 
> the problem with modversions is that it pulls in just everything which 
> gets (accidentally?) #included in the source file.

I think that's SRCVERSION which is something else. But modversions
has problems too.

> The actual ABI maintenance is a different problem, but there are many 
> possible approaches, the _v2 suffix being one of them.

Would be good to get a consensus on that too.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-12-01 Thread Nicholas Piggin
On Thu, 1 Dec 2016 11:48:09 +0100
Stanislav Kozina  wrote:

> On 12/01/2016 05:13 AM, Don Zickus wrote:
> 
> ...
> 
> > I think GregKH pointed to one such tool, libabigail?  We are working on
> > others too.  
> 
> I should mention one of the others here:
> https://github.com/skozina/kabi-dw
> 
> It's quite comparable to libabigail in the way it works, the main 
> differences are:
>   - written in pure C
>   - depends only on elf-utils and flex/yacc
>   - it's much simpler (4k LOC)
>   - stores the type information in the text files and compares those 
> instead of directly comparing two sets of DWARF data

Now this seems much better for distro ABI checking.

The next question is, do they need any kernel support for rare cases
where they do have to break the ABI of an export? Simple rename of the
function with a _v2 postfix might be enough. We could retain some per
symbol versioning in the kernel if needed, but how much would it
actually help?

Thanks,
Nick


Re: linker-tables v5 testing

2016-11-30 Thread Nicholas Piggin
On Thu, 1 Dec 2016 16:04:30 +1100
Nicholas Piggin  wrote:

> On Wed, 30 Nov 2016 19:15:27 -0800
> "Luis R. Rodriguez"  wrote:
> 
> > On Wed, Nov 30, 2016 at 6:51 PM, Nicholas Piggin  wrote:
> > > On Wed, 30 Nov 2016 18:38:16 +0100
> > > "Luis R. Rodriguez"  wrote:
> > >  
> > >> On Wed, Nov 30, 2016 at 02:09:47PM +1100, Nicholas Piggin wrote:
> 
> > >> What is wrong with that ? Separating linker table and section ranges is  
> > >
> > > It's not that you separate those, of course you need that. It's that
> > > you also separate other sections from the input section descriptions:
> > >
> > > -   *(.text.hot .text .text.fixup .text.unlikely)   \
> > > +   *(.text.hot .text)  \
> > > +   *(SORT(.text.rng.*))\
> > > +   *(.text.fixup .text.unlikely)   \

Ahh, you're doing it to avoid clash with compiler generated sections.
The usual way to cope with that seems to be to use two dots for your name.
.text..rng.*


Re: linker-tables v5 testing

2016-11-30 Thread Nicholas Piggin
On Wed, 30 Nov 2016 19:15:27 -0800
"Luis R. Rodriguez"  wrote:

> On Wed, Nov 30, 2016 at 6:51 PM, Nicholas Piggin  wrote:
> > On Wed, 30 Nov 2016 18:38:16 +0100
> > "Luis R. Rodriguez"  wrote:
> >  
> >> On Wed, Nov 30, 2016 at 02:09:47PM +1100, Nicholas Piggin wrote:

> >> What is wrong with that ? Separating linker table and section ranges is  
> >
> > It's not that you separate those, of course you need that. It's that
> > you also separate other sections from the input section descriptions:
> >
> > -   *(.text.hot .text .text.fixup .text.unlikely)   \
> > +   *(.text.hot .text)  \
> > +   *(SORT(.text.rng.*))\
> > +   *(.text.fixup .text.unlikely)   \
> >
> > [snip]  
> 
> And ?

Umm.. don't remember :)


> >> >   If we have an array of pointers and size, it's trivial C code to 
> >> > iterate over
> >> >   it. If it needs to have a set of LINKTABLE accessors over the top of it
> >> >   for this use case, then that would seem to be a failure of the 
> >> > underlying
> >> >   API, no?  
> >>
> >> Still did not get it.  
> >
> > Well fundamentally the linker table is just a way to declare some type of
> > array that any code can add some elements to, right?  
> 
> Well in particular to a special section, and we're providing standard
> easy way to do this without any hacky linker script or assembly.

Right.

> > The non-C aspect of it
> > is this ability of producers to be decentralized.  
> 
> Not sure what you mean here, we only do a little bit of linker table
> magic, tweaks and then provide helpers.

I mean that you don't need some LINKTABLE_FOREACH accessor for it,
because from the consumer point of view, it's a simple array. It can
easily be handled by plain C. All we need is to get the array base
and size.


> > The consumer is not really different from any other array though. They just
> > want to know the address and the number of elements, and that's all. You can
> > provide that with two macros and don't really need macros like for each, run
> > all, etc because it's just simple C iteration over an array.  
> 
> I added those at the request of hpa, and one of the reasons was that
> we tend to historically use these "arrays" in special ways all over
> the kernel. But these arrays are very special, they are not typical
> arrays. This makes emphasis on its use. They are also needed given the
> special formatting we have for start / end of these in a generic form.
> This should all help consolidate this and make it easier to use in a
> generic form. It should also help avoid mismatch use when the intent
> was a section range and we're on a linker table, or the other way
> around.

I disagree. Making a simple iterator for array, or hiding a simple
function call behind it, really doesn't buy you anything. It's not
like linked list for example, where the result of the iterators is
much nicer than the open coded C.


> >> > - Is it really important to be able to add new allocators without 
> >> > modifying
> >> >   a central file for the linker script? Yes it's a benefit, but is it 
> >> > enough
> >> >   to justify the complexity?  
> >>
> >> If by allocators you mean the ability to add new entries into sections 
> >> easily
> >> without having to modify the linker script -- then my answer is:  
> >
> > Yes, but after reading a little closer it may not really be a problem I
> > first thought. Thanks for providing the detailed points.  
> 
> OK. So just a bit of bike shedding ?

No I just misread how a part of it was implemented.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-30 Thread Nicholas Piggin
On Wed, 30 Nov 2016 23:13:25 -0500
Don Zickus  wrote:

> On Wed, Nov 30, 2016 at 10:40:02AM -0800, Linus Torvalds wrote:
> > On Wed, Nov 30, 2016 at 10:18 AM, Nicholas Piggin  
> > wrote:  
> > >
> > > Here's an initial rough hack at removing modversions. It gives an idea
> > > of the complexity we're carrying for this feature (keeping in mind most
> > > of the lines removed are generated parser).  
> > 
> > You definitely don't have to try to convince me. We've had many issues
> > with modversions over the years. This was just the "last drop" as far
> > as I'm concerned, we've had random odd crc generation failures due to
> > some build races too.
> >   
> > > In its place I just added a simple config option to override vermagic
> > > so distros can manage it entirely themselves.  
> > 
> > So at least Fedora doesn't even enable CONFIG_MODVERSIONS as-is. I'm
> > _hoping_ it's just Debian that wants this, and we'd need to get some
> > input from the Debian people whether that "control vermagic" is
> > sufficient? I suspect it isn't, but I can't come up with any simple
> > alternate model either..  
> 
> Oddly, I just posted a patch to enable this for Fedora and then someone
> pointed me at this thread. :-/
> 
> Sorry for chiming in late, but yes RHEL is a big user of MODVERSIONS for our
> kabi protection work.  Despite our best intentions we still have lots of
> partners and customers that provide value-add out-of-tree drivers to their
> customers.  These module builders requested we have a mechanism to allow
> rolling modules forward for each of our minor RHEL updates without breaking
> their drivers.
> 
> They requested this to save time and money on rebuilding and retesting.  It
> also helps deal with situations where RHEL puts out a security fix or new
> minor release and the provider of OOT driver has not released the
> appropriate update.  Customers like the ability to roll their special
> drivers forward quickly to their schedule.
> 
> Now we don't protect every symbol, just a select few that our meets our
> customers needs (and developers willing to support it).
> 
> Anyway, MODVERSIONS is our way of protecting our kabi for the last 10 years.
> It isn't perfect and we have fixed the genksyms tool over the years, but so
> far it mostly works fine.

Okay. It would be good to get all the distros in on this.

What I want to do is work out exactly what it is that modversions is
giving you.

We know it's fairly nasty code to maintain and it does not detect ABI
changes very well. But it's not such a burden that we can't maintain
it if there are good reasons to keep it.

> I am not sure what 'control vermagic' is, but it sounds like a string check,
> which won't protect against the boatload of backports we do to structs,
> enums, and functions.

Basically vermagic is the string all modules and the kernel get, which
must match in order to load modules. If you have modversions disabled,
then vermagic includes the kernel version. If modversions is enabled,
then vermagic does not include the kernel version but the CRCs have to
also match.

Controlling it explicitly is just a couple of lines where a distro can
control it (so they can update their kernel version without breaking).
It's not meant to solve everything, just the first one.
 
> Currently we are exploring various ways to get smarter here.  The genksyms
> tool has its limitations and handling kabi hacks in RHEL is getting
> tiresome.
> 
> I think GregKH pointed to one such tool, libabigail?  We are working on
> others too.
> 
> 
> Circling back to enabling MODVERSIONS in Fedora, that was to start the
> process of syncing Fedora with RHEL stuff in preparation for smarter tools.
> 
> 
> If you take away MODVERSIONS, that would put a damper in our work, but
> easily carried privately (much like MODSIGNING for 8 years until it went
> upstream :-) ).

I don't think that's necessary. A feature requirement for a distro is just
as valid as any other user of upstream. I don't want to hinder any distro,
I'm just still not quite seeing the big picture of exactly what functionality
you need from the kernel.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-30 Thread Nicholas Piggin
On Thu, 01 Dec 2016 02:35:54 +
Ben Hutchings  wrote:

> On Thu, 2016-12-01 at 12:55 +1100, Nicholas Piggin wrote:
> > On Wed, 30 Nov 2016 21:33:01 +  
> > > Ben Hutchings  wrote:  
> >   
> > > On Wed, 2016-11-30 at 10:40 -0800, Linus Torvalds wrote:  
> > > > > On Wed, Nov 30, 2016 at 10:18 AM, Nicholas Piggin  
> > > > > wrote:
> > > > > 
> > > > > Here's an initial rough hack at removing modversions. It gives an idea
> > > > > of the complexity we're carrying for this feature (keeping in mind 
> > > > > most
> > > > > of the lines removed are generated parser).    
> > > > 
> > > > You definitely don't have to try to convince me. We've had many issues
> > > > with modversions over the years. This was just the "last drop" as far
> > > > as I'm concerned, we've had random odd crc generation failures due to
> > > > some build races too.
> > > >     
> > > > > In its place I just added a simple config option to override vermagic
> > > > > so distros can manage it entirely themselves.    
> > > > 
> > > > So at least Fedora doesn't even enable CONFIG_MODVERSIONS as-is. I'm
> > > > _hoping_ it's just Debian that wants this,    
> > > 
> > > The last time I looked, RHEL and SLE did.  They change the release
> > > string for each new kernel version, but they will copy/link old out-of-
> > > tree modules into the new version's "weak-updates" module subdirectory
> > > if the symbol versions still match.
> > >   
> > > > and we'd need to get some
> > > > input from the Debian people whether that "control vermagic" is
> > > > sufficient? I suspect it isn't, but I can't come up with any simple
> > > > alternate model either..    
> > > 
> > > Allowing the vermagic to be changed separately doesn't help us, as we
> > > already control the release string.  If we were to change some of the
> > > module tools to consider vermagic then it would allow us to report the
> > > full version in the release string while not forcing rebuilds on every
> > > kernel upgrade - but that's not a pressing problem.  
> > 
> > Okay, but existing modversions AFAIKS does not solve your problems described
> > in yor your earlier mail either. Modversions hardly catches ABI breakage at
> > all, you can't rely on it that way. It's far more likely that some structure
> > size changes deep in the kernel than an exported function type signature
> > changes.  
> 
> As I understand it, genksyms incorporates the definitions of a
> function's parameter and return types - not just their names - and all
> the types they refer to, recursively.  So a structure size change
> should change the version of all functions where the function and its
> caller pass that structure between them, however indirectly.  It finds
> such indirect ABI breakage for me fairly regularly, though of course I
> don't know that it finds everything.

It is only the type name.

Not only that but even if you did extend it further to structure type
arrangement then you still have to deal with other structures followed
via pointers. Or (rarer but not unheard of):

- changes to structures without changes of the types of their members
- changes to arguments without changes of their type
- changes to semantics of functions
- data structures derived in ways other than exported symbols, e.g.,
  fixed register for `current` on some archs

This is actually a big problem with it, that it provides a false sense
of security. It simply can't be used to verify your ABI stability.

 [Aside: something like the tool Greg linked earlier,

 
https://kernel-recipes.org/en/2016/talks/would-an-abi-changes-visualization-tool-be-useful-to-linux-kernel-maintenance/

 Would be great if that worked with the kernel. Not necessarily as part
 of the build system, but at least a tool that distros could use to
 analyze ABI changes. It wouldn't catch everything, but it would be far
 better than modversions.]

> > I'm not sure how you know which exports are used only by in-tree modules
> > and which are used out of tree, but if you know that then you can version
> > them manually as we said by adding _v2 in the rare case you need to change
> > a behaviour.  
> 
> That's fine for individual functions.
> 
> > So I'm still having trouble understanding what modversions is giving you.  
> 
> Where there is a family of driver modules (

Re: linker-tables v5 testing

2016-11-30 Thread Nicholas Piggin
On Wed, 30 Nov 2016 18:38:16 +0100
"Luis R. Rodriguez"  wrote:

> On Wed, Nov 30, 2016 at 02:09:47PM +1100, Nicholas Piggin wrote:
> > On Wed, 30 Nov 2016 02:33:49 +0100
> > "Luis R. Rodriguez"  wrote:
> >   
> > > On Thu, Nov 24, 2016 at 08:18:40AM -0800, Guenter Roeck wrote:  
> > > > Hi Luis,
> > > > 
> > > > On 11/23/2016 08:11 PM, Luis R. Rodriguez wrote:
> > > > > Guenter,
> > > > > 
> > > > > I think I'm ready to start pushing a new patch set out for review.
> > > > > Before I do that -- can I trouble you for letting your test
> > > > > infrastructure hammer it? I'll only push out the patches for review
> > > > > based on this new set of changes once all tests come back OK for all
> > > > > architectures.
> > > > > 
> > > > > https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20161117-linker-tables-v5
> > > > > 
> > > > > Fenguang & Guenter,
> > > > > 
> > > > > Any chance I can trouble you to enable the new driver:
> > > > > CONFIG_TEST_LINKTABLES=y on each kernel configuration as it will run a
> > > > > test driver which will WARN_ON() if it finds any errors.
> > > > > 
> > > > 
> > > > I see a number of compile failures as well as some crashes in your test 
> > > > driver.
> > > > Please have a look. http://kerneltests.org/builders, column 'testing'.
> > > > 
> > > 
> > > Thanks! I believe I have addressed most issues.. Any chance I can have 
> > > you re-try with
> > > this new branch:
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20161117-linker-tables-v5-try2
> > >   
> > 
> > Few minor things:
> > 
> > - Can module-common.lds be generated with standard cpp_lds_S?  
> 
> I actually no longer have a need to have module-common.lds generated given
> I no longer am using macro wrappers or helpers. IMHO it would be still be
> useful to have this generated but I think we can do this separately so
> I can drop these patches and we can address this after?

Okay.

> 
> > - Breaking up the input section descriptions like that in the linker
> >   script is not what we want AFAIKS. It forces the linker to put them
> >   in different locations.  
> 
> What is wrong with that ? Separating linker table and section ranges is

It's not that you separate those, of course you need that. It's that
you also separate other sections from the input section descriptions:

-   *(.text.hot .text .text.fixup .text.unlikely)   \
+   *(.text.hot .text)  \
+   *(SORT(.text.rng.*))\
+   *(.text.fixup .text.unlikely)   \

[snip]

> 
> > Broader issues:
> > 
> > - I still think calling these "sections" and "de facto Linux sections"
> >   etc. is a bit confusing, especially because assembler sections are
> >   also involved. Region, array, blob, anything. Then you get "section
> >   ranges" and "linker tables". Fundamentally all it is is a linear memory
> >   allocator for static data which is decentralized in the source code
> >   (as opposed to centralized which would just be `u8 mydata[size]`).  
> 
> You are right, sorry I had not updated that part of the documentation yet.
> I had a big nose dive into the ELF specifications after I wrote that
> documentation and as per ELF specs all we have here are sections using
> SHT_PROGBITS (@progbits on most architectures, on ARM this is %progbits)
> and then we have "Special sections" [0] (one is .init for example). The
> SHT_PROGBITS is a section type which can be used on programs to specify
> a section is custom and its use is defined by the program itself. Now,
> some of the "Special sections" though also have SHT_PROGBITS -- and because
> of this it implicates programs can further customize these sections. This
> is precisely why we can customize .init further on Linux for instance.
> 
> I think just calling them Linux sections suffices then. Thoughts?

I suppose. I don't have a much better idea so I'll end the bikeshedding
it. Oh just one last splash of paint: the other thing is "linker tables".
If you call the general concept a linux section, then can linker tables
be linux section tables, or linux section arrays?

> BTW 

Re: linux-next: build warning after merge of the kbuild tree

2016-11-30 Thread Nicholas Piggin
On Thu, 1 Dec 2016 10:06:08 +1100
Stephen Rothwell  wrote:

> Hi Michal,
> 
> After merging the kbuild tree, today's linux-next build (x86_64
> allmodconfig) produced this warning:
> 
> WARNING: EXPORT symbol "__sw_hweight32" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "memcpy_mcsafe_unrolled" [vmlinux] version generation 
> failed, symbol will not be versioned.
> WARNING: EXPORT symbol "empty_zero_page" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__get_user_2" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__put_user_2" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__fentry__" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "native_load_gs_index" [vmlinux] version generation 
> failed, symbol will not be versioned.
> WARNING: EXPORT symbol "__copy_user_nocache" [vmlinux] version generation 
> failed, symbol will not be versioned.
> WARNING: EXPORT symbol "copy_user_generic_unrolled" [vmlinux] version 
> generation failed, symbol will not be versioned.
> WARNING: EXPORT symbol "memset" [vmlinux] version generation failed, symbol 
> will not be versioned.
> WARNING: EXPORT symbol "_copy_to_user" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__sw_hweight64" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__put_user_1" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__memcpy" [vmlinux] version generation failed, symbol 
> will not be versioned.
> WARNING: EXPORT symbol "memmove" [vmlinux] version generation failed, symbol 
> will not be versioned.
> WARNING: EXPORT symbol "__put_user_8" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "copy_user_generic_string" [vmlinux] version 
> generation failed, symbol will not be versioned.
> WARNING: EXPORT symbol "__get_user_4" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "clear_page" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__memset" [vmlinux] version generation failed, symbol 
> will not be versioned.
> WARNING: EXPORT symbol "__memmove" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "copy_user_enhanced_fast_string" [vmlinux] version 
> generation failed, symbol will not be versioned.
> WARNING: EXPORT symbol "_copy_from_user" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__get_user_1" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "phys_base" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__get_user_8" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "copy_page" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "__put_user_4" [vmlinux] version generation failed, 
> symbol will not be versioned.
> WARNING: EXPORT symbol "memcpy" [vmlinux] version generation failed, symbol 
> will not be versioned.
> 
> Introduced by commit
> 
>   d8c1eb86e952 ("kbuild: modpost warn if export version crc is missing")
> 

These are existing problems caught by the new warning. The modversions saga
has another chapter left, but one way or another we'll get rid of the warnings
for 4.10.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-30 Thread Nicholas Piggin
On Wed, 30 Nov 2016 21:33:01 +
Ben Hutchings  wrote:

> On Wed, 2016-11-30 at 10:40 -0800, Linus Torvalds wrote:
> > > On Wed, Nov 30, 2016 at 10:18 AM, Nicholas Piggin  
> > > wrote:
> > > 
> > > Here's an initial rough hack at removing modversions. It gives an idea
> > > of the complexity we're carrying for this feature (keeping in mind most
> > > of the lines removed are generated parser).  
> > 
> > You definitely don't have to try to convince me. We've had many issues
> > with modversions over the years. This was just the "last drop" as far
> > as I'm concerned, we've had random odd crc generation failures due to
> > some build races too.
> >   
> > > In its place I just added a simple config option to override vermagic
> > > so distros can manage it entirely themselves.  
> > 
> > So at least Fedora doesn't even enable CONFIG_MODVERSIONS as-is. I'm
> > _hoping_ it's just Debian that wants this,  
> 
> The last time I looked, RHEL and SLE did.  They change the release
> string for each new kernel version, but they will copy/link old out-of-
> tree modules into the new version's "weak-updates" module subdirectory
> if the symbol versions still match.
> 
> > and we'd need to get some
> > input from the Debian people whether that "control vermagic" is
> > sufficient? I suspect it isn't, but I can't come up with any simple
> > alternate model either..  
> 
> Allowing the vermagic to be changed separately doesn't help us, as we
> already control the release string.  If we were to change some of the
> module tools to consider vermagic then it would allow us to report the
> full version in the release string while not forcing rebuilds on every
> kernel upgrade - but that's not a pressing problem.

Okay, but existing modversions AFAIKS does not solve your problems described
in yor your earlier mail either. Modversions hardly catches ABI breakage at
all, you can't rely on it that way. It's far more likely that some structure
size changes deep in the kernel than an exported function type signature
changes.

I'm not sure how you know which exports are used only by in-tree modules
and which are used out of tree, but if you know that then you can version
them manually as we said by adding _v2 in the rare case you need to change
a behaviour.

So I'm still having trouble understanding what modversions is giving you.

> One thing that could work for us would be:
> 
> - Stricter version matching for in-tree modules (maybe some extra
>   part in vermagic that is skipped for out-of-tree modules)
> - Ability to blacklist use of a symbol, or all symbols in a module,
>   by out-of-tree modules
> 
> where the blacklist would be a matter of distribution policy.  But this
> would still require a fair amount of work by someone, and I doubt you'd
> want this upstream.

I don't think people are adverse to carrying some upstream complexity for
ditsros. Although for this fancy blacklist case, can it just be done in
userspace?

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-30 Thread Nicholas Piggin
On Tue, 29 Nov 2016 12:35:57 -0800
Linus Torvalds  wrote:

> On Tue, Nov 29, 2016 at 11:57 AM, Ben Hutchings  wrote:
> >
> > If the modversion is missing then the fallback should be to a full
> > vermagic match, i.e. including the release string.  Something like
> > this (untested):  
> 
> This really seems way too complicated for this situation.
> 
> And it's wrong too. The whole point of modversions was that you didn't
> want to do the full version check.
> 
> We already know there were *some* crc's (we checked that at the top of
> check_version(), but we've also checked it in "same_magic()" - it's
> what makes us ignore the exact version number), but this particular
> symbol doesn't have a crc. Just let it through, because we have bugs
> in binutils.
> 
> So your extra complexity logic seems actively wrong. It makes
> MODVERSIONS not work at all, rather than limp along. You're better off
> just not having MODVERSIONS.

Here's an initial rough hack at removing modversions. It gives an idea
of the complexity we're carrying for this feature (keeping in mind most
of the lines removed are generated parser).

Note I removed the `rm -r scripts/genksyms` hunks to reduce mail size.

In its place I just added a simple config option to override vermagic
so distros can manage it entirely themselves.

Thanks,
Nick

---
 .gitignore   |1 -
 Documentation/dontdiff   |2 -
 Documentation/kbuild/modules.txt |4 -
 Makefile |7 -
 arch/arm64/include/asm/module.h  |4 -
 arch/avr32/boot/images/Makefile  |2 -
 arch/powerpc/include/asm/module.h|4 -
 arch/powerpc/kernel/module_64.c  |   24 +-
 arch/x86/tools/relocs.c  |1 -
 drivers/char/ipmi/ipmi_ssif.c|5 -
 drivers/firmware/efi/libstub/Makefile|4 +-
 drivers/message/fusion/mptlan.h  |3 -
 include/asm-generic/export.h |8 -
 include/asm-generic/vmlinux.lds.h|   35 -
 include/linux/export.h   |   17 +-
 include/linux/module.h   |   12 -
 include/linux/vermagic.h |   12 +-
 init/Kconfig |   28 +-
 kernel/module.c  |  213 +--
 scripts/Makefile |1 -
 scripts/Makefile.build   |  113 --
 scripts/Makefile.lib |3 +-
 scripts/Makefile.modpost |2 +-
 scripts/adjust_autoksyms.sh  |5 -
 scripts/export_report.pl |   30 +-
 scripts/genksyms/.gitignore  |5 -
 scripts/genksyms/Makefile|   14 -
 scripts/genksyms/genksyms.c  |  880 ---
 scripts/genksyms/genksyms.h  |   94 --
 scripts/genksyms/keywords.gperf  |   60 -
 scripts/genksyms/keywords.hash.c_shipped |  229 ---
 scripts/genksyms/lex.l   |  481 --
 scripts/genksyms/lex.lex.c_shipped   | 2291 
 scripts/genksyms/parse.tab.c_shipped | 2396 --
 scripts/genksyms/parse.tab.h_shipped |  118 --
 scripts/genksyms/parse.y |  513 ---
 scripts/mksysmap |6 +-
 scripts/mod/modpost.c|  122 +-
 scripts/module-common.lds|5 -
 scripts/namespace.pl |2 -
 40 files changed, 60 insertions(+), 7696 deletions(-)
 delete mode 100644 scripts/genksyms/.gitignore
 delete mode 100644 scripts/genksyms/Makefile
 delete mode 100644 scripts/genksyms/genksyms.c
 delete mode 100644 scripts/genksyms/genksyms.h
 delete mode 100644 scripts/genksyms/keywords.gperf
 delete mode 100644 scripts/genksyms/keywords.hash.c_shipped
 delete mode 100644 scripts/genksyms/lex.l
 delete mode 100644 scripts/genksyms/lex.lex.c_shipped
 delete mode 100644 scripts/genksyms/parse.tab.c_shipped
 delete mode 100644 scripts/genksyms/parse.tab.h_shipped
 delete mode 100644 scripts/genksyms/parse.y

diff --git a/.gitignore b/.gitignore
index c2ed4ec..cde8773 100644
--- a/.gitignore
+++ b/.gitignore
@@ -20,7 +20,6 @@
 *.mod.c
 *.i
 *.lst
-*.symtypes
 *.order
 *.elf
 *.bin
diff --git a/Documentation/dontdiff b/Documentation/dontdiff
index 5385cba..40eb3e0 100644
--- a/Documentation/dontdiff
+++ b/Documentation/dontdiff
@@ -47,7 +47,6 @@
 *.sgml
 *.so
 *.so.dbg
-*.symtypes
 *.tab.c
 *.tab.h
 *.tex
@@ -180,7 +179,6 @@ mktree
 modpost
 modules.builtin
 modules.order
-modversions.h*
 nconf
 ncscope.*
 offset.h
diff --git a/Documentation/kbuild/modules.txt b/Documentation/kbuild/modules.txt
index 3fb39e0..1568b8a 100644
--- a/Documentation/kbuild/modules.txt
+++ b/Documentation/kbuild/modules.txt
@@ -61,10 +61,6 @@ make sure the kernel contains the information required. The 
target
 exists solely as a simple way to prepare a kernel source tree for
 building external modules.
 
-NOTE: "modules_prepare" will 

Re: linker-tables v5 testing

2016-11-29 Thread Nicholas Piggin
On Wed, 30 Nov 2016 02:33:49 +0100
"Luis R. Rodriguez"  wrote:

> On Thu, Nov 24, 2016 at 08:18:40AM -0800, Guenter Roeck wrote:
> > Hi Luis,
> > 
> > On 11/23/2016 08:11 PM, Luis R. Rodriguez wrote:  
> > > Guenter,
> > > 
> > > I think I'm ready to start pushing a new patch set out for review.
> > > Before I do that -- can I trouble you for letting your test
> > > infrastructure hammer it? I'll only push out the patches for review
> > > based on this new set of changes once all tests come back OK for all
> > > architectures.
> > > 
> > > https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20161117-linker-tables-v5
> > > 
> > > Fenguang & Guenter,
> > > 
> > > Any chance I can trouble you to enable the new driver:
> > > CONFIG_TEST_LINKTABLES=y on each kernel configuration as it will run a
> > > test driver which will WARN_ON() if it finds any errors.
> > >   
> > 
> > I see a number of compile failures as well as some crashes in your test 
> > driver.
> > Please have a look. http://kerneltests.org/builders, column 'testing'.
> >   
> 
> Thanks! I believe I have addressed most issues.. Any chance I can have you 
> re-try with
> this new branch:
> 
> https://git.kernel.org/cgit/linux/kernel/git/mcgrof/linux-next.git/log/?h=20161117-linker-tables-v5-try2

Few minor things:

- Can module-common.lds be generated with standard cpp_lds_S?

- Breaking up the input section descriptions like that in the linker
  script is not what we want AFAIKS. It forces the linker to put them
  in different locations.

Broader issues:

- I still think calling these "sections" and "de facto Linux sections"
  etc. is a bit confusing, especially because assembler sections are
  also involved. Region, array, blob, anything. Then you get "section
  ranges" and "linker tables". Fundamentally all it is is a linear memory
  allocator for static data which is decentralized in the source code
  (as opposed to centralized which would just be `u8 mydata[size]`).

- It would be nice to just clearly separate the memory allocator function
  from the syntatic sugar on top. Actually I think if the memory allocation
  and access functions are clean enough, you don't need anything more. If
  we have an array of pointers and size, it's trivial C code to iterate over
  it. If it needs to have a set of LINKTABLE accessors over the top of it
  for this use case, then that would seem to be a failure of the underlying
  API, no?

- Is it really important to be able to add new allocators without modifying
  a central file for the linker script? Yes it's a benefit, but is it enough
  to justify the complexity?

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-28 Thread Nicholas Piggin
On Tue, 29 Nov 2016 01:15:48 +
Ben Hutchings  wrote:

> [I've had to guess at the cc list for this, because we no longer have
> mail archives that preserve them.]

You got it about right.

> On Fri, 2016-11-25 at 10:01 -0800, Linus Torvalds wrote:
> > On Thu, Nov 24, 2016 at 4:40 PM, Nicholas Piggin  wrote: 
> >  
> > > > 
> > > > Yes, manual "marking" is never going to be a viable solution.  
> > > 
> > > I guess it really depends on how exactly you want to use it. For distros
> > > that do stable ABI but rarely may have to break something for security
> > > reasons, it should work and give exact control.  
> 
> This is roughly how Debian handles the kernel module ABI during a
> stable release.
> 
> > No. Because nobody else will care, so unless it's like a single symbol
> > or something, it will just be a maintenance nightmare.  
> 
> I agree with this.  We can explicitly "version" individual symbols
> anyway by doing something like:
> 
> -int foo(void);
> +#define foo foo_2
> +int foo_2(int);

Yeah... Benefit being it's very simple and everybody can see exactly
what it does and knows how it will work.

> 
> > > What else do people *actually* use it for? Preventing mismatched modules
> > > when .git version is not attached and release version of the kernel has
> > > not been bumped. Is that it?  
> > 
> > It used to be very useful for avoiding loading stale modules and then
> > wasting days on debugging something that wasn't the case when you had
> > forgotten to do "make modules_install". Change some subtle internal
> > ABI issue (add/remove a parameter, whatever) and it would really help.
> > 
> > These days, for me, LOCALVERSION_AUTO and module signing are what I
> > personally tend to use.
> >
> > The modversions stuff may just be too painful to bother with. Very few
> > people probably use it, and the ones that do likely don't have any
> > overriding reason why.  
> [...]
> 
> Debian has some strong reasons:
> 
> 1. Changing the release string requires any out-of-tree modules to be
> upgraded (at least rebuilt) on end-user systems.  So we try to avoid
> doing that during the lifetime of a stable release, i.e. we don't let
> the release string change.  Also, the release string is reflected in
> package names (e.g. linux-image-4.8.0-1-amd64), and introducing new
> package names requires manual approval by the Debian archive team.

This is something I've noticed. Would it be better if the module loader
ignores the kernel version and instead used some internal ABI version
string to check against? Otherwise (AFAICT) you always have 4.8.0 versions
despite being 4.8.7 kernel, and you can't upgrade a point release without
overwriting your old kernel and modules.

That is something we could potentially replace modversions with. It would
be a far more reasonable complexity to carry for downstream distros than
modversions. Though not something we can add for 4.9.

> 2. We want to allow ABI breaks for "internal" symbols used only by in-
> tree modules, as those breaks will be resolved by rebooting to complete
> the upgrade.  But we need a run-time check to prevent loading an
> incompatible module before the reboot.
> 
> 3. So far as I can see, module signing doesn't work for a distribution
> kernel with out-of-tree modules as there has to be a trust path from a
> built-in certificate to the module signing certificate.  So signature
> enforcement will have to be disabled on systems that use out-of-tree
> modules, thus it's not a substitute for modversions.
> 
> We expect Linux 4.9 to be the basis for a longterm stable branch and on
> that basis intend to include it in the next Debian stable release. 
> Even if the decision is to get rid of modversions, it would be very
> helpful if they could be revived for 4.9 so that we have some time to
> adapt our packaging practices to work without them in future.

It would be nice to get upstream to the point where 4.9 modversions
works if you just patch out depends BROKEN. That would require reverting
a few more of Al's arch patches.

Then in 4.10 we can re-add all those arch patches (which are less
controversial without the asm-prototypes.h workaround), and implement a
simple stable ABI version string check, and then in 4.11 we can remove
modversions.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-25 Thread Nicholas Piggin
On Fri, 25 Nov 2016 10:00:46 -0800
Linus Torvalds  wrote:

> On Thu, Nov 24, 2016 at 4:40 PM, Nicholas Piggin  wrote:
> >>
> >> Yes, manual "marking" is never going to be a viable solution.  
> >
> > I guess it really depends on how exactly you want to use it. For distros
> > that do stable ABI but rarely may have to break something for security
> > reasons, it should work and give exact control.  
> 
> No. Because nobody else will care, so unless it's like a single symbol
> or something, it will just be a maintenance nightmare.

Yeah that's true, and as I realized a distro can rename a symbol if they
make incompatible changes which happens very rarely. Avoids having to
carry some whole infrastructure upstream for it.

> 
> > What else do people *actually* use it for? Preventing mismatched modules
> > when .git version is not attached and release version of the kernel has
> > not been bumped. Is that it?  
> 
> It used to be very useful for avoiding loading stale modules and then
> wasting days on debugging something that wasn't the case when you had
> forgotten to do "make modules_install". Change some subtle internal
> ABI issue (add/remove a parameter, whatever) and it would really help.
> 
> These days, for me, LOCALVERSION_AUTO and module signing are what I
> personally tend to use.
> 
> The modversions stuff may just be too painful to bother with. Very few
> people probably use it, and the ones that do likely don't have any
> overriding reason why.
> 
> So I'd personally be ok with just saying "let's disable it for now",
> and see if anybody even notices and cares, and then has a good enough
> explanation of why. It's entirely possible that most users are "I
> enabled it ten years ago, I didn't even realize it was still in my
> defconfig".

That sounds good. Should we try to get 4.9 working (which we could
do relatively easily with a few arch reverts), and then disable
modversions for 4.10? (at which point we can un-revert Al's arch
patches)

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-24 Thread Nicholas Piggin
On Thu, 24 Nov 2016 16:24:10 +0100
Greg Kroah-Hartman  wrote:

> On Thu, Nov 24, 2016 at 09:31:52PM +1100, Nicholas Piggin wrote:
> > On Thu, 24 Nov 2016 10:56:22 +0100
> > Greg Kroah-Hartman  wrote:
> >   
> > > On Thu, Nov 24, 2016 at 06:53:22PM +1100, Nicholas Piggin wrote:  
> > > > On Thu, 24 Nov 2016 08:36:39 +0100
> > > > Greg Kroah-Hartman  wrote:
> > > > 
> > > > > On Thu, Nov 24, 2016 at 06:20:26PM +1100, Nicholas Piggin wrote:
> > > > > > But still, modversions is pretty complicated for what it gives us. 
> > > > > > It sends
> > > > > > preprocessed C into a C parser that makes CRCs using type 
> > > > > > definitions of
> > > > > > exported symbols, then turns those CRCs into a linker script which 
> > > > > > which is
> > > > > > used to link the .o file with. What we get in return is a quite 
> > > > > > limited and
> > > > > > symbol "versioning" system.
> > > > > > 
> > > > > > What if we ripped all that out and just attached an explicit 
> > > > > > version to
> > > > > > each export, and incompatible changes require an increment?  
> > > > > 
> > > > > How would that work for structures?  Would that be required for every
> > > > > EXPORT_SYMBOL* somehow?
> > > > 
> > > > Yeah just have EXPORT_SYMBOL take another parameter which attaches a 
> > > > version
> > > > number and use that as the value for the __crc_ symbol versions rather 
> > > > than
> > > > a calculated CRC.
> > > > 
> > > > Yes it would require some level of care from developers and may be a 
> > > > small
> > > > annoyance when changing exports. But making people think a tiny bit more
> > > > before chnaging exported ABI shouldn't be the end of the world.
> > > 
> > > That wouldn't work at all for structures that change, as we never
> > > explicitly "mark" them for export anywhere.  
> > 
> > Well, the module arrives at the objects one way or another via an exported
> > symbol. Although it can be by following a lot of pointers so yes it's
> > probably near impossible to do well.  
> 
> Yes, manual "marking" is never going to be a viable solution.

I guess it really depends on how exactly you want to use it. For distros
that do stable ABI but rarely may have to break something for security
reasons, it should work and give exact control.

What else do people *actually* use it for? Preventing mismatched modules
when .git version is not attached and release version of the kernel has
not been bumped. Is that it?

> > >  You need a tool that looks
> > > at either the source code (what we have today), or looks at the  
> > 
> > What we have today only looks at the type of the exported function or
> > variable I think (or does it? I didn't look that far into the parser).  
> 
> It should catch things if you change a structure layout of something
> that is an argument in a function (like a pointer to a structure),
> otherwise it wouldn't really be that good of a check, and kind of
> useless.
>
> > Does not follow down all possible derivable pointer types.  
> 
> It should be pretty good, as I think the code is based on the old SuSE
> scripts that used to do this really well.  But it's been a long time
> since I looked at it, so I could be wrong.

Yeah... turns out modversions is in the "kind of useless" camp.

The crc is based on the name of the type and that's it (that's what I
thought, I just now verify it).

> > > > > > Google tells me
> > > > > > Linus is not a neutral bystander on the topic of symbol versioning, 
> > > > > > so I'm
> > > > > > bracing for a robust response :) (actually I don't much care either 
> > > > > > way, I'm
> > > > > > happy to put a couple of bandaids on it and keep it going)  
> > > > > 
> > > > > There are tools that people are working on to make it more obvious 
> > > > > where
> > > > > API breaks happen by looking at the .o debug data instead of our crazy
> > > > > current system (which is really better than nothing), perhaps we 
> > > > > should
> > > > > start using them instead?
> > > > > 
> > > > > See here fo

Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-24 Thread Nicholas Piggin
On Thu, 24 Nov 2016 10:56:22 +0100
Greg Kroah-Hartman  wrote:

> On Thu, Nov 24, 2016 at 06:53:22PM +1100, Nicholas Piggin wrote:
> > On Thu, 24 Nov 2016 08:36:39 +0100
> > Greg Kroah-Hartman  wrote:
> >   
> > > On Thu, Nov 24, 2016 at 06:20:26PM +1100, Nicholas Piggin wrote:  
> > > > But still, modversions is pretty complicated for what it gives us. It 
> > > > sends
> > > > preprocessed C into a C parser that makes CRCs using type definitions of
> > > > exported symbols, then turns those CRCs into a linker script which 
> > > > which is
> > > > used to link the .o file with. What we get in return is a quite limited 
> > > > and
> > > > symbol "versioning" system.
> > > > 
> > > > What if we ripped all that out and just attached an explicit version to
> > > > each export, and incompatible changes require an increment?
> > > 
> > > How would that work for structures?  Would that be required for every
> > > EXPORT_SYMBOL* somehow?  
> > 
> > Yeah just have EXPORT_SYMBOL take another parameter which attaches a version
> > number and use that as the value for the __crc_ symbol versions rather than
> > a calculated CRC.
> > 
> > Yes it would require some level of care from developers and may be a small
> > annoyance when changing exports. But making people think a tiny bit more
> > before chnaging exported ABI shouldn't be the end of the world.  
> 
> That wouldn't work at all for structures that change, as we never
> explicitly "mark" them for export anywhere.

Well, the module arrives at the objects one way or another via an exported
symbol. Although it can be by following a lot of pointers so yes it's
probably near impossible to do well.

>  You need a tool that looks
> at either the source code (what we have today), or looks at the

What we have today only looks at the type of the exported function or
variable I think (or does it? I didn't look that far into the parser).
Does not follow down all possible derivable pointer types.

> object/debugging code (like the link I pointed at.)
> 
> > > > Google tells me
> > > > Linus is not a neutral bystander on the topic of symbol versioning, so 
> > > > I'm
> > > > bracing for a robust response :) (actually I don't much care either 
> > > > way, I'm
> > > > happy to put a couple of bandaids on it and keep it going)
> > > 
> > > There are tools that people are working on to make it more obvious where
> > > API breaks happen by looking at the .o debug data instead of our crazy
> > > current system (which is really better than nothing), perhaps we should
> > > start using them instead?
> > > 
> > > See here for more details about this:
> > >   
> > > https://kernel-recipes.org/en/2016/talks/would-an-abi-changes-visualization-tool-be-useful-to-linux-kernel-maintenance/
> > >   
> > 
> > Hmm. I guess it's basically similar to modversions, so has downsides of not
> > detecting a semantic change unless it changes the type. But still, if we 
> > could
> > replace our custom code with a tool like this for modversions functionality,
> > that alone would be a massive improvement. But requiring debug info might be
> > a bit of a show stopper. I also don't know if that would handle asm 
> > functions.  
> 
> I think we can live without asm functions changing their arguments as
> that is usually very rare.  And maybe debugging info being a requirement
> for those that want modversions (i.e. the distros), is ok as they
> already generate that as part of their build.

Maybe. I'd like to know how people really care about it. Linus post from

http://yarchive.net/comp/linux/modversions.html

Seem to be that he just likes it to prevent module loading if the git version
is not available. Fair usage, but could we do better with less effort? Maybe
ship with a source version that can do the same job. If you take care of that
case, then what is left?

> 
> But more importantly, that's a much longer-term solution, fixing what we
> have today to at least start working again is much more important before
> we start bikeshedding the whole mess :)

Oh yeah, we're fixing it. I just thought I'd bring it up since I have a few
important ears :)

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-24 Thread Nicholas Piggin
On Thu, 24 Nov 2016 10:32:12 +0100
Michal Marek  wrote:

> On 2016-11-24 08:53, Nicholas Piggin wrote:
> > On Thu, 24 Nov 2016 08:36:39 +0100
> > Greg Kroah-Hartman  wrote:
> >   
> >> On Thu, Nov 24, 2016 at 06:20:26PM +1100, Nicholas Piggin wrote:  
> >>> But still, modversions is pretty complicated for what it gives us. It 
> >>> sends
> >>> preprocessed C into a C parser that makes CRCs using type definitions of
> >>> exported symbols, then turns those CRCs into a linker script which which 
> >>> is
> >>> used to link the .o file with. What we get in return is a quite limited 
> >>> and
> >>> symbol "versioning" system.
> >>>
> >>> What if we ripped all that out and just attached an explicit version to
> >>> each export, and incompatible changes require an increment?
> >>
> >> How would that work for structures?  Would that be required for every
> >> EXPORT_SYMBOL* somehow?  
> > 
> > Yeah just have EXPORT_SYMBOL take another parameter which attaches a version
> > number and use that as the value for the __crc_ symbol versions rather than
> > a calculated CRC.  
> 
> The problem is that with every kernel release, the structures change in
> a way that you would have to bump the version of virtually every export.
>
> At which point, there would be little difference between
> CONFIG_MODVERSION on and off (without CONFIG_MODVERSION, we compare the
> kernel version strings when loading modules).


I'm not sure about that. If they are truly incompatible changes and
MODVERSIONS does not pick up a different CRC, then it's even worse if
incompatible modules are missed so often.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-24 Thread Nicholas Piggin
On Thu, 24 Nov 2016 10:38:04 +0100
Arnd Bergmann  wrote:

> On Thursday, November 24, 2016 6:53:22 PM CET Nicholas Piggin wrote:
> > >   
> > > > Google tells me
> > > > Linus is not a neutral bystander on the topic of symbol versioning, so 
> > > > I'm
> > > > bracing for a robust response  (actually I don't much care either way, 
> > > > I'm
> > > > happy to put a couple of bandaids on it and keep it going)
> > > 
> > > There are tools that people are working on to make it more obvious where
> > > API breaks happen by looking at the .o debug data instead of our crazy
> > > current system (which is really better than nothing), perhaps we should
> > > start using them instead?
> > > 
> > > See here for more details about this:
> > >   
> > > https://kernel-recipes.org/en/2016/talks/would-an-abi-changes-visualization-tool-be-useful-to-linux-kernel-maintenance/
> > >   
> > 
> > Hmm. I guess it's basically similar to modversions, so has downsides of not
> > detecting a semantic change unless it changes the type. But still, if we 
> > could
> > replace our custom code with a tool like this for modversions functionality,
> > that alone would be a massive improvement. But requiring debug info might be
> > a bit of a show stopper. I also don't know if that would handle asm 
> > functions.  
> 
> It's certainly not an option for v4.9 at this point. There is also no

Yeah I wasn't suggesting that for 4.9, or 4.10 even.

> realistic way we can get a correct asm/asm-prototypes.h for all the
> other architectures in place. At the moment, powerpc is the only
> one that actually works with modversions.
> 
> We can either make CONFIG_MODVERSIONS a per-architecture opt-in
> and let only the ones that have the header file select that, or
> revert all of Al's original patches that moved the exports.

alpha, m68k, s390, sparc, ia64 are affected and have no patch (or none
I've been cc'ed on) for asm-prototypes.h. Doesn't seem infeasible to add
them before 4.9, considering the fairly low risk of patch. Opt-in would
be pointless IMO, might as well just revert the arch part of Al's patch
instead.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-23 Thread Nicholas Piggin
On Thu, 24 Nov 2016 08:36:39 +0100
Greg Kroah-Hartman  wrote:

> On Thu, Nov 24, 2016 at 06:20:26PM +1100, Nicholas Piggin wrote:
> > But still, modversions is pretty complicated for what it gives us. It sends
> > preprocessed C into a C parser that makes CRCs using type definitions of
> > exported symbols, then turns those CRCs into a linker script which which is
> > used to link the .o file with. What we get in return is a quite limited and
> > symbol "versioning" system.
> > 
> > What if we ripped all that out and just attached an explicit version to
> > each export, and incompatible changes require an increment?  
> 
> How would that work for structures?  Would that be required for every
> EXPORT_SYMBOL* somehow?

Yeah just have EXPORT_SYMBOL take another parameter which attaches a version
number and use that as the value for the __crc_ symbol versions rather than
a calculated CRC.

Yes it would require some level of care from developers and may be a small
annoyance when changing exports. But making people think a tiny bit more
before chnaging exported ABI shouldn't be the end of the world.

> 
> > Google tells me
> > Linus is not a neutral bystander on the topic of symbol versioning, so I'm
> > bracing for a robust response :) (actually I don't much care either way, I'm
> > happy to put a couple of bandaids on it and keep it going)  
> 
> There are tools that people are working on to make it more obvious where
> API breaks happen by looking at the .o debug data instead of our crazy
> current system (which is really better than nothing), perhaps we should
> start using them instead?
> 
> See here for more details about this:
>   
> https://kernel-recipes.org/en/2016/talks/would-an-abi-changes-visualization-tool-be-useful-to-linux-kernel-maintenance/

Hmm. I guess it's basically similar to modversions, so has downsides of not
detecting a semantic change unless it changes the type. But still, if we could
replace our custom code with a tool like this for modversions functionality,
that alone would be a massive improvement. But requiring debug info might be
a bit of a show stopper. I also don't know if that would handle asm functions.

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-23 Thread Nicholas Piggin
On Thu, 24 Nov 2016 07:00:50 +0100
Ingo Molnar  wrote:

> * Nicholas Piggin  wrote:
> 
> > >  scripts/Makefile.build | 78 
> > > --
> > >  1 file changed, 72 insertions(+), 6 deletions(-)
> > > 
> > > It was applied 4 hours after it was sent in the -rc3 timeframe, and then 
> > > it went 
> > > upstream in -rc5:
> > > 
> > >  "Here are some regression fixes for kbuild:
> > > 
> > >- modversion support for exported asm symbols (Nick Piggin). The
> > >  affected architectures need separate patches adding
> > >  asm-prototypes.h.
> > > 
> > > ... the fine merge log even says that the commit 'needs separate patches'!
> > > 
> > > It's still totally broken upstream and it didn't fix any regressions 
> > > AFAICS (or if 
> > > it did then its changelog was very silent on that fact).  
> > 
> > Well it doesn't fix regression by itself, as discussed it needs architecture
> > patches. I've tried keeping linux-arch on cc for all this modversion 
> > breakage
> > stuff since it became clear it would require arch changes.
> > 
> > The actual x86 bug I suppose you would say is caused by 784d5699eddc5. But I
> > should probably have included more background in the above initial crc 
> > support
> > patch, e.g, at least reference 22823ab419d. So mea culpa for that.  
> 
> Indeed 784d5699eddc5 makes more sense:
> 
>   784d5699eddc ("x86: move exports to actual definitions")
>   22823ab419d8 ("EXPORT_SYMBOL() for asm")
> 
> ... and sorry about coming down on you and Marek!
> 
> I've Cc:-ed Al.
> 
> I think what happened is that 22823ab419d8 and 784d5699eddc caused the boot 
> regression (modular builds with modversions enabled not booting), and your 
> fix 
> half-fixed it - with the remaining fix (that adds the header to x86) fixing 
> the 
> rest.

That's about right. My patch *should* be a noop by itself (just provides the
framework for x86 fix to work). So if you notice any new breakage let me
know.

> 
> Still the fact remains that modversions was broken in -rc1 which delayed 
> testing 
> done by a number of prominent testers. :-(

Yep, not ideal. I have a patch or so which is supposed to make CRC failure
warnings more reliable.

But still, modversions is pretty complicated for what it gives us. It sends
preprocessed C into a C parser that makes CRCs using type definitions of
exported symbols, then turns those CRCs into a linker script which which is
used to link the .o file with. What we get in return is a quite limited and
symbol "versioning" system.

What if we ripped all that out and just attached an explicit version to
each export, and incompatible changes require an increment? Google tells me
Linus is not a neutral bystander on the topic of symbol versioning, so I'm
bracing for a robust response :) (actually I don't much care either way, I'm
happy to put a couple of bandaids on it and keep it going)

Thanks,
Nick


Re: [PATCH] x86/kbuild: enable modversions for symbols exported from asm

2016-11-23 Thread Nicholas Piggin
On Thu, 24 Nov 2016 05:40:28 +0100
Ingo Molnar  wrote:

> * Adam Borowski  wrote:
> 
> > Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
> > modversion support for symbols exported from asm files. Architectures
> > must include C-style declarations for those symbols in asm/asm-prototypes.h
> > in order for them to be versioned.
> > 
> > Add these declarations for x86, and an architecture-independent file that
> > can be used for common symbols.
> > 
> > User impact: kernels may fail to load modules at all when
> > CONFIG_MODVERSIONS=y.
> > 
> > Signed-off-by: Adam Borowski 
> > Tested-by: Kalle Valo 
> > Acked-by: Nicholas Piggin 
> > Tested-by: Peter Wu 
> > Tested-by: Oliver Hartkopp 
> > ---
> > Changes: corrected Peter Wu's address, added Tested-by: Oliver.
> > This is an unsplit version (x86/include/ and include/ together).
> > 
> >  arch/x86/include/asm/asm-prototypes.h | 12 
> >  include/asm-generic/asm-prototypes.h  |  7 +++
> >  2 files changed, 19 insertions(+)
> >  create mode 100644 arch/x86/include/asm/asm-prototypes.h
> >  create mode 100644 include/asm-generic/asm-prototypes.h  
> 
> Michal, I'm quite unhappy about how the offending commit that broke 
> modversions 
> for essentially _everyone_ who does more complex modular builds on x86 ended 
> up 
> upstream:
> 
>   commit 4efca4ed05cbdfd13ec3e8cb623fb77d6e4ab187
>   Author: Nicholas Piggin 
>   AuthorDate: Tue Nov 1 12:46:19 2016 +1100
>   Commit: Michal Marek 
>   CommitDate: Tue Nov 1 16:20:17 2016 +0100
> 
>   kbuild: modversions for EXPORT_SYMBOL() for asm
> 
>   Allow architectures to create asm/asm-prototypes.h file that
>   provides C prototypes for exported asm functions, which enables
>   proper CRC versions to be generated for them.

What did this break? It's the first I've heard of it. For all architectures
without asm/asm-prototypes.h it should have been a functional noop. Any
breakage is some bug in my patch so that would need to be fixed urgently.

> 
>  scripts/Makefile.build | 78 
> --
>  1 file changed, 72 insertions(+), 6 deletions(-)
> 
> It was applied 4 hours after it was sent in the -rc3 timeframe, and then it 
> went 
> upstream in -rc5:
> 
>  "Here are some regression fixes for kbuild:
> 
>- modversion support for exported asm symbols (Nick Piggin). The
>  affected architectures need separate patches adding
>  asm-prototypes.h.
> 
> ... the fine merge log even says that the commit 'needs separate patches'!
> 
> It's still totally broken upstream and it didn't fix any regressions AFAICS 
> (or if 
> it did then its changelog was very silent on that fact).

Well it doesn't fix regression by itself, as discussed it needs architecture
patches. I've tried keeping linux-arch on cc for all this modversion breakage
stuff since it became clear it would require arch changes.

The actual x86 bug I suppose you would say is caused by 784d5699eddc5. But I
should probably have included more background in the above initial crc support
patch, e.g, at least reference 22823ab419d. So mea culpa for that.

> Why was such a complex patch applied and why isn't it reverted or fixed 
> upstream?

It's been discussed and reviewed and tested for a long time (mainly on
linux-arch and linux-kbuild) and simply taken a while to find the least nasty
way to get 4.9 working.

The real problem is that this regression was found very late because it seems
very specific to the exact build environment. Simply enabling modversions was
not enough to break it on all configs (you would silently get 0 CRCs) so it
slipped through despite build tests. Then it took a quite a while longer to
settle on how to fix it.

Thanks,
Nick


Re: linux-next: build failure after merge of the powerpc tree

2016-11-22 Thread Nicholas Piggin
On Tue, 22 Nov 2016 21:21:14 +1100
Stephen Rothwell  wrote:

> Hi all,
> 
> On Tue, 22 Nov 2016 19:58:32 +1100 Stephen Rothwell  
> wrote:
> >
> > After merging the powerpc tree, today's linux-next build (powerpc
> > allyesconfig) failed like this:
> > 
> > Inconsistent kallsyms data
> > Try make KALLSYMS_EXTRA_PASS=1 as a workaround
> > 
> > Which is a vast improvement.  I'll try adding 'KALLSYMS_EXTRA_PASS=1'
> > later and see how it goes.  It would be nice not to need it, though.  
> 
> It build with KALLSYMS_EXTRA_PASS=1 on the command line.
> 

Yay!

The inconsistent kallsyms data is not something we're doing wrong or
toolchain bug so much as an inherent flaw in how kallsyms are done.
Basically the allyesconfig kallsyms section expands the kernel so much
that a subsequent linking requires even more stubs, which adds more
symbols, which means the next kallsyms will be a different size, which
means or offsets will be wrong :)

In theory I think it could actually be triggered on smaller kernels if
the size was just right. I'm sort of looking at ways to fix it:

https://patchwork.ozlabs.org/patch/684910/

Hopefully will get something in 4.10. Worst case we could detect the
kallsyms size change and do another pass in response without slowing
down the case where it's not required. Hmm, that might be the best
thing to do for a first pass.

Thanks,
Nick


Re: [PATCH 1/2] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-11-22 Thread Nicholas Piggin
On Wed, 23 Nov 2016 00:41:07 +
Russell King - ARM Linux  wrote:

> On Tue, Nov 22, 2016 at 11:34:48AM -0500, Nicolas Pitre wrote:
> > On Tue, 22 Nov 2016, Arnd Bergmann wrote:  
> > > This adds an asm/asm-prototypes.h header for ARM to fix the broken symbol
> > > versioning for symbols exported from assembler files.
> > > 
> > > I couldn't find the correct prototypes for the compiler builtins,
> > > so I went with the fake 'void f(void)' prototypes that we had
> > > before, restoring the state before they were moved.
> > > 
> > > Originally I assumed that the problem was just a harmless warning
> > > in unusual configurations, but as Uwe found, we actually need this
> > > to load most modules when symbol versioning is enabled, as it is
> > > in many distro kernels.
> > > 
> > > Cc: Uwe Kleine-König 
> > > Fixes: 4dd1837d7589 ("arm: move exports to definitions")
> > > Signed-off-by: Arnd Bergmann 
> > > ---
> > > Compared to the earlier version, I dropped the changes to the
> > > csumpartial files, which now get handled correctly by Kbuild
> > > even when the export comes from a macro, and I also dropped the
> > > changes to the bitops files, which were already fixed in a
> > > patch from Nico.
> > > 
> > > The patch applies cleanly on top of the rmk/fixes tree but has
> > > no effect there, as it also needs 4efca4ed05cb ("kbuild: modversions
> > > for EXPORT_SYMBOL() for asm") and cc6acc11cad1 ("kbuild: be more
> > > careful about matching preprocessed asm ___EXPORT_SYMBOL").
> > > 
> > > With the combination of rmk/fixes, torvalds/master and these two
> > > patches, symbol versioning works again on ARM. As it is still
> > > broken on almost all other architectures (powerpc is fixed,
> > > x86 has a patch), I wonder if we should make CONFIG_MODVERSIONS
> > > as broken for everything else.  
> > 
> > I'm not sure I like this at all.
> > 
> > The goal for moving EXPORT_SYMBOL() to assembly code where symbols were 
> > defined is to make things close together and avoid those centralized 
> > list of symbols that you can easily miss when modifying the actual code.
> > 
> > This series is therefore bringing back a centralized list of symbols in 
> > a slightly different form, nullifying the advantages from having moved 
> > EXPORT_SYMBOL() to asm code.  To me this looks like a big step backward.
> > 
> > Why not simply extending the original idea of keeping exports close to 
> > the actual code by _also_ having a macro that provides the function 
> > prototype alongside the EXPORT_SYMBOL() instance?  That could even be 
> > expressed with some EXPORT_SYMBOL_PROTO(ret, sym, arg...) macro that 
> > does it all.  
> 
> What you're saying is that you don't like the solution that's taken
> weeks to get merged up to this point, so where do we go from here
> now?  This crap has been broken since 4.9-rc1, and is a regression.
> 
> I think at this point, we just declare that modversions are broken
> on ARM, and those who created this mess get to explain to people
> why the fsck they broke the kernel.

Let's fix it instead :)

Why not merge Arnd's 2 patches? I think he mostly addressed your concerns
of them. Or...

> 
> 4.9 is the next LTS kernel?  ROTFL!
> 
> I agree with Nicolas - it seems that the whole EXPORT_SYMBOL() crap
> has just been a pointless exercise in churn, resulting only in
> something "different" because it looks "cool" to do it some other
> way.  There's no real benefit here at all, only harm.
> 
> Just revert the damned patches that created this breakage in the
> first place please.  It's now way too late to be trying to fix it
> any other way.
> 

4dd1837d7589 diffstat is entirely in arch/arm. I think reverting that
would fix it (I haven't tested it myself so I would advise testing
before committing). So the ball is in your court.

As for process concerns, you have made valid points. Sometimes a mistake
is made or we make an incorrect assumption about how another person works
or what mailing lists they have read, or do not anticipate the fallout
from some change.

Everyone does it sometimes, so it's never a bad time to reflect on how
we work with others and try to do better.

Thanks,
Nick


Re: [PATCH 1/2] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-11-22 Thread Nicholas Piggin
On Tue, 22 Nov 2016 11:34:48 -0500 (EST)
Nicolas Pitre  wrote:

> On Tue, 22 Nov 2016, Arnd Bergmann wrote:
> 
> > This adds an asm/asm-prototypes.h header for ARM to fix the broken symbol
> > versioning for symbols exported from assembler files.
> > 
> > I couldn't find the correct prototypes for the compiler builtins,
> > so I went with the fake 'void f(void)' prototypes that we had
> > before, restoring the state before they were moved.
> > 
> > Originally I assumed that the problem was just a harmless warning
> > in unusual configurations, but as Uwe found, we actually need this
> > to load most modules when symbol versioning is enabled, as it is
> > in many distro kernels.
> > 
> > Cc: Uwe Kleine-König 
> > Fixes: 4dd1837d7589 ("arm: move exports to definitions")
> > Signed-off-by: Arnd Bergmann 
> > ---
> > Compared to the earlier version, I dropped the changes to the
> > csumpartial files, which now get handled correctly by Kbuild
> > even when the export comes from a macro, and I also dropped the
> > changes to the bitops files, which were already fixed in a
> > patch from Nico.
> > 
> > The patch applies cleanly on top of the rmk/fixes tree but has
> > no effect there, as it also needs 4efca4ed05cb ("kbuild: modversions
> > for EXPORT_SYMBOL() for asm") and cc6acc11cad1 ("kbuild: be more
> > careful about matching preprocessed asm ___EXPORT_SYMBOL").
> > 
> > With the combination of rmk/fixes, torvalds/master and these two
> > patches, symbol versioning works again on ARM. As it is still
> > broken on almost all other architectures (powerpc is fixed,
> > x86 has a patch), I wonder if we should make CONFIG_MODVERSIONS
> > as broken for everything else.  
> 
> I'm not sure I like this at all.
> 
> The goal for moving EXPORT_SYMBOL() to assembly code where symbols were 
> defined is to make things close together and avoid those centralized 
> list of symbols that you can easily miss when modifying the actual code.

Right.

> 
> This series is therefore bringing back a centralized list of symbols in 
> a slightly different form, nullifying the advantages from having moved 
> EXPORT_SYMBOL() to asm code.  To me this looks like a big step backward.

Exported symbols have C declarations in headers already. For the most
part, anyway -- these ones Arnd adds are for compiler runtime which is
why some architectures haven't had the prototypes.

> Why not simply extending the original idea of keeping exports close to 
> the actual code by _also_ having a macro that provides the function 
> prototype alongside the EXPORT_SYMBOL() instance?  That could even be 
> expressed with some EXPORT_SYMBOL_PROTO(ret, sym, arg...) macro that 
> does it all.

Well, the reason is to get 4.9 working, I never thought asm-prototypes.h
was a beautiful solution or it should not be changed if we can find ways
to improve it.

EXPORT_SYMBOL_PROTO() for asm code seems possibly a good idea for 4.10.
Of course your exported symbols will still have their prototypes in
various headers -- that redundancy is inherent.

Thanks,
Nick


Re: [1/2] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-11-21 Thread Nicholas Piggin
On Mon, 21 Nov 2016 19:13:55 +
Russell King - ARM Linux  wrote:

> On Mon, Nov 21, 2016 at 07:46:44PM +0100, Uwe Kleine-König wrote:
> > Hello,
> > 
> > On Mon, Oct 24, 2016 at 05:05:26PM +0200, Arnd Bergmann wrote:  
> > > This adds an asm/asm-prototypes.h header for ARM to fix the
> > > broken symbol versioning for symbols exported from assembler
> > > files.
> > > 
> > > In addition to the header, we have to do these other small
> > > changes:
> > > 
> > > - move the exports from bitops.h to {change,clear,set,...}bit.S
> > > - move the exports from csumpartialgeneric.S into the files
> > >   including it
> > > 
> > > I couldn't find the correct prototypes for the compiler builtins,
> > > so I went with the fake 'void f(void)' prototypes that we had
> > > before.
> > > 
> > > This leaves the mmioset/mmiocpy function for now, as it's not
> > > obvious how to best handle them.
> > > 
> > > Signed-off-by: Arnd Bergmann   
> > 
> > In my test builds of 4.9-rc5 plus
> > 
> > 4efca4ed05cb ("kbuild: modversions for EXPORT_SYMBOL() for asm")
> > cc6acc11cad1 ("kbuild: be more careful about matching preprocessed asm 
> > ___EXPORT_SYMBOL")
> > 
> > (which are in -rc6) I got many warnings à la:
> > 
> > WARNING: "memset" [drivers/media/usb/airspy/airspy.ko] has no CRC!
> > 
> > and booting the resulting kernel failed with messages of the type:
> > 
> > [3.024126] usbcore: no symbol version for __memzero
> > [3.029107] usbcore: Unknown symbol __memzero (err -22)
> > 
> > so hardly any module could be loaded. modprobe -f works however, but
> > that's not what my initramfs does.
> > 
> > With this patch and https://patchwork.kernel.org/patch/9392291/ ("ARM:
> > move mmiocpy/mmioset exports to io.c") I could compile a kernel without
> > CRC warnings and it boots fine. So it would be great to get these two
> > patches into 4.9.  
> 
> Yea, many things would be nice, but I've been unable to track the
> issues here - it really didn't help _not_ being copied on the
> original set of patches which introduced this mess.

Quick overview:

- asm exports changes allow EXPORT_SYMBOL to be moved into .S files,
  but they would not get modversion CRCs generated.

- The core kbuild patches to add modversions support for asm exports
  is now merged in Linus's tree from the recent kbuild tree pull.
  asm/asm-prototypes.h must contain C style declarations of the symbol
  for this to work.

- Architectures can now add their asm/asm-prototypes.h and things
  *should* start working.

- Dependency is not a hard one. If you add asm-prototypes.h before
  merging the core patches then it should not introduce any problems.


> I've merged Nicolas' patch, so now we need to work out what to do
> with the remaining bits - which I guess are the asm-prototypes.h
> and the mmio* bits.  I'm not aware of what's happening with the
> patches that they depend on (which is why I recently asked the
> question - again, I seem to be completely out of the loop due to
> lack of Cc's).
> 
> So I'm just throwing my hands up and saying "I don't know what to
> do" at this stage.
> 

I don't think you have missed much since last it came up, it's just
taken a bit of time to get the details right and get it merged.

Not sure what your tree looks like, but if you merge this patch 1/2
plus 2a/2 or 2b/2 into upstream, then ARM should be working.

Thanks,
Nick


Re: [PATCH resend] kbuild: provide include/asm/asm-prototypes.h for x86

2016-11-20 Thread Nicholas Piggin
Hi Adam,

Thanks. I'd suggest doing x86: or x86/kbuild: prefix for the patch. Also
possibly consider describing what the patch does at a higher level in your
subject line, e.g.:

  x86/kbuild: enable modversions for symbols exported from asm

Also, it wouldn't hurt to add a little changelog of your own. Describe
problem then solution, e.g.,

  Commit 4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm") adds
  modversion support for symbols exported from asm files. Architectures
  must include C-style declarations for those symbols in asm/asm-prototypes.h
  in order for them to be versioned.

  Add these declarations for x86, and an architecture-independent file that
  can be used for common symbols.

(if you want to use that as-is or rewrite it, no problem).

You can add Acked-by: Nicholas Piggin 

Also it's not a big deal, but if you redo the patch, you could consider
splitting it into two (first add the generic header, then the x86 header),
but both can go via the x86 tree.

Thanks,
Nick

On Mon, 21 Nov 2016 07:39:45 +0100
Adam Borowski  wrote:

> Nicholas Piggin wrote:
> > Architectures will need to have an include/asm/asm-prototypes.h that
> > defines or #include<>s C-style prototypes for exported asm functions.
> > We can do an asm-generic version for the common ones like memset so
> > there's not a lot of pointless duplication there.  
> 
> Signed-off-by: Adam Borowski 
> Tested-by: Kalle Valo 
> ---
>  arch/x86/include/asm/asm-prototypes.h | 12 
>  include/asm-generic/asm-prototypes.h  |  7 +++
>  2 files changed, 19 insertions(+)
>  create mode 100644 arch/x86/include/asm/asm-prototypes.h
>  create mode 100644 include/asm-generic/asm-prototypes.h
> 
> diff --git a/arch/x86/include/asm/asm-prototypes.h 
> b/arch/x86/include/asm/asm-prototypes.h
> new file mode 100644
> index 000..ae87224
> --- /dev/null
> +++ b/arch/x86/include/asm/asm-prototypes.h
> @@ -0,0 +1,12 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> diff --git a/include/asm-generic/asm-prototypes.h 
> b/include/asm-generic/asm-prototypes.h
> new file mode 100644
> index 000..df13637
> --- /dev/null
> +++ b/include/asm-generic/asm-prototypes.h
> @@ -0,0 +1,7 @@
> +#include 
> +extern void *__memset(void *, int, __kernel_size_t);
> +extern void *__memcpy(void *, const void *, __kernel_size_t);
> +extern void *__memmove(void *, const void *, __kernel_size_t);
> +extern void *memset(void *, int, __kernel_size_t);
> +extern void *memcpy(void *, const void *, __kernel_size_t);
> +extern void *memmove(void *, const void *, __kernel_size_t);



Re: [PATCH 1/2] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-11-20 Thread Nicholas Piggin
On Sun, 20 Nov 2016 19:12:57 +
Russell King - ARM Linux  wrote:

> On Sun, Nov 20, 2016 at 10:32:50AM -0800, Linus Torvalds wrote:
> > On Sun, Nov 20, 2016 at 5:21 AM, Russell King - ARM Linux
> >  wrote:  
> > > On Tue, Oct 25, 2016 at 07:32:00PM +1100, Nicholas Piggin wrote:  
> > >>
> > >> Michal, what's your thoughts? If you merge my patch 2/2 and skip 1/2, it
> > >> should not give any new build warnings or errors, so then arch patches 
> > >> can
> > >> go via arch trees. 1/2 could go in after everyone is up to date.  
> > >
> > > So what's the conclusion on this?  I've just had a failure due to
> > > CONFIG_TRIM_UNUSED_KSYMS reported on ARM, and it looks like (at
> > > least some of) patch 1 could resolve it.  
> > 
> > Hmm. I've got
> > 
> >   cc6acc11cad1 kbuild: be more careful about matching preprocessed asm
> > ___EXPORT_SYMBOL
> >   4efca4ed05cb kbuild: modversions for EXPORT_SYMBOL() for asm
> > 
> > in my tree. Is that sufficient, or do we still have issues?  
> 
> Hmm, those seem to have gone in during the last week, so I haven't
> tested it yet (build running, but it'll take a while).  However, I
> don't think they'll solve _this_ problem.
> 
> Some of the issue here is that we use a mixture of assembly macros
> and preprocessor for the ARM bitops - the ARM bitops are created
> with an assembly macro which contains some pre-processor expanded
> macros (eg, EXPORT_SYMBOL()).
> 
> This means that the actual symbol being exported is not known to
> the preprocessor, so doing the "__is_defined(__KSYM_##sym)" inside
> "EXPORT_SYMBOL(\name)" becomes "__is_defined(__KSYM_\name)" to the
> preprocessor.  As "__KSYM_\name" is never defined, it always comes
> out as zero, hence we always use __cond_export_sym_0, which omits
> the symbol export from the assembly macro definition:
> 
>  .macro bitop, name, instr
> .globl \name ; .align 0 ; \name:
> 
> ...
> 
> .type \name, %function; .size \name, .-\name
> 
>  .endm
> 
> In other words, using preprocessor macros inside an assembly macro
> may not work as expected, and now leads to config-specific failures.
> 

Yes, that's a limitation. cpp expansion we can handle, but not gas macros.
You will need Arnd's patches for ARM.

http://marc.info/?l=linux-kbuild&m=147732160529499&w=2

If that doesn't fix it for you, send me your .config offline and I'll set
up a cross compile to work on it.

Again, any arch always has the option of going back to doing asm exports
in the old style of putting them into a .c file, but hopefully you'll find
Arnd's reworked patches to be something you're willing to merge.

Thanks,
Nick


Re: 'kbuild' merge before 4.9-rc1 breaks build and boot

2016-11-20 Thread Nicholas Piggin
On Sun, 20 Nov 2016 19:26:23 +0100
Peter Wu  wrote:

> Hi Nicholas,
> 
> Current git master (v4.9-rc5-364-g77079b1) with the latest kbuild fixes
> is still failing to load modules when built with CONFIG_MODVERSIONS=y on
> x86_64 using GCC 6.2.1.
> 
> It can still be reproduced with make defconfig, then enabling
> CONFIG_MODVERSIONS=y. The build output contains:
> 
> WARNING: "memcpy" [net/netfilter/nf_nat.ko] has no CRC!
> WARNING: "memmove" [net/netfilter/nf_nat.ko] has no CRC!
> WARNING: "_copy_to_user" [fs/efivarfs/efivarfs.ko] has no CRC!
> WARNING: "memcpy" [fs/efivarfs/efivarfs.ko] has no CRC!
> WARNING: "_copy_from_user" [fs/efivarfs/efivarfs.ko] has no CRC!

Hi Peter,

Sorry it's taken some time, bear with us. The arch specific patches need
to be merged now. Adam, what is the status of your patch? Please submit
to x86 maintainers if you haven't already.

Thanks,
Nick


Re: 'kbuild' merge before 4.9-rc1 breaks build and boot

2016-11-07 Thread Nicholas Piggin
On Mon, 7 Nov 2016 22:39:07 +0100
Peter Wu  wrote:

> On Mon, Nov 07, 2016 at 02:10:12PM -0500, Vince Weaver wrote:
> > On Thu, 27 Oct 2016, Peter Wu wrote:
> >   
> > > I can confirm Olivers issue, the current mainline kernel fails to boot
> > > on kernels with CONFIG_MODVERSIONS=y. Bisection points to:
> > > 
> > > commit 784d5699eddc55878627da20d3fe0c8542e2f1a2
> > > Author: Al Viro 
> > > Date:   Mon Jan 11 11:04:34 2016 -0500  
> > > > WARNING: "memset" [sound/usb/snd-usbmidi-lib.ko] has no CRC!
> > > > WARNING: "__fentry__" [sound/usb/snd-usbmidi-lib.ko] has no CRC!
> > > > WARNING: "memcpy" [sound/usb/snd-usbmidi-lib.ko] has no CRC!
> > > > WARNING: "__sw_hweight32" [sound/usb/snd-usbmidi-lib.ko] has no CRC!  
> > 
> > Has any progress been made with this problem?
> > 
> > I'm also encountering it on my debian-unstable box on any kernel more 
> > recent than 4.9-rc1 (up to and including 4.9-rc4).  I am glad someone 
> > managed to isolate it as I was unable to get a clean bisect.
> > 
> > Vince  
> 
> The original kbuild issue went in via
> merge commit 84d69848c97faab0c25aa2667b273404d2e2a64a which notes:
> 
>  - EXPORT_SYMBOL for asm source by Al Viro.
> 
>This does bring a regression, because genksyms no longer generates
>checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
>working on a patch to fix this.
> 
>Plus, we are talking about functions like strcpy(), which rarely
>change prototypes.
> 
> Adding Nicholas in the cc, hopefully he can give a status update.

I think Michal has everything needed now for the kbuild bits. The arch
specific patches can go via arch trees quite easily (there is no hard
dependency either way). This is the kbuild bit:

https://git.kernel.org/cgit/linux/kernel/git/mmarek/kbuild.git/commit/?h=rc-fixes&id=4efca4ed05cbdfd13ec3e8cb623fb77d6e4ab187

And it also needs this incremental bit not in Michal's tree yet:

---
 scripts/Makefile.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 3e223c2..05c6bb4 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -332,7 +332,7 @@ cmd_gensymtypes_S = 
\
 (echo "\#include " ;\
  echo "\#include " ;  \
 $(CPP) $(a_flags) $< |  \
- grep ^___EXPORT_SYMBOL |   \
+ grep ___EXPORT_SYMBOL |\
  sed 's/___EXPORT_SYMBOL \([a-zA-Z0-9_]*\),.*/EXPORT_SYMBOL(\1);/' ) |  \
 $(CPP) -D__GENKSYMS__ $(c_flags) -xc - |\
 $(GENKSYMS) $(if $(1), -T $(2)) \
-- 
2.9.3



Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-27 Thread Nicholas Piggin
On Thu, 27 Oct 2016 11:10:03 +0300
Kalle Valo  wrote:

> (Adding Thorsten because of a serious regression and Steven because he
> tried to fix something in the same commit)
> 
> Nicholas Piggin  writes:
> 
> > On Wed, 19 Oct 2016 16:38:14 +0200
> > Michal Marek  wrote:
> >  
> >> Dne 18.10.2016 v 03:34 Nicholas Piggin napsal(a):  
> >> > We should probably just bring all these arch patches through the
> >> > kbuild tree.
> >> > 
> >> > I'm sorry for the breakage: I didn't realize it broke the build with
> >> > some configs, otherwise I would have given Michal a heads up before
> >> > his pull request, and worked to get this stuff in first.
> >> 
> >> It breaks with some binutils versions only (and only with
> >> CONFIG_MODVERSIONS=y, of course).  
> >
> > Yeah this seems to be the issue, it apparently slipped past all the
> > automated builds. It seems like the existing CRC warnings in the tree
> > only trigger in rare circumstances too, so something could be a bit
> > fragile there.  
> 
> I upgraded from 4.8 to 4.9-rc2 and noticed that kernel modules fail to
> load (log below). After investigating for some time I found this thread
> and apparently this is not still fixed, at least not in Linus' tree.
> 
> Reverting 784d5699eddc5 fixed the issue for me. As I don't see any fix
> available (please correct me if I'm wrong) we should just revert that
> commit until it's properly fixed.

With these two patches together, does it work for you?

http://marc.info/?l=linux-arch&m=147653546809512&w=2
http://marc.info/?l=linux-kernel&m=147669851906489&w=2

It would be helpful if you could test and let us know, because there seems
to be a very tiny number of configs and toolchains that causes problems.

> 
> Also note that there's a related fix from Steven:
> 
> [PATCH] x86: Fix export for mcount and __fentry__
> https://marc.info/?l=linux-kernel&m=147733572502413
> 
> For compiling the kernel I'm using Ubuntu 12.04:
> 
> ii  binutils 2.22-6ubuntu1.4  GNU assembler, linker and 
> binary utilities
> ii  gcc  4:4.6.3-1ubuntu5 GNU C compiler
> 
> The kernel is running on a separate machine with Ubuntu 14.04.
> 
> [  110.703414] bluetooth: disagrees about version of symbol __get_user_2
> [  110.703416] bluetooth: Unknown symbol __get_user_2 (err -22)
> [  110.703429] bluetooth: disagrees about version of symbol __put_user_2
> [  110.703430] bluetooth: Unknown symbol __put_user_2 (err -22)
> [  110.703579] bluetooth: disagrees about version of symbol __put_user_4
> [  110.703580] bluetooth: Unknown symbol __put_user_4 (err -22)
> [  110.703669] bluetooth: disagrees about version of symbol __put_user_1
> [  110.703670] bluetooth: Unknown symbol __put_user_1 (err -22)
> [  110.703688] bluetooth: disagrees about version of symbol mcount
> [  110.703689] bluetooth: Unknown symbol mcount (err -22)
> 

I haven't seen that one before. Did you definitely make and install new
modules?

Thanks,
Nick


Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-27 Thread Nicholas Piggin
On Thu, 27 Oct 2016 16:14:28 +0300
Kalle Valo  wrote:

> Nicholas Piggin  writes:
> 
> > On Thu, 27 Oct 2016 11:10:03 +0300
> > Kalle Valo  wrote:
> >  
> >> (Adding Thorsten because of a serious regression and Steven because he
> >> tried to fix something in the same commit)
> >> 
> >> Nicholas Piggin  writes:
> >>   
> >> > On Wed, 19 Oct 2016 16:38:14 +0200
> >> > Michal Marek  wrote:
> >> >
> >> >> Dne 18.10.2016 v 03:34 Nicholas Piggin napsal(a):
> >> >> > We should probably just bring all these arch patches through the
> >> >> > kbuild tree.
> >> >> > 
> >> >> > I'm sorry for the breakage: I didn't realize it broke the build with
> >> >> > some configs, otherwise I would have given Michal a heads up before
> >> >> > his pull request, and worked to get this stuff in first.  
> >> >> 
> >> >> It breaks with some binutils versions only (and only with
> >> >> CONFIG_MODVERSIONS=y, of course).
> >> >
> >> > Yeah this seems to be the issue, it apparently slipped past all the
> >> > automated builds. It seems like the existing CRC warnings in the tree
> >> > only trigger in rare circumstances too, so something could be a bit
> >> > fragile there.
> >> 
> >> I upgraded from 4.8 to 4.9-rc2 and noticed that kernel modules fail to
> >> load (log below). After investigating for some time I found this thread
> >> and apparently this is not still fixed, at least not in Linus' tree.
> >> 
> >> Reverting 784d5699eddc5 fixed the issue for me. As I don't see any fix
> >> available (please correct me if I'm wrong) we should just revert that
> >> commit until it's properly fixed.  
> >
> > With these two patches together, does it work for you?
> >
> > http://marc.info/?l=linux-arch&m=147653546809512&w=2
> > http://marc.info/?l=linux-kernel&m=147669851906489&w=2
> >
> > It would be helpful if you could test and let us know, because there seems
> > to be a very tiny number of configs and toolchains that causes
> > problems.  
> 
> With these two patches (on top of ath-201610251249 from ath.git, in
> practice 4.9-rc2 + latest wireless patches) module loading works again.
> If you want you can add my Tested-by:
> 
> Tested-by: Kalle Valo 

Great, thanks for testing it.

> Can we get these patches to Linus' tree soon? It's annoying to revert
> 784d5699eddc5 everytime I update my tree.

Yes I think it's about ready to merge. Michal is returning from vacation
next week so we should get some progress soon.

> >> Also note that there's a related fix from Steven:
> >> 
> >> [PATCH] x86: Fix export for mcount and __fentry__
> >> https://marc.info/?l=linux-kernel&m=147733572502413
> >> 
> >> For compiling the kernel I'm using Ubuntu 12.04:
> >> 
> >> ii  binutils 2.22-6ubuntu1.4  GNU assembler, linker and 
> >> binary utilities
> >> ii  gcc  4:4.6.3-1ubuntu5 GNU C compiler
> >> 
> >> The kernel is running on a separate machine with Ubuntu 14.04.
> >> 
> >> [  110.703414] bluetooth: disagrees about version of symbol __get_user_2
> >> [  110.703416] bluetooth: Unknown symbol __get_user_2 (err -22)
> >> [  110.703429] bluetooth: disagrees about version of symbol __put_user_2
> >> [  110.703430] bluetooth: Unknown symbol __put_user_2 (err -22)
> >> [  110.703579] bluetooth: disagrees about version of symbol __put_user_4
> >> [  110.703580] bluetooth: Unknown symbol __put_user_4 (err -22)
> >> [  110.703669] bluetooth: disagrees about version of symbol __put_user_1
> >> [  110.703670] bluetooth: Unknown symbol __put_user_1 (err -22)
> >> [  110.703688] bluetooth: disagrees about version of symbol mcount
> >> [  110.703689] bluetooth: Unknown symbol mcount (err -22)
> >>   
> >
> > I haven't seen that one before. Did you definitely make and install new
> > modules?  
> 
> I'm pretty sure modules are correctly installed as I have used the same
> procedure for years: on my workstation I do 'make bindeb-pkg', copy the
> .deb to the test laptop and install the deb there. Also once I revert
> 784d5699eddc5 it starts immeadiately working.
> 

Sure, I was just checking because I've seen several types of failure but
not this one before. Thanks for reporting and testing.

Thanks,
Nick


Re: CONFIG_VMAP_STACK, on-stack struct, and wake_up_bit

2016-10-27 Thread Nicholas Piggin
On Thu, 27 Oct 2016 10:08:52 +0200
Peter Zijlstra  wrote:

> On Thu, Oct 27, 2016 at 12:07:26AM +0100, Mel Gorman wrote:
> > > but I consider PeterZ's
> > > patch the fix to that, so I wouldn't worry about it.
> > >   
> > 
> > Agreed. Peter, do you plan to finish that patch?  
> 
> I was waiting for you guys to hash out the 32bit issue. But if we're now
> OK with having this for 64bit only, I can certainly look at doing a new
> version.
> 
> I'll have to look at fixing Alpha's bitops for that first though,
> because as is that patch relies on atomics to the same word not needing
> ordering, but placing the contended/waiters bit in the high word for
> 64bit only sorta breaks that.

I got mine working too. Haven't removed the bitops barrier (that's
for another day), or sorted the page flags in this one. But the core
code is there.

It's a bit more intrusive than your patch, but I like the end result
better. It just stops using the generic bit waiter code completely
and uses its own keys. Ends up being making things easier, and we
could wait for other page details too if that was ever required.

It uses the same wait bit logic for all page waiters.
It keeps PageWaiters manipulation entirely under waitqueue lock, so no
additional data races beyond existing unlocked waitqueue_active tests.
And it checks to clear the waiter bit when no waiters for that page are
in the queue, so hash collisions with long waiters don't end up dragging
us into the slowpath always.

Also didn't uninline unlock_page yet. Still causes some text expansion,
but we should revisit that.

Thanks,
Nick

---
 include/linux/page-flags.h |   2 +
 include/linux/pagemap.h|  23 +++---
 include/trace/events/mmflags.h |   1 +
 mm/filemap.c   | 157 -
 mm/swap.c  |   2 +
 5 files changed, 138 insertions(+), 47 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 74e4dda..8059c04 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -73,6 +73,7 @@
  */
 enum pageflags {
PG_locked,  /* Page is locked. Don't touch. */
+   PG_waiters, /* Page has waiters, check its waitqueue */
PG_error,
PG_referenced,
PG_uptodate,
@@ -253,6 +254,7 @@ static inline int TestClearPage##uname(struct page *page) { 
return 0; }
TESTSETFLAG_FALSE(uname) TESTCLEARFLAG_FALSE(uname)
 
 __PAGEFLAG(Locked, locked, PF_NO_TAIL)
+PAGEFLAG(Waiters, waiters, PF_NO_COMPOUND) __CLEARPAGEFLAG(Waiters, waiters, 
PF_NO_COMPOUND)
 PAGEFLAG(Error, error, PF_NO_COMPOUND) TESTCLEARFLAG(Error, error, 
PF_NO_COMPOUND)
 PAGEFLAG(Referenced, referenced, PF_HEAD)
TESTCLEARFLAG(Referenced, referenced, PF_HEAD)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index dd15d39..97f2d0b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -477,22 +477,14 @@ static inline int lock_page_or_retry(struct page *page, 
struct mm_struct *mm,
  * and for filesystems which need to wait on PG_private.
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
-
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
-extern int wait_on_page_bit_killable_timeout(struct page *page,
-int bit_nr, unsigned long timeout);
-
-static inline int wait_on_page_locked_killable(struct page *page)
-{
-   if (!PageLocked(page))
-   return 0;
-   return wait_on_page_bit_killable(compound_head(page), PG_locked);
-}
+extern void wake_up_page_bit(struct page *page, int bit_nr);
 
-extern wait_queue_head_t *page_waitqueue(struct page *page);
 static inline void wake_up_page(struct page *page, int bit)
 {
-   __wake_up_bit(page_waitqueue(page), &page->flags, bit);
+   if (!PageWaiters(page))
+   return;
+   wake_up_page_bit(page, bit);
 }
 
 /* 
@@ -508,6 +500,13 @@ static inline void wait_on_page_locked(struct page *page)
wait_on_page_bit(compound_head(page), PG_locked);
 }
 
+static inline int wait_on_page_locked_killable(struct page *page)
+{
+   if (!PageLocked(page))
+   return 0;
+   return wait_on_page_bit_killable(compound_head(page), PG_locked);
+}
+
 /* 
  * Wait for a page to complete writeback
  */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5a81ab4..7ac8c0a 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -81,6 +81,7 @@
 
 #define __def_pageflag_names   \
{1UL << PG_locked,  "locked"},  \
+   {1UL << PG_waiters, "waiters"   },  \
{1UL << PG_error,   "error" },  \
{1UL << PG_referenced,  "referenced"},  \
{1UL << PG_uptodate,"uptodate"  },  \
di

Re: [PATCH] kbuild: protect compiled-in dtb from dead data elimination

2016-10-26 Thread Nicholas Piggin
On Wed, 26 Oct 2016 10:47:43 +0200
Marcin Nowakowski  wrote:

> DTBs are compiled into byte arrays that are placed inside
> .dtb.init.rodata section. As they are never referenced directly from the
> code, they get removed during dead data elimination and we end up with
> __dtb_start == __dtb_end.

Thanks, I got a bunch of patches from Paul for similar, so I
went through it and we've actually got a lot of tables we should
probably protect with KEEP(). For reference this is what I have
on top of Paul's patches. I'll send them up to kbuild after we
get this CRC build issue dealt with.

I'm glad you guys are finding it useful.

Thanks,
Nick



From: Nicholas Piggin 
Date: Tue, 25 Oct 2016 20:42:42 +1100
Subject: [PATCH] kbuild: dead code elimination keep various other tables

This should get a little bit less painful as things eventually move
to generic linker table helper macros, but for now be liberal in
keeping data tables.

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/vmlinux.lds.h | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index abe79e3..1cac167 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -184,7 +184,7 @@
 #define ACPI_PROBE_TABLE(name) \
. = ALIGN(8);   \
VMLINUX_SYMBOL(__##name##_acpi_probe_table) = .;\
-   *(__##name##_acpi_probe_table)  \
+   KEEP(*(__##name##_acpi_probe_table))\
VMLINUX_SYMBOL(__##name##_acpi_probe_table_end) = .;
 #else
 #define ACPI_PROBE_TABLE(name)
@@ -193,7 +193,7 @@
 #define KERNEL_DTB()   \
STRUCT_ALIGN(); \
VMLINUX_SYMBOL(__dtb_start) = .;\
-   *(.dtb.init.rodata) \
+   KEEP(*(.dtb.init.rodata))   \
VMLINUX_SYMBOL(__dtb_end) = .;
 
 /*
@@ -214,11 +214,11 @@
/* implement dynamic printk debug */\
. = ALIGN(8);   \
VMLINUX_SYMBOL(__start___jump_table) = .;   \
-   *(__jump_table) \
+   KEEP(*(__jump_table))   \
VMLINUX_SYMBOL(__stop___jump_table) = .;\
. = ALIGN(8);   \
VMLINUX_SYMBOL(__start___verbose) = .;  \
-   *(__verbose)\
+   KEEP(*(__verbose))  \
VMLINUX_SYMBOL(__stop___verbose) = .;   \
LIKELY_PROFILE()\
BRANCH_PROFILE()\
@@ -271,10 +271,10 @@
VMLINUX_SYMBOL(__start_rodata) = .; \
*(.rodata) *(.rodata.*) \
RO_AFTER_INIT_DATA  /* Read only after init */  \
-   *(__vermagic)   /* Kernel version magic */  \
+   KEEP(*(__vermagic)) /* Kernel version magic */  \
. = ALIGN(8);   \
VMLINUX_SYMBOL(__start___tracepoints_ptrs) = .; \
-   *(__tracepoints_ptrs)   /* Tracepoints: pointer array */\
+   KEEP(*(__tracepoints_ptrs)) /* Tracepoints: pointer array */ \
VMLINUX_SYMBOL(__stop___tracepoints_ptrs) = .;  \
*(__tracepoints_strings)/* Tracepoints: strings */  \
}   \
@@ -316,7 +316,7 @@
/* Built-in firmware blobs */   \
.builtin_fw: AT(ADDR(.builtin_fw) - LOAD_OFFSET) {  \
VMLINUX_SYMBOL(__start_builtin_fw) = .; \
-   *(.builtin_fw)  \
+   KEEP(*(.builtin_fw))\
VMLINUX_SYMBOL(__end_builtin_fw) = .;   \
}   \
\
@@ -394,7 +394,7 @@
\
/* Kernel symbol table: strings */  \
 __ksymtab_strings : AT(ADDR(

Re: [PATCH 1/2] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-10-25 Thread Nicholas Piggin
On Mon, 24 Oct 2016 17:05:26 +0200
Arnd Bergmann  wrote:

> This adds an asm/asm-prototypes.h header for ARM to fix the
> broken symbol versioning for symbols exported from assembler
> files.
> 
> In addition to the header, we have to do these other small
> changes:
> 
> - move the exports from bitops.h to {change,clear,set,...}bit.S
> - move the exports from csumpartialgeneric.S into the files
>   including it
> 
> I couldn't find the correct prototypes for the compiler builtins,
> so I went with the fake 'void f(void)' prototypes that we had
> before.
> 
> This leaves the mmioset/mmiocpy function for now, as it's not
> obvious how to best handle them.


This looks nicer. I like variant B because it keeps the GENKSYMS cruft to
a single location, but either one isn't too bad.

I'd like to get moving on this, so let's at least get the generic kbuild
change merged. In the end, the kbuild code does not prevent a maintainer
from putting their EXPORT_SYMBOL in whatever location they like, so there
is no reason not to merge it (certainly there will be archs that do use
it).

Michal, what's your thoughts? If you merge my patch 2/2 and skip 1/2, it
should not give any new build warnings or errors, so then arch patches can
go via arch trees. 1/2 could go in after everyone is up to date.

Thanks,
Nick


Re: [RFC][PATCH] Add EXPORT_MACRO_SYMBOL() for asm

2016-10-23 Thread Nicholas Piggin
On Sat, 22 Oct 2016 15:08:52 -0400
Steven Rostedt  wrote:

> On Sat, 22 Oct 2016 10:44:41 +1100
> Nicholas Piggin  wrote:
> 
> > >  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> > > diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
> > > index 43199a049da5..cb86e746865e 100644
> > > --- a/include/asm-generic/export.h
> > > +++ b/include/asm-generic/export.h
> > > @@ -90,5 +90,10 @@
> > >   __EXPORT_SYMBOL(name, KSYM(name),)
> > >  #define EXPORT_DATA_SYMBOL_GPL(name) \
> > >   __EXPORT_SYMBOL(name, KSYM(name),_gpl)
> > > +/*
> > > + * If "name" is a macro of a function and not a function itself,
> > > + * it needs a second pass.
> > > + */
> > > +#define EXPORT_MACRO_SYMBOL(x) EXPORT_SYMBOL(x)
> > 
> > Seems okay, but what about just calling it EXPORT_SYMBOL?  
> 
> 
> Actually, this doesn't work. There's some magic going on it
> Makefile.build in the scripts directory that causes this to fail.
> 
> I have another patch I'll be sending on Monday that fixes this.

Yes it's grepping for EXPORT_SYMBOL_* I think. If you need to
create a new name, EXPORT_SYMBOL prefix would be preferred. But
yeah if you make the standard macro do macro expansion, it
should just work, no?



Re: [PATCH 2/5] stop_machine: yield CPU during stop machine

2016-10-21 Thread Nicholas Piggin
On Fri, 21 Oct 2016 14:05:36 +0200
Peter Zijlstra  wrote:

> On Fri, Oct 21, 2016 at 01:58:55PM +0200, Christian Borntraeger wrote:
> > stop_machine can take a very long time if the hypervisor does
> > overcommitment for guest CPUs. When waiting for "the one", lets
> > give up our CPU by using the new cpu_relax_yield.  
> 
> This seems something that would apply to most other virt stuff. Lets Cc
> a few more lists for that.
> 
> > Signed-off-by: Christian Borntraeger 
> > ---
> >  kernel/stop_machine.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index ec9ab2f..1eb8266 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -194,7 +194,7 @@ static int multi_cpu_stop(void *data)
> > /* Simple state machine */
> > do {
> > /* Chill out and ensure we re-read multi_stop_state. */
> > -   cpu_relax();
> > +   cpu_relax_yield();
> > if (msdata->state != curstate) {
> > curstate = msdata->state;
> > switch (curstate) {
> > -- 
> > 2.5.5
> >   

This is the only caller of cpu_relax_yield()?

As a step to removing cpu_yield_lowlatency this series is nice so I
have no objection. But "general" kernel coders still have basically
no chance of using this properly.

I wonder what can be done about that. I've got that spin_do/while
series I'll rebase on top of this, but a spin_yield variant of them
is of no more help to the caller.

What makes this unique? Long latency and not performance critical?
Most places where we spin and maybe yield have been moved to arch
code, but I wonder whether we can make an easier to use architecture
independent API?

Thanks,
Nick


Re: [RFC][PATCH] Add EXPORT_MACRO_SYMBOL() for asm

2016-10-21 Thread Nicholas Piggin
On Fri, 21 Oct 2016 12:17:59 -0400
Steven Rostedt  wrote:

> Commit 784d5699eddc5 ("x86: move exports to actual definitions") removed the
> EXPORT_SYMBOL(__fentry__) and EXPORT_SYMBOL(mcount) from x8664_ksyms_64.c,
> and added EXPORT_SYMBOL(function_hook) in mcount_64.S instead. The problem
> is that function_hook isn't a function at all, but a macro that is defined
> as eithe mcount or __fentry__ depending on the support from gcc. But instead
> of adding more #ifdefs like x8684_ksyms_64.c had, I suggest having another
> export that can handle this similar to the way __string() works with
> converting macros to strings. By having:
> 
>  EXPORT_MACRO_SYMBOL(function_hook)
> 
> Where we have:
> 
>  #define EXPORT_MACRO_SYMBOL(x) EXPORT_SYMBOL(x)
> 
> It will convert the macro into what it is defined as before calling
> EXPORT_SYMBOL(), and this will just work properly again.
> 
> Cc: sta...@vger.kernel.org
> Fixes: Commit 784d5699eddc5 ("x86: move exports to actual definitions")
> Signed-off-by: Steven Rostedt 
> ---
>  arch/x86/kernel/mcount_64.S  | 2 +-
>  include/asm-generic/export.h | 5 +
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/mcount_64.S b/arch/x86/kernel/mcount_64.S
> index efe73aacf966..ccd9d912af27 100644
> --- a/arch/x86/kernel/mcount_64.S
> +++ b/arch/x86/kernel/mcount_64.S
> @@ -295,7 +295,7 @@ trace:
>   jmp fgraph_trace
>  END(function_hook)
>  #endif /* CONFIG_DYNAMIC_FTRACE */
> -EXPORT_SYMBOL(function_hook)
> +EXPORT_MACRO_SYMBOL(function_hook)
>  #endif /* CONFIG_FUNCTION_TRACER */
>  
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> diff --git a/include/asm-generic/export.h b/include/asm-generic/export.h
> index 43199a049da5..cb86e746865e 100644
> --- a/include/asm-generic/export.h
> +++ b/include/asm-generic/export.h
> @@ -90,5 +90,10 @@
>   __EXPORT_SYMBOL(name, KSYM(name),)
>  #define EXPORT_DATA_SYMBOL_GPL(name) \
>   __EXPORT_SYMBOL(name, KSYM(name),_gpl)
> +/*
> + * If "name" is a macro of a function and not a function itself,
> + * it needs a second pass.
> + */
> +#define EXPORT_MACRO_SYMBOL(x) EXPORT_SYMBOL(x)

Seems okay, but what about just calling it EXPORT_SYMBOL?

Thanks,
Nick


Re: [PATCH 4/6] mm: remove free_unmap_vmap_area_addr

2016-10-20 Thread Nicholas Piggin
On Thu, 20 Oct 2016 17:46:36 -0700
Joel Fernandes  wrote:

> > @@ -1100,10 +1091,14 @@ void vm_unmap_ram(const void *mem, unsigned int 
> > count)
> > debug_check_no_locks_freed(mem, size);
> > vmap_debug_free_range(addr, addr+size);
> >
> > -   if (likely(count <= VMAP_MAX_ALLOC))
> > +   if (likely(count <= VMAP_MAX_ALLOC)) {
> > vb_free(mem, size);
> > -   else
> > -   free_unmap_vmap_area_addr(addr);
> > +   return;
> > +   }
> > +
> > +   va = find_vmap_area(addr);
> > +   BUG_ON(!va);  
> 
> Considering recent objections to BUG_ON [1], lets make this a WARN_ON
> while we're moving the code?

If you lost track of your kernel memory mappings, you are in big trouble
and fail stop is really the best course of action for containing the
problem, which could have security and data corruption implications. This
is covered by Linus' last paragraph in that commit.

Thanks,
Nick




Re: [PATCH] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-10-20 Thread Nicholas Piggin
On Thu, 20 Oct 2016 15:33:27 +0100
Russell King - ARM Linux  wrote:

> On Fri, Oct 21, 2016 at 01:20:17AM +1100, Nicholas Piggin wrote:
> > Good catch, I'm surprised you're the first one who reported it. This patch
> > seems to do the trick for me:  
> 
> And me, thanks, so...
> 
> > 
> > From: Nicholas Piggin 
> > Date: Fri, 21 Oct 2016 01:13:33 +1100
> > Subject: [PATCH] kbuild: prevent lib-ksyms.o rebuilds
> > 
> > Signed-off-by: Nicholas Piggin   
> 
> Reported-by: Russell King 
> Tested-by: Russell King 

Thanks for testing it.

Hopefully if Arnd is able to respin the ARM patch to something a bit more
to your liking that doesn't expose prototypes, and no other problems arise,
we can avoid reverting. You had a poor first impression, but we may yet win
you over.



Re: [PATCH] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-10-20 Thread Nicholas Piggin
On Thu, 20 Oct 2016 14:17:02 +0100
Russell King - ARM Linux  wrote:

> On Thu, Oct 20, 2016 at 03:08:14PM +1100, Nicholas Piggin wrote:
> > Fair point, what about leaving those as they are, and also adding
> > them to asm-prototypes.h protected with GENKSYMS ifdef? It's not
> > beautiful, but still better than armksyms.c before Al's patches (or
> > at least no worse).  
> 
> I disagree (also see below).  The armksyms way was understandable.
> The new way... I've no idea yet, because I wasn't even copied on

New way is you put the EXPORT_SYMBOL in the .S file, and give it a C
style prototype in asm-prototypes.h, and that's it. The build system
will do the rest. Either way you require some file of C prototypes,
but after Al's patches, at least the EXPORT goes with its definition.

As far as rebuilding too often, that sounds like a bug, more below.


> any of the patches.  I've no idea how the exports are now handled.
> I'm in a black hole with respect to that, and that's now a problem.
> 
> > > Now, it would have _ALSO_ been nice to have been at least COPIED on the
> > > original set of changes that caused the need for this change.  I wasn't.
> > > So I want to see the original set of changes reverted, because they're
> > > clearly causing breakage.  Let's revert them and then go through the
> > > proper process of maintainer review, rather than bypassing maintainers
> > > and screwing up architectures in the process.  There really is no
> > > excuse for this crap.  
> > 
> > You may have a point about improvement of the process. I wasn't
> > involved in the original patches, but we did cc linux-arch when the
> > .S CRC issue became known.  
> 
> Yes, but I'm not on linux-kernel-v2, and I've no desire to end up with
> another list I've no hope of keeping up with to my mailbox - I'll just
> ignore it.  99% of the messages on it at the time when vger kicked me
> off the list was x86 related discussion, and not really cross-arch
> issues.  As I say, it just became another linux-kernel list.

For the patches that touched arm code, I'd agree you should have been cc'ed.

Not to dismiss that concern at all, but for issues of interest to arch code
but not specific to any (such as discovery that the asm exports change would
require this change to restore CRC generation), I was just saying linux-arch
is most appropriate for better or worse. Ccing 30 arch lists is not workable.

And again not to dismiss your concern, I'm just here trying to come up with
a workable fix for the non-revert scenario. Revert is no less valid an
option, it's just not one I'm in favour of myself.


> > However let's work on the assumption that they won't be reverted at this
> > stage, and try to come up with something to fix it that you're happy with.  
> 
> Well, there's more problems with this new KSYMS approach than just the
> CRCs.  It forces a rebuild of the ksyms files every single time, which
> then causes a relink of the kernel:

Good catch, I'm surprised you're the first one who reported it. This patch
seems to do the trick for me:

From: Nicholas Piggin 
Date: Fri, 21 Oct 2016 01:13:33 +1100
Subject: [PATCH] kbuild: prevent lib-ksyms.o rebuilds

Signed-off-by: Nicholas Piggin 

---
 scripts/Makefile.build | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index de46ab0..e1f25d6 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -430,6 +430,9 @@ cmd_export_list = $(OBJDUMP) -h $< | \
 
 $(obj)/lib-ksyms.o: $(lib-target) FORCE
$(call if_changed,export_list)
+
+targets += $(obj)/lib-ksyms.o
+
 endif
 
 #
-- 
2.9.3



Re: [PATCH] kbuild: provide include/asm/asm-prototypes.h for ARM

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 16:32:00 +0100
Russell King - ARM Linux  wrote:

> On Wed, Oct 19, 2016 at 05:02:55PM +0200, Arnd Bergmann wrote:
> > On Wednesday, October 19, 2016 4:52:06 PM CEST Michal Marek wrote:  
> > > Dne 17.10.2016 v 14:26 Arnd Bergmann napsal(a):  
> > > > This adds an asm/asm-prototypes.h header for ARM to fix the
> > > > broken symbol versioning for symbols exported from assembler
> > > > files.
> > > > 
> > > > In addition to the header, we have to do these other small
> > > > changes:
> > > > 
> > > > - move the 'extern' declarations out of memset_io/memcpy_io
> > > >   to make them visible to the symbol version generator
> > > > - move the exports from bitops.h to {change,clear,set,...}bit.S
> > > > - move the exports from csumpartialgeneric.S into the files
> > > >   including it
> > > > 
> > > > I couldn't find the correct prototypes for the compiler builtins,
> > > > so I went with the fake 'void f(void)' prototypes that we had
> > > > before.
> > > > 
> > > > Signed-off-by: Arnd Bergmann   
> > > 
> > > Hi Arnd,
> > > 
> > > just to make sure I'm looking at the right code - is this based on the
> > > patch by Nick here: https://patchwork.kernel.org/patch/9377783/?
> > >   
> > 
> > (adding Russell to Cc, I missed him during my earlier mail, which
> > is now archived at https://lkml.org/lkml/2016/10/17/356)  
> 
> I'm not in favour of this.
> 
>+extern void mmioset(void *, unsigned int, size_t);
>+extern void mmiocpy(void *, const void *, size_t);
>+
> #ifndef __ARMBE__
> static inline void memset_io(volatile void __iomem *dst, unsigned c,
>size_t count)
> {
>-   extern void mmioset(void *, unsigned int, size_t);
>mmioset((void __force *)dst, c, count);
> }
> 
> The reason they're declared _within_ memset_io() is to prevent people
> from using them by hiding their declaration.  Moving them outside is
> an open invitation to stupid people starting to use them as an "oh it
> must be an official API".
> 
> We know this happens, there's been a long history of this kind of stupid
> in the ARM community, not only with cache flushing APIs, but also the
> DMA APIs.
> 
> The way the existing code is written is a completely valid way to hide
> declarations from outside the intended caller's scope.
> 
> We've been here many times, we've had many people doing this crap, so
> I'm now at the point of NAKing changes which result in an increased
> visibility to the rest of the kernel of symbols that should not be
> used by stupid driver authors.
> 
> Now, why do we have these extra functions when they're just aliased to
> memset()/memcpy() - to avoid GCC optimising them because it thinks that
> they're standard memset()/memcpy().
> 
> So overall this gets a NAK from me.

Fair point, what about leaving those as they are, and also adding
them to asm-prototypes.h protected with GENKSYMS ifdef? It's not
beautiful, but still better than armksyms.c before Al's patches (or
at least no worse).


> Now, it would have _ALSO_ been nice to have been at least COPIED on the
> original set of changes that caused the need for this change.  I wasn't.
> So I want to see the original set of changes reverted, because they're
> clearly causing breakage.  Let's revert them and then go through the
> proper process of maintainer review, rather than bypassing maintainers
> and screwing up architectures in the process.  There really is no
> excuse for this crap.

You may have a point about improvement of the process. I wasn't
involved in the original patches, but we did cc linux-arch when the
.S CRC issue became known.

However let's work on the assumption that they won't be reverted at this
stage, and try to come up with something to fix it that you're happy with.

Thanks,
Nick


Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-19 Thread Nicholas Piggin
On Wed, 19 Oct 2016 16:38:14 +0200
Michal Marek  wrote:

> Dne 18.10.2016 v 03:34 Nicholas Piggin napsal(a):
> > We should probably just bring all these arch patches through the
> > kbuild tree.
> > 
> > I'm sorry for the breakage: I didn't realize it broke the build with
> > some configs, otherwise I would have given Michal a heads up before
> > his pull request, and worked to get this stuff in first.  
> 
> It breaks with some binutils versions only (and only with
> CONFIG_MODVERSIONS=y, of course).

Yeah this seems to be the issue, it apparently slipped past all the
automated builds. It seems like the existing CRC warnings in the tree
only trigger in rare circumstances too, so something could be a bit
fragile there.

Thanks,
Nick


Re: [RFC] reduce latency in __purge_vmap_area_lazy

2016-10-18 Thread Nicholas Piggin
On Tue, 18 Oct 2016 08:56:05 +0200
Christoph Hellwig  wrote:

> Hi all,
> 
> this is my spin at sorting out the long lock hold times in
> __purge_vmap_area_lazy.  It is based on the patch from Joel sent this
> week.  I don't have any good numbers for it, but it survived an
> xfstests run on XFS which is a significant vmalloc user.  The
> changelogs could still be improved as well, but I'd rather get it
> out quickly for feedback and testing.

All seems pretty good to me.

Thanks,
Nick


Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-17 Thread Nicholas Piggin
Hi Adam,

Thanks, this is looking good. powerpc will be able to use the generic
header.

On Tue, 18 Oct 2016 02:16:26 +0200
Adam Borowski  wrote:

> On Mon, Oct 17, 2016 at 02:22:34PM +0200, Mathieu OTHACEHE wrote:
> > > +#include 
> > > +#include   
> > 
> > Included twice.  
> 
> D'oh!
> 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#include 
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include   
> > 
> > No  for __sw_hweight32 and __sw_hweight64 ?   
> 
> diff --git a/include/asm-generic/asm-prototypes.h 
> b/include/asm-generic/asm-prototypes.h
> new file mode 100644
> index 000..df13637
> --- /dev/null
> +++ b/include/asm-generic/asm-prototypes.h
> @@ -0,0 +1,7 @@
> +#include 
> 
> ... which has these.
> 
> Alexey Dobriyan  wrote:
> } bitops.h is wrong header as well.
> } Why do you need bitops for bunch of function prototypes?
> 
> Unless you guys prefer using low-level headers only, that is.

Well you can't use asm/arch_hweight.h in a generic header of course.
I would suggest just including linux/ variants where practical for
the asm-generic/asm-prototypes.h header.

We should probably just bring all these arch patches through the
kbuild tree.

I'm sorry for the breakage: I didn't realize it broke the build with
some configs, otherwise I would have given Michal a heads up before
his pull request, and worked to get this stuff in first.

Thanks,
Nick


Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-17 Thread Nicholas Piggin
On Mon, 17 Oct 2016 08:51:31 +0200
Adam Borowski  wrote:

> On Mon, Oct 17, 2016 at 02:57:09PM +1100, Nicholas Piggin wrote:
> > On Sat, 15 Oct 2016 17:22:05 -0700 Omar Sandoval  
> > wrote:  
> > > On Fri, Oct 14, 2016 at 10:12:46PM +0200, Michal Marek wrote:  
> > > > please pull these kbuild changes for v4.9-rc1:
> > > > 
> > > > - EXPORT_SYMBOL for asm source by Al Viro. This does bring a regression,
> > > >   because genksyms no longer generates checksums for these symbols
> > > >   (CONFIG_MODVERSIONS). Nick Piggin is working on a patch to fix this.
> > > >   Plus, we are talking about functions like strcpy(), which rarely
> > > >   change prototypes.
> > > 
> > > So this has broken all module loading for me. I get the following dmesg
> > > spew:
> > > ...
> > > [4.586914] scsi_mod: no symbol version for memset
> > > [4.587920] scsi_mod: Unknown symbol memset (err -22)
> > > [4.588443] scsi_mod: no symbol version for ___preempt_schedule
> > > [4.589026] scsi_mod: Unknown symbol ___preempt_schedule (err -22)
> > > ...
> > > 
> > > Reverting 784d5699eddc ("x86: move exports to actual definitions") fixes
> > > it for me. This is with GCC 6.2.1, binutils 2.27, attached config.  
> > 
> > Thanks for the report. Could you try this patch and see if it helps?  
> [patch snipped]
> 
> Omar probably won't wake up in quite a while, so I've tested the patch.
> Alas, doesn't help.  Similar spew (for the few modules I don't have =y),
> while reverting 784d5699eddc fixes it for me too.
> 
> Debian sid toolchain: gcc 6.2.0-6, binutils 2.27-8, config at
> https://angband.pl/tmp/config-4.9.0-rc1-debug+.gz

Forgot to engage my brain before posting.

Architectures will need to have an include/asm/asm-prototypes.h that
defines or #include<>s C-style prototypes for exported asm functions.
We can do an asm-generic version for the common ones like memset so
there's not a lot of pointless duplication there.

Care to do a patch for x86?

Thanks,
Nick


Re: [PATCH v2] mm: vmalloc: Replace purge_lock spinlock with atomic refcount

2016-10-16 Thread Nicholas Piggin
On Sat, 15 Oct 2016 03:42:42 -0700
Joel Fernandes  wrote:

> The purge_lock spinlock causes high latencies with non RT kernel. This has 
> been
> reported multiple times on lkml [1] [2] and affects applications like audio.
> 
> In this patch, I replace the spinlock with an atomic refcount so that
> preemption is kept turned on during purge. This Ok to do since [3] builds the
> lazy free list in advance and atomically retrieves the list so any instance of
> purge will have its own list it is purging. Since the individual vmap area
> frees are themselves protected by a lock, this is Ok.

This is a good idea, and good results, but that's not what the spinlock was
for -- it was for enforcing the sync semantics.

Going this route, you'll have to audit callers to expect changed behavior
and change documentation of sync parameter.

I suspect a better approach would be to instead use a mutex for this, and
require that all sync=1 callers be able to sleep. I would say that most
probably already can.

Thanks,
Nick


Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-10-16 Thread Nicholas Piggin
On Sat, 15 Oct 2016 17:22:05 -0700
Omar Sandoval  wrote:

> On Fri, Oct 14, 2016 at 10:12:46PM +0200, Michal Marek wrote:
> > Hi Linus,
> > 
> > please pull these kbuild changes for v4.9-rc1:
> > 
> > - EXPORT_SYMBOL for asm source by Al Viro. This does bring a regression,
> >   because genksyms no longer generates checksums for these symbols
> >   (CONFIG_MODVERSIONS). Nick Piggin is working on a patch to fix this.
> >   Plus, we are talking about functions like strcpy(), which rarely
> >   change prototypes.  
> 
> So this has broken all module loading for me. I get the following dmesg
> spew:
> 
> ...
> [4.586914] scsi_mod: no symbol version for memset
> [4.587920] scsi_mod: Unknown symbol memset (err -22)
> [4.588443] scsi_mod: no symbol version for ___preempt_schedule
> [4.589026] scsi_mod: Unknown symbol ___preempt_schedule (err -22)
> ...
> 
> Reverting 784d5699eddc ("x86: move exports to actual definitions") fixes
> it for me. This is with GCC 6.2.1, binutils 2.27, attached config.
> 

Thanks for the report. Could you try this patch and see if it helps?


Allow architectures to create asm/asm-prototypes.h file that
provides C prototypes for exported asm functions, which enables
proper CRC versions to be generated for them.

It's done by creating a trivial C program that includes that file
and the EXPORT_SYMBOL()s from the .S file, and preprocesses that
then sends the result to genksyms.

Signed-off-by: Nicholas Piggin 
---
 scripts/Makefile.build | 71 +-
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index de46ab0..edcf876 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -159,7 +159,8 @@ cmd_cpp_i_c   = $(CPP) $(c_flags) -o $@ $<
 $(obj)/%.i: $(src)/%.c FORCE
$(call if_changed_dep,cpp_i_c)
 
-cmd_gensymtypes =   \
+# These mirror gensymtypes_S and co below, keep them in synch.
+cmd_gensymtypes_c = \
 $(CPP) -D__GENKSYMS__ $(c_flags) $< |   \
 $(GENKSYMS) $(if $(1), -T $(2)) \
  $(patsubst y,-s _,$(CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX)) \
@@ -169,7 +170,7 @@ cmd_gensymtypes =   
\
 quiet_cmd_cc_symtypes_c = SYM $(quiet_modtag) $@
 cmd_cc_symtypes_c = \
 set -e; \
-$(call cmd_gensymtypes,true,$@) >/dev/null; \
+$(call cmd_gensymtypes_c,true,$@) >/dev/null;   \
 test -s $@ || rm -f $@
 
 $(obj)/%.symtypes : $(src)/%.c FORCE
@@ -198,9 +199,10 @@ else
 #   the actual value of the checksum generated by genksyms
 
 cmd_cc_o_c = $(CC) $(c_flags) -c -o $(@D)/.tmp_$(@F) $<
-cmd_modversions =  
\
+
+cmd_modversions_c =
\
if $(OBJDUMP) -h $(@D)/.tmp_$(@F) | grep -q __ksymtab; then 
\
-   $(call cmd_gensymtypes,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))
\
+   $(call cmd_gensymtypes_c,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))  
\
> $(@D)/.tmp_$(@F:.o=.ver); 
\

\
$(LD) $(LDFLAGS) -r -o $@ $(@D)/.tmp_$(@F)  
\
@@ -268,13 +270,14 @@ endif # CONFIG_STACK_VALIDATION
 define rule_cc_o_c
$(call echo-cmd,checksrc) $(cmd_checksrc) \
$(call cmd_and_fixdep,cc_o_c) \
-   $(cmd_modversions)\
+   $(cmd_modversions_c)  \
$(cmd_objtool)\
$(call echo-cmd,record_mcount) $(cmd_record_mcount)
 endef
 
 define rule_as_o_S
$(call cmd_and_fixdep,as_o_S) \
+   $(cmd_modversions_S)  \
$(cmd_objtool)
 endef
 
@@ -314,6 +317,32 @@ modkern_aflags := $(KBUILD_AFLAGS_KERNEL) $(AFLAGS_KERNEL)
 $(real-objs-m)  : modkern_aflags := $(KBUILD_AFLAGS_MODULE) 
$(AFLAGS_MODULE)
 $(real-objs-m:.o=.s): modkern_aflags := $(KBUILD_AFLAGS_MODULE) 
$(AFLAGS_MODULE)
 
+# .S file exports must have their C prototypes defined in asm/asm-prototypes.h
+# or a file that it includes, in order to get versioned symbols. We build a
+# dummy C file that includes asm-pr

Re: ppc64 qemu test failure since commit f9aa67142 ("powerpc/64s: Consolidate Alignment 0x600 interrupt")

2016-10-11 Thread Nicholas Piggin
On Mon, 10 Oct 2016 07:15:11 -0700
Guenter Roeck  wrote:

> On 10/09/2016 10:49 PM, Nicholas Piggin wrote:
> > On Sun, 9 Oct 2016 08:21:21 -0700
> > Guenter Roeck  wrote:
> >  
> >> Nicholas,
> >>
> >> some of my qemu tests for ppc64 started failing on mainline (and -next).
> >> You can find a test log at
> >> http://kerneltests.org/builders/qemu-ppc64-master/builds/580/steps/qemubuildcommand/logs/stdio
> >>
> >> The scripts to run the test are available at
> >> https://github.com/groeck/linux-build-test/tree/master/rootfs/ppc64
> >>
> >> Bisect points to commit f9aa67142ef26 ("powerpc/64s: Consolidate Alignment 
> >> 0x600
> >> interrupt"). Bisect log is attached.
> >>
> >> Since I don't have the means to run the code on a real system, I have no 
> >> idea
> >> if the problem is caused by qemu or by the code. It is interesting, 
> >> though, that
> >> only the 'mac99' tests are affected.
> >>
> >> Please let me know if there is anything I can do to help tracking down the
> >> problem.  
> >
> > Thanks for this. That patch just moves a small amount of code, so it's 
> > likely
> > that it's caused something to get placed out of range of its caller, or the
> > linker started generating a stub for some reason. I can't immediately see 
> > the
> > problem, but it could be specific to your exact toolchain.
> >
> > Something that might help, would you be able to put the compiled vmlinux 
> > binaries
> > from before/after the bad patch somewhere I can grab them?
> >  
> 
> http://server.roeck-us.net/qemu/ppc64/mac99/
> 
> 'bad' is at f9aa67142ef26, 'good' is one commit earlier, 'tot' is from top of 
> tree
> (b66484cd7470, more specifically).
> 
> Key difference in System.map, from the bad case:
> 
> c0005c00 T __end_interrupts
> c0007000 t end_virt_trampolines
> c0008000 t 0010.long_branch.power4_fixup_nap+0
> c0008100 t fs_label
> c0008100 t start_text
> 
> 0010.long_branch.power4_fixup_nap+0 does not exist in the good case,
> and fs_label/start_text are at c0008000.
> 
> Toolchain is from poky 1.5.1, which uses gcc 4.8.1 and binutils 2.23.2.
> I also tried with the toolchain from poky 1.6, using gcc 4.8.2 and binutils 
> 2.24,
> with the same result.

Thank you for the quick response, this points to the exact problem.
I've attached a patch which should fix the bug. There are some checks
I've got planned that will catch this type of thing at build time and
be much easier to track down.

Thanks,
Nick

From: Nicholas Piggin 
Date: Tue, 11 Oct 2016 18:33:26 +1100
Subject: [PATCH] powerpc/64s: fix power4_fixup_nap placement

power4_fixup_nap is called from the "common" handlers, not the virt/real
handlers, therefore it should itself be a common handler. Placing it
down in the trampoline space caused it to go out of reach of its
callers, requiring a trampoline inserted at the start of the text
section, which breaks the fixed section address calculations.

Reported-by: Guenter Roeck 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 08992f8..f129408 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1377,7 +1377,7 @@ __end_interrupts:
 DEFINE_FIXED_SYMBOL(__end_interrupts)
 
 #ifdef CONFIG_PPC_970_NAP
-TRAMP_REAL_BEGIN(power4_fixup_nap)
+EXC_COMMON_BEGIN(power4_fixup_nap)
andcr9,r9,r10
std r9,TI_LOCAL_FLAGS(r11)
ld  r10,_LINK(r1)   /* make idle task do the */
-- 
2.9.3




Re: perf TUI fails with "failed to process type: 64"

2016-10-09 Thread Nicholas Piggin
On Mon, 10 Oct 2016 16:02:07 +1100
Michael Ellerman  wrote:

> Anton Blanchard  writes:
> 
> > Hi,
> >
> > Updating to mainline as of last night, I started seeing the following
> > error when running the perf report TUI:
> >
> > 0x46068 [0x8]: failed to process type: 68
> >
> > This event is just PERF_RECORD_FINISHED_ROUND:
> >
> > 0x46068 [0x8]: event: 68
> > .
> > . ... raw event: size 8 bytes
> > .  :  44 00 00 00 00 00 08 00  D...
> >
> > 0x46068 [0x8]: PERF_RECORD_FINISHED_ROUND
> >
> > Which of course is not our error. It took me a while to find the real
> > culprit:
> >
> >  14c00-14c00 g exc_virt_0x4c00_system_call  
>^
>What's this? The address? If so it's wrong?
>
> 
> > A zero length symbol, which __symbol__inc_addr_samples() barfs on:
> >
> > if (addr < sym->start || addr >= sym->end) {
> > ...
> > return -ERANGE;
> >
> > Seems like we have 3 bugs here:  
> ...
> >
> > 3. Why do we have zero length symbols in the first place? Does the recent
> >ppc64 exception clean up have something to do with it?  
> 
> Seems likely. But I can't see why.
> 
> AFAICS we have never emitted a size for those symbols:
> 
> Old:
> $ nm --print-size build/vmlinux | grep -w system_call_relon_pSeries
> c0004c00 T system_call_relon_pSeries
> 
> New:
> $ nm --print-size build/vmlinux | grep -w exc_virt_0x4c00_system_call
> c0004c00 T exc_virt_0x4c00_system_call
> 
> 
> It also doesn't look like we're emitting another symbol with the same
> address, which has caused confusion in the past:
> 
> Old:
> c0004c00 T exc_virt_0x4c00_system_call
> c0004d00 T exc_virt_0x4d00_single_step
> 
> New:
> c0004c00 T system_call_relon_pSeries
> c0004d00 T single_step_relon_pSeries
> 
> 
> So more digging required.

Yeah, strange. Maybe perf changes?

We were talking about adding size and type to the exception symbols
though.



Re: ppc64 qemu test failure since commit f9aa67142 ("powerpc/64s: Consolidate Alignment 0x600 interrupt")

2016-10-09 Thread Nicholas Piggin
On Sun, 9 Oct 2016 08:21:21 -0700
Guenter Roeck  wrote:

> Nicholas,
> 
> some of my qemu tests for ppc64 started failing on mainline (and -next).
> You can find a test log at
> http://kerneltests.org/builders/qemu-ppc64-master/builds/580/steps/qemubuildcommand/logs/stdio
> 
> The scripts to run the test are available at
> https://github.com/groeck/linux-build-test/tree/master/rootfs/ppc64
> 
> Bisect points to commit f9aa67142ef26 ("powerpc/64s: Consolidate Alignment 
> 0x600
> interrupt"). Bisect log is attached.
> 
> Since I don't have the means to run the code on a real system, I have no idea
> if the problem is caused by qemu or by the code. It is interesting, though, 
> that
> only the 'mac99' tests are affected.
> 
> Please let me know if there is anything I can do to help tracking down the
> problem.

Thanks for this. That patch just moves a small amount of code, so it's likely
that it's caused something to get placed out of range of its caller, or the
linker started generating a stub for some reason. I can't immediately see the
problem, but it could be specific to your exact toolchain.

Something that might help, would you be able to put the compiled vmlinux 
binaries
from before/after the bad patch somewhere I can grab them?

Thanks,
Nick


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-27 Thread Nicholas Piggin
On Tue, 27 Sep 2016 11:37:24 +
David Laight  wrote:

> From: Nicholas Piggin
> > Sent: 27 September 2016 12:25
> > On Tue, 27 Sep 2016 10:44:04 +0200
> > Vlastimil Babka  wrote:
> >   
> > > On 09/23/2016 06:47 PM, Jason Baron wrote:  
> > > > Hi,
> > > >
> > > > On 09/23/2016 03:24 AM, Nicholas Piggin wrote:  
> > > >> On Fri, 23 Sep 2016 14:42:53 +0800
> > > >> "Hillf Danton"  wrote:
> > > >>  
> > > >>>>
> > > >>>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where 
> > > >>>> size grows
> > > >>>> with the number of fds passed. We had a customer report page 
> > > >>>> allocation
> > > >>>> failures of order-4 for this allocation. This is a costly order, so 
> > > >>>> it might
> > > >>>> easily fail, as the VM expects such allocation to have a lower-order 
> > > >>>> fallback.
> > > >>>>
> > > >>>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> > > >>>> physically contiguous. Also the allocation is temporary for the 
> > > >>>> duration of the
> > > >>>> syscall, so it's unlikely to stress vmalloc too much.
> > > >>>>
> > > >>>> Note that the poll(2) syscall seems to use a linked list of order-0 
> > > >>>> pages, so
> > > >>>> it doesn't need this kind of fallback.  
> > > >>
> > > >> How about something like this? (untested)  
> > >
> > > This pushes the limit further, but might just delay the problem. Could be 
> > > an
> > > optimization on top if there's enough interest, though.  
> > 
> > What's your customer doing with those selects? If they care at all about
> > performance, I doubt they want select to attempt order-4 allocations, fail,
> > then use vmalloc :)  
> 
> If they care about performance they shouldn't be passing select() lists that
> are anywhere near that large.
> If the number of actual fd is small - use poll().

Right. Presumably it's some old app they're still using, no?


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-27 Thread Nicholas Piggin
On Tue, 27 Sep 2016 10:44:04 +0200
Vlastimil Babka  wrote:

> On 09/23/2016 06:47 PM, Jason Baron wrote:
> > Hi,
> >
> > On 09/23/2016 03:24 AM, Nicholas Piggin wrote:  
> >> On Fri, 23 Sep 2016 14:42:53 +0800
> >> "Hillf Danton"  wrote:
> >>  
> >>>>
> >>>> The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size 
> >>>> grows
> >>>> with the number of fds passed. We had a customer report page allocation
> >>>> failures of order-4 for this allocation. This is a costly order, so it 
> >>>> might
> >>>> easily fail, as the VM expects such allocation to have a lower-order 
> >>>> fallback.
> >>>>
> >>>> Such trivial fallback is vmalloc(), as the memory doesn't have to be
> >>>> physically contiguous. Also the allocation is temporary for the duration 
> >>>> of the
> >>>> syscall, so it's unlikely to stress vmalloc too much.
> >>>>
> >>>> Note that the poll(2) syscall seems to use a linked list of order-0 
> >>>> pages, so
> >>>> it doesn't need this kind of fallback.  
> >>
> >> How about something like this? (untested)  
> 
> This pushes the limit further, but might just delay the problem. Could be an 
> optimization on top if there's enough interest, though.

What's your customer doing with those selects? If they care at all about
performance, I doubt they want select to attempt order-4 allocations, fail,
then use vmalloc :)



Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-23 Thread Nicholas Piggin
On Thu, 22 Sep 2016 12:07:49 -0400
Tejun Heo  wrote:

> Hello,
> 
> On Thu, Sep 22, 2016 at 02:35:00PM +1000, Nicholas Piggin wrote:
> > Well thank you, how about you?  
> 
> Heh, can't complain.  Hope to see you around sometime.  It's been
> forever.

Yeah, it has been. Hopefully I'll see you around.


> > Trying a new mail client, sorry. It *seems* to be working now, how's
> > this?  
> 
> Hmm... Still encoded.

It looks to be a helpful surprise feature of claws. Any line starting with
the word "From" make it go q-p due to some mail servers treating that
differently. Sigh.

> 
> > From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
> > From: Nicholas Piggin 
> > Date: Wed, 21 Sep 2016 18:23:43 +1000
> > Subject: [PATCH] percpu: improve generic percpu modify-return 
> > implementations
> > 
> > Some architectures require an additional load to find the address of
> > percpu pointers. In some implemenatations, the C aliasing rules do not
> > allow the result of that load to be kept over the store that modifies
> > the percpu variable, which causes additional loads.
> > 
> > Work around this by finding the pointer first, then operating on that.
> > 
> > It's also possible to mark things as restrict and those kind of games,
> > but that can require larger and arch specific changes.  
> 
> QP-decoded and applied to percpu/for-4.9.
> 
> Thanks.
> 

Thanks!


Re: [PATCH 3/5] mm/vmalloc.c: correct lazy_max_pages() return value

2016-09-23 Thread Nicholas Piggin
On Fri, 23 Sep 2016 13:00:35 +0800
zijun_hu  wrote:

> On 2016/9/23 11:30, Nicholas Piggin wrote:
> > On Fri, 23 Sep 2016 00:30:20 +0800
> > zijun_hu  wrote:
> >   
> >> On 2016/9/22 20:37, Michal Hocko wrote:  
> >>> On Thu 22-09-16 09:13:50, zijun_hu wrote:
> >>>> On 09/22/2016 08:35 AM, David Rientjes wrote:
> >>> [...]
> >>>>> The intent is as it is implemented; with your change, lazy_max_pages() 
> >>>>> is 
> >>>>> potentially increased depending on the number of online cpus.  This is 
> >>>>> only a heuristic, changing it would need justification on why the new
> >>>>> value is better.  It is opposite to what the comment says: "to be 
> >>>>> conservative and not introduce a big latency on huge systems, so go with
> >>>>> a less aggressive log scale."  NACK to the patch.
> >>>>>
> >>>> my change potentially make lazy_max_pages() decreased not increased, i 
> >>>> seems
> >>>> conform with the comment
> >>>>
> >>>> if the number of online CPUs is not power of 2, both have no any 
> >>>> difference
> >>>> otherwise, my change remain power of 2 value, and the original code 
> >>>> rounds up
> >>>> to next power of 2 value, for instance
> >>>>
> >>>> my change : (32, 64] -> 64
> >>>>   32 -> 32, 64 -> 64
> >>>> the original code: [32, 63) -> 64
> >>>>32 -> 64, 64 -> 128
> >>>
> >>> You still completely failed to explain _why_ this is an improvement/fix
> >>> or why it matters. This all should be in the changelog.
> >>> 
> >>
> >> Hi npiggin,
> >> could you give some comments for this patch since lazy_max_pages() is 
> >> introduced
> >> by you
> >>
> >> my patch is based on the difference between fls() and get_count_order() 
> >> mainly
> >> the difference between fls() and get_count_order() will be shown below
> >> more MM experts maybe help to decide which is more suitable
> >>
> >> if parameter > 1, both have different return value only when parameter is
> >> power of two, for example
> >>
> >> fls(32) = 6 VS get_count_order(32) = 5
> >> fls(33) = 6 VS get_count_order(33) = 6
> >> fls(63) = 6 VS get_count_order(63) = 6
> >> fls(64) = 7 VS get_count_order(64) = 6
> >>
> >> @@ -594,7 +594,9 @@ static unsigned long lazy_max_pages(void) 
> >> { 
> >> unsigned int log; 
> >>
> >> -log = fls(num_online_cpus()); 
> >> +log = num_online_cpus(); 
> >> +if (log > 1) 
> >> +log = (unsigned int)get_count_order(log); 
> >>
> >> return log * (32UL * 1024 * 1024 / PAGE_SIZE); 
> >> } 
> >>  
> > 
> > To be honest, I don't think I chose it with a lot of analysis.
> > It will depend on the kernel usage patterns, the arch code,
> > and the CPU microarchitecture, all of which would have changed
> > significantly.
> > 
> > I wouldn't bother changing it unless you do some bench marking
> > on different system sizes to see where the best performance is.
> > (If performance is equal, fewer lazy pages would be better.)
> > 
> > Good to see you taking a look at this vmalloc stuff. Don't be
> > discouraged if you run into some dead ends.
> > 
> > Thanks,
> > Nick
> >   
> thanks for your reply
> please don't pay attention to this patch any more since i don't have
> condition to do many test and comparison
> 
> i just feel my change maybe be consistent with operation of rounding up
> to power of 2

You could be right about that, but in cases like this, existing code
probably trumps original comments or intention.

What I mean is that it's a heuristic anyway, so the current behaviour
is what has been used and tested for a long time. Therefore any change
would need to be justified by showing it's an improvement.

I'm sure the existing size is not perfect, but we don't know whether
your change would be an improvement in practice. Does that make sense?

Thanks,
Nick


Re: [PATCH] fs/select: add vmalloc fallback for select(2)

2016-09-23 Thread Nicholas Piggin
On Fri, 23 Sep 2016 14:42:53 +0800
"Hillf Danton"  wrote:

> > 
> > The select(2) syscall performs a kmalloc(size, GFP_KERNEL) where size grows
> > with the number of fds passed. We had a customer report page allocation
> > failures of order-4 for this allocation. This is a costly order, so it might
> > easily fail, as the VM expects such allocation to have a lower-order 
> > fallback.
> > 
> > Such trivial fallback is vmalloc(), as the memory doesn't have to be
> > physically contiguous. Also the allocation is temporary for the duration of 
> > the
> > syscall, so it's unlikely to stress vmalloc too much.
> > 
> > Note that the poll(2) syscall seems to use a linked list of order-0 pages, 
> > so
> > it doesn't need this kind of fallback.

How about something like this? (untested)

Eric isn't wrong about vmalloc sucking :)

Thanks,
Nick


---
 fs/select.c | 57 +++--
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 8ed9da5..3b4834c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -555,6 +555,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
void *bits;
int ret, max_fds;
unsigned int size;
+   size_t nr_bytes;
struct fdtable *fdt;
/* Allocate small arguments on the stack to save memory and be faster */
long stack_fds[SELECT_STACK_ALLOC/sizeof(long)];
@@ -576,21 +577,39 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
 * since we used fdset we need to allocate memory in units of
 * long-words. 
 */
-   size = FDS_BYTES(n);
+   ret = -ENOMEM;
bits = stack_fds;
-   if (size > sizeof(stack_fds) / 6) {
-   /* Not enough space in on-stack array; must use kmalloc */
+   size = FDS_BYTES(n);
+   nr_bytes = 6 * size;
+
+   if (unlikely(nr_bytes > PAGE_SIZE)) {
+   /* Avoid multi-page allocation if possible */
ret = -ENOMEM;
-   bits = kmalloc(6 * size, GFP_KERNEL);
-   if (!bits)
-   goto out_nofds;
+   fds.in = kmalloc(size, GFP_KERNEL);
+   fds.out = kmalloc(size, GFP_KERNEL);
+   fds.ex = kmalloc(size, GFP_KERNEL);
+   fds.res_in = kmalloc(size, GFP_KERNEL);
+   fds.res_out = kmalloc(size, GFP_KERNEL);
+   fds.res_ex = kmalloc(size, GFP_KERNEL);
+
+   if (!(fds.in && fds.out && fds.ex &&
+   fds.res_in && fds.res_out && fds.res_ex))
+   goto out;
+   } else {
+   if (nr_bytes > sizeof(stack_fds)) {
+   /* Not enough space in on-stack array */
+   if (nr_bytes > PAGE_SIZE * 2)
+   bits = kmalloc(nr_bytes, GFP_KERNEL);
+   if (!bits)
+   goto out_nofds;
+   }
+   fds.in  = bits;
+   fds.out = bits +   size;
+   fds.ex  = bits + 2*size;
+   fds.res_in  = bits + 3*size;
+   fds.res_out = bits + 4*size;
+   fds.res_ex  = bits + 5*size;
}
-   fds.in  = bits;
-   fds.out = bits +   size;
-   fds.ex  = bits + 2*size;
-   fds.res_in  = bits + 3*size;
-   fds.res_out = bits + 4*size;
-   fds.res_ex  = bits + 5*size;
 
if ((ret = get_fd_set(n, inp, fds.in)) ||
(ret = get_fd_set(n, outp, fds.out)) ||
@@ -617,8 +636,18 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
ret = -EFAULT;
 
 out:
-   if (bits != stack_fds)
-   kfree(bits);
+   if (unlikely(nr_bytes > PAGE_SIZE)) {
+   kfree(fds.in);
+   kfree(fds.out);
+   kfree(fds.ex);
+   kfree(fds.res_in);
+   kfree(fds.res_out);
+   kfree(fds.res_ex);
+   } else {
+   if (bits != stack_fds)
+   kfree(bits);
+   }
+
 out_nofds:
return ret;
 }
-- 
2.9.3



Re: [PATCH 3/5] mm/vmalloc.c: correct lazy_max_pages() return value

2016-09-22 Thread Nicholas Piggin
On Fri, 23 Sep 2016 00:30:20 +0800
zijun_hu  wrote:

> On 2016/9/22 20:37, Michal Hocko wrote:
> > On Thu 22-09-16 09:13:50, zijun_hu wrote:  
> >> On 09/22/2016 08:35 AM, David Rientjes wrote:  
> > [...]  
> >>> The intent is as it is implemented; with your change, lazy_max_pages() is 
> >>> potentially increased depending on the number of online cpus.  This is 
> >>> only a heuristic, changing it would need justification on why the new
> >>> value is better.  It is opposite to what the comment says: "to be 
> >>> conservative and not introduce a big latency on huge systems, so go with
> >>> a less aggressive log scale."  NACK to the patch.
> >>>  
> >> my change potentially make lazy_max_pages() decreased not increased, i 
> >> seems
> >> conform with the comment
> >>
> >> if the number of online CPUs is not power of 2, both have no any difference
> >> otherwise, my change remain power of 2 value, and the original code rounds 
> >> up
> >> to next power of 2 value, for instance
> >>
> >> my change : (32, 64] -> 64
> >> 32 -> 32, 64 -> 64
> >> the original code: [32, 63) -> 64
> >>32 -> 64, 64 -> 128  
> > 
> > You still completely failed to explain _why_ this is an improvement/fix
> > or why it matters. This all should be in the changelog.
> >   
> 
> Hi npiggin,
> could you give some comments for this patch since lazy_max_pages() is 
> introduced
> by you
> 
> my patch is based on the difference between fls() and get_count_order() mainly
> the difference between fls() and get_count_order() will be shown below
> more MM experts maybe help to decide which is more suitable
> 
> if parameter > 1, both have different return value only when parameter is
> power of two, for example
> 
> fls(32) = 6 VS get_count_order(32) = 5
> fls(33) = 6 VS get_count_order(33) = 6
> fls(63) = 6 VS get_count_order(63) = 6
> fls(64) = 7 VS get_count_order(64) = 6
> 
> @@ -594,7 +594,9 @@ static unsigned long lazy_max_pages(void) 
> { 
> unsigned int log; 
> 
> -log = fls(num_online_cpus()); 
> +log = num_online_cpus(); 
> +if (log > 1) 
> +log = (unsigned int)get_count_order(log); 
> 
> return log * (32UL * 1024 * 1024 / PAGE_SIZE); 
> } 
> 

To be honest, I don't think I chose it with a lot of analysis.
It will depend on the kernel usage patterns, the arch code,
and the CPU microarchitecture, all of which would have changed
significantly.

I wouldn't bother changing it unless you do some benchmarking
on different system sizes to see where the best performance is.
(If performance is equal, fewer lazy pages would be better.)

Good to see you taking a look at this vmalloc stuff. Don't be
discouraged if you run into some dead ends.

Thanks,
Nick


Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 10:23:43 -0400
Tejun Heo  wrote:

> Hello, Nick.
> 
> How have you been? :)

Hey Tejun,

Well thank you, how about you?
 
> On Wed, Sep 21, 2016 at 08:57:11PM +1000, Nicholas Piggin wrote:
> > On Wed, 21 Sep 2016 18:51:37 +1000
> > Nicholas Piggin  wrote:
> >   
> > > Some architectures require an additional load to find the address of
> > > percpu pointers. In some implemenatations, the C aliasing rules do not
> > > allow the result of that load to be kept over the store that modifies
> > > the percpu variable, which causes additional loads.  
> > 
> > Sorry I picked up an old patch here. This one should be better.
> > 
> > From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
> > From: Nicholas Piggin 
> > Date: Wed, 21 Sep 2016 18:23:43 +1000
> > Subject: [PATCH] percpu: improve generic percpu modify-return 
> > implementations
> > 
> > Some architectures require an additional load to find the address of
> > percpu pointers. In some implemenatations, the C aliasing rules do not
> > allow the result of that load to be kept over the store that modifies
> > the percpu variable, which causes additional loads.
> > 
> > Work around this by finding the pointer first, then operating on that.
> > 
> > It's also possible to mark things as restrict and those kind of games,
> > but that can require larger and arch specific changes.
> > 
> > On powerpc, __this_cpu_inc_return compiles to:
> > 
> > ld 10,48(13)
> > ldx 9,3,10
> > addi 9,9,1
> > stdx 9,3,10
> > ld 9,48(13)
> > ldx 3,9,3
> > 
> > With this patch it compiles to:
> > 
> > ld 10,48(13)
> > ldx 9,3,10
> > addi 9,9,1
> > stdx 9,3,10
> > 
> > Signed-off-by: Nicholas Piggin   
> 
> Patch looks good to me but seems QP encoded.  Can you please resend?
> 
> Thanks and it's great to see you again!
> 

Trying a new mail client, sorry. It *seems* to be working now, how's
this?

From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
From: Nicholas Piggin 
Date: Wed, 21 Sep 2016 18:23:43 +1000
Subject: [PATCH] percpu: improve generic percpu modify-return implementations

Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 

To: Tejun Heo 
To: Christoph Lameter 
Cc: linux-kernel@vger.kernel.org
Cc: linux-a...@vger.kernel.org
---
 include/asm-generic/percpu.h | 53 +---
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..40e8870 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #de

Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 15:16:25 -0500 (CDT)
Christoph Lameter  wrote:

> On Wed, 21 Sep 2016, Tejun Heo wrote:
> 
> > Hello, Nick.
> >
> > How have you been? :)
> >  
> 
> He is baack. Are we getting SL!B? ;-)
> 

Hey Christoph. Sure, why not.


Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
On Wed, 21 Sep 2016 18:51:37 +1000
Nicholas Piggin  wrote:

> Some architectures require an additional load to find the address of
> percpu pointers. In some implemenatations, the C aliasing rules do not
> allow the result of that load to be kept over the store that modifies
> the percpu variable, which causes additional loads.

Sorry I picked up an old patch here. This one should be better.

From d0cb9052d6f4c31d24f999b7b0cecb34681eee9b Mon Sep 17 00:00:00 2001
From: Nicholas Piggin 
Date: Wed, 21 Sep 2016 18:23:43 +1000
Subject: [PATCH] percpu: improve generic percpu modify-return implementations

Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/percpu.h | 53 +---
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..40e8870 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #define raw_cpu_generic_xchg(pcp, nval)
\
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
-   raw_cpu_write(pcp, nval);   \
+   __ret = *__p;   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg(pcp, oval, nval)   \
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
+   __ret = *__p;   \
if (__ret == (oval))\
-   raw_cpu_write(pcp, nval);   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) 
\
 ({ \
+   typeof(&(pcp1)) __p1 = raw_cpu_ptr(&(pcp1));\
+   typeof(&(pcp2)) __p2 = raw_cpu_ptr(&(pcp2));\
int __ret = 0;  \
-   if (raw_cpu_read(pcp1) == (oval1) &&\
-raw_cpu_read(pcp2)  == (oval2)) {  \
-   raw_cpu_write(

[PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Nicholas Piggin
Some architectures require an additional load to find the address of
percpu pointers. In some implemenatations, the C aliasing rules do not
allow the result of that load to be kept over the store that modifies
the percpu variable, which causes additional loads.

Work around this by finding the pointer first, then operating on that.

It's also possible to mark things as restrict and those kind of games,
but that can require larger and arch specific changes.

On powerpc, __this_cpu_inc_return compiles to:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10
ld 9,48(13)
ldx 3,9,3

With this patch it saves 2 loads:

ld 10,48(13)
ldx 9,3,10
addi 9,9,1
stdx 9,3,10

Signed-off-by: Nicholas Piggin 
---
 include/asm-generic/percpu.h | 54 +---
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index 4d9f233..3fe18fe 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -65,6 +65,11 @@ extern void setup_per_cpu_areas(void);
 #define PER_CPU_DEF_ATTRIBUTES
 #endif
 
+#define raw_cpu_generic_read(pcp)  \
+({ \
+   *raw_cpu_ptr(&(pcp));   \
+})
+
 #define raw_cpu_generic_to_op(pcp, val, op)\
 do {   \
*raw_cpu_ptr(&(pcp)) op val;\
@@ -72,34 +77,39 @@ do {
\
 
 #define raw_cpu_generic_add_return(pcp, val)   \
 ({ \
-   raw_cpu_add(pcp, val);  \
-   raw_cpu_read(pcp);  \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
+   \
+   *__p += val;\
+   *__p;   \
 })
 
 #define raw_cpu_generic_xchg(pcp, nval)
\
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
-   raw_cpu_write(pcp, nval);   \
+   __ret = *__p;   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg(pcp, oval, nval)   \
 ({ \
+   typeof(&(pcp)) __p = raw_cpu_ptr(&(pcp));   \
typeof(pcp) __ret;  \
-   __ret = raw_cpu_read(pcp);  \
+   __ret = *__p;   \
if (__ret == (oval))\
-   raw_cpu_write(pcp, nval);   \
+   *__p = nval;\
__ret;  \
 })
 
 #define raw_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) 
\
 ({ \
+   typeof(&(pcp1)) __p1 = raw_cpu_ptr(&(pcp1));\
+   typeof(&(pcp2)) __p2 = raw_cpu_ptr(&(pcp2));\
int __ret = 0;  \
-   if (raw_cpu_read(pcp1) == (oval1) &&\
-raw_cpu_read(pcp2)  == (oval2)) {  \
-   raw_cpu_write(pcp1, nval1); \
-   raw_cpu_write(pcp2, nval2); \
+   if (*__p1 == (oval1) && *__p2  == (oval2)) {\
+   *__p1 = nval1;  \
+   *__p2 = nval2;  \
__ret = 1;  \
}   \
(__ret);   

Re: [PATCH 3/3] mm/vmalloc: correct a few logic error in __insert_vmap_area()

2016-09-19 Thread Nicholas Piggin
On Tue, 20 Sep 2016 14:02:26 +0800
zijun_hu  wrote:

> From: zijun_hu 
> 
> correct a few logic error in __insert_vmap_area() since the else if
> condition is always true and meaningless
> 
> avoid endless loop under [un]mapping improper ranges whose boundary
> are not aligned to page
> 
> correct lazy_max_pages() return value if the number of online cpus
> is power of 2
> 
> improve performance for pcpu_get_vm_areas() via optimizing vmap_areas
> overlay checking algorithm and finding near vmap_areas by list_head
> other than rbtree
> 
> simplify /proc/vmallocinfo implementation via seq_file helpers
> for list_head
> 
> Signed-off-by: zijun_hu 
> Signed-off-by: zijun_hu 

Could you submit each of these changes as a separate patch? Would you
consider using capitalisation and punctuation in the changelog?

Did you measure any performance improvements, or do you have a workload
where vmalloc shows up in profiles?


> @@ -108,6 +108,9 @@ static void vunmap_page_range(unsigned long addr, 
> unsigned long end)
>   unsigned long next;
>  
>   BUG_ON(addr >= end);
> + WARN_ON(!PAGE_ALIGNED(addr | end));

I prefer to avoid mixing bitwise and arithmetic operations unless it's
necessary. Gcc should be able to optimise

WARN_ON(!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(end))

> + addr = round_down(addr, PAGE_SIZE);

I don't know if it's really necessary to relax the API like this for
internal vmalloc.c functions. If garbage is detected here, it's likely
due to a bug, and I'm not sure that rounding it would solve the problem.

For API functions perhaps it's reasonable -- in such cases you should
consider using WARN_ON_ONCE() or similar.

Thanks,
Nick


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Nicholas Piggin
On Fri, 16 Sep 2016 08:33:50 +1000
Dave Chinner  wrote:

> On Thu, Sep 15, 2016 at 09:42:22PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 20:32:10 +1000
> > Dave Chinner  wrote:  
> > > 
> > > You still haven't described anything about what a per-block flag
> > > design is supposed to look like :/  
> > 
> > For the API, or implementation? I'm not quite sure what you mean
> > here. For implementation it's possible to carefully ensure metadata
> > is persistent when allocating blocks in page fault but before
> > mapping pages. Truncate or hole punch or such things can be made to
> > work by invalidating all such mappings and holding them off until
> > you can cope with them again. Not necessarily for a filesystem with
> > *all* capabilities of XFS -- I don't know -- but for a complete basic
> > one.  
> 
> SO, essentially, it comes down to synchrnous metadta updates again.

Yes. I guess fundamentally you can't get away from either that or
preloading at some level.

(Also I don't know that there's a sane way to handle [cm]time properly,
so some things like that -- this is just about block allocation /
avoiding fdatasync).

> but synchronous updates would be conditional on whether an extent
> metadata with the "nofsync" flag asserted was updated? Where's the
> nofsync flag kept? in memory at a generic layer, or in the
> filesystem, potentially in an on-disk structure? How would the
> application set it for a given range?

I guess that comes back to the API. Whether you want it to be persistent,
request based, etc. It could be derived type of storage blocks that are
mapped there, stored per-inode, in-memory, or on extents on disk. I'm not
advocating for a particular API and of course less complexity better.

> 
> > > > > > Filesystem will
> > > > > > invalidate all such mappings before it does buffered IOs or hole 
> > > > > > punch,
> > > > > > and will sync metadata after allocating a new block but before 
> > > > > > returning
> > > > > > from a fault.  
> > > > > 
> > > > > ... requires synchronous metadata updates from page fault context,
> > > > > which we already know is not a good solution.  I'll quote one of
> > > > > Christoph's previous replies to save me the trouble:
> > > > > 
> > > > >   "You could write all metadata synchronously from the page
> > > > >   fault handler, but that's basically asking for all kinds of
> > > > >   deadlocks."
> > > > > So, let's redirect back to the "no sync" flag you were talking about
> > > > > - can you answer the questions I asked above? It would be especially
> > > > > important to highlight how the proposed feature would avoid requiring
> > > > > synchronous metadata updates in page fault contexts
> > > > 
> > > > Right. So what deadlocks are you concerned about?
> > > 
> > > It basically puts the entire journal checkpoint path under a page
> > > fault context. i.e. a whole new global locking context problem is  
> > 
> > Yes there are potentially some new lock orderings created if you
> > do that, depending on what locks the filesystem does.  
> 
> Well, that's the whole issue.

For filesystem implementations, but perhaps not mm/vfs implemenatation
AFAIKS.

> 
> > > created as this path can now be run both inside and outside the
> > > mmap_sem. Nothing ever good comes from running filesystem locking
> > > code both inside and outside the mmap_sem.  
> > 
> > You mean that some cases journal checkpoint runs with mmap_sem
> > held, and others without mmap_sem held? Not that mmap_sem is taken
> > inside journal checkpoint.  
> 
> Maybe not, but now we open up the potential for locks held inside
> or outside mmap sem to interact with the journal locks that are now
> held inside and outside mmap_sem. See below
> 
> > Then I don't really see why that's a
> > problem. I mean performance could suffer a bit, but with fault
> > retry you can almost always do the syncing outside mmap_sem in
> > practice.
> > 
> > Yes, I'll preemptively agree with you -- We don't want to add any
> > such burden if it is not needed and well justified.
> >   
> > > FWIW, We've never executed synchronous transactions inside page
> > > faults in XFS, and I think ext4 is in the same boat - it may 

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-15 Thread Nicholas Piggin
On Thu, 15 Sep 2016 20:32:10 +1000
Dave Chinner  wrote:

> On Thu, Sep 15, 2016 at 01:49:45PM +1000, Nicholas Piggin wrote:
> > On Thu, 15 Sep 2016 12:31:33 +1000
> > Dave Chinner  wrote:
> >   
> > > On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:  
> > > > On Wed, 14 Sep 2016 17:39:02 +1000  
> > > Sure, but one first has to describe the feature desired before all  
> > 
> > The DAX people have been.  
> 
> H. the only "DAX people" I know of are kernel developers who
> have been working on implementing DAX in the kernel - Willy, Ross,
> Dan, Jan, Christoph, Kirill, myelf and a few others around the
> fringes.
> 
> > They want to be able to get mappings
> > that can be synced without doing fsync.  
> 
> Oh, ok, the Intel Userspace PMEM library requirement. I though you
> had something more that this - whatever extra problem the per-block
> no fsync flag would solve?

Only the PMEM really. I don't want to add more complexity than
required.

> > The *exact* extent of
> > those capabilities and what the API exactly looks like is up for
> > discussion.  
> 
> Yup.
> 
> > Well you said it was impossible already and Christoph told them
> > they were smoking crack :)  
> 
> I have not said that. I have said bad things about bad
> proposals and called the PMEM library model broken, but I most
> definitely have not said that solving the problem is impossible.
>
> > > That's not an answer to the questions I asked about about the "no
> > > sync" flag you were proposing. You've redirected to the a different
> > > solution, one that   
> > 
> > No sync flag would do the same thing exactly in terms of consistency.
> > It would just do the no-sync sequence by default rather than being
> > asked for it. More of an API detail than implementation.  
> 
> You still haven't described anything about what a per-block flag
> design is supposed to look like :/

For the API, or implementation? I'm not quite sure what you mean
here. For implementation it's possible to carefully ensure metadata
is persistent when allocating blocks in page fault but before
mapping pages. Truncate or hole punch or such things can be made to
work by invalidating all such mappings and holding them off until
you can cope with them again. Not necessarily for a filesystem with
*all* capabilities of XFS -- I don't know -- but for a complete basic
one.

> > > > Filesystem will
> > > > invalidate all such mappings before it does buffered IOs or hole punch,
> > > > and will sync metadata after allocating a new block but before returning
> > > > from a fault.
> > > 
> > > ... requires synchronous metadata updates from page fault context,
> > > which we already know is not a good solution.  I'll quote one of
> > > Christoph's previous replies to save me the trouble:
> > > 
> > >   "You could write all metadata synchronously from the page
> > >   fault handler, but that's basically asking for all kinds of
> > >   deadlocks."
> > > So, let's redirect back to the "no sync" flag you were talking about
> > > - can you answer the questions I asked above? It would be especially
> > > important to highlight how the proposed feature would avoid requiring
> > > synchronous metadata updates in page fault contexts  
> > 
> > Right. So what deadlocks are you concerned about?  
> 
> It basically puts the entire journal checkpoint path under a page
> fault context. i.e. a whole new global locking context problem is

Yes there are potentially some new lock orderings created if you
do that, depending on what locks the filesystem does.

> created as this path can now be run both inside and outside the
> mmap_sem. Nothing ever good comes from running filesystem locking
> code both inside and outside the mmap_sem.

You mean that some cases journal checkpoint runs with mmap_sem
held, and others without mmap_sem held? Not that mmap_sem is taken
inside journal checkpoint. Then I don't really see why that's a
problem. I mean performance could suffer a bit, but with fault
retry you can almost always do the syncing outside mmap_sem in
practice.

Yes, I'll preemptively agree with you -- We don't want to add any
such burden if it is not needed and well justified.

> FWIW, We've never executed synchronous transactions inside page
> faults in XFS, and I think ext4 is in the same boat - it may be even
> worse because of the way it does ordered data dispatch through the
> journal. I don't really even want to think about

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Thu, 15 Sep 2016 12:31:33 +1000
Dave Chinner  wrote:

> On Wed, Sep 14, 2016 at 08:19:36PM +1000, Nicholas Piggin wrote:
> > On Wed, 14 Sep 2016 17:39:02 +1000
> > Dave Chinner  wrote:  
> > > Ok, looking back over your example, you seem to be suggesting a new
> > > page fault behaviour is required from filesystems that has not been
> > > described or explained, and that behaviour is triggered
> > > (persistently) somehow from userspace. You've also suggested
> > > filesystems store a persistent per-block "no fsync" flag
> > > in their extent map as part of the implementation. Right?  
> > 
> > This is what we're talking about. Of course a filesystem can't just
> > start supporting the feature without any changes.  
> 
> Sure, but one first has to describe the feature desired before all

The DAX people have been. They want to be able to get mappings
that can be synced without doing fsync. The *exact* extent of
those capabilities and what the API exactly looks like is up for
discussion.


> parties can discuss it. We need more than vague references and
> allusions from you to define the solution you are proposing.
> 
> Once everyone understands what is being describing, we might be able
> to work out how it can be implemented in a simple, generic manner
> rather than require every filesystem to change their on-disk
> formats. IOWs, we need you to describe /details/ of semantics,
> behaviour and data integrity constraints that are required, not
> describe an implementation of something we have no knwoledge about.

Well you said it was impossible already and Christoph told them
they were smoking crack :)

Anyway, there's a few questions. Implementation and API. Some
filesystems may never cope with it. Of course it should be as
generic as possible though.


> > > Reading between the lines, I'm guessing that the "no fsync" flag has
> > > very specific update semantics, constraints and requirements.  Can
> > > you outline how you expect this flag to be set and updated, how it's
> > > used consistently between different applications (e.g. cp of a file
> > > vs the app using the file), behavioural constraints it implies for
> > > page faults vs non-mmap access to the data in the block, how
> > > you'd expect filesystems to deal with things like a hole punch
> > > landing in the middle of an extent marked with "no fsync", etc?  
> > 
> > Well that's what's being discussed.  An approach close to what I did is
> > to allow the app request a "no sync" type of mmap.  
> 
> That's not an answer to the questions I asked about about the "no
> sync" flag you were proposing. You've redirected to the a different
> solution, one that 

No sync flag would do the same thing exactly in terms of consistency.
It would just do the no-sync sequence by default rather than being
asked for it. More of an API detail than implementation.

> 
> > Filesystem will
> > invalidate all such mappings before it does buffered IOs or hole punch,
> > and will sync metadata after allocating a new block but before returning
> > from a fault.  
> 
> ... requires synchronous metadata updates from page fault context,
> which we already know is not a good solution.  I'll quote one of
> Christoph's previous replies to save me the trouble:
> 
>   "You could write all metadata synchronously from the page
>   fault handler, but that's basically asking for all kinds of
>   deadlocks."
> So, let's redirect back to the "no sync" flag you were talking about
> - can you answer the questions I asked above? It would be especially
> important to highlight how the proposed feature would avoid requiring
> synchronous metadata updates in page fault contexts

Right. So what deadlocks are you concerned about?

There could be a scale of capabilities here, for different filesystems
that do things differently. 

Some filesystems could require fsync for metadata, but allow fdatasync
to be skipped. Users would need to have some knowledge of block size
or do preallocation and sync.

That might put more burden on libraries/applications if there are
concurrent operations, but that might be something they can deal with
-- fdatasync already requires some knowledge of concurrent operations
(or lack thereof).


> > > [snip]
> > >   
> > > > If there is any huge complexity or unsolved problem, it is in XFS.
> > > > Conceptual problem is simple.
> > > 
> > > Play nice and be constructive, please?  
> > 
> > So you agree that the persistent memory people who have

Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-14 Thread Nicholas Piggin
On Wed, 14 Sep 2016 17:39:02 +1000
Dave Chinner  wrote:

> On Tue, Sep 13, 2016 at 11:53:11AM +1000, Nicholas Piggin wrote:
> > On Tue, 13 Sep 2016 07:34:36 +1000
> > Dave Chinner  wrote:
> > But let me understand your example in the absence of that.
> > 
> > - Application mmaps a file, faults in block 0
> > - FS allocates block, creates mappings, syncs metadata, sets "no fsync"
> >   flag for that block, and completes the fault.
> > - Application writes some data to block 0, completes userspace flushes
> > 
> > * At this point, a crash must return with above data (or newer).
> > 
> > - Application starts writing more stuff into block 0
> > - Concurrently, fault in block 1
> > - FS starts to allocate, splits trees including mappings to block 0
> > 
> > * Crash
> > 
> > Is that right?  
> 
> No.
> 
> - app write faults block 0, fs allocates
> < time passes while app does stuff to block 0 mapping >
> - fs syncs journal, block 0 metadata now persistent
> < time passes while app does stuff to block 0 mapping >
> - app structure grows, faults block 1, fs allocates
> - app adds pointers to data in block 1 from block 0, does
>   userspace pmem data sync.
> 
> *crash*
> 
> > How does your filesystem lose data before the sync
> > point?  
> 
> After recovery, file has a data in block 0, but no block 1 because
> the allocation transaction for block 1 was not flushed to the
> journal. Data in block 0 points to data in block 1, but block 1 does
> not exist. IOWs, the application has corrupt data because it never
> issued a data synchronisation request to the filesystem
> 
> 
> 
> Ok, looking back over your example, you seem to be suggesting a new
> page fault behaviour is required from filesystems that has not been
> described or explained, and that behaviour is triggered
> (persistently) somehow from userspace. You've also suggested
> filesystems store a persistent per-block "no fsync" flag
> in their extent map as part of the implementation. Right?

This is what we're talking about. Of course a filesystem can't just
start supporting the feature without any changes.


> Reading between the lines, I'm guessing that the "no fsync" flag has
> very specific update semantics, constraints and requirements.  Can
> you outline how you expect this flag to be set and updated, how it's
> used consistently between different applications (e.g. cp of a file
> vs the app using the file), behavioural constraints it implies for
> page faults vs non-mmap access to the data in the block, how
> you'd expect filesystems to deal with things like a hole punch
> landing in the middle of an extent marked with "no fsync", etc?

Well that's what's being discussed. An approach close to what I did is
to allow the app request a "no sync" type of mmap. Filesystem will
invalidate all such mappings before it does buffered IOs or hole punch,
and will sync metadata after allocating a new block but before returning
from a fault.

The app could query rather than request, but I found request seemed to
work better. The filesystem might be working with apps that don't use
the feature for example, and doesn't want to flush just in case any one
ever queried in the past.


> [snip]
> 
> > If there is any huge complexity or unsolved problem, it is in XFS.
> > Conceptual problem is simple.  
> 
> Play nice and be constructive, please?

So you agree that the persistent memory people who have come with some
requirements and ideas for an API should not be immediately shut down
with bogus handwaving.

Thanks,
Nick


Re: Build failure in -next due to 'kbuild: allow archs to select link dead code/data elimination'

2016-09-13 Thread Nicholas Piggin
On Tue, 13 Sep 2016 13:25:22 -0700
Guenter Roeck  wrote:

> On 09/12/2016 08:51 PM, Nicholas Piggin wrote:
> > On Mon, 12 Sep 2016 20:17:30 -0700
> > Guenter Roeck  wrote:
> >  
> >> Hi Nicholas,
> >>
> >> On 09/12/2016 07:00 PM, Nicholas Piggin wrote:  
> >>> On Mon, 12 Sep 2016 15:24:43 -0700
> >>> Guenter Roeck  wrote:
> >>>  
> >>>> Hi,
> >>>>
> >>>> your commit 'kbuild: allow archs to select link dead code/data 
> >>>> elimination'
> >>>> is causing the following build failure in -next when building 
> >>>> score:defconfig.
> >>>>
> >>>> arch/score/kernel/built-in.o: In function `work_resched':
> >>>> arch/score/kernel/entry.o:(.text+0xe84):
> >>>>  relocation truncated to fit: R_SCORE_PC19 against `schedule'
> >>>>
> >>>> Reverting the commit fixes the problem.
> >>>>
> >>>> Please let me know if I can help tracking down the problem.
> >>>> In case you need a score toochain, you can find the one I use at
> >>>> http://server.roeck-us.net/toolchains/score.tgz.
> >>>>
> >>>> Thanks,
> >>>> Guenter  
> >>>
> >>> It's not supposed to have any real effect unless the
> >>> option is selected, but there are a few changes to linker
> >>> script which must be causing it.
> >>>
> >>> There are two changes to vmlinux.lds.h. One is to KEEP a
> >>> few input sections, that *should* be kept anyway. The other
> >>> is to bring in additional sections into their correct output
> >>> section.
> >>>
> >>> Could you try reverting those lines of vmlinux.lds.h that
> >>> change the latter, i.e.:
> >>>
> >>> - *(.text.hot .text .text.fixup .text.unlikely)   \
> >>> + *(.text.hot .text .text.fixup .text.unlikely .text.*)   \  
> >>
> >> This is the culprit. After removing " .text.*" it builds fine.  
> >
> > Thanks for testing it. Some architectures have .text.x sections
> > not included here, I should have seen that. We should possibly
> > just revert that line and require implementing archs to do the
> > right thing.
> >  
> I would call the build failure a regression, so it should either be
> reverted, or we'll need some other solution to fix the build failure.

Yes it's definitely a regression and that part should be reverted.
To confirm, the following patch fixes your build?

Thanks,
Nick

commit 0ae28be83b4d6cd03ef5b481487d042f2b91954e
Author: Nicholas Piggin 
Date:   Wed Sep 14 12:24:03 2016 +1000

kbuild: -ffunction-sections fix for archs with conflicting sections

Enabling -ffunction-sections modified the generic linker script to
pull .text.* sections into regular TEXT_TEXT section, conflicting
with some architectures. Revert that change and require archs that
enable the option to ensure they have no conflicting section names,
and do the appropriate merging.

Signed-off-by: Nicholas Piggin 

diff --git a/arch/Kconfig b/arch/Kconfig
index 6d5b631..8248037 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -487,7 +487,9 @@ config LD_DEAD_CODE_DATA_ELIMINATION
  This requires that the arch annotates or otherwise protects
  its external entry points from being discarded. Linker scripts
  must also merge .text.*, .data.*, and .bss.* correctly into
- output sections.
+ output sections. Care must be taken not to pull in unrelated
+ sections (e.g., '.text.init'). Typically '.' in section names
+ is used to distinguish them from label names / C identifiers.
 
 config HAVE_CONTEXT_TRACKING
bool
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index ad9d8f9..48dd44f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -198,9 +198,9 @@
 
 /*
  * .data section
- * -fdata-sections generates .data.identifier which needs to be pulled in
- * with .data, but don't want to pull in .data..stuff which has its own
- * requirements. Same for bss.
+ * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections generates
+ * .data.identifier which needs to be pulled in with .data, but don't want to
+ * pull in .data..stuff which has its own requirements. Same for bss.
  */
 #define DATA_DATA  \
*(.data .data.[0-9a-zA-Z_]*)\
@@ -434,10 +434,15 @@
}
 
 /* .text 

Re: linux-next: manual merge of the kbuild tree with Linus' tree

2016-09-13 Thread Nicholas Piggin
On Tue, 13 Sep 2016 09:48:03 +0200
Arnd Bergmann  wrote:

> On Tuesday, September 13, 2016 2:02:57 PM CEST Stephen Rothwell wrote:
> > [For the new cc's, we are discussing the "thin archives" and "link dead
> > code/data elimination" patches in the kbuild tree.]
> > 
> > On Tue, 13 Sep 2016 09:39:45 +1000 Stephen Rothwell  
> > wrote:  
> > >
> > > On Mon, 12 Sep 2016 11:03:08 +0200 Michal Marek  wrote:  
> > > >
> > > > On 2016-09-12 04:53, Nicholas Piggin wrote:
> > > > > Question, what is the best way to merge dependent patches? Considering
> > > > > they will need a good amount of architecture testing, I think they 
> > > > > will
> > > > > have to go via arch trees. But it also does not make sense to merge 
> > > > > these
> > > > > kbuild changes upstream first, without having tested them.  
> > > > 
> > > > I think it makes sense to merge the kbuild changes via kbuild.git, even
> > > > if they are unused and untested. Any follow-up fixes required to enable
> > > > the first architecture can go through the respective architecture tree.
> > > > Does that sound OK?
> > > 
> > > And if you guarantee not to rebase the kbuild tree (or at least the
> > > subset containing these patches), then each of the architecture trees
> > > can just merge your tree (or a tag?) and then implement any necessary
> > > arch dependent changes.  I fixes are necessary, they can also be merged
> > > into the architecture trees.  
> > 
> > Except, of course, the kbuild tree still has the asm EXPORT_SYMBOL
> > patches that produce warnings on PowerPC  (And I am still reverting
> > the PowerPC specific one of those patches).  
> 
> Is that really powerpc specific? I have the same problem on ARM
> and I don't see how any architecture would not have it.
> 
> I prototyped the patch below, which fixes it for me, but I have
> not dared submit that workaround because it's butt ugly.

No it's not powerpc specific, it's just that powerpc build dies
if there are unresolved relocations.

Interesting approach. I have something different that may rival
yours for ugliness, but maybe keeps the muck a bit more contained.
I was just about to submit it, but now I'll wait to see if there is
a preference between the approaches:

(Note this patch alone does not resolve all export symbols, each
arch next needs to add C prototypes for their .S exports)

 scripts/Makefile.build | 71 +-
 1 file changed, 65 insertions(+), 6 deletions(-)

diff --git a/scripts/Makefile.build b/scripts/Makefile.build
index 11602e5..1e89908 100644
--- a/scripts/Makefile.build
+++ b/scripts/Makefile.build
@@ -158,7 +158,8 @@ cmd_cpp_i_c   = $(CPP) $(c_flags) -o $@ $<
 $(obj)/%.i: $(src)/%.c FORCE
$(call if_changed_dep,cpp_i_c)
 
-cmd_gensymtypes =   \
+# These mirror gensymtypes_S and co below, keep them in synch.
+cmd_gensymtypes_c = \
 $(CPP) -D__GENKSYMS__ $(c_flags) $< |   \
 $(GENKSYMS) $(if $(1), -T $(2)) \
  $(patsubst y,-s _,$(CONFIG_HAVE_UNDERSCORE_SYMBOL_PREFIX)) \
@@ -168,7 +169,7 @@ cmd_gensymtypes =   
\
 quiet_cmd_cc_symtypes_c = SYM $(quiet_modtag) $@
 cmd_cc_symtypes_c = \
 set -e; \
-$(call cmd_gensymtypes,true,$@) >/dev/null; \
+$(call cmd_gensymtypes_c,true,$@) >/dev/null;   \
 test -s $@ || rm -f $@
 
 $(obj)/%.symtypes : $(src)/%.c FORCE
@@ -197,9 +198,10 @@ else
 #   the actual value of the checksum generated by genksyms
 
 cmd_cc_o_c = $(CC) $(c_flags) -c -o $(@D)/.tmp_$(@F) $<
-cmd_modversions =  
\
+
+cmd_modversions_c =
\
if $(OBJDUMP) -h $(@D)/.tmp_$(@F) | grep -q __ksymtab; then 
\
-   $(call cmd_gensymtypes,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))
\
+   $(call cmd_gensymtypes_c,$(KBUILD_SYMTYPES),$(@:.o=.symtypes))  
\
> $(@D)/.tmp_$(@F:.o=.ver); 
\

\
$(LD) $(LDFLAGS) -r -o $@ $(@D)/.tmp_$(@F)  
\
@@ -267,13

<    5   6   7   8   9   10   11   >