Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Johannes Weiner
On Mon, Jan 09, 2017 at 09:30:05PM +0100, Jan Kara wrote:
> On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> > On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > > >  wrote:
> > > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > > workload again and about a minute in this fired:
> > > > > > > > >
> > > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > > 
> > > > > > > > Well, part of the changes during the merge window were the 
> > > > > > > > shadow
> > > > > > > > entry tracking changes that came in through Andrew's tree. 
> > > > > > > > Adding
> > > > > > > > Johannes Weiner to the participants.
> > 
> > Okay, the below patch should address this problem. Dave Jones managed
> > to reproduce it with the added WARN_ONs, and they made it obvious. He
> > cannot trigger it anymore with this fix applied. Thanks Dave!
> 
> FWIW the patch looks good to me. I'd just note that the need to pass the
> callback to deletion function and the fact that we do it only in cases
> where we think it is needed appears errorprone. With the warning you've
> added it should at least catch the cases where we got it wrong but more
> robust would be if the radix tree root contained a pointer to the callback
> function so that we would not rely on passing the callback to every place
> which can possibly free a node. Also conceptually this would make more
> sense to me since the fact that we may need to do some cleanup on node
> deletion is a property of the particular radix tree and how we use it.
> OTOH that would mean growing radix tree root by one pointer which would be
> unpopular I guess.

The last sentence is the crux, unfortunately. The first iteration of
the shadow shrinker linked up mappings that contained shadow entries,
rather than nodes. The code would have been drastically simpler in
pretty much all regards, without changes to the radix tree API. But
Dave Chinner objected to adding a pointer to the inode when we could
stick them into empty space (slab fragmentation) inside the nodes. A
fair point, I guess, especially when you consider metadata-intense
workloads. But now we're kind of stuck with the complexity of this.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Johannes Weiner
On Mon, Jan 09, 2017 at 09:30:05PM +0100, Jan Kara wrote:
> On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> > On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > > >  wrote:
> > > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > > workload again and about a minute in this fired:
> > > > > > > > >
> > > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > > 
> > > > > > > > Well, part of the changes during the merge window were the 
> > > > > > > > shadow
> > > > > > > > entry tracking changes that came in through Andrew's tree. 
> > > > > > > > Adding
> > > > > > > > Johannes Weiner to the participants.
> > 
> > Okay, the below patch should address this problem. Dave Jones managed
> > to reproduce it with the added WARN_ONs, and they made it obvious. He
> > cannot trigger it anymore with this fix applied. Thanks Dave!
> 
> FWIW the patch looks good to me. I'd just note that the need to pass the
> callback to deletion function and the fact that we do it only in cases
> where we think it is needed appears errorprone. With the warning you've
> added it should at least catch the cases where we got it wrong but more
> robust would be if the radix tree root contained a pointer to the callback
> function so that we would not rely on passing the callback to every place
> which can possibly free a node. Also conceptually this would make more
> sense to me since the fact that we may need to do some cleanup on node
> deletion is a property of the particular radix tree and how we use it.
> OTOH that would mean growing radix tree root by one pointer which would be
> unpopular I guess.

The last sentence is the crux, unfortunately. The first iteration of
the shadow shrinker linked up mappings that contained shadow entries,
rather than nodes. The code would have been drastically simpler in
pretty much all regards, without changes to the radix tree API. But
Dave Chinner objected to adding a pointer to the inode when we could
stick them into empty space (slab fragmentation) inside the nodes. A
fair point, I guess, especially when you consider metadata-intense
workloads. But now we're kind of stuck with the complexity of this.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Jan Kara
On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > >  wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-09 Thread Jan Kara
On Sat 07-01-17 21:02:00, Johannes Weiner wrote:
> On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > >  wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at 
> > > > > > > > mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> 
> Okay, the below patch should address this problem. Dave Jones managed
> to reproduce it with the added WARN_ONs, and they made it obvious. He
> cannot trigger it anymore with this fix applied. Thanks Dave!

FWIW the patch looks good to me. I'd just note that the need to pass the
callback to deletion function and the fact that we do it only in cases
where we think it is needed appears errorprone. With the warning you've
added it should at least catch the cases where we got it wrong but more
robust would be if the radix tree root contained a pointer to the callback
function so that we would not rely on passing the callback to every place
which can possibly free a node. Also conceptually this would make more
sense to me since the fact that we may need to do some cleanup on node
deletion is a property of the particular radix tree and how we use it.
OTOH that would mean growing radix tree root by one pointer which would be
unpopular I guess.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-07 Thread Linus Torvalds
On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weiner  wrote:
>
> Linus? Andrew?

Looks fine to me. Will apply.

   Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-07 Thread Linus Torvalds
On Sat, Jan 7, 2017 at 6:02 PM, Johannes Weiner  wrote:
>
> Linus? Andrew?

Looks fine to me. Will apply.

   Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-07 Thread Johannes Weiner
On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.

Okay, the below patch should address this problem. Dave Jones managed
to reproduce it with the added WARN_ONs, and they made it obvious. He
cannot trigger it anymore with this fix applied. Thanks Dave!

Linus? Andrew?

---

>From 503eeb20e68bdf3529bdc14aca1ce564880129f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Fri, 6 Jan 2017 19:21:43 -0500
Subject: [PATCH] mm: workingset: fix use-after-free in shadow node shrinker

Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked
like use-after-free access from the shadow node shrinker. Dave Jones
managed to reproduce the issue with a debug patch applied, which
confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
   dump_stack+0x4f/0x73
   __warn+0xcb/0xf0
   warn_slowpath_null+0x1d/0x20
   delete_node+0x1e4/0x200
   __radix_tree_delete_node+0xd/0x10
   shadow_lru_isolate+0xe6/0x220
   __list_lru_walk_one.isra.4+0x9b/0x190
   ? memcg_drain_all_list_lrus+0x1d0/0x1d0
   list_lru_walk_one+0x23/0x30
   scan_shadow_nodes+0x2e/0x40
   shrink_slab.part.44+0x23d/0x5d0
   ? 0xa023a077
   shrink_node+0x22c/0x330
   kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(>private_list)) placed in
the inlined radix_tree_shrink().

The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node. While the reclaimed shadow node itself is unlinked by the
shrinker, its deletion from the tree can cause the left-most leaf node
in the tree to be shrunk. If that happens to be a shadow node as well,
we don't unlink it from the LRU as we should.

Consider this tree, where the s are shadow entries:

 root->rnode
  |
 [0   n]
  |   |
   [s] [s]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

 root->rnode
  |
 [0]
  |
  [s ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

 root->rnode
  |
 [s]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

root->rnode
 |
 s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree 
exceptional tracking")
Reported-by: Dave Chinner 
Reported-by: Hugh Dickins 
Reported-by: Andrea Arcangeli 
Reported-by: Dave Jones 
Tested-by: Dave Jones 
Signed-off-by: Johannes Weiner 
---
 include/linux/radix-tree.h |  4 +++-
 lib/radix-tree.c   | 11 +--
 mm/workingset.c|  3 ++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 5dea8f6440e4..52bda854593b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-07 Thread Johannes Weiner
On Tue, Jan 03, 2017 at 01:28:25PM +0100, Jan Kara wrote:
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.

Okay, the below patch should address this problem. Dave Jones managed
to reproduce it with the added WARN_ONs, and they made it obvious. He
cannot trigger it anymore with this fix applied. Thanks Dave!

Linus? Andrew?

---

>From 503eeb20e68bdf3529bdc14aca1ce564880129f2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Fri, 6 Jan 2017 19:21:43 -0500
Subject: [PATCH] mm: workingset: fix use-after-free in shadow node shrinker

Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked
like use-after-free access from the shadow node shrinker. Dave Jones
managed to reproduce the issue with a debug patch applied, which
confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
   dump_stack+0x4f/0x73
   __warn+0xcb/0xf0
   warn_slowpath_null+0x1d/0x20
   delete_node+0x1e4/0x200
   __radix_tree_delete_node+0xd/0x10
   shadow_lru_isolate+0xe6/0x220
   __list_lru_walk_one.isra.4+0x9b/0x190
   ? memcg_drain_all_list_lrus+0x1d0/0x1d0
   list_lru_walk_one+0x23/0x30
   scan_shadow_nodes+0x2e/0x40
   shrink_slab.part.44+0x23d/0x5d0
   ? 0xa023a077
   shrink_node+0x22c/0x330
   kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(>private_list)) placed in
the inlined radix_tree_shrink().

The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node. While the reclaimed shadow node itself is unlinked by the
shrinker, its deletion from the tree can cause the left-most leaf node
in the tree to be shrunk. If that happens to be a shadow node as well,
we don't unlink it from the LRU as we should.

Consider this tree, where the s are shadow entries:

 root->rnode
  |
 [0   n]
  |   |
   [s] [s]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

 root->rnode
  |
 [0]
  |
  [s ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

 root->rnode
  |
 [s]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

root->rnode
 |
 s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree 
exceptional tracking")
Reported-by: Dave Chinner 
Reported-by: Hugh Dickins 
Reported-by: Andrea Arcangeli 
Reported-by: Dave Jones 
Tested-by: Dave Jones 
Signed-off-by: Johannes Weiner 
---
 include/linux/radix-tree.h |  4 +++-
 lib/radix-tree.c   | 11 +--
 mm/workingset.c|  3 ++-
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 5dea8f6440e4..52bda854593b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -306,7 +306,9 @@ void radix_tree_iter_replace(struct radix_tree_root *,
 void radix_tree_replace_slot(struct radix_tree_root *root,
 void 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Laurence Oberman" <lober...@redhat.com>
> To: "Jan Kara" <j...@suse.cz>
> Cc: "Johannes Weiner" <han...@cmpxchg.org>, "Hugh Dickins" 
> <hu...@google.com>, "Linus Torvalds"
> <torva...@linux-foundation.org>, "Dave Chinner" <da...@fromorbit.com>, "Chris 
> Leech" <cle...@redhat.com>, "Linux
> Kernel Mailing List" <linux-kernel@vger.kernel.org>, "Lee Duncan" 
> <ldun...@suse.com>, open-is...@googlegroups.com,
> "Linux SCSI List" <linux-s...@vger.kernel.org>, linux-bl...@vger.kernel.org, 
> "Christoph Hellwig" <h...@lst.de>,
> "Andrea Arcangeli" <aarca...@redhat.com>
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> - Original Message -
> > From: "Jan Kara" <j...@suse.cz>
> > To: "Johannes Weiner" <han...@cmpxchg.org>
> > Cc: "Hugh Dickins" <hu...@google.com>, "Linus Torvalds"
> > <torva...@linux-foundation.org>, "Dave Chinner"
> > <da...@fromorbit.com>, "Chris Leech" <cle...@redhat.com>, "Linux Kernel
> > Mailing List"
> > <linux-kernel@vger.kernel.org>, "Lee Duncan" <ldun...@suse.com>,
> > open-is...@googlegroups.com, "Linux SCSI List"
> > <linux-s...@vger.kernel.org>, linux-bl...@vger.kernel.org, "Christoph
> > Hellwig" <h...@lst.de>, "Jan Kara"
> > <j...@suse.cz>, "Andrea Arcangeli" <aarca...@redhat.com>
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > <da...@fromorbit.com>
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon fo

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Laurence Oberman" 
> To: "Jan Kara" 
> Cc: "Johannes Weiner" , "Hugh Dickins" 
> , "Linus Torvalds"
> , "Dave Chinner" , "Chris 
> Leech" , "Linux
> Kernel Mailing List" , "Lee Duncan" 
> , open-is...@googlegroups.com,
> "Linux SCSI List" , linux-bl...@vger.kernel.org, 
> "Christoph Hellwig" ,
> "Andrea Arcangeli" 
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> - Original Message -
> > From: "Jan Kara" 
> > To: "Johannes Weiner" 
> > Cc: "Hugh Dickins" , "Linus Torvalds"
> > , "Dave Chinner"
> > , "Chris Leech" , "Linux Kernel
> > Mailing List"
> > , "Lee Duncan" ,
> > open-is...@googlegroups.com, "Linux SCSI List"
> > , linux-bl...@vger.kernel.org, "Christoph
> > Hellwig" , "Jan Kara"
> > , "Andrea Arcangeli" 
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > > when
> > > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time
> > > > > > to
> > > > > > investigate, just set it aside as something to worry about if it
> > > > > > happens
> > > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > > Dave's
> > > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > > problems.
> > > > > 
> > > > 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Jan Kara" <j...@suse.cz>
> To: "Johannes Weiner" <han...@cmpxchg.org>
> Cc: "Hugh Dickins" <hu...@google.com>, "Linus Torvalds" 
> <torva...@linux-foundation.org>, "Dave Chinner"
> <da...@fromorbit.com>, "Chris Leech" <cle...@redhat.com>, "Linux Kernel 
> Mailing List"
> <linux-kernel@vger.kernel.org>, "Lee Duncan" <ldun...@suse.com>, 
> open-is...@googlegroups.com, "Linux SCSI List"
> <linux-s...@vger.kernel.org>, linux-bl...@vger.kernel.org, "Christoph 
> Hellwig" <h...@lst.de>, "Jan Kara"
> <j...@suse.cz>, "Andrea Arcangeli" <aarca...@redhat.com>
> Sent: Tuesday, January 3, 2017 7:28:25 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <da...@fromorbit.com>
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.
> > > > > > 
> > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > affecting the working set code.
> > > > > > 
> > > > > > Well, I suspect that anything that creates memory pressure will end
> > > > > > up
> > > > > > triggering the working set code, so ..
> > > > > > 
> > > > > > That said, obviously memory corruption could be involved and result
> > > > > > in
> > > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > > 
> > > > > > It would probably be really useful to get more data points - is the
> > > > > > problem reliably in this area, or is it going to be random and all
> > > > > > over the place.
> > > > > 
> > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > shadow_lru_isolate,
> > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > when
> > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > > investigate, just set it aside as something to worry about if it
> > > > > happens
> > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > Dave's
> > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > problems.
> > > > 
> > > > This seems consistent with what Dave observed: we encounter regular
> > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > new slot replacement code and the node tracking callback.
> > > 
> > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > warn about an unexpected node->count / node->exceptional state, but
> > > had entries that were inconsistent with that. Hugh got the counter
> > > warning but crashed on a list_head that's not NULLed in a live node.
> > > 
> > > workingset_update_node() should be called on page cache radix tree
> > > leaf nodes that go empty. I must be missing an update_node callback
> > > where a leaf node gets freed somewhere.
> > 
> 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Jan Kara" 
> To: "Johannes Weiner" 
> Cc: "Hugh Dickins" , "Linus Torvalds" 
> , "Dave Chinner"
> , "Chris Leech" , "Linux Kernel 
> Mailing List"
> , "Lee Duncan" , 
> open-is...@googlegroups.com, "Linux SCSI List"
> , linux-bl...@vger.kernel.org, "Christoph 
> Hellwig" , "Jan Kara"
> , "Andrea Arcangeli" 
> Sent: Tuesday, January 3, 2017 7:28:25 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.
> > > > > > 
> > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > affecting the working set code.
> > > > > > 
> > > > > > Well, I suspect that anything that creates memory pressure will end
> > > > > > up
> > > > > > triggering the working set code, so ..
> > > > > > 
> > > > > > That said, obviously memory corruption could be involved and result
> > > > > > in
> > > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > > 
> > > > > > It would probably be really useful to get more data points - is the
> > > > > > problem reliably in this area, or is it going to be random and all
> > > > > > over the place.
> > > > > 
> > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > shadow_lru_isolate,
> > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > when
> > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > > investigate, just set it aside as something to worry about if it
> > > > > happens
> > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > Dave's
> > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > problems.
> > > > 
> > > > This seems consistent with what Dave observed: we encounter regular
> > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > new slot replacement code and the node tracking callback.
> > > 
> > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > warn about an unexpected node->count / node->exceptional state, but
> > > had entries that were inconsistent with that. Hugh got the counter
> > > warning but crashed on a list_head that's not NULLed in a live node.
> > > 
> > > workingset_update_node() should be called on page cache radix tree
> > > leaf nodes that go empty. I must be missing an update_node callback
> > > where a leaf node gets freed somewhere.
> > 
> > Sorry for dropping silent on this. I'm traveling over the holidays
> > with sporadic access to my emails and no access to real equipment.
> > 
> > The times I managed to sneak away to look at the code didn't turn up
> > anything useful yet.
> > 
> > Andrea encountered the warning as well and I gave him

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Christoph Hellwig
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?

Here is a pointer to the current one after splitting it into properly
bisectable chunks.  Besides proper changelogs the biggest item left
is fixing up dm-mpath to not allocate its own request structures.

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Christoph Hellwig
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?

Here is a pointer to the current one after splitting it into properly
bisectable chunks.  Besides proper changelogs the biggest item left
is fixing up dm-mpath to not allocate its own request structures.

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-03 Thread Jan Kara
On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > > wrote:
> > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > workload again and about a minute in this fired:
> > > > > >
> > > > > > [628867.607417] [ cut here ]
> > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > > shadow_lru_isolate+0x171/0x220
> > > > > 
> > > > > Well, part of the changes during the merge window were the shadow
> > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > Johannes Weiner to the participants.
> > > > > 
> > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > affecting the working set code.
> > > > > 
> > > > > Well, I suspect that anything that creates memory pressure will end up
> > > > > triggering the working set code, so ..
> > > > > 
> > > > > That said, obviously memory corruption could be involved and result in
> > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > 
> > > > > It would probably be really useful to get more data points - is the
> > > > > problem reliably in this area, or is it going to be random and all
> > > > > over the place.
> > > > 
> > > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > > shadow_lru_isolate,
> > > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > investigate, just set it aside as something to worry about if it happens
> > > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > > case (I've no XFS and no iscsi), suspect unrelated to his other 
> > > > problems.
> > > 
> > > This seems consistent with what Dave observed: we encounter regular
> > > pages in radix tree nodes on the shadow LRU that should only contain
> > > nodes full of exceptional shadow entries. It could be an issue in the
> > > new slot replacement code and the node tracking callback.
> > 
> > Both encounters seem to indicate use-after-free. Dave's node didn't
> > warn about an unexpected node->count / node->exceptional state, but
> > had entries that were inconsistent with that. Hugh got the counter
> > warning but crashed on a list_head that's not NULLed in a live node.
> > 
> > workingset_update_node() should be called on page cache radix tree
> > leaf nodes that go empty. I must be missing an update_node callback
> > where a leaf node gets freed somewhere.
> 
> Sorry for dropping silent on this. I'm traveling over the holidays
> with sporadic access to my emails and no access to real equipment.
> 
> The times I managed to sneak away to look at the code didn't turn up
> anything useful yet.
> 
> Andrea encountered the warning as well and I gave him a debugging
> patch (attached below), but he hasn't been able to reproduce this
> condition. I've personally never seen the warning trigger, even though
> the patches have been running on my main development machine for quite
> a while now. Albeit against an older base; I've updated to Linus's
> master branch now in case it's an interaction with other new code.
> 
> If anybody manages to reproduce this, that would be helpful. Any extra
> eyes on this would be much appreciated too until I'm back at my desk.

I was looking into this but I didn't find a way how we could possibly leave
radix tree node on LRU. So your debug patch looks like a good way forward.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-03 Thread Jan Kara
On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > > wrote:
> > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > workload again and about a minute in this fired:
> > > > > >
> > > > > > [628867.607417] [ cut here ]
> > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > > shadow_lru_isolate+0x171/0x220
> > > > > 
> > > > > Well, part of the changes during the merge window were the shadow
> > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > Johannes Weiner to the participants.
> > > > > 
> > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > affecting the working set code.
> > > > > 
> > > > > Well, I suspect that anything that creates memory pressure will end up
> > > > > triggering the working set code, so ..
> > > > > 
> > > > > That said, obviously memory corruption could be involved and result in
> > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > 
> > > > > It would probably be really useful to get more data points - is the
> > > > > problem reliably in this area, or is it going to be random and all
> > > > > over the place.
> > > > 
> > > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > > shadow_lru_isolate,
> > > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > investigate, just set it aside as something to worry about if it happens
> > > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > > case (I've no XFS and no iscsi), suspect unrelated to his other 
> > > > problems.
> > > 
> > > This seems consistent with what Dave observed: we encounter regular
> > > pages in radix tree nodes on the shadow LRU that should only contain
> > > nodes full of exceptional shadow entries. It could be an issue in the
> > > new slot replacement code and the node tracking callback.
> > 
> > Both encounters seem to indicate use-after-free. Dave's node didn't
> > warn about an unexpected node->count / node->exceptional state, but
> > had entries that were inconsistent with that. Hugh got the counter
> > warning but crashed on a list_head that's not NULLed in a live node.
> > 
> > workingset_update_node() should be called on page cache radix tree
> > leaf nodes that go empty. I must be missing an update_node callback
> > where a leaf node gets freed somewhere.
> 
> Sorry for dropping silent on this. I'm traveling over the holidays
> with sporadic access to my emails and no access to real equipment.
> 
> The times I managed to sneak away to look at the code didn't turn up
> anything useful yet.
> 
> Andrea encountered the warning as well and I gave him a debugging
> patch (attached below), but he hasn't been able to reproduce this
> condition. I've personally never seen the warning trigger, even though
> the patches have been running on my main development machine for quite
> a while now. Albeit against an older base; I've updated to Linus's
> master branch now in case it's an interaction with other new code.
> 
> If anybody manages to reproduce this, that would be helpful. Any extra
> eyes on this would be much appreciated too until I'm back at my desk.

I was looking into this but I didn't find a way how we could possibly leave
radix tree node on LRU. So your debug patch looks like a good way forward.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-02 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > wrote:
> > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > workload again and about a minute in this fired:
> > > > >
> > > > > [628867.607417] [ cut here ]
> > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > shadow_lru_isolate+0x171/0x220
> > > > 
> > > > Well, part of the changes during the merge window were the shadow
> > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > Johannes Weiner to the participants.
> > > > 
> > > > > Now, this workload does not touch the page cache at all - it's
> > > > > entirely an XFS metadata workload, so it should not really be
> > > > > affecting the working set code.
> > > > 
> > > > Well, I suspect that anything that creates memory pressure will end up
> > > > triggering the working set code, so ..
> > > > 
> > > > That said, obviously memory corruption could be involved and result in
> > > > random issues too, but I wouldn't really expect that in this code.
> > > > 
> > > > It would probably be really useful to get more data points - is the
> > > > problem reliably in this area, or is it going to be random and all
> > > > over the place.
> > > 
> > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > shadow_lru_isolate,
> > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > investigate, just set it aside as something to worry about if it happens
> > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > 
> > This seems consistent with what Dave observed: we encounter regular
> > pages in radix tree nodes on the shadow LRU that should only contain
> > nodes full of exceptional shadow entries. It could be an issue in the
> > new slot replacement code and the node tracking callback.
> 
> Both encounters seem to indicate use-after-free. Dave's node didn't
> warn about an unexpected node->count / node->exceptional state, but
> had entries that were inconsistent with that. Hugh got the counter
> warning but crashed on a list_head that's not NULLed in a live node.
> 
> workingset_update_node() should be called on page cache radix tree
> leaf nodes that go empty. I must be missing an update_node callback
> where a leaf node gets freed somewhere.

Sorry for dropping silent on this. I'm traveling over the holidays
with sporadic access to my emails and no access to real equipment.

The times I managed to sneak away to look at the code didn't turn up
anything useful yet.

Andrea encountered the warning as well and I gave him a debugging
patch (attached below), but he hasn't been able to reproduce this
condition. I've personally never seen the warning trigger, even though
the patches have been running on my main development machine for quite
a while now. Albeit against an older base; I've updated to Linus's
master branch now in case it's an interaction with other new code.

If anybody manages to reproduce this, that would be helpful. Any extra
eyes on this would be much appreciated too until I'm back at my desk.

Thanks

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0783af1c0ebb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,8 @@ static inline void radix_tree_shrink(struct radix_tree_root 
*root,
update_node(node, private);
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
}
 }
@@ -666,6 +668,8 @@ static void delete_node(struct radix_tree_root *root,
root->rnode = NULL;
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
 
node = parent;
@@ -767,6 +771,7 @@ static void radix_tree_free_nodes(struct radix_tree_node 
*node)
struct radix_tree_node *old = child;
offset = child->offset + 1;
child = child->parent;
+   WARN_ON_ONCE(!list_empty(>private_list));
radix_tree_node_free(old);
if (old == entry_to_node(node))
return;



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-02 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  
> > > > wrote:
> > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > workload again and about a minute in this fired:
> > > > >
> > > > > [628867.607417] [ cut here ]
> > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > > shadow_lru_isolate+0x171/0x220
> > > > 
> > > > Well, part of the changes during the merge window were the shadow
> > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > Johannes Weiner to the participants.
> > > > 
> > > > > Now, this workload does not touch the page cache at all - it's
> > > > > entirely an XFS metadata workload, so it should not really be
> > > > > affecting the working set code.
> > > > 
> > > > Well, I suspect that anything that creates memory pressure will end up
> > > > triggering the working set code, so ..
> > > > 
> > > > That said, obviously memory corruption could be involved and result in
> > > > random issues too, but I wouldn't really expect that in this code.
> > > > 
> > > > It would probably be really useful to get more data points - is the
> > > > problem reliably in this area, or is it going to be random and all
> > > > over the place.
> > > 
> > > Data point: kswapd got WARNING on mm/workingset.c:457 in 
> > > shadow_lru_isolate,
> > > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > investigate, just set it aside as something to worry about if it happens
> > > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> > 
> > This seems consistent with what Dave observed: we encounter regular
> > pages in radix tree nodes on the shadow LRU that should only contain
> > nodes full of exceptional shadow entries. It could be an issue in the
> > new slot replacement code and the node tracking callback.
> 
> Both encounters seem to indicate use-after-free. Dave's node didn't
> warn about an unexpected node->count / node->exceptional state, but
> had entries that were inconsistent with that. Hugh got the counter
> warning but crashed on a list_head that's not NULLed in a live node.
> 
> workingset_update_node() should be called on page cache radix tree
> leaf nodes that go empty. I must be missing an update_node callback
> where a leaf node gets freed somewhere.

Sorry for dropping silent on this. I'm traveling over the holidays
with sporadic access to my emails and no access to real equipment.

The times I managed to sneak away to look at the code didn't turn up
anything useful yet.

Andrea encountered the warning as well and I gave him a debugging
patch (attached below), but he hasn't been able to reproduce this
condition. I've personally never seen the warning trigger, even though
the patches have been running on my main development machine for quite
a while now. Albeit against an older base; I've updated to Linus's
master branch now in case it's an interaction with other new code.

If anybody manages to reproduce this, that would be helpful. Any extra
eyes on this would be much appreciated too until I'm back at my desk.

Thanks

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e07de77..0783af1c0ebb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -640,6 +640,8 @@ static inline void radix_tree_shrink(struct radix_tree_root 
*root,
update_node(node, private);
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
}
 }
@@ -666,6 +668,8 @@ static void delete_node(struct radix_tree_root *root,
root->rnode = NULL;
}
 
+   WARN_ON_ONCE(!list_empty(>private_list));
+
radix_tree_node_free(node);
 
node = parent;
@@ -767,6 +771,7 @@ static void radix_tree_free_nodes(struct radix_tree_node 
*node)
struct radix_tree_node *old = child;
offset = child->offset + 1;
child = child->parent;
+   WARN_ON_ONCE(!list_empty(>private_list));
radix_tree_node_free(old);
if (old == entry_to_node(node))
return;



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?
> Not that I'll be able to do any meaningful work until next year, but having 
> a look would be nice. Just to get a feeling where you want to head to; I 
> might be able to work on this start of January.

I'll push out a branch once it's revieable and not my current unbisectable
mess, should be soon.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?
> Not that I'll be able to do any meaningful work until next year, but having 
> a look would be nice. Just to get a feeling where you want to head to; I 
> might be able to work on this start of January.

I'll push out a branch once it's revieable and not my current unbisectable
mess, should be soon.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Hannes Reinecke

On 12/24/2016 11:07 AM, Christoph Hellwig wrote:

On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:

Ugh. This patch is nasty.


It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.


I think we should just fix blk_execute_rq() instead.


As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.


Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is 
protocol-agnostic and refusing to add a sense data pointer, but at the 
same time having a field 'sense_len' (which gives the length of what 
exactly?).


Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but 
having a look would be nice. Just to get a feeling where you want to 
head to; I might be able to work on this start of January.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Hannes Reinecke

On 12/24/2016 11:07 AM, Christoph Hellwig wrote:

On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:

Ugh. This patch is nasty.


It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.


I think we should just fix blk_execute_rq() instead.


As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.


Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is 
protocol-agnostic and refusing to add a sense data pointer, but at the 
same time having a field 'sense_len' (which gives the length of what 
exactly?).


Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but 
having a look would be nice. Just to get a feeling where you want to 
head to; I might be able to work on this start of January.


Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
> Ugh. This patch is nasty.

It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.

> I think we should just fix blk_execute_rq() instead.

As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.

Once that is done the callers won't need a sense buffer at all, and
can just look at the driver provided one.  Which currently is missing
in virtio-blk, so we'd need something similar to the above patch
anyway.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
> Ugh. This patch is nasty.

It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.

> I think we should just fix blk_execute_rq() instead.

As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer.  But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code.  I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.

Once that is done the callers won't need a sense buffer at all, and
can just look at the driver provided one.  Which currently is missing
in virtio-blk, so we'd need something similar to the above patch
anyway.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Fri, Dec 23, 2016 at 07:45:45PM -0700, Jens Axboe wrote:
> It's not that it's technically hard to fix up, it's more that it's a
> pain in the ass to have to do it. For instance, for blk_execute_rq(), we
> either should enforce that the caller allocates it dynamically and then
> free it, or we need nasty hack where the caller needs to know he has to
> free it. Pretty obvious what I would prefer there.
> 
> And yes, there would be a good chunk of other places where this would
> nede to be fixed up...

My planned rework for the BLOCK_PC code (split all fields for them out
of struct request and move them into a separate, driver-allocate structure)
would fix this up as a side-effect.  I really wanted to get it into 4.10,
but I didn't manage to fix it up.  I'll try to get it into 4.11 early.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-24 Thread Christoph Hellwig
On Fri, Dec 23, 2016 at 07:45:45PM -0700, Jens Axboe wrote:
> It's not that it's technically hard to fix up, it's more that it's a
> pain in the ass to have to do it. For instance, for blk_execute_rq(), we
> either should enforce that the caller allocates it dynamically and then
> free it, or we need nasty hack where the caller needs to know he has to
> free it. Pretty obvious what I would prefer there.
> 
> And yes, there would be a good chunk of other places where this would
> nede to be fixed up...

My planned rework for the BLOCK_PC code (split all fields for them out
of struct request and move them into a separate, driver-allocate structure)
would fix this up as a side-effect.  I really wanted to get it into 4.10,
but I didn't manage to fix it up.  I'll try to get it into 4.11 early.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Jens Axboe
On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>>
>> From: Christoph Hellwig 
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel.  Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
> 
> Ugh. This patch is nasty.
> 
> I think we should just fix blk_execute_rq() instead.
> 
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
> 
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
> 
> And resp_requests() in scsi_debug.c.

It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.

And yes, there would be a good chunk of other places where this would
nede to be fixed up...

> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?

I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.

If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.

-- 
Jens Axboe



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Jens Axboe
On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>>
>> From: Christoph Hellwig 
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel.  Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
> 
> Ugh. This patch is nasty.
> 
> I think we should just fix blk_execute_rq() instead.
> 
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
> 
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
> 
> And resp_requests() in scsi_debug.c.

It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.

And yes, there would be a good chunk of other places where this would
nede to be fixed up...

> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?

I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.

If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.

-- 
Jens Axboe



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Linus Torvalds
On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>
> From: Christoph Hellwig 
> Date: Fri, 23 Dec 2016 10:57:06 +0100
> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel.  Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.

Ugh. This patch is nasty.

I think we should just fix blk_execute_rq() instead.

But from a quick look, we also have at least sg_scsi_ioctl() and
sg_io() doing the same thing.

And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c

And resp_requests() in scsi_debug.c.

So I guess ugly it may need to be, and the rule is that the sense
buffer really can be on the stack and you can't DMA to/from it.
Comments from others?

Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Linus Torvalds
On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>
> From: Christoph Hellwig 
> Date: Fri, 23 Dec 2016 10:57:06 +0100
> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel.  Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.

Ugh. This patch is nasty.

I think we should just fix blk_execute_rq() instead.

But from a quick look, we also have at least sg_scsi_ioctl() and
sg_io() doing the same thing.

And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c

And resp_requests() in scsi_debug.c.

So I guess ugly it may need to be, and the rule is that the sense
buffer really can be on the stack and you can't DMA to/from it.
Comments from others?

Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Christoph Hellwig
On Thu, Dec 22, 2016 at 04:03:56PM -0800, Chris Leech wrote:
> Of course, looks like I've screwed up my bisect run on this so I'm still
> taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
> also depend on kernel configuration.
> 
> I started with the fedora rawhide config with a lot of debug on and
> trimed it down with a localmodconfig run in the VM to speed up rebuilds.

I think the configuration dependency is CONFIG_HAVE_ARCH_VMAP_STACK,
I've just reproduce the issue with it, and the backtrace points to
__virtblk_add_req when setting up the sense buffer.   And it turns out
that blk_execute_rq tries to do I/O to the on-stack sense buffer.
At least SCSI always has a kmalloced sense buffer, so I guess we'll
need something similar for virtio_blk for now.  For 4.11 I plan to
rework how BLOCK_PC commands work entirely, so hopefull we can make
the sense buffer handling a lot less wasteful.

---
>From 0a77bc424ed907c1e99b4756bb498370b498183a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 23 Dec 2016 10:57:06 +0100
Subject: virtio_blk: avoid DMA to stack for the sense buffer

Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel.  Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/virtio_blk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a67..3c3b8f6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -56,6 +56,7 @@ struct virtblk_req {
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 status;
+   u8 sense[SCSI_SENSE_BUFFERSIZE];
struct scatterlist sg[];
 };
 
@@ -102,7 +103,8 @@ static int __virtblk_add_req(struct virtqueue *vq,
}
 
if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
-   sg_init_one(, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+   memcpy(vbr->sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+   sg_init_one(, vbr->sense, SCSI_SENSE_BUFFERSIZE);
sgs[num_out + num_in++] = 
sg_init_one(, >in_hdr, sizeof(vbr->in_hdr));
sgs[num_out + num_in++] = 
-- 
2.1.4



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Christoph Hellwig
On Thu, Dec 22, 2016 at 04:03:56PM -0800, Chris Leech wrote:
> Of course, looks like I've screwed up my bisect run on this so I'm still
> taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
> also depend on kernel configuration.
> 
> I started with the fedora rawhide config with a lot of debug on and
> trimed it down with a localmodconfig run in the VM to speed up rebuilds.

I think the configuration dependency is CONFIG_HAVE_ARCH_VMAP_STACK,
I've just reproduce the issue with it, and the backtrace points to
__virtblk_add_req when setting up the sense buffer.   And it turns out
that blk_execute_rq tries to do I/O to the on-stack sense buffer.
At least SCSI always has a kmalloced sense buffer, so I guess we'll
need something similar for virtio_blk for now.  For 4.11 I plan to
rework how BLOCK_PC commands work entirely, so hopefull we can make
the sense buffer handling a lot less wasteful.

---
>From 0a77bc424ed907c1e99b4756bb498370b498183a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Fri, 23 Dec 2016 10:57:06 +0100
Subject: virtio_blk: avoid DMA to stack for the sense buffer

Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel.  Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/virtio_blk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a67..3c3b8f6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -56,6 +56,7 @@ struct virtblk_req {
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 status;
+   u8 sense[SCSI_SENSE_BUFFERSIZE];
struct scatterlist sg[];
 };
 
@@ -102,7 +103,8 @@ static int __virtblk_add_req(struct virtqueue *vq,
}
 
if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
-   sg_init_one(, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+   memcpy(vbr->sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+   sg_init_one(, vbr->sense, SCSI_SENSE_BUFFERSIZE);
sgs[num_out + num_in++] = 
sg_init_one(, >in_hdr, sizeof(vbr->in_hdr));
sgs[num_out + num_in++] = 
-- 
2.1.4



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > > > I unmounted the fs, mkfs'd it again, ran the
> > > > workload again and about a minute in this fired:
> > > >
> > > > [628867.607417] [ cut here ]
> > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > shadow_lru_isolate+0x171/0x220
> > > 
> > > Well, part of the changes during the merge window were the shadow
> > > entry tracking changes that came in through Andrew's tree. Adding
> > > Johannes Weiner to the participants.
> > > 
> > > > Now, this workload does not touch the page cache at all - it's
> > > > entirely an XFS metadata workload, so it should not really be
> > > > affecting the working set code.
> > > 
> > > Well, I suspect that anything that creates memory pressure will end up
> > > triggering the working set code, so ..
> > > 
> > > That said, obviously memory corruption could be involved and result in
> > > random issues too, but I wouldn't really expect that in this code.
> > > 
> > > It would probably be really useful to get more data points - is the
> > > problem reliably in this area, or is it going to be random and all
> > > over the place.
> > 
> > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > investigate, just set it aside as something to worry about if it happens
> > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> 
> This seems consistent with what Dave observed: we encounter regular
> pages in radix tree nodes on the shadow LRU that should only contain
> nodes full of exceptional shadow entries. It could be an issue in the
> new slot replacement code and the node tracking callback.

Both encounters seem to indicate use-after-free. Dave's node didn't
warn about an unexpected node->count / node->exceptional state, but
had entries that were inconsistent with that. Hugh got the counter
warning but crashed on a list_head that's not NULLed in a live node.

workingset_update_node() should be called on page cache radix tree
leaf nodes that go empty. I must be missing an update_node callback
where a leaf node gets freed somewhere.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-23 Thread Johannes Weiner
On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > > > I unmounted the fs, mkfs'd it again, ran the
> > > > workload again and about a minute in this fired:
> > > >
> > > > [628867.607417] [ cut here ]
> > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > > shadow_lru_isolate+0x171/0x220
> > > 
> > > Well, part of the changes during the merge window were the shadow
> > > entry tracking changes that came in through Andrew's tree. Adding
> > > Johannes Weiner to the participants.
> > > 
> > > > Now, this workload does not touch the page cache at all - it's
> > > > entirely an XFS metadata workload, so it should not really be
> > > > affecting the working set code.
> > > 
> > > Well, I suspect that anything that creates memory pressure will end up
> > > triggering the working set code, so ..
> > > 
> > > That said, obviously memory corruption could be involved and result in
> > > random issues too, but I wouldn't really expect that in this code.
> > > 
> > > It would probably be really useful to get more data points - is the
> > > problem reliably in this area, or is it going to be random and all
> > > over the place.
> > 
> > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > investigate, just set it aside as something to worry about if it happens
> > again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
> 
> This seems consistent with what Dave observed: we encounter regular
> pages in radix tree nodes on the shadow LRU that should only contain
> nodes full of exceptional shadow entries. It could be an issue in the
> new slot replacement code and the node tracking callback.

Both encounters seem to indicate use-after-free. Dave's node didn't
warn about an unexpected node->count / node->exceptional state, but
had entries that were inconsistent with that. Hugh got the counter
warning but crashed on a list_head that's not NULLed in a live node.

workingset_update_node() should be called on page cache radix tree
leaf nodes that go empty. I must be missing an update_node callback
where a leaf node gets freed somewhere.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Johannes Weiner
On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] [ cut here ]
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> soon followed by NULL pointer deref in list_lru_isolate, one time when
> I tried out Sunday's git tree.  Not seen since, I haven't had time to
> investigate, just set it aside as something to worry about if it happens
> again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> case (I've no XFS and no iscsi), suspect unrelated to his other problems.

This seems consistent with what Dave observed: we encounter regular
pages in radix tree nodes on the shadow LRU that should only contain
nodes full of exceptional shadow entries. It could be an issue in the
new slot replacement code and the node tracking callback.

Looking...


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Johannes Weiner
On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] [ cut here ]
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> soon followed by NULL pointer deref in list_lru_isolate, one time when
> I tried out Sunday's git tree.  Not seen since, I haven't had time to
> investigate, just set it aside as something to worry about if it happens
> again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
> case (I've no XFS and no iscsi), suspect unrelated to his other problems.

This seems consistent with what Dave observed: we encounter regular
pages in radix tree nodes on the shadow LRU that should only contain
nodes full of exceptional shadow entries. It could be an issue in the
new slot replacement code and the node tracking callback.

Looking...


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 09:33:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> > On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > > Ok, so the numa issue was a red herring. With that fixed:
> > > 
> > > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> > > >
> > > > Better, but still bad. average files/s is not up to 200k files/s,
> > > > so still a good 10-15% off where it should be. xfs_repair is back
> > > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > > a bad page reference count warning, iscsi still panics immediately.
> > > 
> > > Do you have CONFIG_BLK_WBT enabled, perhaps?
> > 
> > Ok, yes, that's enabled. Let me go turn it off and see what happens.
> 
> Numbers are still all over the place.
> 
> FSUse%Count SizeFiles/sec App Overhead
> .
>  0 28800 228175.5 21928812
>  0 30400 167880.5 39628229
>  0 32000 124289.5 41420925
>  0 33600 150577.9 35382318
>  0 35200 216535.4 16072628
>  0 36800 233414.4 11846654
>  0 38400 213812.0 13356633
>  0 40000 175905.7 53012015
>  0 41600 157028.7 34700794
>  0 43200 138829.1 50282461
> 
> And the average is now back down to 185k files/s. repair runtime is
> unchanged and still 10-15% off...
> 
> I've got to run away for a few hours right now, but I'll retest the
> 4.9 + xfs for-next branch when I get back to see if the problem is
> my curent config or whether there really is a perf problem lurking
> somewhere

Well, I'm not sure now. Taking that config back to 4.9 gave
results of 210k files/s. A bit faster, but still not the 230k
files/s I'm expecting. So I'm missing something in the .config
at this point, though I'm not sure what.

FWIW, updating from 4.9 to to the 4.10 tree, this happened:

$ $ make oldconfig; make -j 32
scripts/kconfig/conf  --oldconfig Kconfig
*
* Restart config...
*
*
* General setup
*

Yup, it definitely did something unexpected. And almost silently, I
might add - I didn't notice this the first time around, and wouldn't
hav enoticed it this time if I wasn't looking for something strange
to happen.

As iit is, I still haven't found what magic config option is
taking away that 10-15% of performance.  There's no unexpected debug
options set, and it's nothing obvious in the fs/block layer config.
I'll keep looking for the moment...

Cheers,

Dave.


-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 09:33:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> > On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > > Ok, so the numa issue was a red herring. With that fixed:
> > > 
> > > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> > > >
> > > > Better, but still bad. average files/s is not up to 200k files/s,
> > > > so still a good 10-15% off where it should be. xfs_repair is back
> > > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > > a bad page reference count warning, iscsi still panics immediately.
> > > 
> > > Do you have CONFIG_BLK_WBT enabled, perhaps?
> > 
> > Ok, yes, that's enabled. Let me go turn it off and see what happens.
> 
> Numbers are still all over the place.
> 
> FSUse%Count SizeFiles/sec App Overhead
> .
>  0 28800 228175.5 21928812
>  0 30400 167880.5 39628229
>  0 32000 124289.5 41420925
>  0 33600 150577.9 35382318
>  0 35200 216535.4 16072628
>  0 36800 233414.4 11846654
>  0 38400 213812.0 13356633
>  0 40000 175905.7 53012015
>  0 41600 157028.7 34700794
>  0 43200 138829.1 50282461
> 
> And the average is now back down to 185k files/s. repair runtime is
> unchanged and still 10-15% off...
> 
> I've got to run away for a few hours right now, but I'll retest the
> 4.9 + xfs for-next branch when I get back to see if the problem is
> my curent config or whether there really is a perf problem lurking
> somewhere

Well, I'm not sure now. Taking that config back to 4.9 gave
results of 210k files/s. A bit faster, but still not the 230k
files/s I'm expecting. So I'm missing something in the .config
at this point, though I'm not sure what.

FWIW, updating from 4.9 to to the 4.10 tree, this happened:

$ $ make oldconfig; make -j 32
scripts/kconfig/conf  --oldconfig Kconfig
*
* Restart config...
*
*
* General setup
*

Yup, it definitely did something unexpected. And almost silently, I
might add - I didn't notice this the first time around, and wouldn't
hav enoticed it this time if I wasn't looking for something strange
to happen.

As iit is, I still haven't found what magic config option is
taking away that 10-15% of performance.  There's no unexpected debug
options set, and it's nothing obvious in the fs/block layer config.
I'll keep looking for the moment...

Cheers,

Dave.


-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Jens Axboe
On 12/22/2016 02:10 PM, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
>>
>> Better, but still bad. average files/s is not up to 200k files/s,
>> so still a good 10-15% off where it should be. xfs_repair is back
>> down to 10-15% off where it should be, too. bulkstat still fires off
>> a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

We have it running on thousands of machines at Facebook, and I've used
it in all of my testing, haven't seen any issues. It's been run pretty
damn hard.

-- 
Jens Axboe



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Jens Axboe
On 12/22/2016 02:10 PM, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
>>
>> Better, but still bad. average files/s is not up to 200k files/s,
>> so still a good 10-15% off where it should be. xfs_repair is back
>> down to 10-15% off where it should be, too. bulkstat still fires off
>> a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

We have it running on thousands of machines at Facebook, and I've used
it in all of my testing, haven't seen any issues. It's been run pretty
damn hard.

-- 
Jens Axboe



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Chris Leech
On Fri, Dec 23, 2016 at 07:53:50AM +0800, Ming Lei wrote:
> On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech  wrote:
> > I'm not reproducing any problems with xfstests running over iscsi_tcp
> > right now.  Two 10G luns exported from an LIO target, attached directly
> > to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> > TEST_DEV and SCRATCH_DEV.
> >
> > The virtio scatterlist issue that popped right away for me is triggered
> > by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> > actually seems to happen back on 4.9 as well :(
> 
> Could you share us what the specific hdparm cmd line is? I tried several
> random cmds over virtio-blk/virito-scsi, looks not see this problem.

Of course, looks like I've screwed up my bisect run on this so I'm still
taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
also depend on kernel configuration.

I started with the fedora rawhide config with a lot of debug on and
trimed it down with a localmodconfig run in the VM to speed up rebuilds.

Chris


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Chris Leech
On Fri, Dec 23, 2016 at 07:53:50AM +0800, Ming Lei wrote:
> On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech  wrote:
> > I'm not reproducing any problems with xfstests running over iscsi_tcp
> > right now.  Two 10G luns exported from an LIO target, attached directly
> > to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> > TEST_DEV and SCRATCH_DEV.
> >
> > The virtio scatterlist issue that popped right away for me is triggered
> > by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> > actually seems to happen back on 4.9 as well :(
> 
> Could you share us what the specific hdparm cmd line is? I tried several
> random cmds over virtio-blk/virito-scsi, looks not see this problem.

Of course, looks like I've screwed up my bisect run on this so I'm still
taking a look.  It triggers for me with 'hdparm -B /dev/vda' but may
also depend on kernel configuration.

I started with the fedora rawhide config with a lot of debug on and
trimed it down with a localmodconfig run in the VM to speed up rebuilds.

Chris


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Ming Lei
On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech  wrote:
> On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
>> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
>> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>> > >
>> > > There may be deeper issues. I just started running scalability tests
>> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
>> > > directory corruption reported - something I hadn't seen in the dev
>> > > cycle at all.
>> >
>> > By "in the dev cycle", do you mean your XFS changes, or have you been
>> > tracking the merge cycle at least for some testing?
>>
>> I mean the three months leading up to the 4.10 merge, when all the
>> XFS changes were being tested against 4.9-rc kernels.
>>
>> The iscsi problem showed up when I updated the base kernel from
>> 4.9 to 4.10-current last week to test the pullreq I was going to
>> send you. I've been bust with other stuff until now, so I didn't
>> upgrade my working trees again until today in the hope the iscsi
>> problem had already been found and fixed.
>>
>> > > I unmounted the fs, mkfs'd it again, ran the
>> > > workload again and about a minute in this fired:
>> > >
>> > > [628867.607417] [ cut here ]
>> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
>> > > shadow_lru_isolate+0x171/0x220
>> >
>> > Well, part of the changes during the merge window were the shadow
>> > entry tracking changes that came in through Andrew's tree. Adding
>> > Johannes Weiner to the participants.
>> >
>> > > Now, this workload does not touch the page cache at all - it's
>> > > entirely an XFS metadata workload, so it should not really be
>> > > affecting the working set code.
>> >
>> > Well, I suspect that anything that creates memory pressure will end up
>> > triggering the working set code, so ..
>> >
>> > That said, obviously memory corruption could be involved and result in
>> > random issues too, but I wouldn't really expect that in this code.
>> >
>> > It would probably be really useful to get more data points - is the
>> > problem reliably in this area, or is it going to be random and all
>> > over the place.
>>
>> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
>> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
>> the filesystems.
>>
>> The test machine I'm having all these other problems on? stable and
>> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
>> (i.e. run load/perf benchmarks) it starts falling over left, right
>> and center.
>
> I'm not reproducing any problems with xfstests running over iscsi_tcp
> right now.  Two 10G luns exported from an LIO target, attached directly
> to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> TEST_DEV and SCRATCH_DEV.
>
> The virtio scatterlist issue that popped right away for me is triggered
> by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> actually seems to happen back on 4.9 as well :(

Could you share us what the specific hdparm cmd line is? I tried several
random cmds over virtio-blk/virito-scsi, looks not see this problem.


Thanks,
Ming Lei


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Ming Lei
On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech  wrote:
> On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
>> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
>> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>> > >
>> > > There may be deeper issues. I just started running scalability tests
>> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
>> > > directory corruption reported - something I hadn't seen in the dev
>> > > cycle at all.
>> >
>> > By "in the dev cycle", do you mean your XFS changes, or have you been
>> > tracking the merge cycle at least for some testing?
>>
>> I mean the three months leading up to the 4.10 merge, when all the
>> XFS changes were being tested against 4.9-rc kernels.
>>
>> The iscsi problem showed up when I updated the base kernel from
>> 4.9 to 4.10-current last week to test the pullreq I was going to
>> send you. I've been bust with other stuff until now, so I didn't
>> upgrade my working trees again until today in the hope the iscsi
>> problem had already been found and fixed.
>>
>> > > I unmounted the fs, mkfs'd it again, ran the
>> > > workload again and about a minute in this fired:
>> > >
>> > > [628867.607417] [ cut here ]
>> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
>> > > shadow_lru_isolate+0x171/0x220
>> >
>> > Well, part of the changes during the merge window were the shadow
>> > entry tracking changes that came in through Andrew's tree. Adding
>> > Johannes Weiner to the participants.
>> >
>> > > Now, this workload does not touch the page cache at all - it's
>> > > entirely an XFS metadata workload, so it should not really be
>> > > affecting the working set code.
>> >
>> > Well, I suspect that anything that creates memory pressure will end up
>> > triggering the working set code, so ..
>> >
>> > That said, obviously memory corruption could be involved and result in
>> > random issues too, but I wouldn't really expect that in this code.
>> >
>> > It would probably be really useful to get more data points - is the
>> > problem reliably in this area, or is it going to be random and all
>> > over the place.
>>
>> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
>> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
>> the filesystems.
>>
>> The test machine I'm having all these other problems on? stable and
>> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
>> (i.e. run load/perf benchmarks) it starts falling over left, right
>> and center.
>
> I'm not reproducing any problems with xfstests running over iscsi_tcp
> right now.  Two 10G luns exported from an LIO target, attached directly
> to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> TEST_DEV and SCRATCH_DEV.
>
> The virtio scatterlist issue that popped right away for me is triggered
> by an hdparm ioctl, which is being run by tuned on Fedora.  And that
> actually seems to happen back on 4.9 as well :(

Could you share us what the specific hdparm cmd line is? I tried several
random cmds over virtio-blk/virito-scsi, looks not see this problem.


Thanks,
Ming Lei


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > Ok, so the numa issue was a red herring. With that fixed:
> > 
> > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> > >
> > > Better, but still bad. average files/s is not up to 200k files/s,
> > > so still a good 10-15% off where it should be. xfs_repair is back
> > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > a bad page reference count warning, iscsi still panics immediately.
> > 
> > Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> Ok, yes, that's enabled. Let me go turn it off and see what happens.

Numbers are still all over the place.

FSUse%Count SizeFiles/sec App Overhead
.
 0 28800 228175.5 21928812
 0 30400 167880.5 39628229
 0 32000 124289.5 41420925
 0 33600 150577.9 35382318
 0 35200 216535.4 16072628
 0 36800 233414.4 11846654
 0 38400 213812.0 13356633
 0 40000 175905.7 53012015
 0 41600 157028.7 34700794
 0 43200 138829.1 50282461

And the average is now back down to 185k files/s. repair runtime is
unchanged and still 10-15% off...

I've got to run away for a few hours right now, but I'll retest the
4.9 + xfs for-next branch when I get back to see if the problem is
my curent config or whether there really is a perf problem lurking
somewhere

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > Ok, so the numa issue was a red herring. With that fixed:
> > 
> > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> > >
> > > Better, but still bad. average files/s is not up to 200k files/s,
> > > so still a good 10-15% off where it should be. xfs_repair is back
> > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > a bad page reference count warning, iscsi still panics immediately.
> > 
> > Do you have CONFIG_BLK_WBT enabled, perhaps?
> 
> Ok, yes, that's enabled. Let me go turn it off and see what happens.

Numbers are still all over the place.

FSUse%Count SizeFiles/sec App Overhead
.
 0 28800 228175.5 21928812
 0 30400 167880.5 39628229
 0 32000 124289.5 41420925
 0 33600 150577.9 35382318
 0 35200 216535.4 16072628
 0 36800 233414.4 11846654
 0 38400 213812.0 13356633
 0 40000 175905.7 53012015
 0 41600 157028.7 34700794
 0 43200 138829.1 50282461

And the average is now back down to 185k files/s. repair runtime is
unchanged and still 10-15% off...

I've got to run away for a few hours right now, but I'll retest the
4.9 + xfs for-next branch when I get back to see if the problem is
my curent config or whether there really is a perf problem lurking
somewhere

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> >
> > Better, but still bad. average files/s is not up to 200k files/s,
> > so still a good 10-15% off where it should be. xfs_repair is back
> > down to 10-15% off where it should be, too. bulkstat still fires off
> > a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?

Ok, yes, that's enabled. Let me go turn it off and see what happens.

> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

Well, these aren't extreme IO workloads. Yes, the filesystem does a
lot of work on the CPU, but the IO loads XFS generates aren't
particularly high - maybe 2-3000 IOPS at most, and mostly the
bandwidth is below 100MB/s.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
> 
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
> >
> > Better, but still bad. average files/s is not up to 200k files/s,
> > so still a good 10-15% off where it should be. xfs_repair is back
> > down to 10-15% off where it should be, too. bulkstat still fires off
> > a bad page reference count warning, iscsi still panics immediately.
> 
> Do you have CONFIG_BLK_WBT enabled, perhaps?

Ok, yes, that's enabled. Let me go turn it off and see what happens.

> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.

Well, these aren't extreme IO workloads. Yes, the filesystem does a
lot of work on the CPU, but the IO loads XFS generates aren't
particularly high - maybe 2-3000 IOPS at most, and mostly the
bandwidth is below 100MB/s.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Linus Torvalds
Ok, so the numa issue was a red herring. With that fixed:

On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
>
> Better, but still bad. average files/s is not up to 200k files/s,
> so still a good 10-15% off where it should be. xfs_repair is back
> down to 10-15% off where it should be, too. bulkstat still fires off
> a bad page reference count warning, iscsi still panics immediately.

Do you have CONFIG_BLK_WBT enabled, perhaps?

It's new to this merge window, and I'm not convinced it's been tuned.
Particularly for your kinds of fairly extreme IO loads.

  Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Linus Torvalds
Ok, so the numa issue was a red herring. With that fixed:

On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner  wrote:
>
> Better, but still bad. average files/s is not up to 200k files/s,
> so still a good 10-15% off where it should be. xfs_repair is back
> down to 10-15% off where it should be, too. bulkstat still fires off
> a bad page reference count warning, iscsi still panics immediately.

Do you have CONFIG_BLK_WBT enabled, perhaps?

It's new to this merge window, and I'm not convinced it's been tuned.
Particularly for your kinds of fairly extreme IO loads.

  Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 07:42:40AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> > >
> > > This sort of thing is normally indicative of a memory reclaim or
> > > lock contention problem. Profile showed unusual spinlock contention,
> > > but then I realised there was only one kswapd thread running.
> > > Yup, sure enough, it's caused by a major change in memory reclaim
> > > behaviour:
> > >
> > > [0.00] Zone ranges:
> > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > [0.00]   DMA32[mem 0x0100-0x]
> > > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > > [0.00] Movable zone start for each node
> > > [0.00] Early memory node ranges
> > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > > [0.00] Initmem setup node 0 [mem 
> > > 0x1000-0x00083fff]
> > >
> > > the numa=fake=4 CLI option is broken.
> > 
> > Ok, I think that is independent of anything else. Removing block
> > people and adding the x86 people.
> > 
> > I'm not seeing anything at all that would change the fake numa stuff,
> > but maybe the cpu hotplug changes?
> > 
> > Thomas/Ingo/Peter - Dave is going away for several months, so you
> > won't get feedback from him, but can you look at this? Or maybe point
> > me towards the right people - I'm seeing no possible relevant changes
> > at all fir x85 numa since 4.9, so it must be some indirect breakage.
> > 
> > Dave is using fake-numa to do performance testing in a VM, and it's a
> > big deal for the node optimizations for writeback etc. Do you have any
> > ideas?
> > 
> > Dave, if you're still around, can you send out the kernel config file
> > you used...
> 
> Looking at this fresh this morning (i.e. not pissed off by having
> everything I tried to do fail in different ways all afternoon) I
> found this:
> 
> $ grep NUMA .config
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> # CONFIG_NUMA is not set
> $
> 
> The .config I was using for 4.9 got 'make oldconfig' upgraded, and
> looking at it there's a bunch of stuff that has been turned off that
> I know was set:
> 
> # CONFIG_EXPERT is not set
> # CONFIG_PARAVIRT_SPINLOCKS is not set
> # CONFIG_COMPACTION is not set
> 
> and stuff I never use so don't set was set, like kernel crash dump,
> a bunch of stuff for AMD CPUs, susp/resume and power management
> debug, every partition type and filesystem under the sun was
> selected, heaps of network devices enabled, etc.
> 
> So it looks like the problem has occurred during oldconfig, meaning
> I have no idea exactly WTF I was testing. Rebuilding now with a
> saner config, see what happens.

Better, but still bad. average files/s is not up to 200k files/s,
so still a good 10-15% off where it should be. xfs_repair is back
down to 10-15% off where it should be, too. bulkstat still fires off
a bad page reference count warning, iscsi still panics immediately.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Fri, Dec 23, 2016 at 07:42:40AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> > >
> > > This sort of thing is normally indicative of a memory reclaim or
> > > lock contention problem. Profile showed unusual spinlock contention,
> > > but then I realised there was only one kswapd thread running.
> > > Yup, sure enough, it's caused by a major change in memory reclaim
> > > behaviour:
> > >
> > > [0.00] Zone ranges:
> > > [0.00]   DMA  [mem 0x1000-0x00ff]
> > > [0.00]   DMA32[mem 0x0100-0x]
> > > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > > [0.00] Movable zone start for each node
> > > [0.00] Early memory node ranges
> > > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > > [0.00] Initmem setup node 0 [mem 
> > > 0x1000-0x00083fff]
> > >
> > > the numa=fake=4 CLI option is broken.
> > 
> > Ok, I think that is independent of anything else. Removing block
> > people and adding the x86 people.
> > 
> > I'm not seeing anything at all that would change the fake numa stuff,
> > but maybe the cpu hotplug changes?
> > 
> > Thomas/Ingo/Peter - Dave is going away for several months, so you
> > won't get feedback from him, but can you look at this? Or maybe point
> > me towards the right people - I'm seeing no possible relevant changes
> > at all fir x85 numa since 4.9, so it must be some indirect breakage.
> > 
> > Dave is using fake-numa to do performance testing in a VM, and it's a
> > big deal for the node optimizations for writeback etc. Do you have any
> > ideas?
> > 
> > Dave, if you're still around, can you send out the kernel config file
> > you used...
> 
> Looking at this fresh this morning (i.e. not pissed off by having
> everything I tried to do fail in different ways all afternoon) I
> found this:
> 
> $ grep NUMA .config
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> # CONFIG_NUMA is not set
> $
> 
> The .config I was using for 4.9 got 'make oldconfig' upgraded, and
> looking at it there's a bunch of stuff that has been turned off that
> I know was set:
> 
> # CONFIG_EXPERT is not set
> # CONFIG_PARAVIRT_SPINLOCKS is not set
> # CONFIG_COMPACTION is not set
> 
> and stuff I never use so don't set was set, like kernel crash dump,
> a bunch of stuff for AMD CPUs, susp/resume and power management
> debug, every partition type and filesystem under the sun was
> selected, heaps of network devices enabled, etc.
> 
> So it looks like the problem has occurred during oldconfig, meaning
> I have no idea exactly WTF I was testing. Rebuilding now with a
> saner config, see what happens.

Better, but still bad. average files/s is not up to 200k files/s,
so still a good 10-15% off where it should be. xfs_repair is back
down to 10-15% off where it should be, too. bulkstat still fires off
a bad page reference count warning, iscsi still panics immediately.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x]
> > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > [0.00] Initmem setup node 0 [mem 
> > 0x1000-0x00083fff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?
> 
> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?
> 
> Dave, if you're still around, can you send out the kernel config file
> you used...

Looking at this fresh this morning (i.e. not pissed off by having
everything I tried to do fail in different ways all afternoon) I
found this:

$ grep NUMA .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
# CONFIG_NUMA is not set
$

The .config I was using for 4.9 got 'make oldconfig' upgraded, and
looking at it there's a bunch of stuff that has been turned off that
I know was set:

# CONFIG_EXPERT is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
# CONFIG_COMPACTION is not set

and stuff I never use so don't set was set, like kernel crash dump,
a bunch of stuff for AMD CPUs, susp/resume and power management
debug, every partition type and filesystem under the sun was
selected, heaps of network devices enabled, etc.

So it looks like the problem has occurred during oldconfig, meaning
I have no idea exactly WTF I was testing. Rebuilding now with a
saner config, see what happens.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Dave Chinner
On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x]
> > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > [0.00] Initmem setup node 0 [mem 
> > 0x1000-0x00083fff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?
> 
> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?
> 
> Dave, if you're still around, can you send out the kernel config file
> you used...

Looking at this fresh this morning (i.e. not pissed off by having
everything I tried to do fail in different ways all afternoon) I
found this:

$ grep NUMA .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
# CONFIG_NUMA is not set
$

The .config I was using for 4.9 got 'make oldconfig' upgraded, and
looking at it there's a bunch of stuff that has been turned off that
I know was set:

# CONFIG_EXPERT is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
# CONFIG_COMPACTION is not set

and stuff I never use so don't set was set, like kernel crash dump,
a bunch of stuff for AMD CPUs, susp/resume and power management
debug, every partition type and filesystem under the sun was
selected, heaps of network devices enabled, etc.

So it looks like the problem has occurred during oldconfig, meaning
I have no idea exactly WTF I was testing. Rebuilding now with a
saner config, see what happens.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Hugh Dickins
On Wed, 21 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?
> 
> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
soon followed by NULL pointer deref in list_lru_isolate, one time when
I tried out Sunday's git tree.  Not seen since, I haven't had time to
investigate, just set it aside as something to worry about if it happens
again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
case (I've no XFS and no iscsi), suspect unrelated to his other problems.

Hugh

> 
> That said:
> 
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
> 
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.
> 
> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..
> 
>  Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Thomas Gleixner
On Thu, 22 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x]
> > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > [0.00] Initmem setup node 0 [mem 
> > 0x1000-0x00083fff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?

Nope. Double checked it for correctness. The cpu hotplug code there is not
involved in setting up kswapd threads. They are created by the memory node
stuff.

> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?

There was the nodeid -> cpuid mapping stuff, but that was in 4.9 so it's
not a post 4.9 wreckage, but maybe it's related nevertheless.

> Dave, if you're still around, can you send out the kernel config file
> you used...

Config file and kernel command line would be helpful.

Thanks,

tglx


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Hugh Dickins
On Wed, 21 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?
> 
> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
soon followed by NULL pointer deref in list_lru_isolate, one time when
I tried out Sunday's git tree.  Not seen since, I haven't had time to
investigate, just set it aside as something to worry about if it happens
again.  But it looks like shadow_lru_isolate() has issues beyond Dave's
case (I've no XFS and no iscsi), suspect unrelated to his other problems.

Hugh

> 
> That said:
> 
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
> 
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.
> 
> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..
> 
>  Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Thomas Gleixner
On Thu, 22 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x1000-0x00ff]
> > [0.00]   DMA32[mem 0x0100-0x]
> > [0.00]   Normal   [mem 0x0001-0x00083fff]
> > [0.00] Movable zone start for each node
> > [0.00] Early memory node ranges
> > [0.00]   node   0: [mem 0x1000-0x0009efff]
> > [0.00]   node   0: [mem 0x0010-0xbffdefff]
> > [0.00]   node   0: [mem 0x0001-0x0003bfff]
> > [0.00]   node   0: [mem 0x0005c000-0x0005]
> > [0.00]   node   0: [mem 0x0008-0x00083fff]
> > [0.00] Initmem setup node 0 [mem 
> > 0x1000-0x00083fff]
> >
> > the numa=fake=4 CLI option is broken.
> 
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
> 
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?

Nope. Double checked it for correctness. The cpu hotplug code there is not
involved in setting up kswapd threads. They are created by the memory node
stuff.

> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
> 
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?

There was the nodeid -> cpuid mapping stuff, but that was in 4.9 so it's
not a post 4.9 wreckage, but maybe it's related nevertheless.

> Dave, if you're still around, can you send out the kernel config file
> you used...

Config file and kernel command line would be helpful.

Thanks,

tglx


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Chris Leech
On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > >
> > > There may be deeper issues. I just started running scalability tests
> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > > directory corruption reported - something I hadn't seen in the dev
> > > cycle at all.
> > 
> > By "in the dev cycle", do you mean your XFS changes, or have you been
> > tracking the merge cycle at least for some testing?
> 
> I mean the three months leading up to the 4.10 merge, when all the
> XFS changes were being tested against 4.9-rc kernels.
> 
> The iscsi problem showed up when I updated the base kernel from
> 4.9 to 4.10-current last week to test the pullreq I was going to
> send you. I've been bust with other stuff until now, so I didn't
> upgrade my working trees again until today in the hope the iscsi
> problem had already been found and fixed.
> 
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] [ cut here ]
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
> the filesystems.
> 
> The test machine I'm having all these other problems on? stable and
> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
> (i.e. run load/perf benchmarks) it starts falling over left, right
> and center.

I'm not reproducing any problems with xfstests running over iscsi_tcp
right now.  Two 10G luns exported from an LIO target, attached directly
to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
TEST_DEV and SCRATCH_DEV.

The virtio scatterlist issue that popped right away for me is triggered
by an hdparm ioctl, which is being run by tuned on Fedora.  And that
actually seems to happen back on 4.9 as well :(

Chris



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Chris Leech
On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> > >
> > > There may be deeper issues. I just started running scalability tests
> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > > directory corruption reported - something I hadn't seen in the dev
> > > cycle at all.
> > 
> > By "in the dev cycle", do you mean your XFS changes, or have you been
> > tracking the merge cycle at least for some testing?
> 
> I mean the three months leading up to the 4.10 merge, when all the
> XFS changes were being tested against 4.9-rc kernels.
> 
> The iscsi problem showed up when I updated the base kernel from
> 4.9 to 4.10-current last week to test the pullreq I was going to
> send you. I've been bust with other stuff until now, so I didn't
> upgrade my working trees again until today in the hope the iscsi
> problem had already been found and fixed.
> 
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] [ cut here ]
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > > shadow_lru_isolate+0x171/0x220
> > 
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> > 
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> > 
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> > 
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> > 
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
> 
> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
> the filesystems.
> 
> The test machine I'm having all these other problems on? stable and
> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
> (i.e. run load/perf benchmarks) it starts falling over left, right
> and center.

I'm not reproducing any problems with xfstests running over iscsi_tcp
right now.  Two 10G luns exported from an LIO target, attached directly
to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
TEST_DEV and SCRATCH_DEV.

The virtio scatterlist issue that popped right away for me is triggered
by an hdparm ioctl, which is being run by tuned on Fedora.  And that
actually seems to happen back on 4.9 as well :(

Chris



Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
>
> This sort of thing is normally indicative of a memory reclaim or
> lock contention problem. Profile showed unusual spinlock contention,
> but then I realised there was only one kswapd thread running.
> Yup, sure enough, it's caused by a major change in memory reclaim
> behaviour:
>
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x]
> [0.00]   Normal   [mem 0x0001-0x00083fff]
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0xbffdefff]
> [0.00]   node   0: [mem 0x0001-0x0003bfff]
> [0.00]   node   0: [mem 0x0005c000-0x0005]
> [0.00]   node   0: [mem 0x0008-0x00083fff]
> [0.00] Initmem setup node 0 [mem 
> 0x1000-0x00083fff]
>
> the numa=fake=4 CLI option is broken.

Ok, I think that is independent of anything else. Removing block
people and adding the x86 people.

I'm not seeing anything at all that would change the fake numa stuff,
but maybe the cpu hotplug changes?

Thomas/Ingo/Peter - Dave is going away for several months, so you
won't get feedback from him, but can you look at this? Or maybe point
me towards the right people - I'm seeing no possible relevant changes
at all fir x85 numa since 4.9, so it must be some indirect breakage.

Dave is using fake-numa to do performance testing in a VM, and it's a
big deal for the node optimizations for writeback etc. Do you have any
ideas?

Dave, if you're still around, can you send out the kernel config file
you used...

   Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-22 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner  wrote:
>
> This sort of thing is normally indicative of a memory reclaim or
> lock contention problem. Profile showed unusual spinlock contention,
> but then I realised there was only one kswapd thread running.
> Yup, sure enough, it's caused by a major change in memory reclaim
> behaviour:
>
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x1000-0x00ff]
> [0.00]   DMA32[mem 0x0100-0x]
> [0.00]   Normal   [mem 0x0001-0x00083fff]
> [0.00] Movable zone start for each node
> [0.00] Early memory node ranges
> [0.00]   node   0: [mem 0x1000-0x0009efff]
> [0.00]   node   0: [mem 0x0010-0xbffdefff]
> [0.00]   node   0: [mem 0x0001-0x0003bfff]
> [0.00]   node   0: [mem 0x0005c000-0x0005]
> [0.00]   node   0: [mem 0x0008-0x00083fff]
> [0.00] Initmem setup node 0 [mem 
> 0x1000-0x00083fff]
>
> the numa=fake=4 CLI option is broken.

Ok, I think that is independent of anything else. Removing block
people and adding the x86 people.

I'm not seeing anything at all that would change the fake numa stuff,
but maybe the cpu hotplug changes?

Thomas/Ingo/Peter - Dave is going away for several months, so you
won't get feedback from him, but can you look at this? Or maybe point
me towards the right people - I'm seeing no possible relevant changes
at all fir x85 numa since 4.9, so it must be some indirect breakage.

Dave is using fake-numa to do performance testing in a VM, and it's a
big deal for the node optimizations for writeback etc. Do you have any
ideas?

Dave, if you're still around, can you send out the kernel config file
you used...

   Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping:  
(null) index:0x0
[ 2729.753763] flags: 0x4000()
[ 2729.754671] raw: 4000   

[ 2729.756469] raw: dead0100 dead0200  

[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB   
4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: 
ff03
[ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e
[ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001
[ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678
[ 2729.785818] R10: 0006 R11: 1000 R12: 0061
[ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080
[ 2729.789124]  
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908]  __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963]  new_slab+0x2ac/0x380
[ 2729.801743]  ___slab_alloc.constprop.82+0x336/0x440
[ 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
> 
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?

I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.

The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.

> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] [ cut here ]
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> > shadow_lru_isolate+0x171/0x220
> 
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
> 
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
> 
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
> 
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
> 
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.

The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.

The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.

And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):

[ 2729.750563] BUG: Bad page state in process bstat  pfn:14945
[ 2729.751863] page:ea525140 count:-1 mapcount:0 mapping:  
(null) index:0x0
[ 2729.753763] flags: 0x4000()
[ 2729.754671] raw: 4000   

[ 2729.756469] raw: dead0100 dead0200  

[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: GB   
4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523]  
[ 2729.765004]  dump_stack+0x63/0x83
[ 2729.765784]  bad_page+0xc4/0x130
[ 2729.766552]  free_pages_check_bad+0x4f/0x70
[ 2729.767531]  free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513]  ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510]  drain_pages_zone+0x41/0x60
[ 2729.770417]  drain_pages+0x3e/0x60
[ 2729.771215]  drain_local_pages+0x24/0x30
[ 2729.772138]  flush_smp_call_function_queue+0x88/0x160
[ 2729.773317]  generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742]  smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000]  smp_call_function_interrupt+0xe/0x10
[ 2729.777102]  call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:c9000f0cf500 EFLAGS: 0202 ORIG_RAX: 
ff03
[ 2729.780852] RAX: 77541291 RBX: 88008b5efe40 RCX: 002e
[ 2729.782514] RDX: 0577 RSI: 05541291 RDI: 0001
[ 2729.784167] RBP: c9000f0cf500 R08: 0007 R09: c9000f0cf678
[ 2729.785818] R10: 0006 R11: 1000 R12: 0061
[ 2729.787480] R13: 0001 R14: 83214e30 R15: 0080
[ 2729.789124]  
[ 2729.789626]  __delay+0xf/0x20
[ 2729.790333]  do_raw_spin_lock+0x8c/0x160
[ 2729.791255]  _raw_spin_lock+0x15/0x20
[ 2729.792112]  list_lru_add+0x1a/0x70
[ 2729.792932]  xfs_buf_rele+0x3e7/0x410
[ 2729.793792]  xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099]  shrink_node+0x57/0x90
[ 2729.796905]  do_try_to_free_pages+0xdd/0x230
[ 2729.797914]  try_to_free_pages+0xce/0x1a0
[ 2729.798852]  __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908]  __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963]  new_slab+0x2ac/0x380
[ 2729.801743]  ___slab_alloc.constprop.82+0x336/0x440
[ 2729.802890]  ? 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote:
> > For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> > so it will be ignored anyway.  That being said both I and the lists
> > got CCed halfway through the thread and I haven't seen the original
> > report, so I'm not really sure what's going on here anyway.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/2587485

This doesn't look like the discard changes, but if Chris wants to test
without them f9d03f96b988 reverts cleanly.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote:
> > For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> > so it will be ignored anyway.  That being said both I and the lists
> > got CCed halfway through the thread and I haven't seen the original
> > report, so I'm not really sure what's going on here anyway.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/2587485

This doesn't look like the discard changes, but if Chris wants to test
without them f9d03f96b988 reverts cleanly.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
> 
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
> 
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway.  That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.

http://www.gossamer-threads.com/lists/linux/kernel/2587485

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW  
>  4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32  -d  /mnt/scratch/0  -d 
 /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d 
 /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d 
 /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  
-d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15
#   Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016
#   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#   Directories:  Time based hash 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > > 
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window.  I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > > 
> > > Hmm. There's not much to go by.
> > > 
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> > 
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd.  I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
> 
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
> 
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW  
>  4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691]  dump_stack+0x63/0x83
> [628867.637937]  __warn+0xcb/0xf0
> [628867.641359]  warn_slowpath_null+0x1d/0x20
> [628867.643362]  shadow_lru_isolate+0x171/0x220
> [628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780]  ? __list_lru_init+0x70/0x70
> [628867.646628]  list_lru_walk_one+0x17/0x20
> [628867.647488]  scan_shadow_nodes+0x34/0x50
> [628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506]  shrink_node+0x57/0x90
> [628867.650233]  do_try_to_free_pages+0xdd/0x230
> [628867.651157]  try_to_free_pages+0xce/0x1a0
> [628867.652342]  __alloc_pages_slowpath+0x2df/0x960
> [628867.653332]  ? __might_sleep+0x4a/0x80
> [628867.654148]  __alloc_pages_nodemask+0x24b/0x290
> [628867.655237]  kmalloc_order+0x21/0x50
> [628867.656016]  kmalloc_order_trace+0x24/0xc0
> [628867.656878]  __kmalloc+0x17d/0x1d0
> [628867.657644]  bucket_table_alloc+0x195/0x1d0
> [628867.658564]  ? __might_sleep+0x4a/0x80
> [628867.659449]  rht_deferred_worker+0x287/0x3c0
> [628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294]  process_one_work+0x1de/0x4d0
> [628867.662208]  worker_thread+0x4b/0x4f0
> [628867.662990]  kthread+0x10c/0x140
> [628867.663687]  ? process_one_work+0x4d0/0x4d0
> [628867.664564]  ? kthread_create_on_node+0x40/0x40
> [628867.665523]  ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
> 
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.

The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:

#  ./fs_mark  -D  1  -S0  -n  10  -s  0  -L  32  -d  /mnt/scratch/0  -d 
 /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  -d  /mnt/scratch/4  -d 
 /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7  -d  /mnt/scratch/8  -d 
 /mnt/scratch/9  -d  /mnt/scratch/10  -d  /mnt/scratch/11  -d  /mnt/scratch/12  
-d  /mnt/scratch/13  -d  /mnt/scratch/14  -d  /mnt/scratch/15
#   Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016
#   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
#   Directories:  Time based hash between directories 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Looking around a bit, the only even halfway suspicious scatterlist
> initialization thing I see is commit f9d03f96b988 ("block: improve
> handling of the magic discard payload") which used to have a magic
> hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> now it does __blk_bvec_map_sg() instead.

But that check was only for discard (and discard-like) bios which
had the maic single page that sometimes was unused attached.

For "normal" bios the for_each_segment loop iterates over bi_vcnt,
so it will be ignored anyway.  That being said both I and the lists
got CCed halfway through the thread and I haven't seen the original
report, so I'm not really sure what's going on here anyway.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Christoph Hellwig
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Looking around a bit, the only even halfway suspicious scatterlist
> initialization thing I see is commit f9d03f96b988 ("block: improve
> handling of the magic discard payload") which used to have a magic
> hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> now it does __blk_bvec_map_sg() instead.

But that check was only for discard (and discard-like) bios which
had the maic single page that sometimes was unused attached.

For "normal" bios the for_each_segment loop iterates over bi_vcnt,
so it will be ignored anyway.  That being said both I and the lists
got CCed halfway through the thread and I haven't seen the original
report, so I'm not really sure what's going on here anyway.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.

By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?

> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220

Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.

> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..

That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.

It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.

That said:

> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:

The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.

The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..

 Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner  wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.

By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?

> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] [ cut here ]
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
> shadow_lru_isolate+0x171/0x220

Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.

> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.

Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..

That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.

It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.

That said:

> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:

The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.

The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..

 Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] [ cut here ]
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW   
4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
35074 root  -2   0   0  0  0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root  -2   0   0  0  0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root  -2   0   0  0  0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root  -2   0   0  0  0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root  -2   0   0  0  0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root  -2   0   0  0  0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root  -2   0   0  0  0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root  -2   0   0  0  0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root  -2   0   0  0  0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root  -2   0   0  0  0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root  -2   0   0  0  0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root  -2   0   0  0  0 R  98.1  0.0  12:39.65 acpi_pad/14
35078 root  -2   0   0  0  0 R  98.1  0.0  12:39.58 acpi_pad/15
35081 root  -2   0   0  0  0 R  98.1  0.0  12:39.32 acpi_pad/18
35072 root  -2   0   0  0  0 R  96.2  

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> > 
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window.  I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> > 
> > Hmm. There's not much to go by.
> > 
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
> 
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd.  I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.

There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:

[628867.607417] [ cut here ]
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 
shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: GW   
4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691]  dump_stack+0x63/0x83
[628867.637937]  __warn+0xcb/0xf0
[628867.641359]  warn_slowpath_null+0x1d/0x20
[628867.643362]  shadow_lru_isolate+0x171/0x220
[628867.644627]  __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780]  ? __list_lru_init+0x70/0x70
[628867.646628]  list_lru_walk_one+0x17/0x20
[628867.647488]  scan_shadow_nodes+0x34/0x50
[628867.648358]  shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506]  shrink_node+0x57/0x90
[628867.650233]  do_try_to_free_pages+0xdd/0x230
[628867.651157]  try_to_free_pages+0xce/0x1a0
[628867.652342]  __alloc_pages_slowpath+0x2df/0x960
[628867.653332]  ? __might_sleep+0x4a/0x80
[628867.654148]  __alloc_pages_nodemask+0x24b/0x290
[628867.655237]  kmalloc_order+0x21/0x50
[628867.656016]  kmalloc_order_trace+0x24/0xc0
[628867.656878]  __kmalloc+0x17d/0x1d0
[628867.657644]  bucket_table_alloc+0x195/0x1d0
[628867.658564]  ? __might_sleep+0x4a/0x80
[628867.659449]  rht_deferred_worker+0x287/0x3c0
[628867.660366]  ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294]  process_one_work+0x1de/0x4d0
[628867.662208]  worker_thread+0x4b/0x4f0
[628867.662990]  kthread+0x10c/0x140
[628867.663687]  ? process_one_work+0x4d0/0x4d0
[628867.664564]  ? kthread_create_on_node+0x40/0x40
[628867.665523]  ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---

Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.

And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
35074 root  -2   0   0  0  0 R  99.0  0.0  12:38.92 acpi_pad/12
35079 root  -2   0   0  0  0 R  99.0  0.0  12:39.40 acpi_pad/16
35080 root  -2   0   0  0  0 R  99.0  0.0  12:39.29 acpi_pad/17
35085 root  -2   0   0  0  0 R  99.0  0.0  12:39.35 acpi_pad/22
35087 root  -2   0   0  0  0 R  99.0  0.0  12:39.13 acpi_pad/24
35090 root  -2   0   0  0  0 R  99.0  0.0  12:38.89 acpi_pad/27
35093 root  -2   0   0  0  0 R  99.0  0.0  12:38.88 acpi_pad/30
35063 root  -2   0   0  0  0 R  98.1  0.0  12:40.64 acpi_pad/1
35065 root  -2   0   0  0  0 R  98.1  0.0  12:40.38 acpi_pad/3
35066 root  -2   0   0  0  0 R  98.1  0.0  12:40.30 acpi_pad/4
35067 root  -2   0   0  0  0 R  98.1  0.0  12:40.82 acpi_pad/5
35077 root  -2   0   0  0  0 R  98.1  0.0  12:39.65 acpi_pad/14
35078 root  -2   0   0  0  0 R  98.1  0.0  12:39.58 acpi_pad/15
35081 root  -2   0   0  0  0 R  98.1  0.0  12:39.32 acpi_pad/18
35072 root  -2   0   0  0  0 R  96.2  0.0  12:40.14 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Chris Leech
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Hi,
> 
> On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> >> Thanks Dave,
> >>
> >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> >> modules loaded (virtio block) so there's something else going on in the
> >> current merge window.  I'll keep an eye on it and make sure there's
> >> nothing iSCSI needs fixing for.
> >
> > OK, so before this slips through the cracks.
> >
> > Linus - your tree as of a few minutes ago still panics immediately
> > when starting xfstests on iscsi devices. It appears to be a
> > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > seem to have bounced it and no-one is looking at it.
> 
> Hmm. There's not much to go by.
> 
> Can somebody in iscsi-land please try to just bisect it - I'm not
> seeing a lot of clues to where this comes from otherwise.

Yeah, my hopes of this being quickly resolved by someone else didn't
work out and whatever is going on in that test VM is looking like a
different kind of odd.  I'm saving that off for later, and seeing if I
can't be a bisect on the iSCSI issue.

Chris


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Chris Leech
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Hi,
> 
> On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> >> Thanks Dave,
> >>
> >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> >> modules loaded (virtio block) so there's something else going on in the
> >> current merge window.  I'll keep an eye on it and make sure there's
> >> nothing iSCSI needs fixing for.
> >
> > OK, so before this slips through the cracks.
> >
> > Linus - your tree as of a few minutes ago still panics immediately
> > when starting xfstests on iscsi devices. It appears to be a
> > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > seem to have bounced it and no-one is looking at it.
> 
> Hmm. There's not much to go by.
> 
> Can somebody in iscsi-land please try to just bisect it - I'm not
> seeing a lot of clues to where this comes from otherwise.

Yeah, my hopes of this being quickly resolved by someone else didn't
work out and whatever is going on in that test VM is looking like a
different kind of odd.  I'm saving that off for later, and seeing if I
can't be a bisect on the iSCSI issue.

Chris


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
Hi,

On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
>> Thanks Dave,
>>
>> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
>> modules loaded (virtio block) so there's something else going on in the
>> current merge window.  I'll keep an eye on it and make sure there's
>> nothing iSCSI needs fixing for.
>
> OK, so before this slips through the cracks.
>
> Linus - your tree as of a few minutes ago still panics immediately
> when starting xfstests on iscsi devices. It appears to be a
> scatterlist corruption and not an iscsi problem, so the iscsi guys
> seem to have bounced it and no-one is looking at it.

Hmm. There's not much to go by.

Can somebody in iscsi-land please try to just bisect it - I'm not
seeing a lot of clues to where this comes from otherwise.

There clearly hasn't been anything done to drivers/scsi/*iscsi* since
4.9, so it's coming from elsewhere, presumably the block layer changes
as you say.

But I think we need iscsi people to pinpoint it a bit more, since
nobody else seems to be complaining.

Looking around a bit, the only even halfway suspicious scatterlist
initialization thing I see is commit f9d03f96b988 ("block: improve
handling of the magic discard payload") which used to have a magic
hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
now it does __blk_bvec_map_sg() instead.

Maybe __blk_bvec_map_sg() would want that same

if (!bio->bi_vcnt)
return 0;

adding Christoph explicitly so that he can tell me why I'm a clueless
weenie. Although I suspect he's already on the scsi and block lists
and would have done so anyway.

> I'm disappearing for several months at the end of tomorrow, so I
> thought I better make sure you know about it.  I've also added
> linux-scsi, linux-block to the cc list

Thanks,

Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Linus Torvalds
Hi,

On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner  wrote:
> On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
>> Thanks Dave,
>>
>> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
>> modules loaded (virtio block) so there's something else going on in the
>> current merge window.  I'll keep an eye on it and make sure there's
>> nothing iSCSI needs fixing for.
>
> OK, so before this slips through the cracks.
>
> Linus - your tree as of a few minutes ago still panics immediately
> when starting xfstests on iscsi devices. It appears to be a
> scatterlist corruption and not an iscsi problem, so the iscsi guys
> seem to have bounced it and no-one is looking at it.

Hmm. There's not much to go by.

Can somebody in iscsi-land please try to just bisect it - I'm not
seeing a lot of clues to where this comes from otherwise.

There clearly hasn't been anything done to drivers/scsi/*iscsi* since
4.9, so it's coming from elsewhere, presumably the block layer changes
as you say.

But I think we need iscsi people to pinpoint it a bit more, since
nobody else seems to be complaining.

Looking around a bit, the only even halfway suspicious scatterlist
initialization thing I see is commit f9d03f96b988 ("block: improve
handling of the magic discard payload") which used to have a magic
hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
now it does __blk_bvec_map_sg() instead.

Maybe __blk_bvec_map_sg() would want that same

if (!bio->bi_vcnt)
return 0;

adding Christoph explicitly so that he can tell me why I'm a clueless
weenie. Although I suspect he's already on the scsi and block lists
and would have done so anyway.

> I'm disappearing for several months at the end of tomorrow, so I
> thought I better make sure you know about it.  I've also added
> linux-scsi, linux-block to the cc list

Thanks,

Linus


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > > enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > > 000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0x81b950bd is in iscsi_tcp_segment_done 
> > (drivers/scsi/libiscsi_tcp.c:102).
> > 97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98struct scatterlist *sg, unsigned int offset)
> > 99  {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - 
> > segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106 
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops:  [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: 88003e28 task.stack: c908
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > > 
> > > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > > 880039061730
> > > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > > 05a8
> > > [  160.285627] R10: 9835607d R11:  R12: 
> > > 0200
> > > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > > 880039061730
> > > [  160.287362] FS:  () GS:88003fc0() 
> > > knlGS:
> > > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > > 06f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > > [  160.296525]  ret_from_fork+0x25/0x30
> > > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da 
> > > e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 
> > > 00 00 <8b
> > > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: 
> > > c9083c38
> > > [  160.301584] CR2: 000c
> > > 
> > > 
> > > Known problem, or something new?
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > da...@fromorbit.com
> > > 
> > 
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-21 Thread Dave Chinner
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
> 
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window.  I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.

OK, so before this slips through the cracks.

Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.

I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it.  I've also added
linux-scsi, linux-block to the cc list

Cheers,

Dave.

> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > > 
> > > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > > enabled. Use at your own risk!
> > > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [  159.425710] XFS (sdb): Ending clean mount
> > > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > > 000c
> > > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > FYI, crash is here:
> > 
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0x81b950bd is in iscsi_tcp_segment_done 
> > (drivers/scsi/libiscsi_tcp.c:102).
> > 97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98struct scatterlist *sg, unsigned int offset)
> > 99  {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - 
> > segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106 
> > 
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> > 
> > -Dave.
> > 
> > > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > > PMD 0 [  160.277309]
> > > [  160.277523] Oops:  [#1] PREEMPT SMP
> > > [  160.278004] Modules linked in:
> > > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS Debian-1.8.2-1 04/01/2014
> > > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [  160.280919] task: 88003e28 task.stack: c908
> > > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > > 
> > > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > > 880039061730
> > > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > > 05a8
> > > [  160.285627] R10: 9835607d R11:  R12: 
> > > 0200
> > > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > > 880039061730
> > > [  160.287362] FS:  () GS:88003fc0() 
> > > knlGS:
> > > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > > 06f0
> > > [  160.290084] Call Trace:
> > > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > > [  160.293835]  process_one_work+0x1de/0x4d0
> > > [  160.294388]  worker_thread+0x4b/0x4f0
> > > [  160.294889]  kthread+0x10c/0x140
> > > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > > [  160.296525]  ret_from_fork+0x25/0x30
> > > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da 
> > > e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 
> > > 00 00 <8b
> > > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: 
> > > c9083c38
> > > [  160.301584] CR2: 000c
> > > 
> > > 
> > > Known problem, or something new?
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > da...@fromorbit.com
> > > 
> > 
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-16 Thread Chris Leech
Thanks Dave,

I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
modules loaded (virtio block) so there's something else going on in the
current merge window.  I'll keep an eye on it and make sure there's
nothing iSCSI needs fixing for.

Chris

On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > window kernel to test the XFS merge I am preparing for Linus.
> > Unfortunately, all my test VMs using iscsi failed pretty much
> > instantly on the first mount of an iscsi device:
> > 
> > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > enabled. Use at your own risk!
> > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > [  159.425710] XFS (sdb): Ending clean mount
> > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > 000c
> > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> 
> FYI, crash is here:
> 
> (gdb) l *(iscsi_tcp_segment_done+0x20d)
> 0x81b950bd is in iscsi_tcp_segment_done 
> (drivers/scsi/libiscsi_tcp.c:102).
> 97iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> 98  struct scatterlist *sg, unsigned int offset)
> 99{
> 100   segment->sg = sg;
> 101   segment->sg_offset = offset;
> 102   segment->size = min(sg->length - offset,
> 103   segment->total_size - 
> segment->total_copied);
> 104   segment->data = NULL;
> 105   }
> 106   
> 
> So it looks to be sg = NULL, which means there's probably an issue
> with the scatterlist...
> 
> -Dave.
> 
> > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > PMD 0 [  160.277309]
> > [  160.277523] Oops:  [#1] PREEMPT SMP
> > [  160.278004] Modules linked in:
> > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > Debian-1.8.2-1 04/01/2014
> > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > [  160.280919] task: 88003e28 task.stack: c908
> > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > 
> > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > 880039061730
> > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > 05a8
> > [  160.285627] R10: 9835607d R11:  R12: 
> > 0200
> > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > 880039061730
> > [  160.287362] FS:  () GS:88003fc0() 
> > knlGS:
> > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > 06f0
> > [  160.290084] Call Trace:
> > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > [  160.293835]  process_one_work+0x1de/0x4d0
> > [  160.294388]  worker_thread+0x4b/0x4f0
> > [  160.294889]  kthread+0x10c/0x140
> > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > [  160.296525]  ret_from_fork+0x25/0x30
> > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 
> > ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 
> > <8b
> > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: c9083c38
> > [  160.301584] CR2: 000c
> > 
> > 
> > Known problem, or something new?
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> > 
> 
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-16 Thread Chris Leech
Thanks Dave,

I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
modules loaded (virtio block) so there's something else going on in the
current merge window.  I'll keep an eye on it and make sure there's
nothing iSCSI needs fixing for.

Chris

On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > window kernel to test the XFS merge I am preparing for Linus.
> > Unfortunately, all my test VMs using iscsi failed pretty much
> > instantly on the first mount of an iscsi device:
> > 
> > [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature 
> > enabled. Use at your own risk!
> > [  159.374612] XFS (sdb): Mounting V5 Filesystem
> > [  159.425710] XFS (sdb): Ending clean mount
> > [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> > 000c
> > [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> 
> FYI, crash is here:
> 
> (gdb) l *(iscsi_tcp_segment_done+0x20d)
> 0x81b950bd is in iscsi_tcp_segment_done 
> (drivers/scsi/libiscsi_tcp.c:102).
> 97iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> 98  struct scatterlist *sg, unsigned int offset)
> 99{
> 100   segment->sg = sg;
> 101   segment->sg_offset = offset;
> 102   segment->size = min(sg->length - offset,
> 103   segment->total_size - 
> segment->total_copied);
> 104   segment->data = NULL;
> 105   }
> 106   
> 
> So it looks to be sg = NULL, which means there's probably an issue
> with the scatterlist...
> 
> -Dave.
> 
> > [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> > PMD 0 [  160.277309]
> > [  160.277523] Oops:  [#1] PREEMPT SMP
> > [  160.278004] Modules linked in:
> > [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > Debian-1.8.2-1 04/01/2014
> > [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > [  160.280919] task: 88003e28 task.stack: c908
> > [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> > [  160.282980] RAX:  RBX: 880039061730 RCX: 
> > 
> > [  160.283854] RDX: 1e00 RSI:  RDI: 
> > 880039061730
> > [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> > 05a8
> > [  160.285627] R10: 9835607d R11:  R12: 
> > 0200
> > [  160.286495] R13:  R14: 8800390615a0 R15: 
> > 880039061730
> > [  160.287362] FS:  () GS:88003fc0() 
> > knlGS:
> > [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> > [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> > 06f0
> > [  160.290084] Call Trace:
> > [  160.290429]  ? inet_sendpage+0x4d/0x140
> > [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> > [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> > [  160.292771]  iscsi_xmit_task+0x53/0xc0
> > [  160.293282]  iscsi_xmitworker+0x274/0x310
> > [  160.293835]  process_one_work+0x1de/0x4d0
> > [  160.294388]  worker_thread+0x4b/0x4f0
> > [  160.294889]  kthread+0x10c/0x140
> > [  160.295333]  ? process_one_work+0x4d0/0x4d0
> > [  160.295898]  ? kthread_create_on_node+0x40/0x40
> > [  160.296525]  ret_from_fork+0x25/0x30
> > [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 
> > ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 
> > <8b
> > [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: c9083c38
> > [  160.301584] CR2: 000c
> > 
> > 
> > Known problem, or something new?
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
> > 
> 
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-14 Thread Dave Chinner
On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> Hi folks,
> 
> Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> window kernel to test the XFS merge I am preparing for Linus.
> Unfortunately, all my test VMs using iscsi failed pretty much
> instantly on the first mount of an iscsi device:
> 
> [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. 
> Use at your own risk!
> [  159.374612] XFS (sdb): Mounting V5 Filesystem
> [  159.425710] XFS (sdb): Ending clean mount
> [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> 000c
> [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0

FYI, crash is here:

(gdb) l *(iscsi_tcp_segment_done+0x20d)
0x81b950bd is in iscsi_tcp_segment_done 
(drivers/scsi/libiscsi_tcp.c:102).
97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
98struct scatterlist *sg, unsigned int offset)
99  {
100 segment->sg = sg;
101 segment->sg_offset = offset;
102 segment->size = min(sg->length - offset,
103 segment->total_size - 
segment->total_copied);
104 segment->data = NULL;
105 }
106 

So it looks to be sg = NULL, which means there's probably an issue
with the scatterlist...

-Dave.

> [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> PMD 0 [  160.277309]
> [  160.277523] Oops:  [#1] PREEMPT SMP
> [  160.278004] Modules linked in:
> [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> [  160.280919] task: 88003e28 task.stack: c908
> [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> [  160.282980] RAX:  RBX: 880039061730 RCX: 
> 
> [  160.283854] RDX: 1e00 RSI:  RDI: 
> 880039061730
> [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> 05a8
> [  160.285627] R10: 9835607d R11:  R12: 
> 0200
> [  160.286495] R13:  R14: 8800390615a0 R15: 
> 880039061730
> [  160.287362] FS:  () GS:88003fc0() 
> knlGS:
> [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> 06f0
> [  160.290084] Call Trace:
> [  160.290429]  ? inet_sendpage+0x4d/0x140
> [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> [  160.292771]  iscsi_xmit_task+0x53/0xc0
> [  160.293282]  iscsi_xmitworker+0x274/0x310
> [  160.293835]  process_one_work+0x1de/0x4d0
> [  160.294388]  worker_thread+0x4b/0x4f0
> [  160.294889]  kthread+0x10c/0x140
> [  160.295333]  ? process_one_work+0x4d0/0x4d0
> [  160.295898]  ? kthread_create_on_node+0x40/0x40
> [  160.296525]  ret_from_fork+0x25/0x30
> [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca 
> ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: c9083c38
> [  160.301584] CR2: 000c
> 
> 
> Known problem, or something new?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2016-12-14 Thread Dave Chinner
On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> Hi folks,
> 
> Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> window kernel to test the XFS merge I am preparing for Linus.
> Unfortunately, all my test VMs using iscsi failed pretty much
> instantly on the first mount of an iscsi device:
> 
> [  159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. 
> Use at your own risk!
> [  159.374612] XFS (sdb): Mounting V5 Filesystem
> [  159.425710] XFS (sdb): Ending clean mount
> [  160.274438] BUG: unable to handle kernel NULL pointer dereference at 
> 000c
> [  160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0

FYI, crash is here:

(gdb) l *(iscsi_tcp_segment_done+0x20d)
0x81b950bd is in iscsi_tcp_segment_done 
(drivers/scsi/libiscsi_tcp.c:102).
97  iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
98struct scatterlist *sg, unsigned int offset)
99  {
100 segment->sg = sg;
101 segment->sg_offset = offset;
102 segment->size = min(sg->length - offset,
103 segment->total_size - 
segment->total_copied);
104 segment->data = NULL;
105 }
106 

So it looks to be sg = NULL, which means there's probably an issue
with the scatterlist...

-Dave.

> [  160.276565] PGD 336ed067 [  160.276885] PUD 31b0d067
> PMD 0 [  160.277309]
> [  160.277523] Oops:  [#1] PREEMPT SMP
> [  160.278004] Modules linked in:
> [  160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> [  160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Debian-1.8.2-1 04/01/2014
> [  160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> [  160.280919] task: 88003e28 task.stack: c908
> [  160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> [  160.282312] RSP: 0018:c9083c38 EFLAGS: 00010206
> [  160.282980] RAX:  RBX: 880039061730 RCX: 
> 
> [  160.283854] RDX: 1e00 RSI:  RDI: 
> 880039061730
> [  160.284738] RBP: c9083c90 R08: 0200 R09: 
> 05a8
> [  160.285627] R10: 9835607d R11:  R12: 
> 0200
> [  160.286495] R13:  R14: 8800390615a0 R15: 
> 880039061730
> [  160.287362] FS:  () GS:88003fc0() 
> knlGS:
> [  160.288340] CS:  0010 DS:  ES:  CR0: 80050033
> [  160.289113] CR2: 000c CR3: 31a8d000 CR4: 
> 06f0
> [  160.290084] Call Trace:
> [  160.290429]  ? inet_sendpage+0x4d/0x140
> [  160.290957]  iscsi_sw_tcp_xmit_segment+0x89/0x110
> [  160.291597]  iscsi_sw_tcp_pdu_xmit+0x56/0x180
> [  160.292190]  iscsi_tcp_task_xmit+0xb8/0x280
> [  160.292771]  iscsi_xmit_task+0x53/0xc0
> [  160.293282]  iscsi_xmitworker+0x274/0x310
> [  160.293835]  process_one_work+0x1de/0x4d0
> [  160.294388]  worker_thread+0x4b/0x4f0
> [  160.294889]  kthread+0x10c/0x140
> [  160.295333]  ? process_one_work+0x4d0/0x4d0
> [  160.295898]  ? kthread_create_on_node+0x40/0x40
> [  160.296525]  ret_from_fork+0x25/0x30
> [  160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca 
> ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> [  160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: c9083c38
> [  160.301584] CR2: 000c
> 
> 
> Known problem, or something new?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 

-- 
Dave Chinner
da...@fromorbit.com