Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Laurence Oberman" 
> To: "Jan Kara" 
> Cc: "Johannes Weiner" , "Hugh Dickins" 
> , "Linus Torvalds"
> , "Dave Chinner" , "Chris 
> Leech" , "Linux
> Kernel Mailing List" , "Lee Duncan" 
> , open-iscsi@googlegroups.com,
> "Linux SCSI List" , linux-bl...@vger.kernel.org, 
> "Christoph Hellwig" ,
> "Andrea Arcangeli" 
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> - Original Message -
> > From: "Jan Kara" 
> > To: "Johannes Weiner" 
> > Cc: "Hugh Dickins" , "Linus Torvalds"
> > , "Dave Chinner"
> > , "Chris Leech" , "Linux Kernel
> > Mailing List"
> > , "Lee Duncan" ,
> > open-iscsi@googlegroups.com, "Linux SCSI List"
> > , linux-bl...@vger.kernel.org, "Christoph
> > Hellwig" , "Jan Kara"
> > , "Andrea Arcangeli" 
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > 
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > > when
> > > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time
> > > > > > to
> > > > > > investigate, just set it aside as something to worry about if it
> > > > > > happens
> > > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > > Dave's
> > > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > > problems.
> > > > > 
> > > > > This seems consistent with what Dave observed: we encounter regular
> > > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > > new slot replacement code and the node tracking callback.
> > > > 
> > > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > > warn about an unexpected node->count / node->exceptional state, but
> > > > had entries that were inconsistent with that. Hugh got the counter
> > > > warning but crashed on a list_head that's not NULLed in a live node.
> > > > 
> > > > workingset_update_node() should be called on page cache radix tree
> > > > leaf nodes that go empty. I must be missing an update_node callback
> > > > where a leaf node gets freed somewhere.
> > > 
> > > Sorry for dropping silent on this. I'm traveling over the holidays
> > > with sporadic access to my emails and no access 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman


- Original Message -
> From: "Jan Kara" 
> To: "Johannes Weiner" 
> Cc: "Hugh Dickins" , "Linus Torvalds" 
> , "Dave Chinner"
> , "Chris Leech" , "Linux Kernel 
> Mailing List"
> , "Lee Duncan" , 
> open-iscsi@googlegroups.com, "Linux SCSI List"
> , linux-bl...@vger.kernel.org, "Christoph 
> Hellwig" , "Jan Kara"
> , "Andrea Arcangeli" 
> Sent: Tuesday, January 3, 2017 7:28:25 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.
> > > > > > 
> > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > affecting the working set code.
> > > > > > 
> > > > > > Well, I suspect that anything that creates memory pressure will end
> > > > > > up
> > > > > > triggering the working set code, so ..
> > > > > > 
> > > > > > That said, obviously memory corruption could be involved and result
> > > > > > in
> > > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > > 
> > > > > > It would probably be really useful to get more data points - is the
> > > > > > problem reliably in this area, or is it going to be random and all
> > > > > > over the place.
> > > > > 
> > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > shadow_lru_isolate,
> > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > when
> > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > > investigate, just set it aside as something to worry about if it
> > > > > happens
> > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > Dave's
> > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > problems.
> > > > 
> > > > This seems consistent with what Dave observed: we encounter regular
> > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > new slot replacement code and the node tracking callback.
> > > 
> > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > warn about an unexpected node->count / node->exceptional state, but
> > > had entries that were inconsistent with that. Hugh got the counter
> > > warning but crashed on a list_head that's not NULLed in a live node.
> > > 
> > > workingset_update_node() should be called on page cache radix tree
> > > leaf nodes that go empty. I must be missing an update_node callback
> > > where a leaf node gets freed somewhere.
> > 
> > Sorry for dropping silent on this. I'm traveling over the holidays
> > with sporadic access to my emails and no access to real equipment.
> > 
> > The times I managed to sneak away to look at the code didn't turn up
> > anything useful yet.
> > 
> > Andrea encountered the warning as well and I gave him a debugging
> > patch (attached below), but he hasn't been able to reproduce this
> > condition. I've personally never seen the warning trigger, even though
> > the patches have been running on my main development machine for quite
> > a while now. Albeit against an older base; I've updated to Linus's
> > master branch now in case it's an interaction with other new code.
> > 
> > If anybody manages to reproduce this, that would be helpful. Any extra
> > eyes on this would be much appreciated too until I'm back at my desk.
> 
> I was looking into this but I didn't find a way how we could possibly leave
> radix tree node on LRU. So your debug patch looks like a good way forward.
> 
>   Honza
> --
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send 

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Linus Torvalds
On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>
> From: Christoph Hellwig 
> Date: Fri, 23 Dec 2016 10:57:06 +0100
> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel.  Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.

Ugh. This patch is nasty.

I think we should just fix blk_execute_rq() instead.

But from a quick look, we also have at least sg_scsi_ioctl() and
sg_io() doing the same thing.

And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c

And resp_requests() in scsi_debug.c.

So I guess ugly it may need to be, and the rule is that the sense
buffer really can be on the stack and you can't DMA to/from it.
Comments from others?

Linus

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Jens Axboe
On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig  wrote:
>>
>> From: Christoph Hellwig 
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure.  Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel.  Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
> 
> Ugh. This patch is nasty.
> 
> I think we should just fix blk_execute_rq() instead.
> 
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
> 
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
> 
> And resp_requests() in scsi_debug.c.

It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.

And yes, there would be a good chunk of other places where this would
nede to be fixed up...

> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?

I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.

If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.

-- 
Jens Axboe

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.


Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Christoph Hellwig
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?

Here is a pointer to the current one after splitting it into properly
bisectable chunks.  Besides proper changelogs the biggest item left
is fixing up dm-mpath to not allocate its own request structures.

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-pc-refactor

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at https://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/d/optout.