Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-12-09 Thread Paolo Abeni
On Fri, 2022-12-09 at 08:11 -0800, Jakub Kicinski wrote:
> On Fri, 09 Dec 2022 13:37:08 +0100 Paolo Abeni wrote:
> > I think this is the most feasible way out of the existing issue, and I
> > think this patchset should go via the networking tree, targeting the
> > Linux 6.2.
> 
> FWIW some fields had been moved so this will not longer apply cleanly,
> see b534dc46c8ae016. But I think we can apply it to net since the merge
> window is upon us? Just a heads up.

We will need an additional revision, see my reply to patch 3/3. I think
the -net tree should be the appropriate target.

Thanks,

Paolo



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-12-09 Thread Jakub Kicinski
On Fri, 09 Dec 2022 13:37:08 +0100 Paolo Abeni wrote:
> I think this is the most feasible way out of the existing issue, and I
> think this patchset should go via the networking tree, targeting the
> Linux 6.2.

FWIW some fields had been moved so this will not longer apply cleanly,
see b534dc46c8ae016. But I think we can apply it to net since the merge
window is upon us? Just a heads up.



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-12-09 Thread Paolo Abeni
On Mon, 2022-11-21 at 08:35 -0500, Benjamin Coddington wrote:
> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.  The results of this are
> unexpected corruption in task_frag when SUNRPC is involved in memory
> reclaim.
> 
> The corruption can be seen in crashes, but the root cause is often
> difficult to ascertain as a crashing machine's stack trace will have no
> evidence of being near NFS or SUNRPC code.  I believe this problem to
> be much more pervasive than reports to the community may indicate.
> 
> Fix this by having kernel users of sockets that may corrupt task_frag due
> to reclaim set sk_use_task_frag = false.  Preemptively correcting this
> situation for users that still set sk_allocation allows them to convert to
> memalloc_nofs_save/restore without the same unexpected corruptions that are
> sure to follow, unlikely to show up in testing, and difficult to bisect.
> 
> CC: Philipp Reisner 
> CC: Lars Ellenberg 
> CC: "Christoph Böhmwalder" 
> CC: Jens Axboe 
> CC: Josef Bacik 
> CC: Keith Busch 
> CC: Christoph Hellwig 
> CC: Sagi Grimberg 
> CC: Lee Duncan 
> CC: Chris Leech 
> CC: Mike Christie 
> CC: "James E.J. Bottomley" 
> CC: "Martin K. Petersen" 
> CC: Valentina Manea 
> CC: Shuah Khan 
> CC: Greg Kroah-Hartman 
> CC: David Howells 
> CC: Marc Dionne 
> CC: Steve French 
> CC: Christine Caulfield 
> CC: David Teigland 
> CC: Mark Fasheh 
> CC: Joel Becker 
> CC: Joseph Qi 
> CC: Eric Van Hensbergen 
> CC: Latchesar Ionkov 
> CC: Dominique Martinet 
> CC: "David S. Miller" 
> CC: Eric Dumazet 
> CC: Jakub Kicinski 
> CC: Paolo Abeni 
> CC: Ilya Dryomov 
> CC: Xiubo Li 
> CC: Chuck Lever 
> CC: Jeff Layton 
> CC: Trond Myklebust 
> CC: Anna Schumaker 
> CC: drbd-...@lists.linbit.com
> CC: linux-bl...@vger.kernel.org
> CC: linux-ker...@vger.kernel.org
> CC: n...@other.debian.org
> CC: linux-n...@lists.infradead.org
> CC: open-is...@googlegroups.com
> CC: linux-s...@vger.kernel.org
> CC: linux-...@vger.kernel.org
> CC: linux-...@lists.infradead.org
> CC: linux-c...@vger.kernel.org
> CC: samba-techni...@lists.samba.org
> CC: cluster-devel@redhat.com
> CC: ocfs2-de...@oss.oracle.com
> CC: v9fs-develo...@lists.sourceforge.net
> CC: net...@vger.kernel.org
> CC: ceph-de...@vger.kernel.org
> CC: linux-...@vger.kernel.org
> 
> Suggested-by: Guillaume Nault 
> Signed-off-by: Benjamin Coddington 

I think this is the most feasible way out of the existing issue, and I
think this patchset should go via the networking tree, targeting the
Linux 6.2.

If someone has disagreement with the above, please speak! 

Thanks,

Paolo



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-30 Thread Guillaume Nault
On Tue, Nov 29, 2022 at 11:47:47AM -0500, Benjamin Coddington wrote:
> On 29 Nov 2022, at 9:02, Christoph Hellwig wrote:
> 
> > Hmm.  Having to set a flag to not accidentally corrupt per-task
> > state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> > into the feature only for sockets created from the syscall layer?
> 
> It's totally fragile, and that's why it's currently broken in production.
> The fragile ship sailed when networking decided to depend on users setting
> the socket's GFP_ flags correctly to avoid corruption.
> 
> Meantime, this problem needs fixing in a way that makes everyone happy.
> This fix doesn't make it less fragile, but it may (hopefully) address the
> previous criticisms enough that something gets done to fix it.

Also, let's remember that while we're discussing how the kernel sould
work in an ideal world, the reality is that production NFS systems
crash randomly upon memory reclaim since commit a1231fda7e94 ("SUNRPC:
Set memalloc_nofs_save() on all rpciod/xprtiod jobs"). Fixing that is
just a matter of re-introducing GFP_NOFS on SUNRPC sockets (which has
been proposed several times already). Then we'll have plenty of time
to argue about how networking should use the per-task page_frag and
how to remove GFP_NOFS in the long term.



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-30 Thread Guillaume Nault
On Tue, Nov 29, 2022 at 03:02:42PM +0100, Christoph Hellwig wrote:
> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

That's something I originally considered. But, as far as I can see, nbd
needs this flag _and_ uses sockets created in user space. So it'd still
need to opt out manually.



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-29 Thread Jeff Layton
On Tue, 2022-11-29 at 15:02 +0100, Christoph Hellwig wrote:
> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

I agree that that would be cleaner. task_frag should have been an opt-in
thing all along. That change regressed all of the in-kernel users of
sockets.

Where would be the right place to set that flag for only userland
sockets? A lot of the in-kernel socket users hook into the socket API at
a fairly high-level. 9P and CIFS, for instance, call __sock_create.

We could set it in the syscall handlers (and maybe in iouring) I
suppose, but that seems like the wrong thing to do too.

In the absence of a clean place to do this, I think we're going to be
stuck doing it the way Ben has proposed...
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-29 Thread Benjamin Coddington
On 29 Nov 2022, at 9:02, Christoph Hellwig wrote:

> Hmm.  Having to set a flag to not accidentally corrupt per-task
> state seems a bit fragile.  Wouldn't it make sense to find a way to opt
> into the feature only for sockets created from the syscall layer?

It's totally fragile, and that's why it's currently broken in production.
The fragile ship sailed when networking decided to depend on users setting
the socket's GFP_ flags correctly to avoid corruption.

Meantime, this problem needs fixing in a way that makes everyone happy.
This fix doesn't make it less fragile, but it may (hopefully) address the
previous criticisms enough that something gets done to fix it.

Ben



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-29 Thread Christoph Hellwig
Hmm.  Having to set a flag to not accidentally corrupt per-task
state seems a bit fragile.  Wouldn't it make sense to find a way to opt
into the feature only for sockets created from the syscall layer?



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-22 Thread Benjamin Coddington
On 21 Nov 2022, at 17:32, Shuah Khan wrote:

> On 11/21/22 15:01, Benjamin Coddington wrote:
>> On 21 Nov 2022, at 16:43, Shuah Khan wrote:
>>
>>> On 11/21/22 14:40, Shuah Khan wrote:
 On 11/21/22 07:34, Benjamin Coddington wrote:
> On 21 Nov 2022, at 8:56, David Howells wrote:
>
>> Benjamin Coddington  wrote:
>>
>>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting 
>>> the
>>> GFP_NOIO flag on sk_allocation which the networking system uses to 
>>> decide
>>> when it is safe to use current->task_frag.
>>
>> Um, what's task_frag?
>
> Its a per-task page_frag used to coalesce small writes for networking -- 
> see:
>
> 5640f7685831 net: use a per task frag allocator
>
> Ben
>
>

 I am not seeing this in the mainline. Where can find this commit?

>>>
>>> Okay. I see this commit in the mainline. However, I don't see the
>>> sk_use_task_frag in mainline.
>>
>> sk_use_task_frag is in patch 1/3 in this posting.
>>
>> https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff...@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99
>>
>
> Aha. I don't have 1/3 in my Inbox - I think it would make
> sense to cc people on the first patch so we can understand
> the premise for the change.

Yeah, I can do that if it goes to another version, I was just trying to be
considerate of all the noise this sort of posting generates.

Ben



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-22 Thread Shuah Khan

On 11/21/22 07:34, Benjamin Coddington wrote:

On 21 Nov 2022, at 8:56, David Howells wrote:


Benjamin Coddington  wrote:


Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.


Um, what's task_frag?


Its a per-task page_frag used to coalesce small writes for networking -- see:

5640f7685831 net: use a per task frag allocator

Ben




I am not seeing this in the mainline. Where can find this commit?

thanks,
-- Shuah



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-22 Thread Shuah Khan

On 11/21/22 15:01, Benjamin Coddington wrote:

On 21 Nov 2022, at 16:43, Shuah Khan wrote:


On 11/21/22 14:40, Shuah Khan wrote:

On 11/21/22 07:34, Benjamin Coddington wrote:

On 21 Nov 2022, at 8:56, David Howells wrote:


Benjamin Coddington  wrote:


Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.


Um, what's task_frag?


Its a per-task page_frag used to coalesce small writes for networking -- see:

5640f7685831 net: use a per task frag allocator

Ben




I am not seeing this in the mainline. Where can find this commit?



Okay. I see this commit in the mainline. However, I don't see the
sk_use_task_frag in mainline.


sk_use_task_frag is in patch 1/3 in this posting.

https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff...@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99



Aha. I don't have 1/3 in my Inbox - I think it would make
sense to cc people on the first patch so we can understand
the premise for the change.

thanks,
-- Shuah
 



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-22 Thread Shuah Khan

On 11/21/22 14:40, Shuah Khan wrote:

On 11/21/22 07:34, Benjamin Coddington wrote:

On 21 Nov 2022, at 8:56, David Howells wrote:


Benjamin Coddington  wrote:


Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag.


Um, what's task_frag?


Its a per-task page_frag used to coalesce small writes for networking -- see:

5640f7685831 net: use a per task frag allocator

Ben




I am not seeing this in the mainline. Where can find this commit?



Okay. I see this commit in the mainline. However, I don't see the
sk_use_task_frag in mainline.

thanks,
-- Shuah



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-21 Thread Benjamin Coddington
On 21 Nov 2022, at 16:43, Shuah Khan wrote:

> On 11/21/22 14:40, Shuah Khan wrote:
>> On 11/21/22 07:34, Benjamin Coddington wrote:
>>> On 21 Nov 2022, at 8:56, David Howells wrote:
>>>
 Benjamin Coddington  wrote:

> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.

 Um, what's task_frag?
>>>
>>> Its a per-task page_frag used to coalesce small writes for networking -- 
>>> see:
>>>
>>> 5640f7685831 net: use a per task frag allocator
>>>
>>> Ben
>>>
>>>
>>
>> I am not seeing this in the mainline. Where can find this commit?
>>
>
> Okay. I see this commit in the mainline. However, I don't see the
> sk_use_task_frag in mainline.

sk_use_task_frag is in patch 1/3 in this posting.

https://lore.kernel.org/netdev/26d98c8f-372b-b9c8-c29f-096cddaff...@linuxfoundation.org/T/#m3271959c4cf8dcff1c0c6ba023b2b3821d9e7e99

Ben



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-21 Thread Benjamin Coddington
On 21 Nov 2022, at 8:56, David Howells wrote:

> Benjamin Coddington  wrote:
>
>> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
>> GFP_NOIO flag on sk_allocation which the networking system uses to decide
>> when it is safe to use current->task_frag.
>
> Um, what's task_frag?

Its a per-task page_frag used to coalesce small writes for networking -- see:

5640f7685831 net: use a per task frag allocator

Ben



Re: [Cluster-devel] [PATCH v1 2/3] Treewide: Stop corrupting socket's task_frag

2022-11-21 Thread David Howells


Benjamin Coddington  wrote:

> Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
> GFP_NOIO flag on sk_allocation which the networking system uses to decide
> when it is safe to use current->task_frag.

Um, what's task_frag?

David