On Thu, 6 Mar 2014, Or Gerlitz wrote:

> > This was originally a patch from Matthew Finlay<m...@mellanox.com>  that
> > addressed a problem whereby NFS writes would enter uninterruptible sleep
> > forever.  The issue happened when using NFS over IPoIB. This is not a
> > recommended configuration as RDMA is preferred but it is still a valid
> > configuration and is important to have in situations where the NFS server
> > does not support RDMA. The problem encountered was described as follows:
> > 
> >     It's not memory reclamation that is the problem as such. There is
> >     an indirect dependency between network filesystems writing back
> >     pages and ipoib_cm_tx_init() due to how a kworker is used. Page
> >     reclaim cannot make forward progress until ipoib_cm_tx_init()
> >     succeeds and it is stuck in page reclaim itself waiting for network
> >     transmission. Ordinarily this sitaution may be avoided by having
> >     the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that
> > information.
> > 
> 
> Hi Jiri,
>
> Reading again (*) the problem description, the team here would be happy 
> to clarify with you some details (possibly few MM newbie questions, but 
> it will help us):

Hi Or,

thanks for getting back to me. I am sure there are better people to ask 
MM-related questions, but here we go.

Oh, and by the way, the very original version of the patch is coming from 
a Mellanox employee Matthew Finlay, so perhaps it might be much more 
efficient if you would be able to contact him and discuss the details with 
him.

> 1. just to make sure, the problem happen on the NFS client, not the NFS 
> server, right? so writing-back means client writing over the NFS mount 
> --> network

Yes, that is the case.

> 2. you wrote "due to how a kworker is used", can you clarify if/why things go
> wrong b/c of the kworker usage, or this is matter of phrasing?

The mlx kworker trying to allocate memory with GFP_KERNEL will eventually 
get stuck; if the system is under memory pressure, performing memory 
reclaim is needed in order to free occupied memory and use it for the 
GFP_KERNEL allocation.

Writeback can't however proceed, as the mlx kworker is stuck waiting 
exactly on the writeback to eventually happen.

> in earlier post over this thread you wrote "There was a problem with swapping
> over NFS, as writeback was deadlocked with memory reclaim (memory needs to be
> allocated so that > swap could be accessed to reclaim memory). That's fixed by
> allocating the buffers from PF_MEMALLOC reserve, introduced by Mel's and
> Peter's patchset back in 3.9 or so. Oh, and the same has been done for
> swapping over NBD, btw", in that respect:
>
> 3. you mentioned that the memory allocations in ipoib_cm_tx_init() and 
> ib_create_qp() --> mlx4 driver requires page reclaim and waits for 
> network transmission, so this client node put their swap over that NFS 
> partition?

They need memory reclaim to happen in low-memory situations. GFP_KERNEL 
allocation is allowed to go to sleep and wait for the reclaim to succeed.

> 4. Can you shed more light, why the problem hits also for kmalloc based 
> allocations and not only for vmalloc based allocation e.g not only b/c 
> of the vzalloc call in ipoib_cm_tx_init but rather also b/c of misc 
> kmalloc calls within the HW (here mlx4) driver?

The GFP_KERNEL is the key here -- allocation using GFP_KERNEL allocation 
is allowed to sleep until memory reclamation has succeeded.

Thanks again,

-- 
Jiri Kosina
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to