-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/01/2011 02:55 PM, Jeff Mahoney wrote: > On 11/28/2011 07:04 PM, Jeff Mahoney wrote: >> On 11/28/2011 06:53 PM, Andi Kleen wrote: >>> Jeff Mahoney <je...@suse.com> writes: > >>>> The extent_state structure is used at the core of the extent >>>> i/o code for managing flags, locking, etc. It requires >>>> allocations deep in the write code and if failures occur >>>> they are difficult to recover from. >>>> >>>> We avoid most of the failures by using a mempool, which can >>>> sleep when required, to honor the allocations. This allows >>>> future patches to convert most of the >>>> {set,clear,convert}_extent_bit and derivatives to return >>>> void. > >>> Is this really safe? > >>> iirc if there's any operation that needs multiple mempool >>> objects you can get into the following ABBA style deadlock: > >>> thread 1 thread 2 > >>> get object A from pool 1 get object C from pool 2 get object B >>> from pool 2 pool 2 full -> block get object D from pool 2 >> ^ pool1, i assume >>> pool1 full -> block > > >>> Now for thread 1 to make progress it needs thread 2 to free its >>> object first, but that needs thread 1 to free its object also >>> first, which is a deadlock. > >>> It would still work if there are other users which eventually >>> make progress, but if you're really unlucky either pool 1 or 2 >>> is complete used by threads doing a multiple object operation. >>> So you got a nasty rare deadlock ... > >> Yes, I think you're right. I think the risk is probably there >> and I'm not sure it can be totally mitigated. We'd be stuck if >> there is a string of pathological cases and both mempool are >> empty. The only way I can see to try to help the situation is to >> make the mempool fairly substantial, which is what I did here. >> I'd prefer to make the mempools per-fs but that would require >> pretty heavy modifications in order to pass around a per-fs >> struct. In any case, the risk isn't totally eliminated. > > The more I look into it, the more I don't think this is an > uncommon scenario in the kernel. Device mapper draws from a number > of mempools that can be interspersed with allocations from the bio > pool. Even without stacking different types of dm devices, I think > we can run into this scenario. The DM scenario is probably even > worse since the allocs are GFP_NOIO instead of the (relatively) > more relaxed GFP_NOFS. > > I'm not saying it's ok, just that there are similar sites already > that seem to be easier to hit but we haven't heard of issues there > either. It seems to be a behavior that is largely mitigated by > establishing mempools of sufficient size.
... and the next installment: My understanding of mempools is inaccurate and the pathological case can be hit much more quickly than I anticipated, or at least have a performance impact I'd rather not see. It turns out that the pool will be depleted *before* the other pathological cases are hit. In fact, the pool is hit before reclaim starts at all. Rather than: normal alloc with gfp_t (normal alloc fails after trying reclaim) pool alloc (pool alloc fails) (wait, timeout, retry) it is: normal alloc with gfp_t & ~(__GFP_WAIT|__GFP_IO) (which is GFP_NOFS & ~GFP_NOFS for the common case) .. so: normal alloc with GFP_NOWAIT (normal alloc fails) (much more often) pool alloc (pool alloc fails) ((wait until an object is freed or timeout) and retry with GFP_NOFS) So, using mempools for all cases as I'd hoped isn't what I want. I'll work up another set of patches that use a straight kmem_cache_alloc where possible. The good news is that slab-backed mempools can be replenished with objects from the slab that weren't allocated from the pool. So there's that. - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.18 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJO2atEAAoJEB57S2MheeWy2CYP/iix4znJVqBkN8wUg31MELGB UpodfZkwlhrD/8PfJrisjA1wLLKpK5GzJ0m6rhKCYvkBau28yCSAWGW9sJIfGM0I 4fLAoupncoJqmYyQvrZIczxztDVwMu37bL7fNfCkC64HoaETChQbaDbrwu+L8Rn2 Lls7EtkUNWrpY+J4F0zq9b8eE46WjbpZBKhI/PuqAE0LfSXK8nd46Z5qA/MM7fab tFftC0AyBLP2dq8R0H1IX0yjri6YD0xwwfdHdbzVAoJ/tINVbfiQxntYyONgNnBF 6sNogCtcskpTSDHZyVK7ATuJJL6ZAIFO0ZUJjZYFc+q0q1oYMfmbhNs9Qq5le7bZ Ig6pcMgHqGOqkis95jqzgStl2A1OIYPZsn6K1329N44fvZ8PjxDVCHS83FzBY0qw YuEgBNd8vRrdjtcMax2QOs9yaSmvEXDkws1+tLWg/ZV0Ik8crbW0ctVqsyDpWwPN 2dSyrlxGbAJyXzELUg498dLORw+chHokUhsEvwYEmL1HGLsZGFoXAV3H5df4v6Mx wsKZeT+Nsp16CyYOXVCAWGRyp6FPDWECJAUxygjyEaIGWmiJjMKU1GizL9J4fgc+ 59fantrAMbFwPpRwJxfHumCdnQOi55qtrcP1UvCB3o9jBH4QsUofP4sGQsIAw4oQ ctuvtI1R7zcub0906BwP =h+a2 -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html