So, the commit you referenced is the commit that introduced the behavior of "in these error cases, we can't figure out what's in use reliably, so 'leak the space', log that we leaked it, and that way we don't corrupt things at the cost of space".
So I would posit that the problem may be one of a leaking/corruption condition being triggered that simply wasn't noted prior to said commit, rather than being a logical flaw in that commit? You've helpfully included all the resources needed for someone to replicate this on their own, which should help a lot, presuming that people don't immediately find some counter-examples to your train of logic (which doesn't seem flawed to my eyes, at the moment). Have you actually tried this on any non-FBSD platforms yet to see if this is as reliably replicable there yet? I'm honestly kind of terrified by the prospect of a nice poisonous dataset that can cause your pool to leak space until it's gone, so I'll probably test this on some testbed as soon as I set one up that has nothing else being tested on it I'd mind losing... - Rich On Wed, Oct 15, 2014 at 12:45 AM, Steven Hartland via illumos-zfs <z...@lists.illumos.org> wrote: > > ----- Original Message ----- From: "Steven Hartland" >> >> I've been investigating an issue for a user who was seeing >> his pool import hang after upgrading on FreeBSD. After >> digging around it turned out the issue was due to lack >> of free space on the pool. >> >> As the pool imports it writes hence requiring space but the >> pool has so little space this was failing. The IO being >> a required IO it retries, but obviously fails again >> resulting the the pool being suspened hence the hang. >> >> With the pool suspended during import it still holds the >> pool lock so all attempts to query the status also hang, >> which is one problem as the user can't tell why the hang >> has occured. >> >> During the debugging I mounted the pool read only and >> sent a copy to another empty pool, which resulted in ~1/2 >> capacity being recovered. This seemed odd but I dismissed >> it at the time. >> >> The machine was then left, with the pool not being accessed, >> however I just recieved an alert from our monitoring for >> a pool failure. On looking I now see the new pool I created >> with 2 write errors and no free space. So just having the >> pool mounted, with no access happening, has managed to use >> the remain 2GB on the 4GB pool. >> >> Has anyone seen this before or has any ideas what might >> be going on? >> >> zdb -m -m -m -m <pool> shows allocation to transactions e.g. >> metaslab 100 offset c8000000 spacemap 1453 free 0 >> segments 0 maxsize 0 freepct 0% >> In-memory histogram: >> On-disk histogram: fragmentation 0 >> [ 0] ALLOC: txg 417, pass 2 >> [ 1] A range: 00c8000000-00c8001600 size: 001600 >> [ 2] ALLOC: txg 417, pass 3 >> [ 3] A range: 00c8001600-00c8003a00 size: 002400 >> [ 4] ALLOC: txg 418, pass 2 >> [ 5] A range: 00c8003a00-00c8005000 size: 001600 >> [ 6] ALLOC: txg 418, pass 3 >> [ 7] A range: 00c8005000-00c8006600 size: 001600 >> [ 8] ALLOC: txg 419, pass 2 >> [ 9] A range: 00c8006600-00c8007c00 size: 001600 >> [ 10] ALLOC: txg 419, pass 3 >> >> I tried destroying the pool and that hung, presumably due to >> IO being suspended after the out of space errors. > > > After bisecting the kernel changes the commit which seems > to be causing this is: > https://svnweb.freebsd.org/base?view=revision&revision=268650 > https://github.com/freebsd/freebsd/commit/91643324a9009cb5fbc8c00544b7781941f0d5d1 > which correlates to: > https://github.com/illumos/illumos-gate/commit/7fd05ac4dec0c343d2f68f310d3718b715ecfbaf > > I've checked the two make the same changes so there doesn't > seem to have been a downstream merge issue, at least not on > this specific commit. > > My test now consists of: > 1. mdconfig -t malloc -s 4G -S 512 > 2. zpool create tpool md0 > 3. zfs recv -duF tpool < test.zfs > 4. zpool list -p -o free zfs 5 > > With this commit present, free reduces every 5 seconds until > the pool is out of space. Without it after at most 3 reductions > the pool settles and no further free space reduction is seen. > > I've also found that creating the pool without async_destroy > enabled also prevents the issue. > > An image that shows the final result of the leak can be found > here: > http://www.ijs.si/usr/mark/bsd/ > > On FreeBSD this image stalls on import unless imported readonly. > Once imported I used the following to create the test image > used above: > zfs send -R zfs/ROOT@auto-2014-09-19_22.30 >test.zfs > > Copying in the zfs illumos list to get more eyeballs given it > seems to be a quite serious issue. > > Regards > Steve > > > ------------------------------------------- > illumos-zfs > Archives: https://www.listbox.com/member/archive/182191/=now > RSS Feed: > https://www.listbox.com/member/archive/rss/182191/24536931-5d25148d > Modify Your Subscription: > https://www.listbox.com/member/?member_id=24536931&id_secret=24536931-14679a5f > Powered by Listbox: http://www.listbox.com _______________________________________________ developer mailing list developer@open-zfs.org http://lists.open-zfs.org/mailman/listinfo/developer