This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef

Reply via email to