On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote: > On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote: >> >> We already have dirty_bytes and dirty_background_bytes, for example. I >> don't see why there couldn't be another limit defining how much dirty >> data to allow before blocking writes altogether. I'm sure it's not that >> simple, but you get the general idea - do not allow using all available >> memory because of writeback issues, but don't throw the data away in >> case it's just a temporary issue. > > Sure, there could be knobs for limiting how much memory such "zombie" > pages may occupy. Not sure how helpful it would be in the long run > since this tends to be highly application-specific, and for something > with a large data footprint one would end up tuning this accordingly > in a system-wide manner. This has the potential to leave other > applications running in the same system with very little memory, in > cases where for example original application crashes and never clears > the error. Apart from that, further interfaces would need to be provided > for actually dealing with the error (again assuming non-transient > issues that may not be fixed transparently and that temporary issues > are taken care of by lower layers of the stack). >
I don't quite see how this is any different from other possible issues when running multiple applications on the same system. One application can generate a lot of dirty data, reaching dirty_bytes and forcing the other applications on the same host to do synchronous writes. Of course, you might argue that is a temporary condition - it will resolve itself once the dirty pages get written to storage. In case of an I/O issue, it is a permanent impact - it will not resolve itself unless the I/O problem gets fixed. Not sure what interfaces would need to be written? Possibly something that says "drop dirty pages for these files" after the application gets killed or something. That makes sense, of course. >> Well, there seem to be kernels that seem to do exactly that already. At >> least that's how I understand what this thread says about FreeBSD and >> Illumos, for example. So it's not an entirely insane design, apparently. > > It is reasonable, but even FreeBSD has a big fat comment right > there (since 2017), mentioning that there can be no recovery from > EIO at the block layer and this needs to be done differently. No > idea how an application running on top of either FreeBSD or Illumos > would actually recover from this error (and clear it out), other > than remounting the fs in order to force dropping of relevant pages. > It does provide though indeed a persistent error indication that > would allow Pg to simply reliably panic. But again this does not > necessarily play well with other applications that may be using > the filesystem reliably at the same time, and are now faced with > EIO while their own writes succeed to be persisted. > In my experience when you have a persistent I/O error on a device, it likely affects all applications using that device. So unmounting the fs to clear the dirty pages seems like an acceptable solution to me. I don't see what else the application should do? In a way I'm suggesting applications don't really want to be responsible for recovering (cleanup or dirty pages etc.). We're more than happy to hand that over to kernel, e.g. because each kernel will do that differently. What we however do want is reliable information about fsync outcome, which we need to properly manage WAL, checkpoints etc. > Ideally, you'd want a (potentially persistent) indication of error > localized to a file region (mapping the corresponding failed writeback > pages). NetBSD is already implementing fsync_ranges(), which could > be a step in the right direction. > >> One has to wonder how many applications actually use this correctly, >> considering PostgreSQL cares about data durability/consistency so much >> and yet we've been misunderstanding how it works for 20+ years. > > I would expect it would be very few, potentially those that have > a very simple process model (e.g. embedded DBs that can abort a > txn on fsync() EIO). I think that durability is a rather complex > cross-layer issue which has been grossly misunderstood similarly > in the past (e.g. see [1]). It seems that both the OS and DB > communities greatly benefit from a periodic reality check, and > I see this as an opportunity for strengthening the IO stack in > an end-to-end manner. > Right. What I was getting to is that perhaps the current fsync() behavior is not very practical for building actual applications. > Best regards, > Anthony > > [1] > https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf > Thanks. The paper looks interesting. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services