On Tue, Nov 1, 2016 at 8:56 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Merlin Moncure <mmonc...@gmail.com> writes: >> On Mon, Oct 31, 2016 at 10:32 AM, Oskari Saarenmaa <o...@ohmu.fi> wrote: >>> Your production system's postgres backends probably have a lot more open >>> files associated with them than the simple test case does. Since Postgres >>> likes to keep files open as long as possible and only closes them when you >>> need to free up fds to open new files, it's possible that your production >>> backends have almost all allowed fds used when you execute your pl/sh >>> function. >>> >>> If that's the case, the sqsh process that's executed may not have enough fds >>> to do what it wanted to do and because of busted error handling could end up >>> writing to fds that were opened by Postgres and point to $PGDATA files. > >> Does that apply? the mechanics are a sqsh function that basically does: >> cat foo.sql | sqsh <args> >> pipe redirection opens a new process, right? > > Yeah, but I doubt that either level of the shell would attempt to close > inherited file handles. > > The real problem with Oskari's theory is that it requires not merely > busted, but positively brain-dead error handling in the shell and/or > sqsh, ie ignoring open() failures altogether. That seems kind of > unlikely. Still, I suspect he might be onto something --- there must > be some reason you can reproduce the issue in production and not in > your test bed, and number-of-open-files is as good a theory as I've > heard. > > Maybe the issue is not with open() failures, but with the resulting > FD numbers being much larger than sqsh is expecting. It would be > weird to try to store an FD in something narrower than int, but > I could see a use of select() being unprepared for large FDs. > Still, it's hard to translate that idea into scribbling on the > wrong file...
Looking at the sqsh code, nothing really stands out. It's highly developed and all obvious errors are checked. There certainly could be a freak bug in there (or in libfreetds which sqsh links to) doing the damage though. In the meantime I'll continue to try and work a reliable reproduction. This particular routine only gets called in batches on a quarterly basis so things have settled down. Just a thought; could COPY be tricked into writing into the wrong file descriptor? For example, if a file was killed with a rm -rf and the fd pressured backend reopened the fd immediately? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers