On Fri, Oct 28, 2016 at 3:16 PM, Jim Nasby <jim.na...@bluetreble.com> wrote: > On 10/28/16 8:23 AM, Merlin Moncure wrote: >> >> On Thu, Oct 27, 2016 at 6:39 PM, Greg Stark <st...@mit.edu> wrote: >>> >>> On Thu, Oct 27, 2016 at 9:53 PM, Merlin Moncure <mmonc...@gmail.com> >>> wrote: >>>> >>>> I think we can rule out faulty storage >>> >>> >>> Nobody ever expects the faulty storage > > > LOL > >> Believe me, I know. But the evidence points elsewhere in this case; >> this is clearly application driven. > > > FWIW, just because it's triggered by specific application behavior doesn't > mean there isn't a storage bug. That's what makes data corruption bugs such > a joy to figure out. > > BTW, if you haven't already, I would reset all your storage related options > and GUCs to safe defaults... plain old FSYNC, no cute journal / FS / mount > options, etc. Maybe this is related to the app, but the most helpful thing > right now is to find some kind of safe config so you can start bisecting.
upthread, you might have noticed that I already did that. Here is the other evidence: *) server running fine for 5+ years *) other database on same cluster not impacted with 10x write activity *) no interesting logs reported in /var/log/messages, dmesg, etc *) san fabric turns over petabytes/day with no corruption. 100+ postgres clusters, 1000+ sql server clusters (and that's not production) *) storage/network teams have been through everything. nothing intersting/unusual to report *) we have infrequently run routing (posted upthread) that, when run, database crashed within minutes *) after turning on checksums, 30% of invocations of routine resulted in checksum error *) problem re-occurred after dump-restore and full cluster rebuild *) checksum error caused routine rollback. FWICT this prevented the damage *) everything is fine now that routine is not being run anymore you can come up with your conclusion, I've come up with mine. The only frustrating thing here is that I can't reproduce out of the production environment. If this database goes down I have 30 people sitting around so I can't take downtime lightly. > I would also consider alternatives to plsh, just to rule it out if nothing > else. I'd certainly look at some way to get sqsh out of the loop (again, > just to get something that doesn't crash). First idea that comes to mind is > a stand-alone shell script that watches a named pipe for a filename; when it > gets that file it runs it with sqsh and does something to signal completion. I do a lot of etl to/from sql server and it's all sqsh based. If I can figure out how to reproduce in a better way, I'll zero in on the problem in about 10 minutes. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers