Tom Lane wrote:
Same here. I don't even want to have to prove anything if the hardware isn't reliable but the "management" queries about the lost transactions, blaming on system/software/database. I could prove to them that the lost transactions were due to the system hang, but transaction #10 being there makes my reasoning doubtful.Marco Colombo <[EMAIL PROTECTED]> writes:Tom Lane wrote:However this would seem to imply disk drive misfeasance above and beyond your motherboard problem.Well, no. How about this theory:1) everything is ok: the backend executes write()/fsync() for transactions 1-52) hardware fails some how at MB level (imagine CPU/RAM overheating): RAM gets corrupted - kernel starts oopsing (but goes on) meanwhile, the backend executes write()/fsync() for transactions 6-10, but randomly corrupted data gets written to disk.3) unrecoverable kernel error occurs, the show stops.On recover, transactions 6-9 don't even look like valid log entries, while 10, for some reason, does (maybe only data is corrupted).I'm not familiar with the details of WAL files and post-crash recovery, but is that possible? Or does the process stop at the first failure?Recovery will stop at the first corrupted record, so it would not happen like that. But you are right, the MB failure alone might have been enough to corrupt the outgoing WAL log data and thus produce the scenario I described. Once Postgres *thinks* transactions 1-10 are safely down to disk in the WAL log, it will feel free to update the data files in any random order that seems convenient. So the write of record 10 could have occurred before the rest, and if that happened not to get corrupted by the MB problem, we could see the result lec describes.Of course this is all guesswork since we have no direct evidence to look at, but it seems fairly plausible.Anyway, if your CPU/RAM is failing, no DB technology can save you.Agreed. Software certainly cannot make any guarantees if it can't even execute correctly ... Thanks for all your feedbacks and reasoning. --lec |
- [GENERAL] Losing records when server hang lec
- Re: [GENERAL] Losing records when server hang Scott Marlowe
- Re: [GENERAL] Losing records when server han... Alvaro Herrera Munoz
- Re: [GENERAL] Losing records when server... Scott Marlowe
- Re: [GENERAL] Losing records when server han... Chris Travers
- Re: [GENERAL] Losing records when server... lec
- Re: [GENERAL] Losing records when se... Scott Marlowe
- [GENERAL] Problem when installing dbsize... Renê Salomão
- Re: [GENERAL] Problem when installin... Bruce Momjian
- [GENERAL] where can i download ... Geoffrey KRETZ
- Re: [GENERAL] where can i d... Devrim GUNDUZ
- Re: [GENERAL] where can... Devrim GUNDUZ
- Re: [GENERAL] where can... Geoffrey KRETZ
- Re: [GENERAL] where can... Bruno Wolff III