Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption
Correcting typos Michael, Thanks for your prompt reply In my environment those two parameters are enabled . Just give you brief of PG database envornment Version 9.2.4.1 Windows 7 Professional SP1 fsync=on full_page_writes=on wal_sync_method=open_datasync My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% . We are in process of reproducing the issue , since there are different variables involved in reproducing issue like Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no battery backup and power failures etc , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash though we set these parameters a)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie b)and assume that full_page_writes=on has dumped the whole 8k block into WAL before it updates block i.e. after step 2) and before 3) c) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) I am not sure will the same applicable to existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure recovery process can update the PageHeader from WAL records it wrote recptr as part of step 4) during the recovery process . -Sreekanth On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluruwrote: > Michael, > Thanks for your prompt reply > > In my environment those two parameters are enabled . Just give you brief > of PG database envornment > Version 9.2.4.1 > Windows 7 Professional SP1 > fsync=on > full_page_writes=on > wal_sync_method=open_datasync > > My Customer is into building Cancer related systems and we ship Dell > systems with our software image contains PG. Few of the customers are > facing corruption issues say around 5% . > We are in process of reproducing the issue , since there are different > variables involved in reproducing issue like Dell HW, Software image > versions, Application versions, write-cache settings RAID/Disk, RAID > controllers with no backup and power failures etc , I am trying to > understand is there possibility that PG can end up in having corrupted > blocks due to system crash. > > 1)As I understand fsycn will write the block from memory to disk and block > just after step 4) would have written disk assuming disk cache did not lie > 2)and assume that full_page_writes=on has dumped the whole 8k block into > WAL > before it updates block i.e. after step 2) and before 3) > 3) if crash happens after step4) , since there is no PageHeader data , > after system restarts PG will complain that it is corrupted block or > invalid header > > Please correct me if my understanding about play fsync and > full_page_writes are correct ? if so , I see that there is possibility > getting corruptions whenever PG extends a relation and crash happens just > after step 4) > > I am not sure will the same applicable to existing page (not a new page) > and how it handles if there is PageHeader available as part of > full_page_writes, will same corruption can be happen or will PG can recover > database as I am not sure > recovery process can update the PageHeader from WAL records it wrote recptr > as part of step 4) during the recovery process . > > > -Sreekanth > > > > On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier < > michael.paqu...@gmail.com> wrote: > >> (Please top-post that's annoying) >> >> On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru >> wrote: >> > Can I generalize that, if after step 4) page ( new page or old page) >> got >> > written disk from buffer and crash happens between step 4) and 5) we >> > always get >> > block corruption issues with Postgres which can only be recovered by >> setting >> > zero_damaged_pages if we just have pg_dump backups and we are OK lose >> data >> > in the affected blocks? >> > >> > I am also looking at ways of reproducing the issue ? appreciate your >> advice >> > on it ? >> >> Postgres is designed to avoid such corruption problems if >> full_page_writes and fsync are enabled, that's a base stone of its >> reliability. If you can create a self-contained scenario able to >> reproduce a failure, that could be treated as a Postgres bug, but you >> are giving no evidence that this is the case. >> -- >> Michael >> > > > > -- > Regards > Sreekanth > -- Regards Sreekanth
Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption
Michael, Thanks for your prompt reply In my environment those two parameters are enabled . Just give you brief of PG database envornment Version 9.2.4.1 Windows 7 Professional SP1 fsync=on full_page_writes=on wal_sync_method=open_datasync My Customer is into building Cancer related systems and we ship Dell systems with our software image contains PG. Few of the customers are facing corruption issues say around 5% . We are in process of reproducing the issue , since there are different variables involved in reproducing issue like Dell HW, Software image versions, Application versions, write-cache settings RAID/Disk, RAID controllers with no backup and power failures etc , I am trying to understand is there possibility that PG can end up in having corrupted blocks due to system crash. 1)As I understand fsycn will write the block from memory to disk and block just after step 4) would have written disk assuming disk cache did not lie 2)and assume that full_page_writes=on has dumped the whole 8k block into WAL before it updates block i.e. after step 2) and before 3) 3) if crash happens after step4) , since there is no PageHeader data , after system restarts PG will complain that it is corrupted block or invalid header Please correct me if my understanding about play fsync and full_page_writes are correct ? if so , I see that there is possibility getting corruptions whenever PG extends a relation and crash happens just after step 4) I am not sure will the same applicable to existing page (not a new page) and how it handles if there is PageHeader available as part of full_page_writes, will same corruption can be happen or will PG can recover database as I am not sure recovery process can update the PageHeader from WAL records it wrote recptr as part of step 4) during the recovery process . -Sreekanth On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquierwrote: > (Please top-post that's annoying) > > On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru > wrote: > > Can I generalize that, if after step 4) page ( new page or old page) > got > > written disk from buffer and crash happens between step 4) and 5) we > > always get > > block corruption issues with Postgres which can only be recovered by > setting > > zero_damaged_pages if we just have pg_dump backups and we are OK lose > data > > in the affected blocks? > > > > I am also looking at ways of reproducing the issue ? appreciate your > advice > > on it ? > > Postgres is designed to avoid such corruption problems if > full_page_writes and fsync are enabled, that's a base stone of its > reliability. If you can create a self-contained scenario able to > reproduce a failure, that could be treated as a Postgres bug, but you > are giving no evidence that this is the case. > -- > Michael > -- Regards Sreekanth
Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption
(Please top-post that's annoying) On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluruwrote: > Can I generalize that, if after step 4) page ( new page or old page) got > written disk from buffer and crash happens between step 4) and 5) we > always get > block corruption issues with Postgres which can only be recovered by setting > zero_damaged_pages if we just have pg_dump backups and we are OK lose data > in the affected blocks? > > I am also looking at ways of reproducing the issue ? appreciate your advice > on it ? Postgres is designed to avoid such corruption problems if full_page_writes and fsync are enabled, that's a base stone of its reliability. If you can create a self-contained scenario able to reproduce a failure, that could be treated as a Postgres bug, but you are giving no evidence that this is the case. -- Michael -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption
Michael, Can I generalize that, if after step 4) page ( new page or old page) got written disk from buffer and crash happens between step 4) and 5) we always get block corruption issues with Postgres which can only be recovered by setting zero_damaged_pages if we just have pg_dump backups and we are OK lose data in the affected blocks? I am also looking at ways of reproducing the issue ? appreciate your advice on it ? On Fri, Dec 9, 2016 at 12:01 PM, Michael Paquierwrote: > On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru > wrote: > > Hi , > > I am working on page corruption issue want to know if below scenario is > > possible > > > > 1) Insert command from client , I understand heap_insert is called from > > heampam.c > > 2) Let us say table is full and relation is extended and added a new > block > > 3) Tuple is inserted into new page for the block > RelationPutHeapTuple/hio.c > > 4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID, > > info); > > 5) Then backend update the PageHeader with WAL LSN details > PageSetLSN(page, > > recptr); > > > > If my server got crashed after step 4) is there a possibility that after > > postgres database restart I get below error when I access the relation > or > > vacuum is run on this relation or taking backup through pg_dump ? > > ERROR: invalid page header in block 204 of relation base/16413/16900 ? > > So the block is corrupted. You may want to move to another server. > > > or > > Postgres can automatically recover the page without throwing any error ? > > At crash recovery, Postgres would redo things from a point where > everything was consistent on disk. If this corrupted page made it to > disk, there is not much that can be done except restoring from a > backup. You could as well zero_damaged_pages to help here, but you > would lose the data on this page, still you would be able to perform > pg_dump and get back as much data as you can. At the same time, > corruption can spread as well as if that's a hardware problem, so you > are just seeing the beginning of a series of problems. > -- > Michael > -- Regards Sreekanth
Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption
On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluruwrote: > Hi , > I am working on page corruption issue want to know if below scenario is > possible > > 1) Insert command from client , I understand heap_insert is called from > heampam.c > 2) Let us say table is full and relation is extended and added a new block > 3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c > 4) Later WAL record is inserted through recptr = XLogInsert(RM_HEAP_ID, > info); > 5) Then backend update the PageHeader with WAL LSN details PageSetLSN(page, > recptr); > > If my server got crashed after step 4) is there a possibility that after > postgres database restart I get below error when I access the relation or > vacuum is run on this relation or taking backup through pg_dump ? > ERROR: invalid page header in block 204 of relation base/16413/16900 ? So the block is corrupted. You may want to move to another server. > or > Postgres can automatically recover the page without throwing any error ? At crash recovery, Postgres would redo things from a point where everything was consistent on disk. If this corrupted page made it to disk, there is not much that can be done except restoring from a backup. You could as well zero_damaged_pages to help here, but you would lose the data on this page, still you would be able to perform pg_dump and get back as much data as you can. At the same time, corruption can spread as well as if that's a hardware problem, so you are just seeing the beginning of a series of problems. -- Michael -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general