Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

2016-12-08 Thread Sreekanth Palluru
Correcting typos
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of
PG database envornment
Version 9.2.4.1
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like  Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no battery backup and power failures  etc  , I am trying
to understand is there possibility that PG can end up in having corrupted
blocks due to system crash though we set these parameters

a)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
b)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3)
c) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

Please correct me if my understanding about play fsync and full_page_writes
are correct ? if so , I see that there is possibility getting corruptions
whenever PG extends a relation and crash happens just after step 4)

I am not sure will the same applicable to  existing page (not a new page)
and how it handles if there is PageHeader available as part of
full_page_writes, will same corruption can be happen or will PG can recover
database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr
as part of step 4) during the recovery process .

-Sreekanth


On Fri, Dec 9, 2016 at 2:09 PM, Sreekanth Palluru  wrote:

> Michael,
> Thanks for your prompt reply
>
> In my environment those two parameters are enabled . Just give you brief
> of PG database envornment
> Version 9.2.4.1
> Windows 7 Professional SP1
> fsync=on
> full_page_writes=on
> wal_sync_method=open_datasync
>
> My Customer is into building Cancer related systems and we ship Dell
> systems with our software image contains PG. Few of the customers are
> facing corruption issues say around 5% .
> We are in process of reproducing the issue , since there are different
> variables involved in reproducing issue like  Dell HW, Software image
> versions, Application versions, write-cache settings RAID/Disk, RAID
> controllers with no backup and power failures  etc  , I am trying to
> understand is there possibility that PG can end up in having corrupted
> blocks due to system crash.
>
> 1)As I understand fsycn will write the block from memory to disk and block
> just after step 4) would have written disk assuming disk cache did not lie
> 2)and assume that full_page_writes=on has dumped the whole 8k block into
> WAL
> before it updates block i.e. after step 2) and before 3)
> 3) if crash happens after step4) , since there is no PageHeader data ,
> after system restarts PG will complain that it is corrupted block or
> invalid header
>
> Please correct me if my understanding about play fsync and
> full_page_writes are correct ? if so , I see that there is possibility
> getting corruptions whenever PG extends a relation and crash happens just
> after step 4)
>
> I am not sure will the same applicable to  existing page (not a new page)
> and how it handles if there is PageHeader available as part of
> full_page_writes, will same corruption can be happen or will PG can recover
> database as I am not sure
> recovery process can update the PageHeader   from WAL records it wrote recptr
> as part of step 4) during the recovery process .
>
>
> -Sreekanth
>
>
>
> On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> (Please top-post that's annoying)
>>
>> On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru 
>> wrote:
>> > Can I generalize that, if after step 4)  page ( new page or old page)
>> got
>> > written disk from buffer  and crash happens between step 4) and 5)  we
>> > always get
>> > block corruption issues with Postgres which can only be recovered by
>> setting
>> > zero_damaged_pages if we just have pg_dump backups and we are OK lose
>> data
>> > in the affected blocks?
>> >
>> > I am also looking at ways of reproducing the issue ? appreciate your
>> advice
>> > on it ?
>>
>> Postgres is designed to avoid such corruption problems if
>> full_page_writes and fsync are enabled, that's a base stone of its
>> reliability. If you can create a self-contained scenario able to
>> reproduce a failure, that could be treated as a Postgres bug, but you
>> are giving no evidence that this is the case.
>> --
>> Michael
>>
>
>
>
> --
> Regards
> Sreekanth
>



-- 
Regards
Sreekanth


Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

2016-12-08 Thread Sreekanth Palluru
Michael,
Thanks for your prompt reply

In my environment those two parameters are enabled . Just give you brief of
PG database envornment
Version 9.2.4.1
Windows 7 Professional SP1
fsync=on
full_page_writes=on
wal_sync_method=open_datasync

My Customer is into building Cancer related systems and we ship Dell
systems with our software image contains PG. Few of the customers are
facing corruption issues say around 5% .
We are in process of reproducing the issue , since there are different
variables involved in reproducing issue like  Dell HW, Software image
versions, Application versions, write-cache settings RAID/Disk, RAID
controllers with no backup and power failures  etc  , I am trying to
understand is there possibility that PG can end up in having corrupted
blocks due to system crash.

1)As I understand fsycn will write the block from memory to disk and block
just after step 4) would have written disk assuming disk cache did not lie
2)and assume that full_page_writes=on has dumped the whole 8k block into WAL
before it updates block i.e. after step 2) and before 3)
3) if crash happens after step4) , since there is no PageHeader data ,
after system restarts PG will complain that it is corrupted block or
invalid header

Please correct me if my understanding about play fsync and full_page_writes
are correct ? if so , I see that there is possibility getting corruptions
whenever PG extends a relation and crash happens just after step 4)

I am not sure will the same applicable to  existing page (not a new page)
and how it handles if there is PageHeader available as part of
full_page_writes, will same corruption can be happen or will PG can recover
database as I am not sure
recovery process can update the PageHeader   from WAL records it wrote recptr
as part of step 4) during the recovery process .


-Sreekanth



On Fri, Dec 9, 2016 at 12:44 PM, Michael Paquier 
wrote:

> (Please top-post that's annoying)
>
> On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru 
> wrote:
> > Can I generalize that, if after step 4)  page ( new page or old page)
> got
> > written disk from buffer  and crash happens between step 4) and 5)  we
> > always get
> > block corruption issues with Postgres which can only be recovered by
> setting
> > zero_damaged_pages if we just have pg_dump backups and we are OK lose
> data
> > in the affected blocks?
> >
> > I am also looking at ways of reproducing the issue ? appreciate your
> advice
> > on it ?
>
> Postgres is designed to avoid such corruption problems if
> full_page_writes and fsync are enabled, that's a base stone of its
> reliability. If you can create a self-contained scenario able to
> reproduce a failure, that could be treated as a Postgres bug, but you
> are giving no evidence that this is the case.
> --
> Michael
>



-- 
Regards
Sreekanth


Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

2016-12-08 Thread Michael Paquier
(Please top-post that's annoying)

On Fri, Dec 9, 2016 at 10:28 AM, Sreekanth Palluru  wrote:
> Can I generalize that, if after step 4)  page ( new page or old page)  got
> written disk from buffer  and crash happens between step 4) and 5)  we
> always get
> block corruption issues with Postgres which can only be recovered by setting
> zero_damaged_pages if we just have pg_dump backups and we are OK lose data
> in the affected blocks?
>
> I am also looking at ways of reproducing the issue ? appreciate your advice
> on it ?

Postgres is designed to avoid such corruption problems if
full_page_writes and fsync are enabled, that's a base stone of its
reliability. If you can create a self-contained scenario able to
reproduce a failure, that could be treated as a Postgres bug, but you
are giving no evidence that this is the case.
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

2016-12-08 Thread Sreekanth Palluru
Michael,
Can I generalize that, if after step 4)  page ( new page or old page)  got
written disk from buffer  and crash happens between step 4) and 5)  we
always get
block corruption issues with Postgres which can only be recovered by
setting zero_damaged_pages if we just have pg_dump backups and we are OK
lose data in the affected blocks?

I am also looking at ways of reproducing the issue ? appreciate your advice
on it ?


On Fri, Dec 9, 2016 at 12:01 PM, Michael Paquier 
wrote:

> On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru 
> wrote:
> > Hi ,
> > I am working on page corruption issue want to know if below scenario is
> > possible
> >
> > 1)  Insert command from client , I understand heap_insert is called from
> > heampam.c
> > 2) Let us say table is full and relation is extended and added a new
> block
> > 3) Tuple is inserted into new page for the block
> RelationPutHeapTuple/hio.c
> > 4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID,
> > info);
> > 5) Then backend update the PageHeader with WAL LSN details
> PageSetLSN(page,
> > recptr);
> >
> > If my server got crashed after step 4) is there a possibility that after
> > postgres database restart I get  below error when I access the relation
> or
> > vacuum is run on this relation or taking backup through pg_dump  ?
> > ERROR:  invalid page header in block 204 of relation base/16413/16900 ?
>
> So the block is corrupted. You may want to move to another server.
>
> > or
> > Postgres can automatically recover the page  without throwing any error ?
>
> At crash recovery, Postgres would redo things from a point where
> everything was consistent on disk. If this corrupted page made it to
> disk, there is not much that can be done except restoring from a
> backup. You could as well zero_damaged_pages to help here, but you
> would lose the data on this page, still you would be able to perform
> pg_dump and get back as much data as you can. At the same time,
> corruption can spread as well as if that's a hardware problem, so you
> are just seeing the beginning of a series of problems.
> --
> Michael
>



-- 
Regards
Sreekanth


Re: [GENERAL] Would like to below scenario is possible for getting page/block corruption

2016-12-08 Thread Michael Paquier
On Fri, Dec 9, 2016 at 9:46 AM, Sreekanth Palluru  wrote:
> Hi ,
> I am working on page corruption issue want to know if below scenario is
> possible
>
> 1)  Insert command from client , I understand heap_insert is called from
> heampam.c
> 2) Let us say table is full and relation is extended and added a new block
> 3) Tuple is inserted into new page for the block RelationPutHeapTuple/hio.c
> 4) Later  WAL record is inserted  through recptr = XLogInsert(RM_HEAP_ID,
> info);
> 5) Then backend update the PageHeader with WAL LSN details  PageSetLSN(page,
> recptr);
>
> If my server got crashed after step 4) is there a possibility that after
> postgres database restart I get  below error when I access the relation or
> vacuum is run on this relation or taking backup through pg_dump  ?
> ERROR:  invalid page header in block 204 of relation base/16413/16900 ?

So the block is corrupted. You may want to move to another server.

> or
> Postgres can automatically recover the page  without throwing any error ?

At crash recovery, Postgres would redo things from a point where
everything was consistent on disk. If this corrupted page made it to
disk, there is not much that can be done except restoring from a
backup. You could as well zero_damaged_pages to help here, but you
would lose the data on this page, still you would be able to perform
pg_dump and get back as much data as you can. At the same time,
corruption can spread as well as if that's a hardware problem, so you
are just seeing the beginning of a series of problems.
-- 
Michael


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general