Re: [HACKERS] Page Checksums

2012-01-24 Thread jesper
 * Robert Treat:

 Would it be unfair to assert that people who want checksums but aren't
 willing to pay the cost of running a filesystem that provides
 checksums aren't going to be willing to make the cost/benefit trade
 off that will be asked for? Yes, it is unfair of course, but it's
 interesting how small the camp of those using checksummed filesystems
 is.

 Don't checksumming file systems currently come bundled with other
 features you might not want (such as certain vendors)?

I would chip in and say that I would prefer sticking to well-known proved
filesystems like xfs/ext4 and let the application do the checksumming.

I dont forsee fully production-ready checksumming filesystems readily
available in the standard Linux distributions within a near future.

And yes, I would for sure turn such functionality on if it were present.

-- 
Jesper


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Florian Weimer
 I would chip in and say that I would prefer sticking to well-known proved
 filesystems like xfs/ext4 and let the application do the checksumming.

Yes, that's a different way of putting my concern.  If you want a proven
file system with checksumming (and an fsck), options are really quite
limited.

 And yes, I would for sure turn such functionality on if it were present.

Same here.  I already use page-level checksum with Berkeley DB.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Robert Treat
On Tue, Jan 24, 2012 at 3:02 AM,  jes...@krogh.cc wrote:
 * Robert Treat:

 Would it be unfair to assert that people who want checksums but aren't
 willing to pay the cost of running a filesystem that provides
 checksums aren't going to be willing to make the cost/benefit trade
 off that will be asked for? Yes, it is unfair of course, but it's
 interesting how small the camp of those using checksummed filesystems
 is.

 Don't checksumming file systems currently come bundled with other
 features you might not want (such as certain vendors)?

 I would chip in and say that I would prefer sticking to well-known proved
 filesystems like xfs/ext4 and let the application do the checksumming.


*shrug* You could use Illumos or BSD and you'd get generally vendor
free systems using ZFS, which I'd say offers more well-known and
proved checksumming than anything cooking in linux land, or than the
as-to-be-written yet checksumming in postgres.

 I dont forsee fully production-ready checksumming filesystems readily
 available in the standard Linux distributions within a near future.

 And yes, I would for sure turn such functionality on if it were present.


That's nice to say, but most people aren't willing to take a 50%
performance hit. Not saying what we end up with will be that bad, but
I've seen people get upset about performance hits much lower than
that.


Robert Treat
conjecture: xzilla.net
consulting: omniti.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Simon Riggs
On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat r...@xzilla.net wrote:
 And yes, I would for sure turn such functionality on if it were present.


 That's nice to say, but most people aren't willing to take a 50%
 performance hit. Not saying what we end up with will be that bad, but
 I've seen people get upset about performance hits much lower than
 that.

When we talk about a 50% hit, are we discussing (1) checksums that are
checked on each I/O, or (2) checksums that are checked each time we
re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
not yet been measured, so shouldn't be used unqualified when
discussing checksums. Same thing is also true I would use it
comments, since we're not sure whether you're voting for (1) or (2).

As to whether people will actually use (1), I have no clue. But I do
know is that many people request that feature, including people that
run heavy duty Postgres production systems and who also know about
filesystems. Do people need (2)? It's easy enough to add as an option,
once we have (1) and there is real interest.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-24 Thread Jim Nasby
On Jan 24, 2012, at 9:15 AM, Simon Riggs wrote:
 On Tue, Jan 24, 2012 at 2:49 PM, Robert Treat r...@xzilla.net wrote:
 And yes, I would for sure turn such functionality on if it were present.
 
 
 That's nice to say, but most people aren't willing to take a 50%
 performance hit. Not saying what we end up with will be that bad, but
 I've seen people get upset about performance hits much lower than
 that.
 When we talk about a 50% hit, are we discussing (1) checksums that are
 checked on each I/O, or (2) checksums that are checked each time we
 re-pin a shared buffer?  The 50% hit was my estimate of (2) and has
 not yet been measured, so shouldn't be used unqualified when
 discussing checksums. Same thing is also true I would use it
 comments, since we're not sure whether you're voting for (1) or (2).
 
 As to whether people will actually use (1), I have no clue. But I do
 know is that many people request that feature, including people that
 run heavy duty Postgres production systems and who also know about
 filesystems. Do people need (2)? It's easy enough to add as an option,
 once we have (1) and there is real interest.

Some people will be able to take a 50% hit and will happily turn on 
checksumming every time a page is pinned. But I suspect a lot of folks can't 
afford that kind of hit, but would really like to have their filesystem cache 
protected (we're certainly in the later camp).

As for checksumming filesystems, I didn't see any answers about whether the 
filesystem *cache* was also protected by the filesystem checksum. Even if it 
is, the choice of checksumming filesystems is certainly limited... ZFS is the 
only one that seems to have real traction, but that forces you off of Linux, 
which is a problem  for many shops.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-23 Thread Robert Treat
On Sat, Jan 21, 2012 at 6:12 PM, Jim Nasby j...@nasby.net wrote:
 On Jan 10, 2012, at 3:07 AM, Simon Riggs wrote:
 I think we could add an option to check the checksum immediately after
 we pin a block for the first time but it would be very expensive and
 sounds like we're re-inventing hardware or OS features again. Work on
 50% performance drain, as an estimate.

 That is a level of protection no other DBMS offers, so that is either
 an advantage or a warning. Jim, if you want this, please do the
 research and work out what the probability of losing shared buffer
 data in your ECC RAM really is so we are doing it for quantifiable
 reasons (via old Google memory academic paper) and to verify that the
 cost/benefit means you would actually use it if we built it. Research
 into requirements is at least as important and time consuming as
 research on possible designs.

 Maybe I'm just dense, but it wasn't clear to me how you could use the 
 information in the google paper to extrapolate data corruption probability.

 I can say this: we have seen corruption from bad memory, and our Postgres 
 buffer pool (8G) is FAR smaller than
 available memory on all of our servers (192G or 512G). So at least in our 
 case, CRCs that protect the filesystem
 cache would protect the vast majority of our memory (96% or 98.5%).

Would it be unfair to assert that people who want checksums but aren't
willing to pay the cost of running a filesystem that provides
checksums aren't going to be willing to make the cost/benefit trade
off that will be asked for? Yes, it is unfair of course, but it's
interesting how small the camp of those using checksummed filesystems
is.

Robert Treat
conjecture: xzilla.net
consulting: omniti.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-23 Thread Florian Weimer
* Robert Treat:

 Would it be unfair to assert that people who want checksums but aren't
 willing to pay the cost of running a filesystem that provides
 checksums aren't going to be willing to make the cost/benefit trade
 off that will be asked for? Yes, it is unfair of course, but it's
 interesting how small the camp of those using checksummed filesystems
 is.

Don't checksumming file systems currently come bundled with other
features you might not want (such as certain vendors)?

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-22 Thread Jim Nasby
On Jan 10, 2012, at 3:07 AM, Simon Riggs wrote:
 I think we could add an option to check the checksum immediately after
 we pin a block for the first time but it would be very expensive and
 sounds like we're re-inventing hardware or OS features again. Work on
 50% performance drain, as an estimate.
 
 That is a level of protection no other DBMS offers, so that is either
 an advantage or a warning. Jim, if you want this, please do the
 research and work out what the probability of losing shared buffer
 data in your ECC RAM really is so we are doing it for quantifiable
 reasons (via old Google memory academic paper) and to verify that the
 cost/benefit means you would actually use it if we built it. Research
 into requirements is at least as important and time consuming as
 research on possible designs.

Maybe I'm just dense, but it wasn't clear to me how you could use the 
information in the google paper to extrapolate data corruption probability.

I can say this: we have seen corruption from bad memory, and our Postgres 
buffer pool (8G) is FAR smaller than available memory on all of our servers 
(192G or 512G). So at least in our case, CRCs that protect the filesystem cache 
would protect the vast majority of our memory (96% or 98.5%).
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-10 Thread Heikki Linnakangas

On 10.01.2012 02:12, Jim Nasby wrote:

Filesystem CRCs very likely will not happen to data that's in the cache. For 
some users, that's a huge amount of data to leave un-protected.


You can repeat that argument ad infinitum. Even if the CRC covers all 
the pages in the OS buffer cache, it still doesn't cover the pages in 
the shared_buffers, CPU caches, in-transit from one memory bank to 
another etc. You have to draw the line somewhere, and it seems 
reasonable to draw it where the data moves between long-term storage, 
ie. disk, and RAM.



Filesystem bugs do happen... though presumably most of those would be caught by 
the filesystem's CRC check... but you never know!


Yeah. At some point we have to just have faith on the underlying system. 
It's reasonable to provide protection or make recovery easier from bugs 
or hardware faults that happen fairly often in the real world, but a 
can't-trust-no-one attitude is not very helpful.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-10 Thread Simon Riggs
On Tue, Jan 10, 2012 at 8:04 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 10.01.2012 02:12, Jim Nasby wrote:

 Filesystem CRCs very likely will not happen to data that's in the cache.
 For some users, that's a huge amount of data to leave un-protected.


 You can repeat that argument ad infinitum. Even if the CRC covers all the
 pages in the OS buffer cache, it still doesn't cover the pages in the
 shared_buffers, CPU caches, in-transit from one memory bank to another etc.
 You have to draw the line somewhere, and it seems reasonable to draw it
 where the data moves between long-term storage, ie. disk, and RAM.

We protect each change with a CRC when we write WAL, so doing the same
thing doesn't sound entirely unreasonable, especially if your database
fits in RAM and we aren't likely to be doing I/O anytime soon. The
long term storage argument may no longer apply in a world with very
large memory.

The question is, when exactly would we check the checksum? When we
lock the block, when we pin it? We certainly can't do it on every
access to the block since we don't even track where that happens in
the code.

I think we could add an option to check the checksum immediately after
we pin a block for the first time but it would be very expensive and
sounds like we're re-inventing hardware or OS features again. Work on
50% performance drain, as an estimate.

That is a level of protection no other DBMS offers, so that is either
an advantage or a warning. Jim, if you want this, please do the
research and work out what the probability of losing shared buffer
data in your ECC RAM really is so we are doing it for quantifiable
reasons (via old Google memory academic paper) and to verify that the
cost/benefit means you would actually use it if we built it. Research
into requirements is at least as important and time consuming as
research on possible designs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-10 Thread Benedikt Grundmann
On 10/01/12 09:07, Simon Riggs wrote:
  You can repeat that argument ad infinitum. Even if the CRC covers all the
  pages in the OS buffer cache, it still doesn't cover the pages in the
  shared_buffers, CPU caches, in-transit from one memory bank to another etc.
  You have to draw the line somewhere, and it seems reasonable to draw it
  where the data moves between long-term storage, ie. disk, and RAM.
 
 We protect each change with a CRC when we write WAL, so doing the same
 thing doesn't sound entirely unreasonable, especially if your database
 fits in RAM and we aren't likely to be doing I/O anytime soon. The
 long term storage argument may no longer apply in a world with very
 large memory.
 
I'm not so sure about that.  The experience we have is that storage
and memory doesn't grow as fast as demand.  Maybe we are in a minority 
but at Jane Street memory size  database size is sadly true for most 
of the important databases.

Concrete the two most important databases are 

715 GB

and

473 GB 

in size (the second used to be much closer to the first one in size but
we recently archived a lot of data).

In both databases there is a small set of tables that use the majority of
the disk space.  Those tables are also the most used tables.  Typically
the size of one of those tables is between 1-3x size of memory.  And the
cumulative size of all indices on the table is normally roughly the same
size as the table.

Cheers,

Bene

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-09 Thread Jim Nasby
On Jan 8, 2012, at 5:25 PM, Simon Riggs wrote:
 On Mon, Dec 19, 2011 at 8:18 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 
 Double-writes would be a useful option also to reduce the size of WAL that
 needs to be shipped in replication.
 
 Or you could just use a filesystem that does CRCs...
 
 Double writes would reduce the size of WAL and we discussed many times
 we want that.
 
 Using a filesystem that does CRCs is basically saying let the
 filesystem cope. If that is an option, why not just turn full page
 writes off and let the filesystem cope?

I don't think that just because a filesystem CRC's that you can't have a torn 
write.

Filesystem CRCs very likely will not happen to data that's in the cache. For 
some users, that's a huge amount of data to leave un-protected.

Filesystem bugs do happen... though presumably most of those would be caught by 
the filesystem's CRC check... but you never know!
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-08 Thread Simon Riggs
On Mon, Dec 19, 2011 at 8:18 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 Double-writes would be a useful option also to reduce the size of WAL that
 needs to be shipped in replication.

 Or you could just use a filesystem that does CRCs...

Double writes would reduce the size of WAL and we discussed many times
we want that.

Using a filesystem that does CRCs is basically saying let the
filesystem cope. If that is an option, why not just turn full page
writes off and let the filesystem cope?

Do we really need double writes or even checksums in Postgres? What
use case are we covering that isn't covered by using the right
filesystem for the job? Or is that the problem? Are we implementing a
feature we needed 5 years ago but don't need now? Yes, other databases
have some of these features, but do we need them? Do we still need
them now?

Tell me we really need some or all of this and I will do my best to
make it happen.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-05 Thread Florian Pflug
On Jan4, 2012, at 21:27 , Robert Haas wrote:
 I think the first thing we need to look at is increasing the number of
 CLOG buffers.

What became of the idea to treat the stable (i.e. earlier than the oldest
active xid) and the unstable (i.e. the rest) parts of the CLOG differently.

On 64-bit machines at least, we could simply mmap() the stable parts of the
CLOG into the backend address space, and access it without any locking at all.

I believe that we could also compress the stable part by 50% if we use one
instead of two bits per txid. AFAIK, we need two bits because we

  a) Distinguish between transaction where were ABORTED and those which never
 completed (due to i.e. a backend crash) and

  b) Mark transaction as SUBCOMMITTED to achieve atomic commits.

Which both are strictly necessary for the stable parts of the clog. Note that
we could still keep the uncompressed CLOG around for debugging purposes - the
additional compressed version would require only 2^32/8 bytes = 512 MB in the
worst case, which people who're serious about performance can very probably
spare.

The fly in the ointment are 32-bit machines, of course - but then, those could
still fall back to the current way of doing things.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-05 Thread Merlin Moncure
On Thu, Jan 5, 2012 at 5:15 AM, Florian Pflug f...@phlo.org wrote:
 On Jan4, 2012, at 21:27 , Robert Haas wrote:
 I think the first thing we need to look at is increasing the number of
 CLOG buffers.

 What became of the idea to treat the stable (i.e. earlier than the oldest
 active xid) and the unstable (i.e. the rest) parts of the CLOG differently.


I'm curious -- anyone happen to have an idea how big the unstable CLOG
xid space is in the typical case?  What's would be the main driver
of making it bigger?  What are the main tradeoffs in terms of trying
to keep the unstable area compact?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-05 Thread Robert Haas
On Thu, Jan 5, 2012 at 6:15 AM, Florian Pflug f...@phlo.org wrote:
 On 64-bit machines at least, we could simply mmap() the stable parts of the
 CLOG into the backend address space, and access it without any locking at all.

True.  I think this could be done, but it would take some fairly
careful thought and testing because (1) we don't currently use mmap()
anywhere else in the backend AFAIK, so we might run into portability
issues (think: Windows) and perhaps unexpected failure modes (e.g.
mmap() fails because there are too many mappings already).  Also, it's
not completely guaranteed to be a win.  Sure, you save on locking, but
now you are doing an mmap() call in every backend instead of just one
read() into shared memory.  If concurrency isn't a problem that might
be more expensive on net.  Or maybe no, but I'm kind of inclined to
steer clear of this whole area at least for 9.2.  So far, the only
test result I have only supports the notion that we run into trouble
when NUM_CPUS  NUM_CLOG_BUFFERS, and people have to before they can
even start their I/Os.  That can be fixed with a pretty modest
reengineering.  I'm sure there is a second-order effect from the cost
of repeated I/Os per se, which a backend-private cache of one form or
another might well help with, but it may not be very big.  Test
results are welcome, of course.

 I believe that we could also compress the stable part by 50% if we use one
 instead of two bits per txid. AFAIK, we need two bits because we

  a) Distinguish between transaction where were ABORTED and those which never
     completed (due to i.e. a backend crash) and

  b) Mark transaction as SUBCOMMITTED to achieve atomic commits.

 Which both are strictly necessary for the stable parts of the clog.

Well, if we're going to do compression at all, I'm inclined to think
that we should compress by more than a factor of two.  Jim Nasby's
numbers (the worst we've seen so far) show that 18% of 1k blocks of
XIDs were all commits.  Presumably if we reduced the chunk size to,
say, 8 transactions, that percentage would go up, and even that would
be enough to get 16x compression rather than 2x.  Of course, then
keeping the uncompressed CLOG files becomes required rather than
optional, but that's OK.  What bothers me about compressing by only 2x
is that the act of compressing is not free.  You have to read all the
chunks and then write out new chunks, and those chunks then compete
for each other in cache.  Who is to say that we're not better off just
reading the uncompressed data at that point?  At least then we have
only one copy of it.

 Note that
 we could still keep the uncompressed CLOG around for debugging purposes - the
 additional compressed version would require only 2^32/8 bytes = 512 MB in the
 worst case, which people who're serious about performance can very probably
 spare.

I don't think it'd be even that much, because we only ever use half
the XID space at a time, and often probably much less: the default
value of vacuum_freeze_table_age is only 150 million transactions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-05 Thread Benedikt Grundmann
For what's worth here are the numbers on one of our biggest databases
(same system as I posted about separately wrt seq_scan_cost vs
random_page_cost).


0053 1001
00BA 1009
0055 1001
00B9 1020
0054 983
00BB 1010
0056 1001
00BC 1019
0069 0
00BD 1009
006A 224
00BE 1018
006B 1009
00BF 1008
006C 1008
00C0 1006
006D 1004
00C1 1014
006E 1016
00C2 1023
006F 1003
00C3 1012
0070 1011
00C4 1000
0071 1011
00C5 1002
0072 1005
00C6 982
0073 1009
00C7 996
0074 1013
00C8 973
0075 1002
00D1 987
0076 997
00D2 968
0077 1007
00D3 974
0078 1012
00D4 964
0079 994
00D5 981
007A 1013
00D6 964
007B 999
00D7 966
007C 1000
00D8 971
007D 1000
00D9 956
007E 1008
00DA 976
007F 1010
00DB 950
0080 1001
00DC 967
0081 1009
00DD 983
0082 1008
00DE 970
0083 988
00DF 965
0084 1007
00E0 984
0085 1012
00E1 1004
0086 1004
00E2 976
0087 996
00E3 941
0088 1008
00E4 960
0089 1003
00E5 948
008A 995
00E6 851
008B 1001
00E7 971
008C 1003
00E8 954
008D 982
00E9 938
008E 1000
00EA 931
008F 1008
00EB 956
0090 1009
00EC 960
0091 1013
00ED 962
0092 1006
00EE 933
0093 1012
00EF 956
0094 994
00F0 978
0095 1017
00F1 292
0096 1004
0097 1005
0098 1014
0099 1012
009A 994
0035 1003
009B 1007
0036 1004
009C 1010
0037 981
009D 1024
0038 1002
009E 1009
0039 998
009F 1011
003A 995
00A0 1015
003B 996
00A1 1018
003C 1013
00A5 1007
003D 1008
00A3 1016
003E 1007
00A4 1020
003F 989
00A7 375
0040 989
00A6 1010
0041 975
00A9 3
0042 994
00A8 0
0043 1010
00AA 1
0044 1007
00AB 1
0045 1008
00AC 0
0046 991
00AF 4
0047 1010
00AD 0
0048 997
00AE 0
0049 1002
00B0 5
004A 1004
00B1 0
004B 1012
00B2 0
004C 999
00B3 0
004D 1008
00B4 0
004E 1007
00B5 807
004F 1010
00B6 1007
0050 1004
00B7 1007
0051 1009
00B8 1006
0052 1005
0057 1008
00C9 994
0058 991
00CA 977
0059 1000
00CB 978
005A 998
00CD 944
005B 971
00CC 972
005C 1005
00CF 969
005D 1010
00CE 988
005E 1006
00D0 975
005F 1015
0060 989
0061 998
0062 1014
0063 1000
0064 991
0065 990
0066 1000
0067 947
0068 377
00A2 1011


On 23/12/11 14:23, Kevin Grittner wrote:
 Jeff Janes jeff.ja...@gmail.com wrote:
  
  Could we get some major OLTP users to post their CLOG for
  analysis?  I wouldn't think there would be much
  security/propietary issues with CLOG data.
  
 FWIW, I got the raw numbers to do my quick check using this Ruby
 script (put together for me by Peter Brant).  If it is of any use to
 anyone else, feel free to use it and/or post any enhanced versions
 of it.
  
 #!/usr/bin/env ruby
 
 Dir.glob(*) do |file_name|
   contents = File.read(file_name)
   total = 
 contents.enum_for(:each_byte).enum_for(:each_slice,
 256).inject(0) do |count, chunk|
   if chunk.all? { |b| b == 0x55 }
 count + 1
   else
 count
   end
 end
   printf %s %d\n, file_name, total
 end
  
 -Kevin
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-05 Thread Kevin Grittner
Benedikt Grundmann bgrundm...@janestreet.com wrote:
 
 For what's worth here are the numbers on one of our biggest
 databases (same system as I posted about separately wrt
 seq_scan_cost vs random_page_cost).
 
That's would be a 88.4% hit rate on the summarized data.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Jim Nasby

On Dec 23, 2011, at 2:23 PM, Kevin Grittner wrote:

 Jeff Janes jeff.ja...@gmail.com wrote:
 
 Could we get some major OLTP users to post their CLOG for
 analysis?  I wouldn't think there would be much
 security/propietary issues with CLOG data.
 
 FWIW, I got the raw numbers to do my quick check using this Ruby
 script (put together for me by Peter Brant).  If it is of any use to
 anyone else, feel free to use it and/or post any enhanced versions
 of it.

Here's output from our largest OLTP system... not sure exactly how to interpret 
it, so I'm just providing the raw data. This spans almost exactly 1 month.

I have a number of other systems I can profile if anyone's interested.

063A 379
063B 143
063C 94
063D 94
063E 326
063F 113
0640 122
0641 270
0642 81
0643 390
0644 183
0645 76
0646 61
0647 50
0648 275
0649 288
064A 126
064B 53
064C 59
064D 125
064E 357
064F 92
0650 54
0651 83
0652 267
0653 328
0654 118
0655 75
0656 104
0657 280
0658 414
0659 105
065A 74
065B 153
065C 303
065D 63
065E 216
065F 169
0660 113
0661 405
0662 85
0663 52
0664 44
0665 78
0666 412
0667 116
0668 48
0669 61
066A 66
066B 364
066C 104
066D 48
066E 68
066F 104
0670 465
0671 158
0672 64
0673 62
0674 115
0675 452
0676 296
0677 65
0678 80
0679 177
067A 316
067B 86
067C 87
067D 270
067E 84
067F 295
0680 299
0681 88
0682 35
0683 67
0684 66
0685 456
0686 146
0687 52
0688 33
0689 73
068A 147
068B 345
068C 107
068D 67
068E 50
068F 97
0690 473
0691 156
0692 47
0693 57
0694 97
0695 550
0696 224
0697 51
0698 80
0699 280
069A 115
069B 426
069C 241
069D 395
069E 98
069F 130
06A0 523
06A1 296
06A2 92
06A3 97
06A4 122
06A5 524
06A6 256
06A7 118
06A8 111
06A9 157
06AA 553
06AB 166
06AC 106
06AD 103
06AE 200
06AF 621
06B0 288
06B1 95
06B2 107
06B3 227
06B4 92
06B5 447
06B6 210
06B7 364
06B8 119
06B9 113
06BA 384
06BB 319
06BC 45
06BD 68
06BE 2
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Kevin Grittner
Jim Nasby j...@nasby.net wrote:
 
 Here's output from our largest OLTP system... not sure exactly how
 to interpret it, so I'm just providing the raw data. This spans
 almost exactly 1 month.
 
Those number wind up meaning that 18% of the 256-byte blocks (1024
transactions each) were all commits.  Yikes.  That pretty much
shoots down Robert's idea of summarized CLOG data, I think.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Robert Haas
On Wed, Jan 4, 2012 at 3:02 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Jim Nasby j...@nasby.net wrote:
 Here's output from our largest OLTP system... not sure exactly how
 to interpret it, so I'm just providing the raw data. This spans
 almost exactly 1 month.

 Those number wind up meaning that 18% of the 256-byte blocks (1024
 transactions each) were all commits.  Yikes.  That pretty much
 shoots down Robert's idea of summarized CLOG data, I think.

I'm not *totally* certain of that... another way to look at it is that
I have to be able to show a win even if only 18% of the probes into
the summarized data are successful, which doesn't seem totally out of
the question given how cheap I think lookups could be.  But I'll admit
it's not real encouraging.

I think the first thing we need to look at is increasing the number of
CLOG buffers.  Even if hypothetical summarized CLOG data had a 60% hit
rate rather than 18%, 8 CLOG buffers is probably still not going to be
enough for a 32-core system, let alone anything larger.  I am aware of
two concerns here:

1. Unconditionally adding more CLOG buffers will increase PostgreSQL's
minimum memory footprint, which is bad for people suffering under
default shared memory limits or running a database on a device with
less memory than a low-end cell phone.

2. The CLOG code isn't designed to manage a large number of buffers,
so adding more might cause a performance regression on small systems.

On Nate Boley's 32-core system, running pgbench at scale factor 100,
the optimal number of buffers seems to be around 32.  I'd like to get
some test results from smaller systems - any chance you (or anyone)
have, say, an 8-core box you could test on?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 2. The CLOG code isn't designed to manage a large number of
 buffers, so adding more might cause a performance regression on
 small systems.
 
 On Nate Boley's 32-core system, running pgbench at scale factor
 100, the optimal number of buffers seems to be around 32.  I'd
 like to get some test results from smaller systems - any chance
 you (or anyone) have, say, an 8-core box you could test on?
 
Hmm.  I can think of a lot of 4-core servers I could test on.  (We
have a few poised to go into production where it would be relatively
easy to do benchmarking without distorting factors right now.) 
After that we jump to 16 cores, unless I'm forgetting something. 
These are currently all in production, but some of them are
redundant machines which could be pulled for a few hours here and
there for benchmarks.  If either of those seem worthwhile, please
spec the useful tests so I can capture the right information.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Jim Nasby
On Jan 4, 2012, at 2:02 PM, Kevin Grittner wrote:
 Jim Nasby j...@nasby.net wrote:
 Here's output from our largest OLTP system... not sure exactly how
 to interpret it, so I'm just providing the raw data. This spans
 almost exactly 1 month.
 
 Those number wind up meaning that 18% of the 256-byte blocks (1024
 transactions each) were all commits.  Yikes.  That pretty much
 shoots down Robert's idea of summarized CLOG data, I think.

Here's another data point. This is for a londiste slave of what I posted 
earlier. Note that this slave has no users on it.
054A 654
054B 835
054C 973
054D 1020
054E 1012
054F 1022
0550 284


And these clog files are from Sep 15-30... I believe that's the period when we 
were building this slave, but I'm not 100% certain.

04F0 194
04F1 253
04F2 585
04F3 243
04F4 176
04F5 164
04F6 358
04F7 505
04F8 168
04F9 180
04FA 369
04FB 318
04FC 236
04FD 437
04FE 242
04FF 625
0500 222
0501 139
0502 174
0503 91
0504 546
0505 220
0506 187
0507 151
0508 199
0509 491
050A 232
050B 170
050C 191
050D 414
050E 557
050F 231
0510 173
0511 159
0512 436
0513 789
0514 354
0515 157
0516 187
0517 333
0518 599
0519 483
051A 300
051B 512
051C 713
051D 422
051E 291
051F 596
0520 785
0521 825
0522 484
0523 238
0524 151
0525 190
0526 256
0527 403
0528 551
0529 757
052A 837
052B 418
052C 256
052D 161
052E 254
052F 423
0530 469
0531 757
0532 627
0533 325
0534 224
0535 295
0536 290
0537 352
0538 561
0539 565
053A 833
053B 756
053C 485
053D 276
053E 241
053F 270
0540 334
0541 306
0542 700
0543 821
0544 402
0545 199
0546 226
0547 250
0548 354
0549 587


This is for a slave of that database that does have user activity:

054A 654
054B 835
054C 420
054D 432
054E 852
054F 666
0550 302
0551 243
0552 600
0553 295
0554 617
0555 504
0556 232
0557 304
0558 580
0559 156

--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2012-01-04 Thread Robert Haas
On Wed, Jan 4, 2012 at 4:02 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 2. The CLOG code isn't designed to manage a large number of
 buffers, so adding more might cause a performance regression on
 small systems.

 On Nate Boley's 32-core system, running pgbench at scale factor
 100, the optimal number of buffers seems to be around 32.  I'd
 like to get some test results from smaller systems - any chance
 you (or anyone) have, say, an 8-core box you could test on?

 Hmm.  I can think of a lot of 4-core servers I could test on.  (We
 have a few poised to go into production where it would be relatively
 easy to do benchmarking without distorting factors right now.)
 After that we jump to 16 cores, unless I'm forgetting something.
 These are currently all in production, but some of them are
 redundant machines which could be pulled for a few hours here and
 there for benchmarks.  If either of those seem worthwhile, please
 spec the useful tests so I can capture the right information.

Yes, both of those seem useful.  To compile, I do this:

./configure --prefix=$HOME/install/$BRANCHNAME --enable-depend
--enable-debug ${EXTRA_OPTIONS}
make
make -C contrib/pgbench
make check
make install
make -C contrib/pgbench install

In this case, the relevant builds would probably be (1) master, (2)
master with NUM_CLOG_BUFFERS = 16, (3) master with NUM_CLOG_BUFFERS =
32, and (4) master with NUM_CLOG_BUFFERS = 48.  (You could also try
intermediate numbers if it seems warranted.)

Basic test setup:

rm -rf $PGDATA
~/install/master/bin/initdb
cat  $PGDATA/postgresql.conf EOM;
shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
EOM

I'm attaching a driver script you can modify to taste.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


runtestw
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2012-01-03 Thread Jim Nasby
On Dec 28, 2011, at 3:31 AM, Simon Riggs wrote:
 On Wed, Dec 28, 2011 at 9:00 AM, Robert Haas robertmh...@gmail.com wrote:
 
  What I'm not too clear
 about is whether a 16-bit checksum meets the needs of people who want
 checksums.
 
 We need this now, hence the gymnastics to get it into this release.
 
 16-bits of checksum is way better than zero bits of checksum, probably
 about a million times better (numbers taken from papers quoted earlier
 on effectiveness of checksums).
 
 The strategy I am suggesting is 16-bits now, 32/64 later.

What about allowing for an initdb option? That means that if you want binary 
compatibility so you can pg_upgrade then you're stuck with 16 bit checksums. If 
you can tolerate replicating all your data then you can get more robust 
checksumming.

In either case, it seems that we're quickly approaching the point where we need 
to start putting resources into binary page upgrading...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-28 Thread Robert Haas
On Tue, Dec 27, 2011 at 1:39 PM, Jeff Davis pg...@j-davis.com wrote:
 On Mon, 2011-12-19 at 07:50 -0500, Robert Haas wrote:
 I
 think it would be regrettable if everyone had to give up 4 bytes per
 page because some people want checksums.

 I can understand that some people might not want the CPU expense of
 calculating CRCs; or the upgrade expense to convert to new pages; but do
 you think 4 bytes out of 8192 is a real concern?

 (Aside: it would be MAXALIGNed anyway, so probably more like 8 bytes.)

Yeah, I do.  Our on-disk footprint is already significantly greater
than that of some other systems, and IMHO we should be looking for a
way to shrink our overhead in that area, not make it bigger.
Admittedly, most of the fat is probably in the tuple header rather
than the page header, but at any rate I don't consider burning up 1%
of our available storage space to be a negligible overhead.  I'm not
sure I believe it should need to be MAXALIGN'd, since it is followed
by item pointers which IIRC only need 2-byte alignment, but then again
Heikki also recently proposed adding 4 bytes per page to allow each
page to track its XID generation, to help mitigate the need for
anti-wraparound vacuuming.

I think Simon's approach of stealing the 16-bit page version field is
reasonably clever in this regard, although I also understand why Tom
objects to it, and I certainly agree with him that we need to be
careful not to back ourselves into a corner.  What I'm not too clear
about is whether a 16-bit checksum meets the needs of people who want
checksums.  If we assume that flaky hardware is going to corrupt pages
steadily over time, then it seems like it might be adequate, because
in the unlikely event that the first corrupted page happens to still
pass its checksum test, well, another will come along and we'll
probably spot the problem then, likely well before any significant
fraction of the data gets eaten.  But I'm not sure whether that's the
right mental model.  I, and I think some others, initially assumed
we'd want a 32-bit checksum, but I'm not sure I can justify that
beyond well, I think that's what people usually do.  It could be
that even if we add new page header space for the checksum (as opposed
to stuffing it into the page version field) we still want to add only
2 bytes.  Not sure...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-28 Thread Simon Riggs
On Wed, Dec 28, 2011 at 9:00 AM, Robert Haas robertmh...@gmail.com wrote:

 What I'm not too clear
 about is whether a 16-bit checksum meets the needs of people who want
 checksums.

We need this now, hence the gymnastics to get it into this release.

16-bits of checksum is way better than zero bits of checksum, probably
about a million times better (numbers taken from papers quoted earlier
on effectiveness of checksums).

The strategy I am suggesting is 16-bits now, 32/64 later.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-28 Thread Greg Stark
On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure mmonc...@gmail.com wrote:
  I bet if you kept a judicious number of
 clog pages in each local process with some smart invalidation you
 could cover enough cases that scribbling the bits down would become
 unnecessary.

I don't understand how any cache can completely remove the need for
hint bits. Without hint bits the xids in the tuples will be in-doubt
forever. No matter how large your cache you'll always come across
tuples that are arbitrarily old and are from an unbounded size set of
xids.

We could replace the xids with a frozen xid sooner but that just
amounts to nearly the same thing as the hint bits only with page
locking and wal records.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-28 Thread Merlin Moncure
On Wed, Dec 28, 2011 at 8:45 AM, Greg Stark st...@mit.edu wrote:
 On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure mmonc...@gmail.com wrote:
  I bet if you kept a judicious number of
 clog pages in each local process with some smart invalidation you
 could cover enough cases that scribbling the bits down would become
 unnecessary.

 I don't understand how any cache can completely remove the need for
 hint bits. Without hint bits the xids in the tuples will be in-doubt
 forever. No matter how large your cache you'll always come across
 tuples that are arbitrarily old and are from an unbounded size set of
 xids.

well, hint bits aren't needed strictly speaking, they are an
optimization to guard against clog lookups.   but is marking bits on
the tuple the only way to get that effect?

I'm conjecturing that some process local memory could be laid on top
of the clog slru that would be fast enough such that it could take the
place of the tuple bits in the visibility check.  Maybe this could
reduce clog contention as well -- or maybe the idea is unworkable.
That said, it shouldn't be that much work to make a proof of concept
to test the idea.

 We could replace the xids with a frozen xid sooner but that just
 amounts to nearly the same thing as the hint bits only with page
 locking and wal records.

right -- I don't think that helps.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-28 Thread Heikki Linnakangas

On 28.12.2011 11:00, Robert Haas wrote:

Admittedly, most of the fat is probably in the tuple header rather
than the page header, but at any rate I don't consider burning up 1%
of our available storage space to be a negligible overhead.


8 / 8192 = 0.1%.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-27 Thread Jeff Davis
On Mon, 2011-12-19 at 07:50 -0500, Robert Haas wrote:
 I
 think it would be regrettable if everyone had to give up 4 bytes per
 page because some people want checksums.

I can understand that some people might not want the CPU expense of
calculating CRCs; or the upgrade expense to convert to new pages; but do
you think 4 bytes out of 8192 is a real concern?

(Aside: it would be MAXALIGNed anyway, so probably more like 8 bytes.)

I was thinking we'd go in the other direction: expanding the header
would take so much effort, why not expand it a little more to give some
breathing room for the future?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-27 Thread Jeff Davis
On Mon, 2011-12-19 at 01:55 +, Greg Stark wrote:
 On Sun, Dec 18, 2011 at 7:51 PM, Jesper Krogh jes...@krogh.cc wrote:
  I dont know if it would be seen as a half baked feature.. or similar,
  and I dont know if the hint bit problem is solvable at all, but I could
  easily imagine checksumming just skipping the hit bit entirely.
 
 That was one approach discussed. The problem is that the hint bits are
 currently in each heap tuple header which means the checksum code
 would have to know a fair bit about the structure of the page format.

Which is actually a bigger problem, because it might not be the backend
that's reading the page. It might be your backup script taking a new
base backup.

The kind of person to care about CRCs would also want the base backup
tool to verify them during the copy so that you don't overwrite your
previous (good) backup with a bad one. The more complicated we make the
verification process, the less workable that becomes.

I vote for a simple way to calculate the checksum -- fixed offsets of
each page (of course, it would need to know the page size), and a
standard checksum algorithm.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-27 Thread Jeff Davis
On Mon, 2011-12-19 at 22:18 +0200, Heikki Linnakangas wrote:
 Or you could just use a filesystem that does CRCs...

That just moves the problem. Correct me if I'm wrong, but I don't think
there's anything special that the filesystem can do that we can't.

The filesystems that support CRCs are more like ZFS than ext3. They do
all writes to a new location, thus fragmenting the files. That may be a
good trade-off for some people, but it's not free.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-27 Thread Jeff Davis
On Sun, 2011-12-25 at 22:18 +, Greg Stark wrote:
 2) The i/o system was in the process of writing out blocks and the
 system lost power or crashed as they were being written out. In this
 case there will probably only be 0 or 1 torn pages -- perhaps as many
 as the scsi queue depth if there's some weird i/o scheduling going on.

That would also depend on how many disks you have and what configuration
they're in, right?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-27 Thread Jeff Davis
On Thu, 2011-12-22 at 03:50 -0600, Kevin Grittner wrote:
 Now, on to the separate-but-related topic of double-write.  That
 absolutely requires some form of checksum or CRC to detect torn
 pages, in order for the technique to work at all.  Adding a CRC
 without double-write would work fine if you have a storage stack
 which prevents torn pages in the file system or hardware driver.  If
 you don't have that, it could create a damaged page indication after
 a hardware or OS crash, although I suspect that would be the
 exception, not the typical case.  Given all that, and the fact that
 it would be cleaner to deal with these as two separate patches, it
 seems the CRC patch should go in first.

I think it could be broken down further.

Taking a step back, there are several types of HW-induced corruption,
and checksums only catch some of them. For instance, the disk losing
data completely and just returning zeros won't be caught, because we
assume that a zero page is just fine.

From a development standpoint, I think a better approach would be:

1. Investigate if there are reasonable ways to ensure that (outside of
recovery) pages are always initialized; and therefore zero pages can be
treated as corruption.

2. Make some room in the page header for checksums and maybe some other
simple sanity information (like file and page number). It will be a big
project to sort out the pg_upgrade issues (as Tom and others have
pointed out).

3. Attack hint bits problem.

If (1) and (2) were complete, we would catch many common types of
corruption, and we'd be in a much better position to think clearly about
hint bits, double writes, etc.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-27 Thread Merlin Moncure
On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis pg...@j-davis.com wrote:
 3. Attack hint bits problem.

A large number of problems would go away if the current hint bit
system could be replaced with something that did not require writing
to the tuple itself.  FWIW, moving the bits around seems like a
non-starter -- you're trading a problem with a much bigger problem
(locking, wal logging, etc).  But perhaps a clog caching strategy
would be a win.  You get a full nibble back in the tuple header,
significant i/o reduction for some workloads, crc becomes relatively
trivial, etc etc.

My first attempt at a process local cache for hint bits wasn't perfect
but proved (at least to me) that you can sneak a tight cache in there
without significantly impacting the general case.  Maybe the angle of
attack was wrong anyways -- I bet if you kept a judicious number of
clog pages in each local process with some smart invalidation you
could cover enough cases that scribbling the bits down would become
unnecessary.  Proving that is a tall order of course, but IMO merits
another attempt.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-27 Thread Jeff Davis
On Tue, 2011-12-27 at 16:43 -0600, Merlin Moncure wrote:
 On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis pg...@j-davis.com wrote:
  3. Attack hint bits problem.
 
 A large number of problems would go away if the current hint bit
 system could be replaced with something that did not require writing
 to the tuple itself.

My point was that neither the zero page problem nor the upgrade problem
are solved by addressing the hint bits problem. They can be solved
independently, and in my opinion, it seems to make sense to solve those
problems before the hint bits problem (in the context of detecting
hardware corruption).

Of course, don't let that stop you from trying to get rid of hint bits,
that has numerous potential benefits.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-25 Thread Greg Stark
On Mon, Dec 19, 2011 at 7:16 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 It seems to me that on a typical production system you would
 probably have zero or one such page per OS crash

Incidentally I don't think this is right. There are really two kinds
of torn pages:

1) The kernel vm has many dirty 4k pages and decides to flush one 4k
page of a Postgres 8k buffer but not the other one. It doesn't sound
very logical for it to do this but it has the same kind of tradeoffs
to make that Postgres does and there could easily be cases where the
extra book-keeping required to avoid it isn't deemed worthwhile. The
two memory pages might not even land on the same part of the disk
anyways so flushing one and not the other might be reasonable.

In this case there could be an unbounded number of such torn pages and
they can stay torn on disk for a long period of time so the torn pages
may not have been actively being written when the crash occurred. On
Linux these torn pages will always be on memory page boundaries -- ie
4k blocks on x86.

2) The i/o system was in the process of writing out blocks and the
system lost power or crashed as they were being written out. In this
case there will probably only be 0 or 1 torn pages -- perhaps as many
as the scsi queue depth if there's some weird i/o scheduling going on.
In this case the torn page could be on a hardware block boundary --
often 512 byte boundaries (or if the drives don't guarantee otherwise
it could corrupt a disk block).

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-24 Thread Simon Riggs
On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:

 Simon, does it sound like I understand your proposal?

 Yes, thanks for restating.

I've implemented that proposal, posting patch on a separate thread.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 I would suggest you examine how to have an array of N bgwriters,
 then just slot the code for hinting into the bgwriter. That way a
 bgwriter can set hints, calc CRC and write pages in sequence on a
 particular block. The hinting needs to be synchronised with the
 writing to give good benefit.
 
 I'll think about that.  I see pros and cons, and I'll have to see
 how those balance out after I mull them over.
 
I think maybe the best solution is to create some common code to use
from both.  The problem with *just* doing it in bgwriter is that it
would not help much with workloads like Robert has been using for
most of his performance testing -- a database which fits entirely in
shared buffers and starts thrashing on CLOG.  For a background
hinter process my goal would be to deal with xids as they are passed
by the global xmin value, so that you have a cheap way to know that
they are ripe for hinting, and you can frequently hint a bunch of
transactions that are all in the same CLOG page which is recent
enough to likely be already loaded.
 
Now, a background hinter isn't going to be a net win if it has to
grovel through every tuple on every dirty page every time it sweeps
through the buffers, so the idea depends on having a sufficiently
efficient was to identify interesting buffers.  I'm hoping to
improve on this, but my best idea so far is to add a field to the
buffer header for earliest unhinted xid for the page.  Whenever
this background process wakes up and is scanning through the buffers
(probably just in buffer number order), it does a quick check,
without any pin or lock, to see if the buffer is dirty and the
earliest unhinted xid is below the global xmin.  If it passes both
of those tests, there is definitely useful work which can be done if
the page doesn't get evicted before we can do it.  We pin the page,
recheck those conditions, and then we look at each tuple and hint
where possible.  As we go, we remember the earliest xid that we see
which is *not* being hinted, to store back into the buffer header
when we're done.  Of course, we would also update the buffer header
for new tuples or when an xmax is set if the xid involved precedes
what we have in the buffer header.
 
This would not only help avoid multiple page writes as unhinted
tuples on the page are read, it would minimize thrashing on CLOG and
move some of the hinting work from the critical path of reading a
tuple into a background process.
 
Thoughts?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Robert Haas
On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Thoughts?

Those are good thoughts.

Here's another random idea, which might be completely nuts.  Maybe we
could consider some kind of summarization of CLOG data, based on the
idea that most transactions commit.  We introduce the idea of a CLOG
rollup page.  On a CLOG rollup page, each bit represents the status of
N consecutive XIDs.  If the bit is set, that means all XIDs in that
group are known to have committed.  If it's clear, then we don't know,
and must fall through to a regular CLOG lookup.

If you let N = 1024, then 8K of CLOG rollup data is enough to
represent the status of 64 million transactions, which means that just
a couple of pages could cover as much of the XID space as you probably
need to care about.  Also, you would need to replace CLOG summary
pages in memory only very infrequently.  Backends could test the bit
without any lock.  If it's set, they do pg_read_barrier(), and then
check the buffer label to make sure it's still the summary page they
were expecting.  If so, no CLOG lookup is needed.  If the page has
changed under us or the bit is clear, then we fall through to a
regular CLOG lookup.

An obvious problem is that, if the abort rate is significantly
different from zero, and especially if the aborts are randomly mixed
in with commits rather than clustered together in small portions of
the XID space, the CLOG rollup data would become useless.  On the
other hand, if you're doing 10k tps, you only need to have a window of
a tenth of a second or so where everything commits in order to start
getting some benefit, which doesn't seem like a stretch.

Perhaps the CLOG rollup data wouldn't even need to be kept on disk.
We could simply have bgwriter (or bghinter) set the rollup bits in
shared memory for new transactions, as it becomes possible to do so,
and let lookups for XIDs prior to the last shutdown fall through to
CLOG.  Or, if that's not appealing, we could reconstruct the data in
memory by groveling through the CLOG pages - or maybe just set summary
bits only for CLOG pages that actually get faulted in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 An obvious problem is that, if the abort rate is significantly
 different from zero, and especially if the aborts are randomly mixed
 in with commits rather than clustered together in small portions of
 the XID space, the CLOG rollup data would become useless.

Yeah, I'm afraid that with N large enough to provide useful
acceleration, the cases where you'd actually get a win would be too thin
on the ground to make it worth the trouble.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Robert Haas
On Fri, Dec 23, 2011 at 12:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 An obvious problem is that, if the abort rate is significantly
 different from zero, and especially if the aborts are randomly mixed
 in with commits rather than clustered together in small portions of
 the XID space, the CLOG rollup data would become useless.

 Yeah, I'm afraid that with N large enough to provide useful
 acceleration, the cases where you'd actually get a win would be too thin
 on the ground to make it worth the trouble.

Well, I don't know: something like pgbench is certainly going to
benefit, because all the transactions commit.  I suspect that's true
for many benchmarks.  Whether it's true of real-life workloads is more
arguable, of course, but if the benchmarks aren't measuring things
that people really do with the database, then why are they designed
the way they are?

I've certainly written applications that relied on the database for
integrity checking, so rollbacks were an expected occurrence, but then
again those were very low-velocity systems where there wasn't going to
be enough CLOG contention to matter anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Jeff Janes
On 12/23/11, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 Thoughts?

 Those are good thoughts.

 Here's another random idea, which might be completely nuts.  Maybe we
 could consider some kind of summarization of CLOG data, based on the
 idea that most transactions commit.

I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
effectively read only?  Could backends that need these bypass locking
and shared memory altogether?

 An obvious problem is that, if the abort rate is significantly
 different from zero, and especially if the aborts are randomly mixed
 in with commits rather than clustered together in small portions of
 the XID space, the CLOG rollup data would become useless.  On the
 other hand, if you're doing 10k tps, you only need to have a window of
 a tenth of a second or so where everything commits in order to start
 getting some benefit, which doesn't seem like a stretch.

Could we get some major OLTP users to post their CLOG for analysis?  I
wouldn't think there would be much security/propietary issues with
CLOG data.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 An obvious problem is that, if the abort rate is significantly
 different from zero, and especially if the aborts are randomly
 mixed in with commits rather than clustered together in small
 portions of the XID space, the CLOG rollup data would become
 useless.
 
 Yeah, I'm afraid that with N large enough to provide useful
 acceleration, the cases where you'd actually get a win would be
 too thin on the ground to make it worth the trouble.
 
Just to get a real-life data point, I check the pg_clog directory
for Milwaukee County Circuit Courts.  They have about 300 OLTP
users, plus replication feeds to the central servers.  Looking at
the now-present files, there are 19,104 blocks of 256 bytes (which
should support N of 1024, per Robert's example).  Of those, 12,644
(just over 66%) contain 256 bytes of hex 55.
 
Last modified dates on the files go back to the 4th of October, so
this represents roughly three months worth of real-life
transactions.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Kevin Grittner
Jeff Janes jeff.ja...@gmail.com wrote:
 
 Could we get some major OLTP users to post their CLOG for
 analysis?  I wouldn't think there would be much
 security/propietary issues with CLOG data.
 
FWIW, I got the raw numbers to do my quick check using this Ruby
script (put together for me by Peter Brant).  If it is of any use to
anyone else, feel free to use it and/or post any enhanced versions
of it.
 
#!/usr/bin/env ruby

Dir.glob(*) do |file_name|
  contents = File.read(file_name)
  total = 
contents.enum_for(:each_byte).enum_for(:each_slice,
256).inject(0) do |count, chunk|
  if chunk.all? { |b| b == 0x55 }
count + 1
  else
count
  end
end
  printf %s %d\n, file_name, total
end
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-23 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
 effectively read only?  Could backends that need these bypass locking
 and shared memory altogether?

Hmm ... once they've been written out from the SLRU arena, yes.  In fact
you don't need to go back as far as global xmin --- *any* valid xmin is
a sufficient boundary point.  The only real problem is to know whether
the data's been written out from the shared area yet.

This idea has potential.  I like it better than Robert's, mainly because
I do not want to see us put something in place that would lead people to
try to avoid rollbacks.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-22 Thread Leonardo Francalanci

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).



My point is that it looks like this is going to take 1-2 years in 
postgresql, so it looks like a huge job... but at the same time I 
understand we can't hope other filesystems will catch up!



I guess this feature will be tunable (off/on)?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Florian Weimer
* David Fetter:

 The issue is that double writes needs a checksum to work by itself,
 and page checksums more broadly work better when there are double
 writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache?

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Simon Riggs
On Thu, Dec 22, 2011 at 7:44 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 22.12.2011 01:43, Tom Lane wrote:

 A utility to bump the page version is equally a whole lot easier said
 than done, given that the new version has more overhead space and thus
 less payload space than the old.  What does it do when the old page is
 too full to be converted?  Move some data somewhere else might be
 workable for heap pages, but I'm less sanguine about rearranging indexes
 like that.  At the very least it would imply that the utility has full
 knowledge about every index type in the system.


 Remembering back the old discussions, my favorite scheme was to have an
 online pre-upgrade utility that runs on the old cluster, moving things
 around so that there is enough spare room on every page. It would do normal
 heap updates to make room on heap pages (possibly causing transient
 serialization failures, like all updates do), and split index pages to make
 room on them. Yes, it would need to know about all index types. And it would
 set a global variable to indicate that X bytes must be kept free on all
 future updates, too.

 Once the pre-upgrade utility has scanned through the whole cluster, you can
 run pg_upgrade. After the upgrade, old page versions are converted to new
 format as pages are read in. The conversion is staightforward, as there the
 pre-upgrade utility ensured that there is enough spare room on every page.

That certainly works, but we're still faced with pg_upgrade rewriting
every page, which will take a significant amount of time and with no
backout plan or rollback facility. I don't like that at all, hence why
I think we need an online upgrade facility if we do have to alter page
headers.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Simon Riggs
On Thu, Dec 22, 2011 at 8:42 AM, Florian Weimer fwei...@bfk.de wrote:
 * David Fetter:

 The issue is that double writes needs a checksum to work by itself,
 and page checksums more broadly work better when there are double
 writes, obviating the need to have full_page_writes on.

 How desirable is it to disable full_page_writes?  Doesn't it cut down
 recovery time significantly because it avoids read-modify-write cycles
 with a cold cache?

It's way too late in the cycle to suggest removing full page writes or
code them. We're looking to add protection, not swap out existing
ones.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Jesper Krogh

On 2011-12-22 09:42, Florian Weimer wrote:

* David Fetter:


The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes?  Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache

What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

Anyway, I have an archive_command that looks like:
archive_command = 'test ! -f /data/wal/%f.gz  gzip --fast  %p  
/data/wal/%f.gz'


It brings on along somewhere between 50 and 75% reduction in log-volume
with no cost on the production system (since gzip just occupices one 
of the

many cores on the system) and can easily keep up even during
quite heavy writes.

Recovery is a bit more tricky, because hooking gunzip into the command 
there
will cause the system to replay log, gunzip, read data, replay log cycle 
where the gunzip
easily could be done on the other logfiles while replay are being done 
on one.


So a straightforward recovery will cost in recovery time, but that can 
be dealt with.


Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Kevin Grittner
Simon Riggs  wrote:
 
 So overall, I do now think its still possible to add an optional
 checksum in the 9.2 release and am willing to pursue it unless
 there are technical objections.
 
Just to restate Simon's proposal, to make sure I'm understanding it,
we would support a new page header format number and the old one in
9.2, both to be the same size and carefully engineered to minimize
what code would need to be aware of the version.  PageHeaderIsValid()
and PageInit() certainly would, and we would need some way to set,
clear (maybe), and validate a CRC.  We would need a GUC to indicate
whether to write the CRC, and if present we would always test it on
read and treat it as a damaged page if it didn't match.  (Perhaps
other options could be added later, to support recovery attempts, but
let's not complicate a first cut.)  This whole idea would depend on
either (1) trusting your storage system never to tear a page on write
or (2) getting the double-write feature added, too.
 
I see some big advantages to this over what I suggested to David. 
For starters, using a flag bit and putting the CRC somewhere other
than the page header would require that each AM deal with the CRC,
exposing some function(s) for that.  Simon's idea doesn't require
that.  I was also a bit concerned about shifting tuple images to
convert non-protected pages to protected pages.  No need to do that,
either.  With the bit flags, I think there might be some cases where
we would be unable to add a CRC to a converted page because space was
too tight; that's not an issue with Simon's proposal.
 
Heikki was talking about a pre-convert tool.  Neither approach really
needs that, although with Simon's approach it would be possible to
have a background *post*-conversion tool to add CRCs, if desired. 
Things would continue to function if it wasn't run; you just wouldn't
have CRC protection on pages not updated since pg_upgrade was run.
 
Simon, does it sound like I understand your proposal?
 
Now, on to the separate-but-related topic of double-write.  That
absolutely requires some form of checksum or CRC to detect torn
pages, in order for the technique to work at all.  Adding a CRC
without double-write would work fine if you have a storage stack
which prevents torn pages in the file system or hardware driver.  If
you don't have that, it could create a damaged page indication after
a hardware or OS crash, although I suspect that would be the
exception, not the typical case.  Given all that, and the fact that
it would be cleaner to deal with these as two separate patches, it
seems the CRC patch should go in first.  (And, if this is headed for
9.2, *very soon*, so there is time for the double-write patch to
follow.)
 
It seems to me that the full_page_writes GUC could become an
enumeration, with off having the current meaning, wal meaning
what on now does, and double meaning that the new double-write
technique would be used.  (It doesn't seem to make any sense to do
both at the same time.)  I don't think we need a separate GUC to tell
us *what* to protect against torn pages -- if not off we should
always protect the first write of a page after checkpoint, and if
double and write_page_crc (or whatever we call it) is on, then we
protect hint-bit-only writes.  I think.  I can see room to argue that
with CRCs on we should do a full-page write to the WAL for a
hint-bit-only change, or that we should add another GUC to control
when we do this.
 
I'm going to take a shot at writing a patch for background hinting
over the holidays, which I think has benefit alone but also boosts
the value of these patches, since it would reduce double-write
activity otherwise needed to prevent spurious error when using CRCs.
 
This whole area has some overlap with spreading writes, I think.  The
double-write approach seems to count on writing a bunch of pages
(potentially from different disk files) sequentially to the
double-write buffer, fsyncing that, and then writing the actual pages
-- which must be fsynced before the related portion of the
double-write buffer can be reused.  The simple implementation would
be to simply fsync the files just written to if they required a prior
write to the double-write buffer, although fancier techniques could
be used to try to optimize that.  Again, setting hint bits set before
the write when possible would help reduce the impact of that.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Jignesh Shah
On Thu, Dec 22, 2011 at 4:00 AM, Jesper Krogh jes...@krogh.cc wrote:
 On 2011-12-22 09:42, Florian Weimer wrote:

 * David Fetter:

 The issue is that double writes needs a checksum to work by itself,
 and page checksums more broadly work better when there are double
 writes, obviating the need to have full_page_writes on.

 How desirable is it to disable full_page_writes?  Doesn't it cut down
 recovery time significantly because it avoids read-modify-write cycles
 with a cold cache

 What is the downsides of having full_page_writes enabled .. except from
 log-volume? The manual mentions something about speed, but it is
 a bit unclear where that would come from, since the full pages must
 be somewhere in memory when being worked on anyway,.



I thought I will share some of my perspective on this checksum +
doublewrite from a performance point of view.

Currently from what I see in our tests based on dbt2, DVDStore, etc
is that checksum does not impact scalability or total throughput
measured. It does increase CPU cycles depending on the algorithm used
by not really anything that causes problems. The Doublewrite change
will be the big win to performance compared to full_page_write.  For
example compared to other databases our WAL traffic is one of the
highest. Most of it is attributed to full_page_write. The reason
full_page_write is necessary in production (atleast without worrying
about replication impact) is that if a write fails, we can recover
that whole page from WAL Logs as it is and just put it back out there.
(In fact I believe thats the recovery does.) However the net impact is
during high OLTP the runtime impact on WAL is high due to the high
traffic and compared to other databases due to the higher traffic, the
utilization is high. Also this has a huge impact on transaction
response time the first time a page is changed which in all OLTP
environments it is huge because by nature the transactions are all on
random pages.

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic without
loss of reliatbility due to a write fault since there are two writes
always. (Implementation detail discussable). Since the double writes
itself are sequential bundling multiple such writes further reduces
the write time. The biggest improvement is that now these writes are
not done during TRANSACTION COMMIT but during CHECKPOINT WRITES which
improves performance drastically for OLTP application's transaction
performance  and you still get the reliability that is needed.

Typically  Performance in terms of throughput tps system is like
tps(Full_page Write)  tps (no full page write)
With the double write and CRC we see
tps (Full_page_write)  tps (Doublewrite)  tps(no full page
write)
Which is a big win for production systems to get the reliability of
full_page write.

Also the side effect for response times is that they are more leveled
unlike full page write where the response time varies like  0.5ms to
5ms depending on whether the same transaction needs to write a full
page onto WAL or not.  With doublewrite it can always be around 0.5ms
rather than have a huge deviation on transaction performance. With
this folks measuring the 90 %ile  response time will see a huge relief
on trying to meet their SLAs.

Also from WAL perspective, I like to put the WAL on its own
LUN/spindle/VMDK etc .. The net result that I have with the reduced
WAL traffic, my utilization drops which means the same hardware can
now handle higher WAL traffic in terms of IOPS resulting that WAL
itself becomes lesser of a bottleneck. Typically this is observed by
the reduction in response times of the transactions and increase in
tps till some other bottleneck becomes the gating factor.

So overall this is a big win.

Regards,
Jignesh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Kevin Grittner
Jignesh Shah jks...@gmail.com wrote:
 
 When we use Doublewrite with checksums, we can safely disable
 full_page_write causing a HUGE reduction to the WAL traffic
 without loss of reliatbility due to a write fault since there are
 two writes always. (Implementation detail discussable).
 
The always there surprised me.  It seemed to me that we only need
to do the double-write where we currently do full page writes or
unlogged writes.  In thinking about your message, it finally struck
me that this might require a WAL record to be written with the
checksum (or CRC; whatever we use).  Still, writing a WAL record
with a CRC prior to the page write would be less data than the full
page.  Doing double-writes instead for situations without the torn
page risk seems likely to be a net performance loss, although I have
no benchmarks to back that up (not having a double-write
implementation to test).  And if we can get correct behavior without
doing either (the checksum WAL record or the double-write), that's
got to be a clear win.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Jignesh Shah
On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Jignesh Shah jks...@gmail.com wrote:

 When we use Doublewrite with checksums, we can safely disable
 full_page_write causing a HUGE reduction to the WAL traffic
 without loss of reliatbility due to a write fault since there are
 two writes always. (Implementation detail discussable).

 The always there surprised me.  It seemed to me that we only need
 to do the double-write where we currently do full page writes or
 unlogged writes.  In thinking about your message, it finally struck

Currently PG only does full page write for the first change that makes
the dirty after a checkpoint. This scheme works when all changes are
relative to that first page so when checkpoint write fails then it can
recreate the page by using the full page write + all the delta changes
from WAL.

In the double write implementation, every checkpoint write is double
writed, so if the first doublewrite page write fails then then
original page is not corrupted and if the second write to the actual
datapage fails, then one can recover it from the earlier write. Now
while it seems that there are 2X double writes during checkpoint is
true. I can argue that there are the same 2 X writes right now except
1X of the write goes to WAL DURING TRANSACTION COMMIT.  Also since
doublewrite is generally written in its own file it is essentially
sequential so it doesnt have the same write latencies as the actual
checkpoint write. So if you look at the net amount of the writes it is
the same. For unlogged tables even if you do doublewrite it is not
much of a penalty while that may not be logging before in the WAL.  By
doing the double write for it, it is still safe and gives resilience
for those tables to it eventhough it is not required. The net result
is that the underlying page is never irrecoverable due to failed
writes.


 me that this might require a WAL record to be written with the
 checksum (or CRC; whatever we use).  Still, writing a WAL record
 with a CRC prior to the page write would be less data than the full
 page.  Doing double-writes instead for situations without the torn
 page risk seems likely to be a net performance loss, although I have
 no benchmarks to back that up (not having a double-write
 implementation to test).  And if we can get correct behavior without
 doing either (the checksum WAL record or the double-write), that's
 got to be a clear win.

I am not sure why would one want to write the checksum to WAL.
As for the double writes, infact there is not a net loss because
(a) the writes to the doublewrite area is sequential the writes calls
are relatively very fast and infact does not cause any latency
increase to any transactions unlike full_page_write.
(b) It can be moved to a different location to have no stress on the
default tablespace if you are worried about that spindle handling 2X
writes which is mitigated in full_page_writes if you move pg_xlogs to
different spindle

and my own tests supports that the net result is almost as fast as
full_page_write=off  but not the same due to the extra write  (which
gives you the desired reliability) but way better than
full_page_write=on.


Regards,
Jignesh






 -Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Robert Haas
On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah jks...@gmail.com wrote:
 In the double write implementation, every checkpoint write is double
 writed,

Unless I'm quite thoroughly confused, which is possible, the double
write will need to happen the first time a buffer is written following
each checkpoint.  Which might mean the next checkpoint, but it could
also be sooner if the background writer kicks in, or in the worst case
a buffer has to do its own write.

Furthermore, we can't *actually* write any pages until they are
written *and fsync'd* to the double-write buffer.  So the penalty for
the background writer failing to do the right thing is going to go up
enormously.  Think about VACUUM or COPY IN, using a ring buffer and
kicking out its own pages.  Every time it evicts a page, it is going
to have to doublewrite the buffer, fsync it, and then write it for
real.  That is going to make PostgreSQL 6.5 look like a speed demon.
The background writer or checkpointer can conceivably dump a bunch of
pages into the doublewrite area and then fsync the whole thing in
bulk, but a backend that needs to evict a page only wants one page, so
it's pretty much screwed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Jignesh Shah
On Thu, Dec 22, 2011 at 3:04 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah jks...@gmail.com wrote:
 In the double write implementation, every checkpoint write is double
 writed,

 Unless I'm quite thoroughly confused, which is possible, the double
 write will need to happen the first time a buffer is written following
 each checkpoint.  Which might mean the next checkpoint, but it could
 also be sooner if the background writer kicks in, or in the worst case
 a buffer has to do its own write.



Logically the double write happens for every checkpoint write and it
gets fsynced.. Implementation wise you can do a chunk of those pages
like we do in sets of pages and sync them once and yes it still
performs better than full_page_write. As long as you compare with
full_page_write=on, the scheme is always much better. If you compare
it with performance of full_page_write=off it is slightly less but
then you lose the the reliability. So for performance testers like me
who always turn off  full_page_write anyway during my benchmark run
will not see any impact. However for folks in production who are
rightly scared to turn off full_page_write will have an ability to
increase performance without being scared on failed writes.

 Furthermore, we can't *actually* write any pages until they are
 written *and fsync'd* to the double-write buffer.  So the penalty for
 the background writer failing to do the right thing is going to go up
 enormously.  Think about VACUUM or COPY IN, using a ring buffer and
 kicking out its own pages.  Every time it evicts a page, it is going
 to have to doublewrite the buffer, fsync it, and then write it for
 real.  That is going to make PostgreSQL 6.5 look like a speed demon.

Like I said implementation detail wise it depends on how many such
pages do you sync simultaneously and the real tests prove that it is
actually much faster than one expects.

 The background writer or checkpointer can conceivably dump a bunch of
 pages into the doublewrite area and then fsync the whole thing in
 bulk, but a backend that needs to evict a page only wants one page, so
 it's pretty much screwed.


Generally what point you pay the penalty is a trade off.. I would
argue that you are making me pay for the full page write for my first
transaction commit  that changes the page which I can never avoid and
the result is I get a transaction response time that is unacceptable
since the deviation of a similar transaction which modifies the page
already made dirty is lot less. However I can avoid page evictions if
I select a bigger bufferpool (not necessarily that I want to do that
but I have a choice without losing reliability).

Regards,
Jignesh



 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Simon Riggs
On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:

 Simon, does it sound like I understand your proposal?

Yes, thanks for restating.

 Now, on to the separate-but-related topic of double-write.  That
 absolutely requires some form of checksum or CRC to detect torn
 pages, in order for the technique to work at all.  Adding a CRC
 without double-write would work fine if you have a storage stack
 which prevents torn pages in the file system or hardware driver.  If
 you don't have that, it could create a damaged page indication after
 a hardware or OS crash, although I suspect that would be the
 exception, not the typical case.  Given all that, and the fact that
 it would be cleaner to deal with these as two separate patches, it
 seems the CRC patch should go in first.  (And, if this is headed for
 9.2, *very soon*, so there is time for the double-write patch to
 follow.)

It could work that way, but I seriously doubt that a technique only
mentioned in dispatches one month before the last CF is likely to
become trustable code within one month. We've been discussing CRCs for
years, so assembling the puzzle seems much easier, when all the parts
are available.

 It seems to me that the full_page_writes GUC could become an
 enumeration, with off having the current meaning, wal meaning
 what on now does, and double meaning that the new double-write
 technique would be used.  (It doesn't seem to make any sense to do
 both at the same time.)  I don't think we need a separate GUC to tell
 us *what* to protect against torn pages -- if not off we should
 always protect the first write of a page after checkpoint, and if
 double and write_page_crc (or whatever we call it) is on, then we
 protect hint-bit-only writes.  I think.  I can see room to argue that
 with CRCs on we should do a full-page write to the WAL for a
 hint-bit-only change, or that we should add another GUC to control
 when we do this.

 I'm going to take a shot at writing a patch for background hinting
 over the holidays, which I think has benefit alone but also boosts
 the value of these patches, since it would reduce double-write
 activity otherwise needed to prevent spurious error when using CRCs.

I would suggest you examine how to have an array of N bgwriters, then
just slot the code for hinting into the bgwriter. That way a bgwriter
can set hints, calc CRC and write pages in sequence on a particular
block. The hinting needs to be synchronised with the writing to give
good benefit.

If we want page checksums in 9.2, I'll need your help, so the hinting
may be a sidetrack.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-22 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 
 It could work that way, but I seriously doubt that a technique
 only mentioned in dispatches one month before the last CF is
 likely to become trustable code within one month. We've been
 discussing CRCs for years, so assembling the puzzle seems much
 easier, when all the parts are available.
 
Well, double-write has been mentioned on the lists for years,
sometimes in conjunction with CRCs, and I get the impression this is
one of those things which has been worked on out of the community's
view for a while and is just being posted now.  That's often not
viewed as the ideal way for development to proceed from a community
standpoint, but it's been done before with some degree of success --
particularly when a feature has been bikeshedded to a standstill. 
;-)
 
 I would suggest you examine how to have an array of N bgwriters,
 then just slot the code for hinting into the bgwriter. That way a
 bgwriter can set hints, calc CRC and write pages in sequence on a
 particular block. The hinting needs to be synchronised with the
 writing to give good benefit.
 
I'll think about that.  I see pros and cons, and I'll have to see
how those balance out after I mull them over.
 
 If we want page checksums in 9.2, I'll need your help, so the
 hinting may be a sidetrack.
 
Well, VMware posted the initial patch, and that was the first I
heard of it.  I just had some off-line discussions with them after
they posted it.  Perhaps the engineers who wrote it should take your
comments as a review an post a modified patch?  It didn't seem like
that pot of broth needed any more cooks, so I was going to go work
on a nice dessert; but I agree that any way I can help along the
either of the $Subject patches should take priority.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such as 
the one available in ZFS and btrfs?



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Stephen Frost
* Leonardo Francalanci (m_li...@yahoo.it) wrote:
 I can't help in this discussion, but I have a question:
 how different would this feature be from filesystem-level CRC, such
 as the one available in ZFS and btrfs?

Depends on how much you trust the filesystem. :)

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Page Checksums

2011-12-21 Thread Kevin Grittner
Greg Smith g...@2ndquadrant.com wrote:
  Some people think I border on the paranoid on this issue.
 
 Those people are also out to get you, just like the hardware.
 
Hah!  I *knew* it!
 
 Are you arguing that autovacuum should be disabled after crash
 recovery?  I guess if you are arguing that a database VACUUM
 might destroy recoverable data when hardware starts to fail, I
 can't argue.
 
 A CRC failure suggests to me a significantly higher possibility
 of hardware likely to lead to more corruption than a normal crash
 does though.
 
Yeah, the discussion has me coming around to the point of view
advocated by Andres: that it should be treated the same as corrupt
pages detected through other means.  But that can only be done if
you eliminate false positives from hint-bit-only updates.  Without
some way to handle that, I guess that means the idea is dead.
 
Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time.  A CRC or
checksum is a lot weaker than that.  I can see where it would be
very valuable where more rigorous methods aren't in use; but it
would really be just extra overhead with little or no benefit for
most of our database clusters.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Andres Freund
On Wednesday, December 21, 2011 04:21:53 PM Kevin Grittner wrote:
 Greg Smith g...@2ndquadrant.com wrote:
   Some people think I border on the paranoid on this issue.
  
  Those people are also out to get you, just like the hardware.
 
 Hah!  I *knew* it!
 
  Are you arguing that autovacuum should be disabled after crash
  recovery?  I guess if you are arguing that a database VACUUM
  might destroy recoverable data when hardware starts to fail, I
  can't argue.
  
  A CRC failure suggests to me a significantly higher possibility
  of hardware likely to lead to more corruption than a normal crash
  does though.
 
 Yeah, the discussion has me coming around to the point of view
 advocated by Andres: that it should be treated the same as corrupt
 pages detected through other means.  But that can only be done if
 you eliminate false positives from hint-bit-only updates.  Without
 some way to handle that, I guess that means the idea is dead.
 
 Also, I'm not sure that our shop would want to dedicate any space
 per page for this, since we're comparing between databases to ensure
 that values actually match, row by row, during idle time.  A CRC or
 checksum is a lot weaker than that.  I can see where it would be
 very valuable where more rigorous methods aren't in use; but it
 would really be just extra overhead with little or no benefit for
 most of our database clusters.
Comparing between database will by far not recognize failures in all data 
because you surely will not use all indexes. With index only scans the 
likelihood of unnoticed heap corruption also increases.
E.g. I have seen disk level corruption silently corrupting a unique index so 
it didn't cover all data anymore which lead to rather big problems.
Not everyone can do regular dump+restore tests to protect against such 
scenarios...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

On 21/12/2011 16.19, Stephen Frost wrote:

* Leonardo Francalanci (m_li...@yahoo.it) wrote:

I can't help in this discussion, but I have a question:
how different would this feature be from filesystem-level CRC, such
as the one available in ZFS and btrfs?


Depends on how much you trust the filesystem. :)



Ehm I hope that was a joke...


I think what I meant was: isn't this going to be useless in a couple of 
years (if, say, btrfs will be available)? Or it actually gives something 
that FS will never be able to give?


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Heikki Linnakangas

On 21.12.2011 17:21, Kevin Grittner wrote:

Also, I'm not sure that our shop would want to dedicate any space
per page for this, since we're comparing between databases to ensure
that values actually match, row by row, during idle time.


4 bytes out of a 8k block is just under 0.05%. I don't think anyone is 
going to notice the extra disk space consumed by this. There's all those 
other issues like the hint bits that make this a non-starter, but disk 
space overhead is not one of them.


INHO we should just advise that you should use a filesystem with CRCs if 
you want that extra level of safety. It's the hardware's and operating 
system's job to ensure that data doesn't get corrupt after we hand it 
over to the OS with write()/fsync().


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Robert Haas
On Tue, Dec 20, 2011 at 12:12 PM, Christopher Browne cbbro...@gmail.com wrote:
 This seems to be a frequent problem with this whole doing CRCs on pages 
 thing.

 It's not evident which problems will be real ones.

That depends on the implementation.  If we have a flaky, broken
implementation such as the one proposed, then, yes, it will be
unclear.  But if we properly guard against a torn page invalidating
the CRC, then it won't be unclear at all: any CRC mismatch means
something bad happened.

Of course, that may be fairly expensive in terms of performance.  But
the only way I can see to get around that problem is to rewrite our
heap AM or our MVCC implementation in some fashion that gets rid of
hint bits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Stephen Frost
* Leonardo Francalanci (m_li...@yahoo.it) wrote:
 Depends on how much you trust the filesystem. :)
 
 Ehm I hope that was a joke...

It certainly wasn't..

 I think what I meant was: isn't this going to be useless in a couple
 of years (if, say, btrfs will be available)? Or it actually gives
 something that FS will never be able to give?

Yes, it will help you find/address bugs in the filesystem.  These things
are not unheard of...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Page Checksums

2011-12-21 Thread Leonardo Francalanci

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?


Yes, it will help you find/address bugs in the filesystem.  These things
are not unheard of...


It sounds to me like a huge job to fix some issues not unheard of...

My point is: if we are trying to fix misbehaving drives/controllers 
(something that is more common than one might think), that's already 
done by ZFS on Solaris and FreeBSD, and will be done in btrfs for linux.


I understand not trusting drives/controllers; but not trusting a 
filesystem...



What am I missing? (I'm far from being an expert... I just don't 
understand...)






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 4 bytes out of a 8k block is just under 0.05%. I don't think anyone is 
 going to notice the extra disk space consumed by this. There's all those 
 other issues like the hint bits that make this a non-starter, but disk 
 space overhead is not one of them.

The bigger problem is that adding a CRC necessarily changes the page
format and therefore breaks pg_upgrade.  As Greg and Simon already
pointed out upthread, there's essentially zero chance of this getting
applied before we have a solution that allows pg_upgrade to cope with
page format changes.  A CRC feature is not compelling enough to justify
a non-upgradable release cycle.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Greg Smith

On 12/21/2011 10:49 AM, Stephen Frost wrote:

* Leonardo Francalanci (m_li...@yahoo.it) wrote:
   

I think what I meant was: isn't this going to be useless in a couple
of years (if, say, btrfs will be available)? Or it actually gives
something that FS will never be able to give?
 

Yes, it will help you find/address bugs in the filesystem.  These things
are not unheard of...
   


There was a spike in data recovery business here after people started 
migrating to ext4.  New filesystems are no fun to roll out; some bugs 
will only get shaken out when brave early adopters deploy them.


And there's even more radical changes in btrfs, since it wasn't starting 
with a fairly robust filesystem as a base.  And putting my tin foil hat 
on, I don't feel real happy about assuming *the* solution for this issue 
in PostgreSQL is the possibility of a filesystem coming one day when 
that work is being steered by engineers who work at Oracle.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Page Checksums + Double Writes

2011-12-21 Thread David Fetter
Folks,

One of the things VMware is working on is double writes, per previous
discussions of how, for example, InnoDB does things.   I'd initially
thought that introducing just one of the features in $Subject at a
time would help, but I'm starting to see a mutual dependency.

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

If submitting these things together seems like a better idea than
having them arrive separately, I'll work with my team here to make
that happen soonest.

There's a separate issue we'd like to get clear on, which is whether
it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If so, there's less to do, but pg_upgrade as it currently stands is
broken.

If not, we'll have to do some extra work on the patch as described
below.  Thanks to Kevin Grittner for coming up with this :)

- Use a header bit to say whether we've got a checksum on the page.
  We're using 3/16 of the available bits as described in
  src/include/storage/bufpage.h.

- When that bit is set, place the checksum somewhere convenient on the
  page.  One way to do this would be to have an optional field at the
  end of the special space based on the new bit.  Rows from pg_upgrade
  would have the bit clear, and would have the shorter special
  structure without the checksum.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Alvaro Herrera

Excerpts from David Fetter's message of mié dic 21 18:59:13 -0300 2011:

 If not, we'll have to do some extra work on the patch as described
 below.  Thanks to Kevin Grittner for coming up with this :)
 
 - Use a header bit to say whether we've got a checksum on the page.
   We're using 3/16 of the available bits as described in
   src/include/storage/bufpage.h.
 
 - When that bit is set, place the checksum somewhere convenient on the
   page.  One way to do this would be to have an optional field at the
   end of the special space based on the new bit.  Rows from pg_upgrade
   would have the bit clear, and would have the shorter special
   structure without the checksum.

If you get away with a new page format, let's make sure and coordinate
so that we can add more info into the header.  One thing I wanted was to
have an ID struct on each file, so that you know what
DB/relation/segment the file corresponds to.  So the first page's
special space would be a bit larger than the others.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Kevin Grittner
Alvaro Herrera alvhe...@commandprompt.com wrote:
 
 If you get away with a new page format, let's make sure and
 coordinate so that we can add more info into the header.  One
 thing I wanted was to have an ID struct on each file, so that you
 know what DB/relation/segment the file corresponds to.  So the
 first page's special space would be a bit larger than the others.
 
Couldn't that also be done by burning a bit in the page header
flags, without a page layout version bump?  If that were done, you
wouldn't have the additional information on tables converted by
pg_upgrade, but you would get them on new tables, including those
created by pg_dump/psql conversions.  Adding them could even be made
conditional, although I don't know whether that's a good idea
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Martijn van Oosterhout
On Wed, Dec 21, 2011 at 09:32:28AM +0100, Leonardo Francalanci wrote:
 I can't help in this discussion, but I have a question:
 how different would this feature be from filesystem-level CRC, such
 as the one available in ZFS and btrfs?

Hmm, filesystems are not magical. If they implement this then they will
have the same issues with torn pages as Postgres would.  Which I
imagine they solve by doing a transactional update by writing the new
page to a new location, with checksum and updating a pointer.  They
can't even put the checksum on the same page, like we could.  How that
interacts with seqscans I have no idea.

Certainly I think we could look to them for implementation ideas, but I
don't imagine they've got something that can't be specialised for
better performence.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Simon Riggs
On Wed, Dec 21, 2011 at 10:19 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Alvaro Herrera alvhe...@commandprompt.com wrote:

 If you get away with a new page format, let's make sure and
 coordinate so that we can add more info into the header.  One
 thing I wanted was to have an ID struct on each file, so that you
 know what DB/relation/segment the file corresponds to.  So the
 first page's special space would be a bit larger than the others.

 Couldn't that also be done by burning a bit in the page header
 flags, without a page layout version bump?  If that were done, you
 wouldn't have the additional information on tables converted by
 pg_upgrade, but you would get them on new tables, including those
 created by pg_dump/psql conversions.  Adding them could even be made
 conditional, although I don't know whether that's a good idea

These are good thoughts because they overcome the major objection to
doing *anything* here for 9.2.

We don't need to use any flag bits at all. We add
PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
becomes an initdb option. All new pages can be created with
PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
be either the layout version from this release (4) or the next version
(5). Page validity then becomes version dependent.

pg_upgrade still works.

Layout 5 is where we add CRCs, so its basically optional.

We can also have a utility that allows you to bump the page version
for all new pages, even after you've upgraded, so we may end with a
mix of page layout versions in the same relation. That's more
questionable but I see no problem with it.

Do we need CRCs as a table level option? I hope not. That complicates
many things.

All of this allows us to have another more efficient page version (6)
in future without problems, so its good infrastructure.

I'm now personally game on to make something work here for 9.2.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Tom Lane
David Fetter da...@fetter.org writes:
 There's a separate issue we'd like to get clear on, which is whether
 it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If you're not going to provide pg_upgrade support, I think there is no
chance of getting a new page layout accepted.  The people who might want
CRC support are pretty much exactly the same people who would find lack
of pg_upgrade a showstopper.

Now, given the hint bit issues, I rather doubt that you can make this
work without a page format change anyway.  So maybe you ought to just
bite the bullet and start working on the pg_upgrade problem, rather than
imagining you will find an end-run around it.

 The issue is that double writes needs a checksum to work by itself,
 and page checksums more broadly work better when there are double
 writes, obviating the need to have full_page_writes on.

Um.  So how is that going to work if checksums are optional?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-21 Thread Simon Riggs
On Wed, Dec 21, 2011 at 7:35 PM, Greg Smith g...@2ndquadrant.com wrote:

 And there's even more radical changes in btrfs, since it wasn't starting
 with a fairly robust filesystem as a base.  And putting my tin foil hat on,
 I don't feel real happy about assuming *the* solution for this issue in
 PostgreSQL is the possibility of a filesystem coming one day when that work
 is being steered by engineers who work at Oracle.

Agreed.

I do agree with Heikki that it really ought to be the OS problem, but
then we thought that about dtrace and we're still waiting for that or
similar to be usable on all platforms (+/- 4 years).

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 We don't need to use any flag bits at all. We add
 PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
 becomes an initdb option. All new pages can be created with
 PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
 be either the layout version from this release (4) or the next version
 (5). Page validity then becomes version dependent.

 We can also have a utility that allows you to bump the page version
 for all new pages, even after you've upgraded, so we may end with a
 mix of page layout versions in the same relation. That's more
 questionable but I see no problem with it.

It seems like you've forgotten all of the previous discussion of how
we'd manage a page format version change.

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions.  That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective.  Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it.  I do know that
we considered the idea and mostly rejected it a year or two back.

A utility to bump the page version is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old.  What does it do when the old page is
too full to be converted?  Move some data somewhere else might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that.  At the very least it would imply that the utility has full
knowledge about every index type in the system.

 I'm now personally game on to make something work here for 9.2.

If we're going to freeze 9.2 in the spring, I think it's a bit late
for this sort of work to be just starting.  What you've just described
sounds to me like possibly a year's worth of work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Simon Riggs
On Wed, Dec 21, 2011 at 11:43 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 It seems like you've forgotten all of the previous discussion of how
 we'd manage a page format version change.

Maybe I've had too much caffeine. It's certainly late here.

 Having two different page formats running around in the system at the
 same time is far from free; in the worst case it means that every single
 piece of code that touches pages has to know about and be prepared to
 cope with both versions.  That's a rather daunting prospect, from a
 coding perspective and even more from a testing perspective.  Maybe
 the issues can be kept localized, but I've seen no analysis done of
 what the impact would be or how we could minimize it.  I do know that
 we considered the idea and mostly rejected it a year or two back.

I'm looking at that now.

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I'm investing some time on the required analysis.

 A utility to bump the page version is equally a whole lot easier said
 than done, given that the new version has more overhead space and thus
 less payload space than the old.  What does it do when the old page is
 too full to be converted?  Move some data somewhere else might be
 workable for heap pages, but I'm less sanguine about rearranging indexes
 like that.  At the very least it would imply that the utility has full
 knowledge about every index type in the system.

I agree, rewriting every page is completely out and I never even considered it.

 I'm now personally game on to make something work here for 9.2.

 If we're going to freeze 9.2 in the spring, I think it's a bit late
 for this sort of work to be just starting.

I agree with that. If this goes adrift it will have to be killed for 9.2.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Rob Wultsch
On Wed, Dec 21, 2011 at 1:59 PM, David Fetter da...@fetter.org wrote:
 One of the things VMware is working on is double writes, per previous
 discussions of how, for example, InnoDB does things.

The world is moving to flash, and the lifetime of flash is measured
writes. Potentially doubling the number of writes is potentially
halving the life of the flash.

Something to think about...

-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Robert Haas
On Wed, Dec 21, 2011 at 7:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 My feeling is it probably depends upon how different the formats are,
 so given we are discussing a 4 byte addition to the header, it might
 be doable.

I agree.  When thinking back on Zoltan's patches, it's worth
remembering that he had a number of pretty bad ideas mixed in with the
good stuff - such as taking a bunch of things that are written as
macros for speed, and converting them to function calls.  Also, he
didn't make any attempt to isolate the places that needed to know
about both page versions; everybody knew about everything, everywhere,
and so everything needed to branch in places where it had not needed
to do so before.  I don't think we should infer from the failure of
those patches that no one can do any better.

On the other hand, I also agree with Tom that the chances of getting
this done in time for 9.2 are virtually zero, assuming that (1) we
wish to ship 9.2 in 2012 and (2) we don't wish to be making
destabilizing changes beyond the end of the last CommitFest.  There is
a lot of work here, and I would be astonished if we could wrap it all
up in the next month.  Or even the next four months.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread David Fetter
On Wed, Dec 21, 2011 at 04:18:33PM -0800, Rob Wultsch wrote:
 On Wed, Dec 21, 2011 at 1:59 PM, David Fetter da...@fetter.org wrote:
  One of the things VMware is working on is double writes, per
  previous discussions of how, for example, InnoDB does things.
 
 The world is moving to flash, and the lifetime of flash is measured
 writes.  Potentially doubling the number of writes is potentially
 halving the life of the flash.
 
 Something to think about...

Modern flash drives let you have more write cycles than modern
spinning rust, so while yes, there is something happening, it's also
happening to spinning rust, too.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Simon Riggs
On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs si...@2ndquadrant.com wrote:

 Having two different page formats running around in the system at the
 same time is far from free; in the worst case it means that every single
 piece of code that touches pages has to know about and be prepared to
 cope with both versions.  That's a rather daunting prospect, from a
 coding perspective and even more from a testing perspective.  Maybe
 the issues can be kept localized, but I've seen no analysis done of
 what the impact would be or how we could minimize it.  I do know that
 we considered the idea and mostly rejected it a year or two back.

 I'm looking at that now.

 My feeling is it probably depends upon how different the formats are,
 so given we are discussing a 4 byte addition to the header, it might
 be doable.

 I'm investing some time on the required analysis.

We've assumed to now that adding a CRC to the Page Header would add 4
bytes, meaning that we are assuming we are taking a CRC-32 check
field. This will change the size of the header and thus break
pg_upgrade in a straightforward implementation. Breaking pg_upgrade is
not acceptable. We can get around this by making code dependent upon
page version, allowing mixed page versions in one executable. That
causes the PageGetItemId() macro to be page version dependent. After
review, altering the speed of PageGetItemId() is not acceptable either
(show me microbenchmarks if you doubt that). In a large minority of
cases the line pointer and the page header will be in separate cache
lines.

As Kevin points out, we have 13 bits spare on the pd_flags of
PageHeader, so we have a little wiggle room there. In addition to that
I notice that pd_pagesize_version itself is 8 bits (page size is other
8 bits packed together), yet we currently use just one bit of that,
since version is 4. Version 3 was last seen in Postgres 8.2, now
de-supported.

Since we don't care too much about backwards compatibility with data
in Postgres 8.2 and below, we can just assume that all pages are
version 4, unless marked otherwise with additional flags. We then use
two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000).
We then completely replace the 16 bit version field with a 16-bit CRC
value, rather than a 32-bit value. Why two flag bits? If either CRC
bit is set we assume the page's CRC is supposed to be valid. This
ensures that a single bit error doesn't switch off CRC checking when
it was supposed to be active. I suggest we remove the page size data
completely; if we need to keep that we should mark 8192 bytes as the
default and set bits for 16kB and 32 kB respectively.

With those changes, we are able to re-organise the page header so that
we can add a 16 bit checksum (CRC), yet retain the same size of
header. Thus, we don't need to change PageGetItemId(). We would
require changes to PageHeaderIsValid() and PageInit() only. Making
these changes means we are reducing the number of bits used to
validate the page header, though we are providing a much better way of
detecting page validity, so the change is of positive benefit.

Adding a CRC was a performance concern because of the hint bit
problem, so making the value 16 bits long gives performance where it
is needed. Note that we do now have a separation of bgwriter and
checkpointer, so we have more CPU bandwidth to address the problem.
Adding multiple bgwriters is also possible.

Notably, this proposal makes CRC checking optional, so if performance
is a concern it can be disabled completely.

Which CRC algorithm to choose?
A study of error detection capabilities for random independent bit
errors and burst errors reveals that XOR, two's complement addition,
and Adler checksums are suboptimal for typical network use. Instead,
one's complement addition should be used for networks willing to
sacrifice error detection effectiveness to reduce compute cost,
Fletcher checksum for networks looking for a balance of error
detection and compute cost, and CRCs for networks willing to pay a
higher compute cost for significantly improved error detection.
The Effectiveness of Checksums for Embedded Control Networks,
Maxino, T.C.  Koopman, P.J.,
Dependable and Secure Computing, IEEE Transactions on
Issue Date: Jan.-March 2009
Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

Based upon that paper, I suggest we use Fletcher-16. The overall
concept is not sensitive to the choice of checksum algorithm however
and the algorithm itself could be another option. F16 or CRC. My poor
understanding of the difference is that F16 is about 20 times cheaper
to calculate, at the expense of about 1000 times worse error detection
(but still pretty good).

16 bit CRCs are not the strongest available, but still support
excellent error detection rates - better than 1 failure in a million,
possibly much better depending on which algorithm and block size.
That's good easily enough to detect our kind of errors.

This idea doesn't 

Re: [HACKERS] Page Checksums + Double Writes

2011-12-21 Thread Heikki Linnakangas

On 22.12.2011 01:43, Tom Lane wrote:

A utility to bump the page version is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old.  What does it do when the old page is
too full to be converted?  Move some data somewhere else might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that.  At the very least it would imply that the utility has full
knowledge about every index type in the system.


Remembering back the old discussions, my favorite scheme was to have an 
online pre-upgrade utility that runs on the old cluster, moving things 
around so that there is enough spare room on every page. It would do 
normal heap updates to make room on heap pages (possibly causing 
transient serialization failures, like all updates do), and split index 
pages to make room on them. Yes, it would need to know about all index 
types. And it would set a global variable to indicate that X bytes must 
be kept free on all future updates, too.


Once the pre-upgrade utility has scanned through the whole cluster, you 
can run pg_upgrade. After the upgrade, old page versions are converted 
to new format as pages are read in. The conversion is staightforward, as 
there the pre-upgrade utility ensured that there is enough spare room on 
every page.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Christopher Browne
On Tue, Dec 20, 2011 at 8:36 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 I was thinking that we would warn when such was found, set hint bits
 as needed, and rewrite with the new CRC.  In the unlikely event that
 it was a torn hint-bit-only page update, it would be a warning about
 something which is a benign side-effect of the OS or hardware crash.

 But that's terrible.  Surely you don't want to tell people:

 WARNING:  Your database is corrupted, or maybe not.  But don't worry,
 I modified the data block so that you won't get this warning again.

 OK, I guess I'm not sure that you don't want to tell people that.  But
 *I* don't!

This seems to be a frequent problem with this whole doing CRCs on pages thing.

It's not evident which problems will be real ones.  And in such
cases, is the answer to turf the database and recover from backup,
because of a single busted page?  For a big database, I'm not sure
that's less scary than the possibility of one page having a
corruption.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Alvaro Herrera

Excerpts from Christopher Browne's message of mar dic 20 14:12:56 -0300 2011:

 It's not evident which problems will be real ones.  And in such
 cases, is the answer to turf the database and recover from backup,
 because of a single busted page?  For a big database, I'm not sure
 that's less scary than the possibility of one page having a
 corruption.

I don't think the problem is having one page of corruption.  The problem
is *not knowing* that random pages are corrupted, and living in the fear
that they might be.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 I was thinking that we would warn when such was found, set hint
 bits as needed, and rewrite with the new CRC.  In the unlikely
 event that it was a torn hint-bit-only page update, it would be a
 warning about something which is a benign side-effect of the OS
 or hardware crash.
 
 But that's terrible.  Surely you don't want to tell people:
 
 WARNING:  Your database is corrupted, or maybe not.  But don't
 worry, I modified the data block so that you won't get this
 warning again.
 
 OK, I guess I'm not sure that you don't want to tell people that. 
 But *I* don't!
 
Well, I would certainly change that to comply with standard message
style guidelines.  ;-)
 
But the alternatives I've heard so far bother me more.  It sounds
like the most-often suggested alternative is:
 
ERROR (or stronger?):  page checksum failed in relation 999 page 9
DETAIL:  This may not actually affect the validity of any tuples,
since it could be a flipped bit in the checksum itself or dead
space, but we're shutting you down just in case.
HINT:  You won't be able to read anything on this page, even if it
appears to be well-formed, without stopping your database and using
some arcane tool you've never heard of before to examine and
hand-modify the page.  Any query which accesses this table may fail
in the same way.
 
The warning level message will be followed by something more severe
if the page or a needed tuple is mangled in a way that it would not
be used.  I guess the biggest risk here is that there is real damage
to data which doesn't generate a stronger response, and the users
are ignoring warning messages.  I'm not sure what to do about that,
but the above error doesn't seem like the right solution.
 
Assuming we do something about the torn page on hint-bit only
write issue, by moving the hint bits to somewhere else or logging
their writes, what would you suggest is the right thing to do when a
page is read with a checksum which doesn't match page contents?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Kevin Grittner
Alvaro Herrera alvhe...@commandprompt.com wrote:
 Excerpts from Christopher Browne's message of mar dic 20 14:12:56
 -0300 2011:
 
 It's not evident which problems will be real ones.  And in such
 cases, is the answer to turf the database and recover from
 backup, because of a single busted page?  For a big database, I'm
 not sure that's less scary than the possibility of one page
 having a corruption.
 
 I don't think the problem is having one page of corruption.  The
 problem is *not knowing* that random pages are corrupted, and
 living in the fear that they might be.
 
What would you want the server to do when a page with a mismatching
checksum is read?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Andres Freund
On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
 Alvaro Herrera alvhe...@commandprompt.com wrote:
  Excerpts from Christopher Browne's message of mar dic 20 14:12:56
  
  -0300 2011:
  It's not evident which problems will be real ones.  And in such
  cases, is the answer to turf the database and recover from
  backup, because of a single busted page?  For a big database, I'm
  not sure that's less scary than the possibility of one page
  having a corruption.
  
  I don't think the problem is having one page of corruption.  The
  problem is *not knowing* that random pages are corrupted, and
  living in the fear that they might be.
 
 What would you want the server to do when a page with a mismatching
 checksum is read?
Follow the behaviour of zero_damaged_pages.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Aidan Van Dyk
On Tue, Dec 20, 2011 at 12:38 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:

 I don't think the problem is having one page of corruption.  The
 problem is *not knowing* that random pages are corrupted, and
 living in the fear that they might be.

 What would you want the server to do when a page with a mismatching
 checksum is read?

But that's exactly the problem.  I don't know what I want the server
to do, because I don't know if the page with the checksum mismatch is
one of the 10GB of pages in the page cache that were dirty and poses 0
risk (i.e. hint-bit only changes made it dirty), a page that was
really messed up on the kernel panic that last happened causing this
whole mess, or an even older page that really is giving bitrot...

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
 What would you want the server to do when a page with a mismatching
 checksum is read?

 Follow the behaviour of zero_damaged_pages.

Surely not.  Nobody runs with zero_damaged_pages turned on in
production; or at least, nobody with any semblance of a clue.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Andres Freund
On Tuesday, December 20, 2011 07:08:56 PM Tom Lane wrote:
 Andres Freund and...@anarazel.de writes:
  On Tuesday, December 20, 2011 06:38:44 PM Kevin Grittner wrote:
  What would you want the server to do when a page with a mismatching
  checksum is read?
  
  Follow the behaviour of zero_damaged_pages.
 
 Surely not.  Nobody runs with zero_damaged_pages turned on in
 production; or at least, nobody with any semblance of a clue.
Thats my point. There is no automated solution for page errors. So it should 
ERROR (not PANIC) out in normal operation and be fixable via 
zero_damaged_pages.
I personally wouldn't even have a problem making zero_damaged_pages only 
applicable in single backend mode.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Simon Riggs
On Mon, Dec 19, 2011 at 11:10 AM, Simon Riggs si...@2ndquadrant.com wrote:

 The only sensible way to handle this is to change the page format as
 discussed. IMHO the only sensible way that can happen is if we also
 support an online upgrade feature. I will take on the online upgrade
 feature if others work on the page format issues, but none of this is
 possible for 9.2, ISTM.

I've had another look at this just to make sure.

Doing this for 9.2 will change the page format, causing every user to
do an unload/reload, with no provided mechanism to do that, whether or
not they use this feature.

If we do that, the hints are all in the wrong places, meaning any hint
set will need to change the CRC.

Currently, setting hints can be done while holding a share lock on the
buffer. Preventing that would require us to change the way buffer
manager works to make it take an exclusive lock while writing out,
since a hint would change the CRC and so allowing hints to be set
while we write out would cause invalid CRCs. So we would need to hold
exclusive lock on buffers while we calculate CRCs.

Overall, this will cause a much bigger performance hit than we planned
for. But then we have SSI as an option, so why not this?

So, do we have enough people in the house that are willing to back
this idea, even with a severe performance hit?  Are we willing to
change the page format now, with plans to change it again in the
future? Are we willing to change the page format for a feature many
people will need to disable anyway? Do we have people willing to spend
time measuring the performance in enough cases to allow educated
debate?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Andres Freund
On Tuesday, December 20, 2011 06:44:48 PM Simon Riggs wrote:
 Currently, setting hints can be done while holding a share lock on the
 buffer. Preventing that would require us to change the way buffer
 manager works to make it take an exclusive lock while writing out,
 since a hint would change the CRC and so allowing hints to be set
 while we write out would cause invalid CRCs. So we would need to hold
 exclusive lock on buffers while we calculate CRCs.
While hint bits are a problem that specific problem is actually handled by 
copying the buffer onto a separate buffer and calculating the CRC on that copy. 
Given that we already rely on the fact that the flags can be read consistently 
from the individual backends thats fine.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Jesper Krogh

On 2011-12-20 18:44, Simon Riggs wrote:

On Mon, Dec 19, 2011 at 11:10 AM, Simon Riggssi...@2ndquadrant.com  wrote:


The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

I've had another look at this just to make sure.

Doing this for 9.2 will change the page format, causing every user to
do an unload/reload, with no provided mechanism to do that, whether or
not they use this feature.


How about only calculating the checksum and setting it in the bgwriter 
just before

flying the buffer off to disk.

Perhaps even let autovacuum do the same if it flushes pages to disk as a 
part

of the process.

If someone comes along and sets a hint bit,changes data, etc.  its only 
job is to clear

the checksum to a meaning telling we dont have a checksum for this page.

Unless the bgwriter becomes bottlenecked by doing it, the impact on 
foreground

work should be fairly limited.


Jesper .. just throwing in random thoughts ..
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Jesper Krogh

On 2011-12-19 02:55, Greg Stark wrote:

On Sun, Dec 18, 2011 at 7:51 PM, Jesper Kroghjes...@krogh.cc  wrote:

I dont know if it would be seen as a half baked feature.. or similar,
and I dont know if the hint bit problem is solvable at all, but I could
easily imagine checksumming just skipping the hit bit entirely.

That was one approach discussed. The problem is that the hint bits are
currently in each heap tuple header which means the checksum code
would have to know a fair bit about the structure of the page format.
Also the closer people looked the more hint bits kept turning up
because the coding pattern had been copied to other places (the page
header has one, and index pointers have a hint bit indicating that the
target tuple is deleted, etc). And to make matters worse skipping
individual bits in varying places quickly becomes a big consumer of
cpu time since it means injecting logic into each iteration of the
checksum loop to mask out the bits.

I do know it is a valid and really relevant point (the cpu-time spend),
but here in late 2011 it is really a damn irritating limitation, since if
there any resources I have plenty available of in the production environment
then it is cpu-time, just not on the single core currently serving the 
client.


Jesper
--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Greg Smith

On 12/19/2011 06:14 PM, Kevin Grittner wrote:

But if you need all that infrastructure just to get the feature
launched, that's a bit hard to stomach.
 


Triggering a vacuum or some hypothetical scrubbing feature?
   


What you were suggesting doesn't require triggering  just a vacuum 
though--it requires triggering some number of vacuums, for all impacted 
relations.  You said yourself that all tables if the there's no way to 
rule any of them out was a possibility.  I'm just pointing out that 
scheduling that level of work is a logistics headache, and it would be 
reasonable for people to expect some help with that were it to become a 
necessary thing falling out of the implementation.



Some people think I border on the paranoid on this issue.


Those people are also out to get you, just like the hardware.


Are you arguing that autovacuum should be disabled after crash
recovery?  I guess if you are arguing that a database VACUUM might
destroy recoverable data when hardware starts to fail, I can't
argue.


A CRC failure suggests to me a significantly higher possibility of 
hardware likely to lead to more corruption than a normal crash does though.



The main way I expect to validate this sort of thing is with an as
yet unwritten function to grab information about a data block from
a standby server for this purpose, something like this:

Master:  Computed CRC A, Stored CRC B; error raised because A!=B
Standby:  Computed CRC C, Stored CRC D

If C==D  A==C, the corruption is probably overwritten bits of
the CRC B.
 


Are you arguing we need *that* infrastructure to get the feature
launched?
   


No; just pointing out the things I'd eventually expect people to want, 
because they help answer questions about what to do when CRC failures 
occur.  The most reasonable answer to what should I do about suspected 
corruption on a page? in most of the production situations I worry 
about is see if it's recoverable from the standby.  I see this as 
being similar to how RAID-1 works:  if you find garbage on one drive, 
and you can get a clean copy of the block from the other one, use that 
to recover the missing data.  If you don't have that capability, you're 
stuck with no clear path forward when a CRC failure happens, as you 
noted downthread.


This obviously gets troublesome if you've recently written a page out, 
so there's some concern about whether you are checking against the 
correct version of the page or not, based on where the standby's replay 
is at.  I see that as being a case that's also possible to recover from 
though, because then the page you're trying to validate on the master is 
likely sitting in the recent WAL stream.  This is already the sort of 
thing companies doing database recovery work (of which we are one) deal 
with, and I doubt any proposal will cover every possible situation.  In 
some cases there may be no better answer than show all the known 
versions and ask the user to sort it out.  The method I suggested would 
sometimes kick out an automatic fix.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-20 Thread Robert Haas
On Mon, Dec 19, 2011 at 2:44 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 I was thinking that we would warn when such was found, set hint bits
 as needed, and rewrite with the new CRC.  In the unlikely event that
 it was a torn hint-bit-only page update, it would be a warning about
 something which is a benign side-effect of the OS or hardware crash.

But that's terrible.  Surely you don't want to tell people:

WARNING:  Your database is corrupted, or maybe not.  But don't worry,
I modified the data block so that you won't get this warning again.

OK, I guess I'm not sure that you don't want to tell people that.  But
*I* don't!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Page Checksums

2011-12-19 Thread Simon Riggs
On Mon, Dec 19, 2011 at 4:21 AM, Josh Berkus j...@agliodbs.com wrote:
 On 12/18/11 5:55 PM, Greg Stark wrote:
 There is another way to look at this problem. Perhaps it's worth
 having a checksum *even if* there are ways for the checksum to be
 spuriously wrong. Obviously having an invalid checksum can't be a
 fatal error then but it might still be useful information. Rright now
 people don't really know if their system can experience torn pages or
 not and having some way of detecting them could be useful. And if you
 have other unexplained symptoms then having checksum errors might be
 enough evidence that the investigation should start with the hardware
 and get the sysadmin looking at hardware logs and running memtest
 sooner.

 Frankly, if I had torn pages, even if it was just hint bits missing, I
 would want that to be logged.  That's expected if you crash, but if you
 start seeing bad CRC warnings when you haven't had a crash?  That means
 you have a HW problem.

 As long as the CRC checks are by default warnings, then I don't see a
 problem with this; it's certainly better than what we have now.

It is an important problem, and also a big one, hence why it still exists.

Throwing WARNINGs for normal events would not help anybody; thousands
of false positives would just make Postgres appear to be less robust
than it really is. That would be a credibility disaster. VMWare
already have their own distro, so if they like this patch they can use
it.

The only sensible way to handle this is to change the page format as
discussed. IMHO the only sensible way that can happen is if we also
support an online upgrade feature. I will take on the online upgrade
feature if others work on the page format issues, but none of this is
possible for 9.2, ISTM.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >