Re: [HACKERS] production server down

Joe Conway Tue, 14 Dec 2004 21:50:20 -0800

Tom Lane wrote:

...
pg_control last modified:             Tue Dec 14 15:39:26 2004
...
Time of latest checkpoint:            Tue Nov  2 17:05:32 2004


[ blink... ]  That seems like an unreasonable gap between checkpoints,
especially for a production server.  Can you see an explanation?

Hmmm, this is even more scary. We have two database clusters on this server, one on /replica/pgdata, and one on /production/pgdata (ignore the names -- /replica is actually the "production" instance at the moment).

# pg_controldata /replica/pgdata
pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               shutting down
pg_control last modified:             Tue Dec 14 15:39:26 2004
Current log file ID:                  0
Next log file segment:                1
Latest checkpoint location:           0/9B0B8C
Prior checkpoint location:            0/9AA1B4
Latest checkpoint's REDO location:    0/9B0B8C
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        12
Latest checkpoint's NextXID:          536
Latest checkpoint's NextOID:          17142
Time of latest checkpoint:            Tue Nov  2 17:05:32 2004
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

# pg_controldata /production/pgdata
pg_control version number:            72
Catalog version number:               200310211
Database cluster state:               shutting down
pg_control last modified:             Tue Nov  2 21:57:49 2004
Current log file ID:                  0
Next log file segment:                1
Latest checkpoint location:           0/9B0B8C
Prior checkpoint location:            0/9AA1B4
Latest checkpoint's REDO location:    0/9B0B8C
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        12
Latest checkpoint's NextXID:          536
Latest checkpoint's NextOID:          17142
Time of latest checkpoint:            Tue Nov  2 17:05:32 2004
Database block size:                  8192
Blocks per segment of large relation: 131072
Maximum length of identifiers:        64
Maximum number of function arguments: 32
Date/time type storage:               64-bit integers
Maximum length of locale name:        128
LC_COLLATE:                           C
LC_CTYPE:                             C

I have no idea how this happened, but those look too similar except for the "last modified" date. The space used is quite what I'd expect:

# du -h --max-depth=1 /replica
403G    /replica/pgdata

# du -h --max-depth=1 /production
201G    /production/pgdata

The "/production/pgdata" cluster has not been in use since Nov 2. But we've been loading data aggressively into "/replica/pgdata".

Any theories on how we screwed up?

Joe

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] production server down

Reply via email to