subject:"\[HACKERS\] Could not open file pg_multixact\/offsets\/ ERROR on 9.3.4"

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

2014-06-05 Thread Alexey Klyukin

On Wed, Jun 4, 2014 at 5:10 PM, Andres Freund 
wrote:

> Looks like you're hitting the issue described in
>
> http://archives.postgresql.org/message-id/20140530121631.GE25431%40alap3.anarazel.de


Aha, so the problem looks like this:
- all multixact contents was purged (9.2 to 9.3 incompatibility)
- the only offset was the  file created by initidb during bootstrap
- truncation code in vacuum assumed the page related to multixact it tried
to truncate actually exists
- the assumption was based on the fact that there exists a page earlier
than the one it tried to truncate
- due to the wrong assumption, there was an attempt to truncate a
non-existing page, resulting in the error.

Now that makes sense, thank you very much. Vacuum runs w/o issues once I
removed that  file.

I think it should be safe to recommend removing the  file for everyone
with the same problem.

The 2 cases where  file is actually used are:
- when the are no more then 2 pages of offsets ( and 0001) since the
server creation
- when the multixacts wrapped around and  is not the first page.

I think in both cases  there will be no gaps in segments in offsets,
so the problem won't be there in the first place.

Cheers,
-- 
Alexey Klyukin

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

2014-06-04 Thread Alvaro Herrera

Andres Freund wrote:

> On 2014-06-04 16:59:10 +0200, Alexey Klyukin wrote:
> > I've recently discovered a peculiar problem on one of our big databases
> > (more than 1TB). The database has been upgraded from 9.2 to 9.3.4 (using
> > hardlinks to speedup the process)  on April 7th around 22:00 local time.
> > When doing  vacuum on any table, the system fails with the following error:
> > 
> > ERROR:  could not access status of transaction 2035987419
> > DETAIL:  Could not open file "pg_multixact/offsets/795A": No such file or
> > directory.

> Looks like you're hitting the issue described in
> http://archives.postgresql.org/message-id/20140530121631.GE25431%40alap3.anarazel.de

Will look into this.  Thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

2014-06-04 Thread Andres Freund

Hi,

On 2014-06-04 16:59:10 +0200, Alexey Klyukin wrote:
> I've recently discovered a peculiar problem on one of our big databases
> (more than 1TB). The database has been upgraded from 9.2 to 9.3.4 (using
> hardlinks to speedup the process)  on April 7th around 22:00 local time.
> When doing  vacuum on any table, the system fails with the following error:
> 
> ERROR:  could not access status of transaction 2035987419
> DETAIL:  Could not open file "pg_multixact/offsets/795A": No such file or
> directory.
> 
> The erorr doesn't depend on the table being vacuumed, or even database, i.e:
> 
> postgres=# create database x;
> CREATE DATABASE
> postgres=# \c x
> You are now connected to database "x" as user "postgres".
> x=# create table test();
> CREATE TABLE
> x=# vacuum test;
> ERROR:  could not access status of transaction 2035987419
> DETAIL:  Could not open file "pg_multixact/offsets/795A": No such file or
> directory.
> 
> The content of pg_multixact/offsets is:
>
> pg_multixact$ ls -lR
> ./members:
> -rw--- 1 postgres postgres 8192 Apr 16 18:20 
> ./offsets:
> -rw--- 1 postgres postgres   8192 Apr  7 22:51 
> -rw--- 1 postgres postgres 262144 Apr 16 18:20 79A6
> 
> the select mutlixact from pg_database gives me:

> and the 2035987419 = 0x795AB3DB belongs to 795A segment.
> The  file just contains a single page of all zeroes. Neither the 9.3.4
> replica of this database, nor the original 9.2 cluster data directory
> contain this file.

Looks like you're hitting the issue described in
http://archives.postgresql.org/message-id/20140530121631.GE25431%40alap3.anarazel.de

> I'm tempted to just remove the  file from master and restart the
> database, since it's effectively impossible to run vacuum now, but I'd like
> to understand what's happening first. Below is the result of pg_filedump
> for the master:

Yes, that's fine in this specific case.

Note that the  segment isn't yused by anything between the oldest
and newest multixact:

> Latest checkpoint's NextMultiXactId:  2040987419
> Latest checkpoint's NextMultiOffset:  3
> Latest checkpoint's oldestXID:1038291920
> Latest checkpoint's oldestXID's DB:   16415
> Latest checkpoint's oldestActiveXID:  1655189767
> Latest checkpoint's oldestMultiXid:   2040987417
> Latest checkpoint's oldestMulti's DB: 0

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

2014-06-04 Thread Alexey Klyukin

Greetings,

I've recently discovered a peculiar problem on one of our big databases
(more than 1TB). The database has been upgraded from 9.2 to 9.3.4 (using
hardlinks to speedup the process)  on April 7th around 22:00 local time.
When doing  vacuum on any table, the system fails with the following error:

ERROR:  could not access status of transaction 2035987419
DETAIL:  Could not open file "pg_multixact/offsets/795A": No such file or
directory.

The erorr doesn't depend on the table being vacuumed, or even database, i.e:

postgres=# create database x;
CREATE DATABASE
postgres=# \c x
You are now connected to database "x" as user "postgres".
x=# create table test();
CREATE TABLE
x=# vacuum test;
ERROR:  could not access status of transaction 2035987419
DETAIL:  Could not open file "pg_multixact/offsets/795A": No such file or
directory.

The content of pg_multixact/offsets is:

pg_multixact$ ls -lR
./members:
-rw--- 1 postgres postgres 8192 Apr 16 18:20 
./offsets:
-rw--- 1 postgres postgres   8192 Apr  7 22:51 
-rw--- 1 postgres postgres 262144 Apr 16 18:20 79A6

the select mutlixact from pg_database gives me:

postgres=# select oid, datminmxid from pg_database;
  oid  | datminmxid
---+
 12078 | 2040987419
 12073 | 2040987419
 16414 | 2035987419
 16415 | 2040987418
 1 | 2040987419
(5 rows)

and the 2035987419 = 0x795AB3DB belongs to 795A segment.
The  file just contains a single page of all zeroes. Neither the 9.3.4
replica of this database, nor the original 9.2 cluster data directory
contain this file.

I guess what happens is that at the end of vacuum PostgreSQL checks the
status of the earliest available multixact (perhaps in order to truncate
it), and fails because the file has been already removed.

I also have a replica of this database, (which, as I said, doesn't have the
 segment). When I shut down this replica, copied the data directory,
promoted the copied directory (effectively making a copy of the master) and
issued vacuum I observed no errors. However, when I manually created this
multixact/offset/ with dd (populating 8192 bytes with 0), exactly the
same problem re-appeared as on the master.

I'm tempted to just remove the  file from master and restart the
database, since it's effectively impossible to run vacuum now, but I'd like
to understand what's happening first. Below is the result of pg_filedump
for the master:

data$ pg_controldata .
pg_control version number:937
Catalog version number:   201306121
Database system identifier:   5999656352803688379
Database cluster state:   in production
pg_control last modified: Wed 04 Jun 2014 04:49:45 PM CEST
Latest checkpoint location:   25D0/D38A8DF0
Prior checkpoint location:25D0/BFA7B068
Latest checkpoint's REDO location:25D0/C27A54A0
Latest checkpoint's REDO WAL file:000125D000C2
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0/1655213806
Latest checkpoint's NextOID:  16795470
Latest checkpoint's NextMultiXactId:  2040987419
Latest checkpoint's NextMultiOffset:  3
Latest checkpoint's oldestXID:1038291920
Latest checkpoint's oldestXID's DB:   16415
Latest checkpoint's oldestActiveXID:  1655189767
Latest checkpoint's oldestMultiXid:   2040987417
Latest checkpoint's oldestMulti's DB: 0
Time of latest checkpoint:Wed 04 Jun 2014 04:45:45 PM CEST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline:   0
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
Current wal_level setting:hot_standby
Current max_connections setting:  500
Current max_prepared_xacts setting:   0
Current max_locks_per_xact setting:   64
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Date/time type storage:   64-bit integers
Float4 argument passing:  by value
Float8 argument passing:  by value
Data page checksum version:   0

There were no wraparound-related messages in the logs of either old or new
cluster, nor we observed any other errors except this one (we originally
discovered it in the server logs, likely because the autovacuum was also
failing). Any hints on what's going on (and whether the data corruption is
a possibility)?


Cheers,
-- 
Alexey Klyukin

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

Re: [HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

[HACKERS] Could not open file pg_multixact/offsets/ ERROR on 9.3.4

4 matches

Site Navigation

Mail list logo

Footer information