Re: [GENERAL] Serializable Isolation and read/write conflict with index and different keys

2017-08-29 Thread Thomas Munro
On Wed, Aug 30, 2017 at 4:28 AM, Luca Looz  wrote:
> I'm trying to use the serializable isolation but i'm getting read/write
> dependencies error even if i have an unique index on the column used and the
> transactions are using different keys.
> For an example see this gist:
> https://gist.github.com/nathanl/f98450014f62dcaf0405394a0955e18e
> Is this an expected behavior?

Take a look at the EXPLAIN plan for the SELECT.  You'll get fine
grained SIREAD locks if it's doing a btree index scan, but table-level
SIREAD locks for other kinds of plans.  Sometimes simple SSI tests can
show a lot of false positives just because of empty tables or missing
statistics (ANALYZE).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Isolation of schema renames

2017-08-09 Thread Thomas Munro
On Thu, Aug 10, 2017 at 2:14 PM, Tom Lane  wrote:
> Ben Leslie  writes:
>> I'm wondering if I can/should expect schema renames to be isolated.
>
> Nope, you should not.
>
> This is not an especially easy thing to fix, because to have the system
> behave as you wish it did, your second transaction would have to be
> ignoring already-committed DDL changes, and it is very easy to show
> examples where that would be fatal.

As mentioned on bug #14691[1], I think it might make theoretical sense
to do SSI checks on system catalogue tables + associated caches
(though I don't claim that's easy).  Then a SERIALIZABLE transaction
would abort if you created a conflict cycle with some other session
that has moved your cheese and it was also running in SERIALIZABLE
isolation.

[1] 
https://www.postgresql.org/message-id/flat/20170605191104.1442.24999%40wrigleys.postgresql.org

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multixact members limit exceeded

2017-08-09 Thread Thomas Munro
On Thu, Aug 10, 2017 at 10:26 AM, Thomas Munro
 wrote:
> eaten a total of n! member space with an average size of n/2 per

Erm, math fail, not n! but 1 + 2 + ... + n.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multixact members limit exceeded

2017-08-09 Thread Thomas Munro
On Wed, Aug 9, 2017 at 10:06 PM, Peter Hunčár  wrote:
> SELECT relname, age(relminmxid) as mxid_age,

I'm pretty sure you can't use age(xid) to compute the age of a
multixact ID.  Although they have type xid in pg_class and the
age(xid) function will happily subtract your multixact ID from the
next xid, the answer is meaningless.  mxid_age is the function you
want, though I can't remember if it's present in 9.4.

On Thu, Aug 10, 2017 at 7:31 AM, Peter Hunčár  wrote:
> Thank you, yes those are the 'urgent' tables, I'd talk to the developers
> regarding the locks.I too think, there's something 'fishy' going on.
>
> Anyway, could it be that autovacuum blocks manual vacuum? Because I ran
> vacuum (full, verbose) and some tables finished quite fast, with huge amount
> of io recorded in the monitoring, but some of them are kind of stuck?
> Which brings me to the second question, how can I cancel autovacuum?
>
> And the last thing I don't understand:
>
> One particular table before vacuum full:
>
>relname| relminmxid | table_size
> --++
>  delayed_jobs | 1554151198 | 21 GB
>
> And after vacuum full:
>
>relname| relminmxid | table_size
> --++
>  delayed_jobs | 1554155465 | 6899 MB
>
> Shouldn't be the relminmxid changed after vacuum full, or am I not
> understanding something?

How can delayed_jobs have relminmxid = 1,554,155,465?  According to
your pg_controldata output, the active range of multixact IDs in your
cluster is 11,604 -> 129,346,573, so there is some confusion here.
Perhaps this is the output of an expression involving
age(multixactid)?

One thing I noticed is that there are ~4 billion members (that's how
many you have when you run out of member space), but only ~128 million
multixacts, so I think the average multixact has ~32 members.
Considering the way that multixacts grow by copying and extending by
one, by the time you've created a multixact with n members you've
eaten a total of n! member space with an average size of n/2 per
multixact...  So one way to hit that average would be to repeatedly
build ~64 member multixacts, or if mixed with smaller ones then you'd
need to be intermittently building even larger ones.  A thundering
herd of worker processes repeatedly share-locking the same row or
something like that?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Concurrency and UPDATE [...] FROM

2017-07-10 Thread Thomas Munro
On Tue, Jul 11, 2017 at 6:23 AM, Seamus Abshere  wrote:
> The purpose is to concat new data onto existing values of c:
>
> UPDATE tbl
> SET c = c || new_data.c
> FROM ( [...] ) AS new_data
> WHERE
>   tbl.id = new_data.id
>
> It appears to have a race condition:
>
> t0: Query A starts subquery
> t1: Query A starts self-join
> t2. Query A starts UPDATE with data from self-join and subquery
> t3. Query B starts subquery
> t4. Query B starts self-join (note: data from t1!)
> [...]
> tN. Query A finishes UPDATE
> tN+1. Query B finishes UPDATE, missing any new_data from Query A
>
> My assumption is that t1 and t4 (the self-joins) use SELECT but not
> SELECT FOR UPDATE. If they did, I think the race condition would go
> away.
>
> Did I analyze that right?

Yeah, I think so.  There is no EvalPlanQual[1] behaviour on "new_data"
meaning that you can finish up self-joining versions of "tbl" from two
different times.  SELECT FOR UPDATE in "new_data" (as a subselect or
CTE etc) would activate that, or you could use SERIALIZABLE isolation
to abort transactions where the race would change the outcome, or some
other serialisation scheme like table or advisory locks.

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/executor/README#L297

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FreeBSD 10 => 11: Dump and reload your PostgreSQL database unless you like it broken

2016-12-07 Thread Thomas Munro
On Wed, Dec 7, 2016 at 10:40 PM, Maeldron T.  wrote:
> Anyway, ICU is turned on for PostgreSQL 9.6 even in the pkg version. Hurray.

Hmm, a curious choice, considering that FreeBSD finally has built-in
collations that work!

Using the port's ICU patch doesn't change anything about the risks
here as far as I know.  ICU continually pulls corrections and
improvements from CLDR, and FreeBSD's libc is now doing the same, so
actually both ICU and libc might be getting the very same stream of
collation changes, just at different speeds.

IMHO we can't continue to ignore this problem: we need to teach
Postgres to recognise when collations change.  That could mean
adopting like ICU and then tracking when every index is potentially
invalidated by a version change (see Peter Eisentraut's proposal), or
adopting non-portable implementation-specific techniques.  For the
latter, we'd probably not want to litter core Postgres with
assumptions about how each OS does things.  One fairly blunt approach
I came up with goes like this:

1.  Add a new optional GUC system_collation_version_command =
'/some/user/supplied/script.sh'.
2.  When postmaster starts, run it and remember the output in memory.
3.  When a database is created, store it for this database.
4.  When connecting to a database, complain loudly if version at
startup doesn't match the stored version.
5.  Update the stored value to the startup value when you REINDEX
DATABASE (ie that's how to clear the warning).

Then package mantainers could supply a script that know how to do the
right thing on this OS.  For example it could be the package version
string from the currently installed locales package, or an MD5 hash of
the contents of all files in /usr/share/locales/, or whatever suits.
The reason for only running the script at postmaster startup is that
there is a risk of libc caching data, so that a REINDEX would use old
data but running the command would see new files on disk, so we need
to make sure that a cluster restart is necessary after upgrading your
OS to clear the warning.

That's horribly blunt: it makes you reindex the whole database even if
you don't use a collation that changed, or don't even use btrees, etc.
You could do something more nuanced and complicated that works at the
level of individual locales and indexes (see the link I posted earlier
to check_pg_collations for some ideas), but the idea would be
basically the same.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FreeBSD 10 => 11: Dump and reload your PostgreSQL database unless you like it broken

2016-12-02 Thread Thomas Munro
On Sat, Dec 3, 2016 at 2:52 PM, Maeldron T.  wrote:
> Upgrading FreeBSD from 10 to 11 might break your database. It probably won’t
> be corrupted but it will be useless until you dump-import it, and you might
> need to do manual fixing.

Sorry to hear about your arm and your database.

The collation support was changed in FreeBSD 11 (in collaboration with
Illumos and DragonflyBSD[1]).  From FreeBSD's UPDATING:

Databases administrators will need to reindex their databases given
collation results will be different.

This is a problem that faces users of Postgres on all operating
systems, too, for example recently some GNU/Linux users had indexes
silently corrupted because of change to the glibc locale data for
German.  One approach is to use "amcheck"[2] (proposed for future
inclusion in PostgreSQL) to check if indexes are still sane, any time
the collation definition files or libc itself might have changed.

> (Maybe database clusters should have a header that wouldn’t allow
> incompatible server versions to process the existing data. I wonder if it
> would take more than 8 bytes per server. But I guess it was not know to be
> incompatible. Even my two CIs didn’t show it.)

I had some thoughts along those lines too[3].  I thought about
checksumming libc and all relevant collation files (ie OS specific
files) so you could notice when anything that could bite you changes
(that is just some sketched out ideas, not production code).  Some
have suggested that PostgreSQL should stop using OS collations and
adopt ICU[4] and then use its versioning metadata.  Of course some
people already use ICU on FreeBSD because the old strcoll
implementation didn't play well with Unicode, but those users are
still presumably exposed to silent corruption when ICU changes because
AFAIK that didn't keep track of ICU versions.

[1] http://blog.etoilebsd.net/post/This_is_how_I_like_opensource
[2] 
https://www.postgresql.org/message-id/flat/cam3swzqzlmhmwmbqjzk+prkxrnuz4w90wymuwfkev8mz3de...@mail.gmail.com
[3] https://github.com/macdice/check_pg_collations
[4] 
https://www.postgresql.org/message-id/flat/85364fde-091f-bbc0-fec2-e3ede3984...@2ndquadrant.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Immutable datastore library?

2016-10-17 Thread Thomas Munro
On Tue, Oct 18, 2016 at 9:39 AM, Guyren Howe  wrote:
> I would like to use Postgres as an immutable data store. A subject table 
> would have a timestamp column, and that would be added to what would 
> otherwise be the primary key.
>
> Trap updates and turn them into inserts. Have an associated _deleted table. 
> Trap deletes and turn them into inserts of the primary key into that table.
>
> Create a view that only shows the non-deleted records with the most recent 
> timestamps.
>
> Stored procedure to do all that to a table. Event trigger to drop and 
> re-create the view on changes to the table columns.
>
> I think that's about it. Seems an obvious and nice general-purpose thing to 
> do. It also seems like something someone might have already done, but 
> googling "immutable postgres" pulls up things about writing functions.
>
> Does anyone know of a project that has already done this?

The term you need is 'temporal' and searching for that you'll find
several projects providing something like that for PostgreSQL.

I've done some projects using temporal (and bitemporal) models based
on Richard Snodgrass's excellent book (which I hear is widely read at
utility companies among others), without any special library support:

http://www.cs.arizona.edu/~rts/tdbbook.pdf

His work influenced the SQL standard which I expect/hope is inspiring
those projects.  SQL:2011 has a temporal feature that does this sort
of thing automatically.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIALIZABLE and INSERTs with multiple VALUES

2016-10-13 Thread Thomas Munro
On Fri, Oct 14, 2016 at 2:04 AM, Kevin Grittner  wrote:
> On Wed, Oct 12, 2016 at 8:06 PM, Thomas Munro  
> wrote:
>> The "higher isolation levels" probably shouldn't be treated the same way.
>>
>> I think Peter's right about REPEATABLE READ.  We should definitely
>> raise the error immediately as we do in that level, because our RR
>> (SI) doesn't care about write skew and all that stuff, it just
>> promises that you can only see data in your snapshot.
>
> But the whole point of the special code for both RI and INSERT/ON
> CONFLICT is to get "underneath" that and provide a "primitive" that
> can see things an application statement can't, for better
> performance and error handling.  What SERIALIZABLE promises is that
> it runs exactly the same as REPEATABLE READ, but with some
> additional monitoring for serialization failure errors in some
> places that REPEATABLE READ does not generate them -- this would be
> the first and only place that SERIALIZABLE would break that model.
> The idea seems completely wrong and arbitrary.

Ugh, yeah.  Thanks for this reminder of the relationship between SI
and SSI, which I somehow temporarily lost sight of.

> Where do you see a problem if REPEATABLE READ handles INSERT/ON
> CONFLICT without error?  In many cases it would actually be
> providing a result consistent with a serial execution of the
> transactions; and where it doesn't, it would be the same anomalies
> that are possible with anything else under REPEATABLE READ.

I thought that there was something fishy about the idea of not running
Peter's check in the case of ON CONFLICT DO NOTHING in RR, because
then there isn't an opportunity to detect serialization failure that
the DO UPDATE variant has.  Upon reflection, DO NOTHING is not very
different from INSERT with an exception handler for unique_violation
that does nothing, and that doesn't cause RR to raise an error.  I see
now that you are right, and the check is probably bogus for RR.

>> But yeah, the existing code raises false positive serialization
>> failures under SERIALIZABLE, and that's visible in the isolation test
>> I posted: there is actually a serial order of those transactions with
>> the same result.
>
> Exactly.  The error based on the write conflict with ON CONFLICT DO
> NOTHING in your patch is really a false positive.  That doesn't
> break correctness, but it hurts performance, so it should be
> avoided if possible.

Agreed.  The check is bogus for SERIALIZABLE too, if we have proper SSI checks.

>> When working on commit fcff8a57 I became suspicious of the way ON
>> CONFLICT interacts with SSI, as I mentioned in passing back then[1],
>> thinking mainly of false negatives.  I failed to find a
>> non-serializable schedule involving ON CONFLICT that was allowed to
>> run, though I didn't spend much time on it.   One thing that worries
>> me is the final permutation of read-write-unique-4.spec, which
>> produces an arguably spurious UCV, that is, a transaction that doesn't
>> commit but raises a UCV instead of the serialization failure you might
>> expect.  The ON CONFLICT equivalent might be a transaction that takes
>> the ON CONFLICT path and then commits, even though it should be
>> considered non-serializable.  I would really like to understand that
>> case better, and until then I wouldn't bet my boots that it isn't
>> possible to commit anomalies using ON CONFLICT under SERIALIZABLE
>> without Peter's check (or even with it), despite the fact that it
>> reaches predicate locking code via heap_fetch etc.
>
> Hm.  With the duplicate key error I fail to see how any anomaly
> could make it to a committed state in the database, although I
> agree it is unfortunate that there is that one case where it really
> should be considered a serialization failure that we haven't yet
> coerced to yield that instead of the duplicate key error.

Right, in the unique_violation case it can't commit so there's no
problem (it would just be nicer to users if we could catch that case;
you might call it a false negative but it is harmless because a
unique_violation saves the day).  What I'm wondering about though is
whether a similar ON CONFLICT schedule suffers a similar problem, but
would allow you to commit.  For example, I think the ON CONFLICT
equivalent might be something like the following (rather contrived)
schedule, which happily commits if you comment out Peter's check:

(1)  postgres=# create table bank_account (id int primary key, cash int);
(1)  CREATE TABLE
(1)  postgres=# begin transaction isolation level serializable ;
(1)  BEGIN

(2)  postgres=# begin transaction isolation level 

Re: [GENERAL] SERIALIZABLE and INSERTs with multiple VALUES

2016-10-12 Thread Thomas Munro
On Thu, Oct 13, 2016 at 2:32 PM, Peter Geoghegan  wrote:
> On Wed, Oct 12, 2016 at 6:06 PM, Thomas Munro
>  wrote:
>> But yeah, the existing code raises false positive serialization
>> failures under SERIALIZABLE, and that's visible in the isolation test
>> I posted: there is actually a serial order of those transactions with
>> the same result.
>
> I was under the impression that false positives of this kind are
> allowed by SSI. Why focus on this false positive scenario in
> particular?

Sure, they're allowed.  Of course the ones caused by your own command
are questionable because there is no concurrent transaction and
retrying the transaction will never succeed, as discussed, but it
seems we all agree on that.  The question is just whether INSERT ...
ON CONFLICT should generate more false positives than plain old
INSERT.  Two overlapping conflicting plain old INSERTs without any
other entanglement to create a cycle will result in one succeeding and
the other getting a UCV, as if one ran after the other with no
overlap.  It would be nice if the ON CONFLICT case used the same
smarts to take the ON CONFLICT path, unless there is some theoretical
problem I'm overlooking.  Otherwise concurrency is reduced.

I wonder if we should fix the same-command problem reported by the OP,
and then study larger questions of ON CONFLICT/SERIALIZABLE
interaction as a separate project.  I may be imagining problems where
there are none...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIALIZABLE and INSERTs with multiple VALUES

2016-10-12 Thread Thomas Munro
On Thu, Oct 13, 2016 at 10:06 AM, Kevin Grittner  wrote:
> On Wed, Oct 12, 2016 at 3:02 PM, Peter Geoghegan  wrote:
>
>> I agree that the multi-value case is a bug.
>
>> I think that it should be pretty obvious to you why the check exists
>> at all, Kevin. It exists because it would be improper to decide to
>> take the DO NOTHING path on the basis of some other committed tuple
>> existing that is not visible to the original MVCC snapshot, at higher
>> isolation levels.
>
> That's only true if it causes a cycle in the apparent order of
> execution.  If we rip out the check what we have is behavior
> completely consistent with the other transaction executing first;
> in other words, it creates a read-write dependency (a/k/a
> rw-conflict) from the current transaction to the concurrent
> transaction which succeeds in its insert.  That may or may not
> cause a cycle, depending on what else is happening.

The "higher isolation levels" probably shouldn't be treated the same way.

I think Peter's right about REPEATABLE READ.  We should definitely
raise the error immediately as we do in that level, because our RR
(SI) doesn't care about write skew and all that stuff, it just
promises that you can only see data in your snapshot.  We can't allow
you to take a different course of action based on data that your
snapshot can't see, so the only reasonable thing to do is abandon
ship.

But yeah, the existing code raises false positive serialization
failures under SERIALIZABLE, and that's visible in the isolation test
I posted: there is actually a serial order of those transactions with
the same result.

When working on commit fcff8a57 I became suspicious of the way ON
CONFLICT interacts with SSI, as I mentioned in passing back then[1],
thinking mainly of false negatives.  I failed to find a
non-serializable schedule involving ON CONFLICT that was allowed to
run, though I didn't spend much time on it.   One thing that worries
me is the final permutation of read-write-unique-4.spec, which
produces an arguably spurious UCV, that is, a transaction that doesn't
commit but raises a UCV instead of the serialization failure you might
expect.  The ON CONFLICT equivalent might be a transaction that takes
the ON CONFLICT path and then commits, even though it should be
considered non-serializable.  I would really like to understand that
case better, and until then I wouldn't bet my boots that it isn't
possible to commit anomalies using ON CONFLICT under SERIALIZABLE
without Peter's check (or even with it), despite the fact that it
reaches predicate locking code via heap_fetch etc.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D2kYCegxp9qMR5TM1X3oXHj16iYzLPj_go52R2R07EvnA%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIALIZABLE and INSERTs with multiple VALUES

2016-10-12 Thread Thomas Munro
On Thu, Oct 13, 2016 at 8:45 AM, Kevin Grittner  wrote:
> On Wed, Oct 12, 2016 at 10:06 AM, Kevin Grittner  wrote:
>
>> The test in ExecCheckHeapTupleVisible() seems wrong to me.  It's
>> not immediately obvious what the proper fix is.
>
> To identify what cases ExecCheckHeapTupleVisible() was meant to
> cover I commented out the body of the function to see which
> regression tests failed.  None did.  The failures shown on this
> thread are fixed by doing so.  If there is really a need for this
> function, speak up now and provide a test case showing what is
> broken without it; otherwise if I can't find some justification for
> this function I will rip it (and the calls to it) out of the code.
> If you do have some test case showing what breaks without the
> function, let's get it added to the regression tests!

Here's an isolation test that shows the distinction between a
transaction that reports a serialization failure because it crashed
into its own invisible tuples, and one that reports a serialization
failure because it crashed into a concurrent transaction's invisible
tuples.  Surely Peter intended the latter to report an error, but the
former seems like an oversight.

Here's a patch that shows one way to fix it.  I think it does make
sense to change this, because otherwise automatic
retry-on-serialization-failure strategies will be befuddle by this
doomed transaction.  And as you and Vitaly have said, there is
literally no concurrent update.

-- 
Thomas Munro
http://www.enterprisedb.com


isolation-test.patch
Description: Binary data


check-self-inserted.patch
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SERIALIZABLE and INSERTs with multiple VALUES

2016-10-12 Thread Thomas Munro
On Wed, Oct 12, 2016 at 8:50 PM, Albe Laurenz  wrote:
> Kevin Grittner wrote:
>> On Tue, Oct 11, 2016 at 2:29 PM, Jason Dusek  wrote:
>>> I notice the following oddity:
>>
>>>  =# CREATE TABLE with_pk (i integer PRIMARY KEY);
>>> CREATE TABLE
>>
>>>  =# BEGIN;
>>> BEGIN
>>>  =# INSERT INTO with_pk VALUES (2), (2) ON CONFLICT DO NOTHING;
>>> ERROR:  could not serialize access due to concurrent update
>>>  =# END;
>>> ROLLBACK
>>
>> I don't see that on development HEAD.  What version are you
>> running?  What is your setting for default_transaction_isolation?
>
> The subject says SERIALIZABLE, and I can see it on my 9.5.4 database:
>
> test=> CREATE TABLE with_pk (i integer PRIMARY KEY);
> CREATE TABLE
> test=> START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
> START TRANSACTION
> test=> INSERT INTO with_pk VALUES (2), (2) ON CONFLICT DO NOTHING;
> ERROR:  could not serialize access due to concurrent update

This happens in both SERIALIZABLE and REPEATABLE READ when a single
command inserts conflicting rows with an ON CONFLICT cause, and it
comes from the check in ExecCheckHeapTupleVisible whose comment says:

/*
 * ExecCheckHeapTupleVisible -- verify heap tuple is visible
 *
 * It would not be consistent with guarantees of the higher isolation levels to
 * proceed with avoiding insertion (taking speculative insertion's alternative
 * path) on the basis of another tuple that is not visible to MVCC snapshot.
 * Check for the need to raise a serialization failure, and do so as necessary.
 */

So it seems to be working as designed.  Perhaps someone could argue
that you should make an exception for tuples inserted by the current
command.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql-8.1.18-2.1 information

2016-10-06 Thread Thomas Munro
On Fri, Oct 7, 2016 at 6:19 PM, Adrian Klaver  wrote:
> On 10/06/2016 10:05 PM, Yogesh Sharma wrote:
>>
>> Dear All,
>>
>>
>>
>> Thanks for your support.
>>
>>
>>
>> I need to take source rpm for below version for rhel.
>>
>> postgresql-8.1.18-2.1
>>
>>
>>
>> Could you please share link, so that I will take.
>>
>> I have tried to search but could not found.
>
>
> Probably because it is 6 years past End-of-Life:
>
> https://www.postgresql.org/support/versioning/
>
> Close as I can come is the source version:
>
> https://www.postgresql.org/ftp/source/v8.1.18/

Ancient Red Hat source RPMs are apparently still be available for
archeology projects though:

ftp://ftp.redhat.com/pub/redhat/linux/enterprise/5Server/en/os/SRPMS/

(That's "5Server", they have other flavours too.)

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Time travel?

2016-09-29 Thread Thomas Munro
On Fri, Sep 30, 2016 at 8:17 AM, Kevin Grittner  wrote:
> On Thu, Sep 29, 2016 at 2:06 PM, Melvin Davidson  wrote:
>
>> I am using PostgreSQL 9.4.0 (Yes, I know 9.4.9 is most recent but
>> it's out of my control)
>
> As long as the decision-maker is aware that 9.4.0 has known bugs
> (fixed in later minor releases) that can render the database
> unusable without warning...
>
>> How is it possible for the WAL file to be accessed BEFORE it was
>> created?
>
> Perhaps renaming it counts as "creation" without affecting access
> time.

Apparently some filesystems change the ctime for rename and others
don't, and POSIX tolerates both.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Allowing multiple versions of PG under NetBSD

2016-07-29 Thread Thomas Munro
On Sat, Jul 30, 2016 at 8:51 AM, Stephen Frost  wrote:
> * Larry Rosenman (l...@lerctr.org) wrote:
>> On 2016-07-29 15:14, Bruce Momjian wrote:
>> >On Fri, Jul 29, 2016 at 03:09:59PM -0500, Larry Rosenman wrote:
>> >>>Data Directory naming, as well as keeping the init-scripts straight.
>> >>>
>> >>And who gets 5432, and Unix socket naming, it starts to get messy.
>> >
>> >Well, pg_upgrade uses a socket created in the current run directory, so
>> >that should be fine.
>> if we're talking JUST for pg_upgrade, that's one thing.  Peaceful
>> co-existence on an ongoing basis
>> is quite another.
>
> It's not an insurmountable problem, though it's a bit painful.  Still,
> both the Debian-based and RedHat-based distributions demonstrate how it
> can be done.

As a user of FreeBSD (as of recently) and Debian (since forever), I
have often thought that I''d really like to see a FreeBSD port of the
excellent postgresql-common package[1], and then
postgresql[major][minor] ports that can be installed in parallel.  The
binaries would have be installed under somewhere like
/usr/local/libexec/postgresql/[major].[minor], the default pg_data
location would have to be somewhere like like
/usr/local/pg_data/[major].[minor]/[cluster-name], the config files
somewhere like 
/usr/local/etc/postgresql/[major].[minor]/[cluster-name]/postgresql.conf
(or they could stay inside the pg_data dirs, as long as they can be
found by the postgresql-common cluster management scripts), and the
postgresql-common port would provide wrapper commands like
pg_createcluster, pg_lsclusters, ... and that psql wrapper that takes
an extra optional switch like --cluster 9.5/main.  The client library
on the other hand would not be versioned in that way: there would be
just the latest major version's libpq5[2], and that is what other
things like py-psycopg2 etc would depend on (instead of depending on a
specific client major version like postgresql93-client).

[1] https://packages.debian.org/jessie/postgresql-common
[2] https://packages.debian.org/jessie/libpq5

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Locking issue

2016-07-27 Thread Thomas Munro
On Thu, Jul 28, 2016 at 10:18 AM, Josh Berkus  wrote:
> On 07/26/2016 01:16 AM, David Harrison wrote:
>   where sps.sessionid = ses and
> sps.rankid = rank and
>   ...
> pg_try_advisory_lock(seatid)
>   order by s.row_number, s.seat_number_in_row
>   limit 1
>   for update of sps;

Don't you want pg_try_advisory_xact_lock (note "xact") here?
Otherwise you hold the advisory lock for the rest of the session,
unless you explicitly release it later.

> You appear to be trying to implement your own "SKIP LOCKED" (which you
> should maybe use instead).

+1

One difference between WHERE foo = 42 AND
pg_try_advisory_xact_lock(...) ... FOR UPDATE and WHERE foo = 42 ...
FOR UPDATE SKIP LOCKED is that the order of evaluation of the bits of
the WHERE clause linked by AND is probably undefined, so you could in
theory be acquiring advisory locks corresponding rows that don't have
foo = 42, depending on how the evaluator finished up processing that.
That could mess things up a bit for concurrent sessions.  SKIP LOCKED
on the other hand operates after any WHERE clause filtering.

> I'm not sure this works as-is; SELECT FOR UPDATE with LIMIT/ORDER is
> always fairly tricky, and tends to block the whole set, not just the
> LIMITed row.

There are complications with ORDER BY  FOR UPDATE, because row locking happens later
than ordering and causes you to see newer values that still match the
WHERE clause.  It gives you an isolation level more similar to READ
COMMITTED in non-snapshot based databases, except that the plan
implementing the ORDER BY didn't get the memo, and you'd have to fix
that with an outer query that sorts.  But I don't see why it would
block the whole set: LockRows sits on top and only gets its hands on
tuples emitted by nodes below it, so if there is a LIMIT then how
could it lock anything outside the limited set of rows that are
returned?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Transaction serialization

2016-06-23 Thread Thomas Munro
On Fri, Jun 24, 2016 at 4:13 AM, Dusan Milanov  wrote:
> Hi,
>
> A question about transactions: does postgres report serialization errors
> before a transaction is committed? Obviously, it does on commit, but how
> about previous statements? Can there be an ERRCODE_T_R_SERIALIZATION_FAILURE
> error as a response to anything else but the commit?

Yes.  See src/test/isolation/expected/project-manager.out, which shows
how to reach this error:

postgres=# UPDATE person SET is_project_manager = false WHERE person_id = 1;
ERROR:  could not serialize access due to read/write dependencies
among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming replication - slave server

2016-05-02 Thread Thomas Munro
On Tue, May 3, 2016 at 8:23 AM, Tony Nelson  wrote:
> I have a nicely working 3 server, 1 master, 2 slave setup.  All servers are 
> running on Ubuntu 12.04.  I was considering building a new slave server on 
> 16.04.
>
> The master is currently running 9.1.13, the slave I'm going to replace is 
> running 9.1.20.
>
> Does the new slave have to be running 9.1?  Or can use the latest, which 
> appears to be 9.5 on 16.04?

For binary streaming replication, the major version has to match.  But
you aren't constrained to using the major version that ships with
Ubuntu: you could also use the community packages from here:

https://wiki.postgresql.org/wiki/Apt

The packages work the same way and as far as I know the same people
are involved in maintaining them.

BTW  The latest 9.1.x is actually 9.1.21, and that major version
reaches EOL pretty soon:
http://www.postgresql.org/support/versioning/

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to manually force a transaction wraparound

2016-04-29 Thread Thomas Munro
On Sat, Apr 30, 2016 at 10:48 AM, Eric Ridge  wrote:
> I want to force my database to wraparound, just to see what happens.  How
> can I do this without consuming a few billion transactions?

Take a look at the script repro-bogus-subtrans-error-wraparound.sh
from this email:

http://www.postgresql.org/message-id/CAEepm=3z0eolpo5wtuwsem38kbq+gjp8xxiuljkuqpm-sw7...@mail.gmail.com

That used pg_resetxlog -x $XID $PGDATA, but needed to do several hops
stop/pg_resetxlog/start hops to get all the way around the xid clock.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Proper relational database?

2016-04-24 Thread Thomas Munro
On Sun, Apr 24, 2016 at 2:56 PM,   wrote:
>> One of the people involved in that was Hugh Darwen, who is one of the authors
>> of The Third Manifesto, which is an attempt to define what a properly
>> relational language and system should look like. So you could say the
>> experience of ISBL vs SQL has been folded into that effort.
>
> See http://www.thethirdmanifesto.com/.

So what incremental improvements could we steal from "properly
relational" query languages?

Here's one I've thought about, trivial as it may be.  I noticed that
Tutorial D (and apparently Andl too) includes a [NOT] MATCHING
operator (alternatively spelled SEMIJOIN and SEMIMINUS) corresponding
to the ⋉ (semi-join) and ▷ (anti-join) operators from relational
algebra.  In SQL you write [NOT] EXISTS or [NOT] IN in the WHERE
clause, rather than something explicit in a  clause,
though experienced users often talk explicitly about semi- and
anti-joins, both because of the theory and because the terms show up
in query plans.

A recent blog post[1] argues that SQL should have a SQL92-style JOIN
syntax for this and points at a couple of products that have added
one[2][3].  I guess it might be hard to convince the Postgres
community to add support for a non-standard syntax that doesn't give
you anything you can't already do, but as an idea I find it
interesting and it seems to be in the spirit of the part of the Third
Manifesto that says: "support[ing] the usual operators of the
relational algebra [..].  All such operators shall be expressible
without excessive circumlocution."

For example, say we want all students who have one or more exam today:

  SELECT s.student_id, s.name
FROM student s
   WHERE EXISTS (SELECT 1
   FROM exam e
  WHERE e.student_id = s.student_id
AND e.exam_date = CURRENT_DATE)

I don't know Tutorial D, but I think it might express that with something like:

  ( student MATCHING exam
WHERE exam_date = CURRENT_DATE )
  { student_id, name }

With 10 minutes of prototype hacking I convinced Postgres to accept
SEMI and ANTI like this:

  SELECT s.student_id, s.name
FROM student s SEMI JOIN exam e USING (student_id)
   WHERE e.exam_date = CURRENT_DATE

I guess a real version should accept (or require?) LEFT or RIGHT
before SEMI/ANTI.  When using this hypothetical syntax I think you
should be allowed to refer to e.exam_date in the WHERE clause but not
in the SELECT list (Impala apparently does allow you to see data from
exam, and returns values from an arbitrary matching row, but that
doesn't seem right to me).  But the EXISTS syntax is correspondingly
strange in that it requires you to provide a SELECT list which is
entirely discarded, so people often write "*" or "1" (erm, OK, I guess
you can use an empty select list in recent Postgres).

[1] 
https://blog.jooq.org/2015/10/13/semi-join-and-anti-join-should-have-its-own-syntax-in-sql/
[2] 
http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_joins.html
[3] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Proper relational database?

2016-04-22 Thread Thomas Munro
On Sat, Apr 23, 2016 at 2:04 PM,   wrote:
>> Why is starting at a low level important?  A database is truly relational to
>> the extent that it implements the relational model. If you don't want the
>> database to allow tables without keys, or to allow null values, don't let
>> people create them. If the underlying machinery allows them, that seems like
>> a mere performance issue; worrying about that from the outset seems like a
>> perfect example of premature optimization. If PostgreSQL's performance is
>> acceptable now, why wouldn't it be acceptable with a different interface
>> language?
>
> Agreed.
>
>> There are other aspects of what would make a truly relational database, of
>> course. Codd's 0th rule, for instance, that the "system must be able to
>> manage data bases entirely through its relational capabilities" to me says
>> that there should be no data definition language, except as syntactic sugar
>> for relational operations. So you'd create users (thousands in one command,
>> if you liked) by adding tuples to a base relation.
>
> Yes, maybe, but that makes it not part of the language. You can't apply rule 
> 0 to things like creating a type or operator in a language.
>
>> But which things are important? I think a good many of the things one might
>> would be lower-hanging fruit than that. Just having a clean query language
>> would alleviate a lot of (my) discomfort.
>
> Andl is that.
>
>> > I don't know if Postgres exposes the lower-level stuff to plugins or
>> > not — it would be nice if this could be an alternative query language
>> > for Postgres itself,
>>
>> Well, the parser doesn't, but as best I can tell it's also somewhat loosely
>> coupled from the system. It doesn't do table access, for instance.  It builds
>> and returns a parse tree.  There's no reason you couldn't parse a different
>> language and return a tree of the same type.  Or you could just translate
>> your input language into SQL, and pass it along to the existing parser.
>
> I looked into that, and it's too hard as a starting place. There is too much 
> of the tree and the subsequent query planning that is hooked into specific 
> features of SQL. Instead, Andl generates a small subset of SQL. Once each 
> query has been generated and parsed, the prepared statements can be cached 
> and you get most of the benefits.
>
>> > but the assumptions about the two worlds (SQL vs a properly relational
>> > store) are probably too different.
>
>> Are there relational algebra expressions, or other operations necessary to a
>> truly relational database, which cannot be translated into SQL?  I'm not
>> aware that there are, but I'd be interested to hear of it.  If there were,
>> there's a good chance you wouldn't be able to translate them into the parse
>> tree, either.
>
> Absolutely not. SQL is a (nearly) full implementation of the relational 
> algebra, plus other non-relational stuff.  The only thing it really can't 
> handle is a table with no columns! (I have to fake that)

FWIW standard SQL may not allow it but Postgres does, and it's even
possible to exclude duplicates by using an expression that references
the whole row.

postgres=# select;
┌──┐
├──┤
└──┘
(1 row)
postgres=# create table dum ();
CREATE TABLE
postgres=# select * from dum;
┌──┐
├──┤
└──┘
(0 rows)
postgres=# create unique index dum_unique on dum((dum));
CREATE INDEX
postgres=# insert into dum select;
INSERT 0 1
postgres=# select * from dum;
┌──┐
├──┤
└──┘
(1 row)
postgres=# insert into dum select;
ERROR:  duplicate key value violates unique constraint "dum_unique"
DETAIL:  Key ((dum.*))=(()) already exists.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multixacts wraparound monitoring

2016-03-31 Thread Thomas Munro
On Fri, Apr 1, 2016 at 4:31 AM, Pavlov, Vladimir
 wrote:
> I understand correctly, that number of members cannot be more than 2^32 (also 
> uses a 32-bit counter)?

Correct.

> I had 69640 files in main/pg_multixact/members/, 69640*32*2045 = 4557241600 
> members, this is normal?

Where did you get 2045 from?  I thought it was like this:

number of members = number of member segment files * 1636 * 32
number of multixacts = number of offsets segment files * 2048 * 32

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_dump crashing

2016-03-20 Thread Thomas Munro
On Mon, Mar 21, 2016 at 4:18 AM, Matthias Schmitt
 wrote:
> Hello,
>
> sorry for the late response.
>
>> On 15 Mar 2016, at 18:59, rob stone  wrote:
>>
>> I'm running Debian 4.3.5-1 (2016-02-06) x86_64 GNU/Linux.
>>
>> Backups are done via a shell script using double hyphen syntax, as in
>> e.g.:-
>>
>> pg_dump --port=5433 --dbname=mydatabase_1_0_0 --username=mmppostgres
>>  --file=/my_backup_path/mydatabase_1_0_0.dump
>>
>> We do it twice. First with --data_only and the second time with
>>  --format=custom
>>
>> Hasn't failed yet. Don't know if this helps at all but worth a try.
>
> Thank you for your answer. But sorry, no, this does not change anything. Same 
> server crash when executed in a cron job. It runs perfectly when executed 
> manually from the shell.

Is this related?

http://www.postgresql.org/message-id/cak7teys9-o4bterbs3xuk2bffnnd55u2sm9j5r2fi7v6bhj...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to Qualifying or quantify risk of loss in asynchronous replication

2016-03-16 Thread Thomas Munro
On Wed, Mar 16, 2016 at 9:59 PM, otheus uibk  wrote:
>> In asynchronous replication,
>> the primary writes to the WAL and flushes the disk.  Then, for any
>> standbys that happen to be connected, a WAL sender process trundles
>> along behind feeding new WAL doesn the socket as soon as it can, but
>> it can be running arbitrarily far behind or not running at all (the
>> network could be down or saturated, the standby could be temporarily
>> down or up but not reading the stream fast enough, etc etc).
>
>
>
> This is the *process* I want more detail about. The question is the same as
> above:
>> (is it true that) PG async guarantees that the WAL
>> is *sent* to the receivers, but not that they are received, before the
>> client receives acknowledgement?

The primary writes WAL to disk, and then wakes up walsender processes,
and they read the WAL from disk (presumably straight out of the OS
page cache) in the background and send it down the network some time
later.  Async replication doesn't guarantee anything about the WAL
being sent.

Look for WalSndWakeupRequest() in xlog.c, which expands to a call to
WalSndWakeup in walsender.c which sets latches (= a mechanism for
waking processes) on all walsenders, and see the WaitLatchOrSocket
calls in walsender.c which wait for that to happen.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to Qualifying or quantify risk of loss in asynchronous replication

2016-03-15 Thread Thomas Munro
;).
3.  Your primary computer is destroyed by a meteor, and its WAL sender
hadn't yet got around to sending that transaction to the standby.
4.  You recover using the standby.
5.  The transaction has been forgotten ("I'm sorry Dave, I'm afraid we
have no record of booking JUPITER123, and the rocket is full.  The
next rocket leaves in 7 years, would you like to book a seat?").

If you enable synchronous replication, and you are careful to recover
in step 4 using the correct standby, then you can't lose a transaction
that you reported to external systems *after* (because) COMMIT
returned.  If your primary is destroyed after you executed COMMIT, but
*before* it returned, it is possible that the current synchronous
standby's WAL contains the transaction or doesn't contain the
transaction, but not for you to have taken any external action based
on the commit having returned, because it didn't.  (If your primary
crashes and restarts before COMMIT returns, and it had got as far as
flushing locally but not yet heard from the standby, then things may
be slightly more complicated).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] synch streaming replication question

2016-02-22 Thread Thomas Munro
On Tue, Feb 23, 2016 at 3:09 PM, Michael Paquier
 wrote:
> On Tue, Feb 23, 2016 at 6:43 AM, John Wiencek  wrote:
>
>> I have a three node cluster using streaming replication configured as
>> follows:
>> One synch node and one asynch node.
>> If my synch node is down the master node is hung until that node is brought
>> back on line.
>>
>> Two questions:
>> 1.  Is this the expected action on my master?  It makes sense if it is since
>> the master is waiting for an ack from the synch node.
>
> Yes, when synchronous_commit = on, the default values that you should
> have, the master will wait from standby the confirmation that the WAL
> record for the commit has been flushed;
> http://www.postgresql.org/docs/devel/static/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-MASTER
>
>> 2. Is there anyway to configure my cluster so I have a synch node but not
>> have my master node “HANG” if the synch node is down?
>
> You can do that at transaction level for example by disabling
> synchronous_commit.

Or you could configure both of your standbys as synchronous standbys.
Only one of them will actually be a synchronous standby at a time, and
the other one will take over that role if the first one is down, so
your system won't hang but you'll still have the sync standby
guarantee.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_multixact issues

2016-02-10 Thread Thomas Munro
On Thu, Feb 11, 2016 at 11:05 AM, Alvaro Herrera
 wrote:
> Thomas Munro wrote:
>
>> 4.  You could look into whether all those multixacts with many member
>> are really expected.  (Large numbers of concurrent FK checks or
>> explicit share locks on the same rows perhaps?  A good proportion of
>> this happened on one day this week I see.)
>
> I think maybe this patch
> https://www.postgresql.org/message-id/20150620043650.GX133018%40postgresql.org
> should help with this.  I expect to come back to this and get it pushed
> to 9.3 and 9.4 sometime ...

Ah, so the OP might get some improvement today by moving to 9.5, which
has the optimization already.

BTW in my message above I said vacuum_multixact_freeze_table_age when
I meant to say autovacuum_multixact_freeze_max_age (and its default is
400 million).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pg_multixact issues

2016-02-10 Thread Thomas Munro
Feb  8 15:32 001C
> -rw---   1 postgres dba   262144 Feb  8 17:01 001D
> -rw---   1 postgres dba   262144 Feb  8 19:19 001E
> -rw---   1 postgres dba   262144 Feb  8 22:11 001F
> -rw---   1 postgres dba   262144 Feb  9 01:44 0020
> -rw---   1 postgres dba   262144 Feb  9 05:57 0021
> -rw---   1 postgres dba   262144 Feb  9 10:45 0022
> -rw---   1 postgres dba98304 Feb 10 13:35 0023
>
> the members directory has 15723 files:
> ls -l|wc -l
>15723

Maybe you didn't vacuum freeze your template0 database and it's
probably holding back the oldest multixact ID -- take a look at the
datminmxid column in pg_database.

You seem to be generating fairly large multixacts, and they use disk
space until the cluster-wide oldest multixact ID is advanced by vacuum
freeze.  You can estimate the average number of members per multixact
like this (assuming default page size):

number of members = number of member segment files * 1636 * 32
number of multixacts = number of offsets segment files * 2048 * 32

You have about 2.3 million multixacts, and about 823 million members,
so your multixacts have an average of around 358 members each.

There are two ways for autovacuum to be automatically triggered to
clean up old multixact state:

1.  When you have more than vacuum_multixact_freeze_table_age (default
150 million) multixacts, it will perform a wraparound vacuum to
advance the cluster's oldest multixact ID.  The oldest is 1 for you,
meaning this has probably never happened, and you're only up to around
2.3 million so it's not going to happen for a long time with the
default setting.

2.  When it thinks that the member address space is in danger of
wrapping, which kicks in before the above for clusters with large
average multixact size.  This happens after
MULTIXACT_MEMBER_SAFE_THRESHOLD is exceeded, which is when you get to
~2.2 billion members, and you're not quite half way there yet.

I'm not sure what the best course of action is for you, but here are
some thoughts:

1.  You could manually freeze all databases with vacuumdb --freeze
--all, but this requires first setting datallowcon to true for your
template0 database, because otherwise it would not be vacuum-frozen by
any command you can issue manually, and would therefore prevent the
oldest multixact ID from advancing.

2.  You could change the vacuum_multixact_freeze_table_age to a value
that will cause it to do vacuum freezes more often.  Assuming your
workload continues with the same average multixact size, you could
work backwards from the numbers above: since 2.3 million multixacts
occupy 4GB, if you set vacuum_multixact_freeze_table_age = 50 you
should expect a size under 1GB.  But note that the freeze operations
can be IO intensive and take a long time depending on your database
size so you don't want them too often.

3.  You could do nothing and wait for autovacuum to detect that you
are using more than half the member address space and trigger a
freeze, which will happen some time after you have around 41k member
segments (occupying around 10GB of disk since they hold 256kb each).

4.  You could look into whether all those multixacts with many member
are really expected.  (Large numbers of concurrent FK checks or
explicit share locks on the same rows perhaps?  A good proportion of
this happened on one day this week I see.)

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] BSD initdb without ICU support and switch later

2016-01-29 Thread Thomas Munro
On Sat, Jan 30, 2016 at 4:39 AM, Achilleas Mantzios
 wrote:
> On 29/01/2016 16:46, Maeldron T. wrote:
>>
>>  Hello,
>>
>> the ICU patch isn’t ready for PostgreSQL on FreeBSD.
>
>
> https://people.freebsd.org/~girgen/postgresql-icu/readme.html

If the goal is to get Unicode collation support, note also that
FreeBSD 11 (due some time this year) supports that in libc.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Synchronous replication

2016-01-13 Thread Thomas Munro
On Thu, Jan 14, 2016 at 9:34 AM, Joshua D. Drake  wrote:
> On 01/13/2016 12:28 PM, jwienc...@comcast.net wrote:
>>
>> Is it possible to set up synchronous replication to more that one node
>> in a cluster?  Or, am I limited to one synchronous node and one
>> asynchronous node?
>
>
> Yes you can have N number of synchronous slaves. However, be careful. It can
> be a huge performance hit.

Note that only one of the listed standbys is a synchronous standby at
any given time though.  That is, when you commit, the primary server
will wait just for that one server to report that it has fsync'ed the
WAL.  (There is a patch being developed to change that so that you
might be able to wait for more than one in a future release).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] planner does not detect same-as-default collation.

2016-01-04 Thread Thomas Munro
On Tue, Jan 5, 2016 at 6:39 PM, Jeff Janes  wrote:
> I don't know if this is a bug, or a surprising feature, or just a
> misunderstanding.
>
> If I specify an explicit collation which happens to be the same as the
> default collation, it is not recognized as being the same as the
> default.
>
> select version();
> version
> ---
>  PostgreSQL 9.4.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> 4.4.7 20120313 (Red Hat 4.4.7-16), 64-bit
>
> (But also in at least 9.5rc1 and 9.6dev)
>
> create database test2 encoding='utf8' lc_collate='en_US.utf8'
> template=template0;
> \c test2
> create table foo as select generate_series::text as bar from
> generate_series(1,10);
> create index on foo (bar );
> cluster foo USING foo_bar_idx ;
> set enable_sort TO off;
> explain select * from foo order by bar;
>
> 
>
> explain select * from foo order by bar collate "en_US.utf8";
>
> 
>
> Am I the only one surprised by this?

It works if you say: order by bar collate "default".  "default" and
"en_US.utf8" are different collations with different OIDs even if
'en_US.utf8' is the collation name listed in pg_database.datcollate.
I suppose if you knew the OID of that collation, you could teach
indxpath.c and (and I don't know what other planner machinery) to
consider that OID to be equivalent to DEFAULT_COLLATION_OID when
comparing them to consider an index path.

There was another email somewhere talking about constraint exclusion's
treatment of explicit and default collations.  Maybe predtest.c could
use the same type of logic.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Replication with 9.4

2015-12-01 Thread Thomas Munro
On Tue, Oct 6, 2015 at 12:27 PM, Thomas Munro  wrote:

> On Sun, Oct 4, 2015 at 11:47 PM, Michael Paquier
>  wrote:
> > (Seems like you forgot to push the Reply-all button)
> >
> > On Sun, Oct 4, 2015 at 7:01 PM, Madovsky wrote:
> >> On 10/3/2015 3:30 PM, Michael Paquier wrote:
> >>>  and no reason is given to justify *why* this would be needed in your
> case
> >> reason for a choice can be often an issue for other :D
> >>
> >> I thought that postgresql 9.4  user could change on the fly with
> >> synchronous_commit from local to on for ex
> >> which hotstandby would become in sync and which in async to avoid a big
> >> latency in case of let's say 100 hot standby.
> >> it was an idea, a concept to let the master write and update the nodes,
> like
> >> a queen bee ;)
> >> but I'm afraid it's not possible, so maybe future version of pg will do
> it,
> >> for now  read from the master is my only solution.
> >
> > Well, Thomas Munro (adding him in CC) has sent for integration with
> > 9.6 a patch that would cover your need, by adding to
> > synchronous_commit a mode called 'apply', in which case a master would
> > wait for the transaction to be applied on standby before committing
> > locally:
> >
> http://www.postgresql.org/message-id/CAEepm=1fqkivl4v-otphwsgw4af9hcogimrcw-ybtjipx9g...@mail.gmail.com
> > Perhaps you could help with the review of the patch, this has stalled
> > a bit lately.
>
> That patch (or something more sophisticated long those lines) is a
> small piece of a bigger puzzle, though it might be enough if you only
> have one standby, are prepared to block until manual intervention if
> that standby fails, and don't mind potentially lumpy apply
> performance.  See also the work being done to separate wal writing
> from wal applying for smoother performance[1], and handle multiple
> synchronous standbys[2].  But there is another piece of the puzzle
> IMHO: how to know reliably that the standby that you are talking to
> guarantees causal consistency, while also allowing standbys to
> fail/drop out gracefully, and I'm currently working on an idea for
> that.
>

FYI I posted the resulting proposal and patch over on the -hackers list.
Feedback, ideas, flames welcome as always.

http://www.postgresql.org/message-id/flat/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=z...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [GENERAL] Issues with german locale on CentOS 5,6,7

2015-11-12 Thread Thomas Munro
On Thu, Oct 8, 2015 at 12:49 PM, Thomas Munro  wrote:

> On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan
>  wrote:
> > On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane  wrote:
> >> The only real way out of such a situation is to REINDEX affected
> indexes.
> >> Refusing to start the server not only doesn't contribute to a solution,
> >> but makes it impossible to fix manually.
> >
> > I agree that that would be almost as bad as carrying on, because there
> > is no reason to think that the locale thing can easily be rolled back.
> > That was my point, in fact.
>
> I have contemplated a maintenance script that would track either the
> md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
> of installed locale packages and automatically reindex things when
> they change (I guess after restarting the cluster to clear any glibc
> caches that might be lurking in long running backends).  Or at least
> tell me that's needed.  Obviously completely OS-specific...
>

I got around to trying this on a recent rainy day:

https://github.com/macdice/check_pg_collations/blob/master/check_pg_collations

It needs Python and psycopg2.  It notices when LC_COLLATE files' checksums
change, and generates REINDEX commands on stdout.  It seems to work OK on
Debian, but I haven't tested much.  Pull requests welcome :-)  But I guess
it would be much cooler as a pure plpgsql extension.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [GENERAL] A question about PL/pgSQL DECLAREd variable behavior

2015-10-22 Thread Thomas Munro
On Fri, Oct 23, 2015 at 12:33 PM, Dane Foster  wrote:
> On Thu, Oct 22, 2015 at 6:54 PM, Thomas Munro
>  wrote:
>>
>> On Fri, Oct 23, 2015 at 10:27 AM, Dane Foster  wrote:
>> > On Thu, Oct 22, 2015 at 2:00 PM, Jim Nasby 
>> > wrote:
>> >> On 10/21/15 9:32 PM, Dane Foster wrote:
>> >>>
>> >>> "If STRICT is not specified in the INTO clause, then target will
>> >>> be
>> >>> set to the first row returned by the query, or to nulls if the
>> >>> query
>> >>> returned no rows."
>> >>>
>> >>> Foot removed from mouth.
>> >>
>> >> Note however that there's some unexpected things when checking whether
>> >> a
>> >> record variable IS (NOT) NULL. It's not as simple as 'has the variable
>> >> been
>> >> set or not'.
>> >
>> > Please elaborate. I'm entirely new to PL/pgSQL so the more details you
>> > can
>> > provide the better.
>> > Thanks,
>>
>> The surprising thing here, required by the standard, is that this
>> expression is true:
>>
>>   ROW(NULL, NULL) IS NULL
>>
>> So "r IS NULL" is not a totally reliable way to check if your row
>> variable was set or not by the SELECT INTO, if there is any chance
>> that r is a record full of NULL.  "r IS NOT DISTINCT FROM NULL" would
>> work though, because it's only IS [NOT] NULL that has that strange
>> special case.  Other constructs that have special behaviour for NULL
>> don't consider a composite type composed of NULLs to be NULL.  For
>> example IS DISTINCT FROM, COALESCE, COUNT, STRICT functions.
>
> Someone should include your explanation in the [fine] manual.

The quirky standard behaviour of IS [NOT] NULL with rows is described
in a 'Note' section here:

http://www.postgresql.org/docs/9.4/static/functions-comparison.html

But I do think we should consider pointing out explicitly that "IS
NULL" doesn't mean the same thing as, erm, "is null" where it appears
throughout the documentation, and I proposed a minor tweak:

http://www.postgresql.org/message-id/CAEepm=1wW4MGBS6Hwteu6B-OMZiX6_FM=wfyn7otehycfkg...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] A question about PL/pgSQL DECLAREd variable behavior

2015-10-22 Thread Thomas Munro
On Fri, Oct 23, 2015 at 10:27 AM, Dane Foster  wrote:
> On Thu, Oct 22, 2015 at 2:00 PM, Jim Nasby  wrote:
>> On 10/21/15 9:32 PM, Dane Foster wrote:
>>>
>>> "If STRICT is not specified in the INTO clause, then target will be
>>> set to the first row returned by the query, or to nulls if the query
>>> returned no rows."
>>>
>>> Foot removed from mouth.
>>
>> Note however that there's some unexpected things when checking whether a
>> record variable IS (NOT) NULL. It's not as simple as 'has the variable been
>> set or not'.
>
> Please elaborate. I'm entirely new to PL/pgSQL so the more details you can
> provide the better.
> Thanks,

The surprising thing here, required by the standard, is that this
expression is true:

  ROW(NULL, NULL) IS NULL

So "r IS NULL" is not a totally reliable way to check if your row
variable was set or not by the SELECT INTO, if there is any chance
that r is a record full of NULL.  "r IS NOT DISTINCT FROM NULL" would
work though, because it's only IS [NOT] NULL that has that strange
special case.  Other constructs that have special behaviour for NULL
don't consider a composite type composed of NULLs to be NULL.  For
example IS DISTINCT FROM, COALESCE, COUNT, STRICT functions.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] A question about PL/pgSQL DECLAREd variable behavior

2015-10-21 Thread Thomas Munro
On Thu, Oct 22, 2015 at 2:48 PM, Dane Foster  wrote:
> I wrote the following simple function to try to learn what happens to a
> DECLAREd variable whose assignment comes from an INTO statement where the
> query being executed does not return a result.
>
> CREATE OR REPLACE FUNCTION _test() RETURNS BOOLEAN AS $$
> DECLARE r RECORD;
> BEGIN SELECT 1 AS one INTO r WHERE false; RETURN r IS NULL; END;
> $$ LANGUAGE plpgsql;
>
> The function returns true. Given that I can't find any explicit reference in
> the documentation about the behavior I've just described is it safe to
> assume that the current behavior is the expected behavior but it's just not
> documented?

http://www.postgresql.org/docs/9.4/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW

"If STRICT is not specified in the INTO clause, then target will be
set to the first row returned by the query, or to nulls if the query
returned no rows."

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Issues with german locale on CentOS 5,6,7

2015-10-07 Thread Thomas Munro
On Thu, Oct 8, 2015 at 1:16 PM, Peter Geoghegan
 wrote:
> On Wed, Oct 7, 2015 at 4:49 PM, Thomas Munro
>  wrote:
>>> I agree that that would be almost as bad as carrying on, because there
>>> is no reason to think that the locale thing can easily be rolled back.
>>> That was my point, in fact.
>>
>> I have contemplated a maintenance script that would track either the
>> md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
>> of installed locale packages and automatically reindex things when
>> they change (I guess after restarting the cluster to clear any glibc
>> caches that might be lurking in long running backends).  Or at least
>> tell me that's needed.  Obviously completely OS-specific...
>
> I think we should bite the bullet and adopt ICU, without abandoning
> support for OS locales for users that really need it (certainly, many
> will need it initially when using pg_upgrade to get on to the first
> version that happens to have ICU support). I don't like suggesting a
> solution that I myself am unlikely to find the time to work on, but in
> the long run that's the only sensible approach IMV.

How would you handle changes in ICU's collation definitions?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Issues with german locale on CentOS 5,6,7

2015-10-07 Thread Thomas Munro
On Thu, Oct 8, 2015 at 5:52 AM, Peter Geoghegan
 wrote:
> On Wed, Oct 7, 2015 at 6:25 AM, Tom Lane  wrote:
>> The only real way out of such a situation is to REINDEX affected indexes.
>> Refusing to start the server not only doesn't contribute to a solution,
>> but makes it impossible to fix manually.
>
> I agree that that would be almost as bad as carrying on, because there
> is no reason to think that the locale thing can easily be rolled back.
> That was my point, in fact.

I have contemplated a maintenance script that would track either the
md5 checksums of the /usr/lib/locale/*/LC_COLLATE files or the version
of installed locale packages and automatically reindex things when
they change (I guess after restarting the cluster to clear any glibc
caches that might be lurking in long running backends).  Or at least
tell me that's needed.  Obviously completely OS-specific...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Replication with 9.4

2015-10-05 Thread Thomas Munro
On Sun, Oct 4, 2015 at 11:47 PM, Michael Paquier
 wrote:
> (Seems like you forgot to push the Reply-all button)
>
> On Sun, Oct 4, 2015 at 7:01 PM, Madovsky wrote:
>> On 10/3/2015 3:30 PM, Michael Paquier wrote:
>>>  and no reason is given to justify *why* this would be needed in your case
>> reason for a choice can be often an issue for other :D
>>
>> I thought that postgresql 9.4  user could change on the fly with
>> synchronous_commit from local to on for ex
>> which hotstandby would become in sync and which in async to avoid a big
>> latency in case of let's say 100 hot standby.
>> it was an idea, a concept to let the master write and update the nodes, like
>> a queen bee ;)
>> but I'm afraid it's not possible, so maybe future version of pg will do it,
>> for now  read from the master is my only solution.
>
> Well, Thomas Munro (adding him in CC) has sent for integration with
> 9.6 a patch that would cover your need, by adding to
> synchronous_commit a mode called 'apply', in which case a master would
> wait for the transaction to be applied on standby before committing
> locally:
> http://www.postgresql.org/message-id/CAEepm=1fqkivl4v-otphwsgw4af9hcogimrcw-ybtjipx9g...@mail.gmail.com
> Perhaps you could help with the review of the patch, this has stalled
> a bit lately.

That patch (or something more sophisticated long those lines) is a
small piece of a bigger puzzle, though it might be enough if you only
have one standby, are prepared to block until manual intervention if
that standby fails, and don't mind potentially lumpy apply
performance.  See also the work being done to separate wal writing
from wal applying for smoother performance[1], and handle multiple
synchronous standbys[2].  But there is another piece of the puzzle
IMHO: how to know reliably that the standby that you are talking to
guarantees causal consistency, while also allowing standbys to
fail/drop out gracefully, and I'm currently working on an idea for
that.

Of course you can make your own causal consistency today if you are
prepared to have your clients explicitly wait for WAL to be applied.
You can call pg_current_xlog_location() on the master after
committing, and then wait until pg_last_xlog_replay_location() reports
that that LSN has been applied on any standby you talk to at the start
of any transaction that wants causal reads.  You could wrap the
waiting up in a user defined function
wait_for_xlog_replay_location(lsn, timeout) which could do a naive
poll/sleep loop (or do something more efficient with latches in core
code).  For example, imagine a client that inserts some new accounts
and then causes a large number of workers to regenerate some reports
that must include the new accounts against a pool of standbys.  It
just needs to give them the LSN they should wait for first.  And if
you don't want to pass LSNs around but don't mind introducing some
extra conservative lag, those workers could call
pg_current_xlog_location() on the master themselves to get some
arbitrary recent LSN and then wait for that to be applied before they
start their work on the standbys.

The explicit wait-for-LSN approach pushes the waiting over to readers
who want causal reads, instead of writer (the master), which still
might be interesting for some cases even if we do finish up with a
good optional master-waits system.  But a master-waits system will
allow naive clients to see up to date data no matter where they run
their queries (or be kicked off by standbys that can't guarantee that)
without having to think about LSNs and replication machinery, and I
think that would be a very useful feature.

[1] 
http://www.postgresql.org/message-id/flat/CA+U5nMJifauXvVbx=v3UbYbHO3Jw2rdT4haL6CCooEDM5=4...@mail.gmail.com
[2] 
http://www.postgresql.org/message-id/flat/caog9aphycpmtypaawfd3_v7svokbnecfivmrc1axhb40zbs...@mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Thomas Munro
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera
 wrote:
> Thomas Munro wrote:
>
>> Thanks.   As mentioned elsewhere in the thread, I discovered that the
>> same problem exists for page boundaries, with a different error
>> message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
>> and master with the same results:
>>
>> FATAL:  could not access status of transaction 2048
>> DETAIL:  Could not read from file "pg_multixact/offsets/" at
>> offset 8192: Undefined error: 0.
>>
>> FATAL:  could not access status of transaction 131072
>> DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
>> or directory.
>
> So I checked this bug against current master, because it's claimed to be
> closed.  The first script doesn't emit a message at all; the second
> script does emit a message:
>
> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>
> If you start and stop again, there's no more noise in the logs.  That's
> pretty innocuous -- great.

Right, I included a fix for this in
https://commitfest.postgresql.org/5/265/ which handles both
pg_subtrans and pg_multixact, since it was lost in the noise in this
thread...  Hopefully someone can review that.

> But then I modified your script to do two segments instead of one.  Then
> after the second cycle is done, start the server and stop it again.  The
> end result is a bit surprising: you end up with no files in
> pg_multixact/offsets at all!

Ouch.  I see why: latest_page_number gets initialised to a different
value when you restart (computed from oldest multixact ID, whereas
during normal running it remembers the last created page number), so
in this case (next == oldest, next % 2048 == 0), restarting the server
moves latest_page_number forwards by one, so SimpleLruTruncate no
longer bails out with the above error message and it happily deletes
all files.  That is conceptually OK (there are no multixacts, so no
files should be OK), but see below...  Applying the page linked above
prevents this problem (it always keeps at least one multixact and
therefore at least one page and therefore at least one segment,
because it steps back one multixact to avoid boundary problems when
oldest == next).

As for whether it's actually OK to have no files in
pg_multixact/offsets, it seems that if you restart *twice* after
running checkpoint-segment-boundary.sh, you finish up with earliest =
4294965248 in TruncateMultiXact, because this code assumes that there
was at least one file found and then proceeds to assign (-1 * 2048) to
earliest (which is unsigned).

trunc.earliestExistingPage = -1;
SlruScanDirectory(MultiXactOffsetCtl,
SlruScanDirCbFindEarliest, &trunc);
earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
if (earliest < FirstMultiXactId)
earliest = FirstMultiXactId;

I think this should bail out if earliestExistingPage is still -1 after
the call to SlruScanDirectory.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Thomas Munro
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro
 wrote:
> On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro
>  wrote:
>> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas  wrote:
>>> Here's a new version with some more fixes and improvements:
>>> [...]
>>
>> With this patch, when I run the script
>> "checkpoint-segment-boundary.sh" from
>> http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=u...@mail.gmail.com
>> I see the following during shutdown checkpoint:
>>
>> LOG:  could not truncate directory "pg_multixact/offsets": apparent 
>> wraparound
>>
>> That message comes from SimpleLruTruncate.
>
> Suggested patch attached.

Is it a problem that we don't drop/forget page buffers from the
members SLRU (unlike SimpleLruTruncate, which is used for the offsets
SLRU)?

I may be missing something but it seems to me that it isn't, because
(1) CheckPointMultiXact is called to flush any dirty pages to disk
before TruncateMultiXact is called and (2) no pages older than the one
holding the oldest offset should be dirtied after CheckPointMultiXact
runs (member space is 'append only', at least until it is recycled),
so any pages in the SLRU whose underlying file has been truncated
should just naturally fall out of the LRU slots.  So they can't create
problems by being written to disk after the unlink.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro
 wrote:
> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas  wrote:
>> Here's a new version with some more fixes and improvements:
>>
>> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
>> when the oldest offset became known if the now-known value happened to
>> be zero.  Fixed.
>>
>> - SetOffsetVacuumLimit now logs useful information at the DEBUG1
>> level, so that you can see that it's doing what it's supposed to.
>>
>> - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
>> offsetStopLimit even if it can't truncate anything.  This seems
>> useless, but it's not, because it may be that the last checkpoint
>> advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
>> value, and now we can actually set offsetStopLimit properly.
>>
>> - TruncateMultiXact no longer calls find_multixact_start when there
>> are no remaining multixacts.  This is actually a completely separate
>> bug that goes all the way back to 9.3.0 and can potentially cause
>> TruncateMultiXact to remove every file in pg_multixact/offsets.
>> Restarting the cluster becomes impossible because TrimMultiXact barfs.
>>
>> - TruncateMultiXact now logs a message if the oldest multixact does
>> not precede the earliest one on disk and is not equal to the next
>> multixact and yet does not exist.  The value of the log message is
>> that it discovered the bug mentioned in the previous line, so I think
>> it's earning its keep.
>>
>> With this version, I'm able to see that when you start up a
>> 9.3.latest+this patch with a cluster that has a bogus value of 1 in
>> relminmxid, datminmxid, and the control file, autovacuum vacuums
>> everything in sight, all the values get set back to the right thing,
>> and the next checkpoint enables the member-wraparound guards.  This
>> works with both autovacuum=on and autovacuum=off; the emergency
>> mechanism kicks in as intended.  We'll want to warn people with big
>> databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
>> want to pre-vacuum those tables before upgrading to avoid a vacuum
>> storm.  But generally I'm pretty happy with this: forcing those values
>> to get fixed so that we can guard against member-space wraparound
>> seems like the right thing to do.
>>
>> So, to summarize, this patch does the following:
>>
>> - Fixes the failure-to-start problems introduced in 9.4.2 in
>> complicated pg_upgrade scenarios.
>> - Prevents the new calls to find_multixact_start we added in 9.4.2
>> from happening during recovery, where they can only create failure
>> scenarios.  The call in TruncateMultiXact that has been there all
>> along is not eliminated, but now handles failure more gracefully.
>> - Fixes possible incorrect removal of every single
>> pg_multixact/offsets file when no multixacts exist; one file should be
>> kept.
>> - Forces aggressive autovacuuming when the control file's
>> oldestMultiXid doesn't point to a valid MultiXact and enables member
>> wraparound at the next checkpoint following the correction of that
>> problem.
>
> With this patch, when I run the script
> "checkpoint-segment-boundary.sh" from
> http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=u...@mail.gmail.com
> I see the following during shutdown checkpoint:
>
> LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound
>
> That message comes from SimpleLruTruncate.

Suggested patch attached.

-- 
Thomas Munro
http://www.enterprisedb.com


fence-post.patch
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas  wrote:
> Here's a new version with some more fixes and improvements:
>
> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset
> when the oldest offset became known if the now-known value happened to
> be zero.  Fixed.
>
> - SetOffsetVacuumLimit now logs useful information at the DEBUG1
> level, so that you can see that it's doing what it's supposed to.
>
> - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
> offsetStopLimit even if it can't truncate anything.  This seems
> useless, but it's not, because it may be that the last checkpoint
> advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
> value, and now we can actually set offsetStopLimit properly.
>
> - TruncateMultiXact no longer calls find_multixact_start when there
> are no remaining multixacts.  This is actually a completely separate
> bug that goes all the way back to 9.3.0 and can potentially cause
> TruncateMultiXact to remove every file in pg_multixact/offsets.
> Restarting the cluster becomes impossible because TrimMultiXact barfs.
>
> - TruncateMultiXact now logs a message if the oldest multixact does
> not precede the earliest one on disk and is not equal to the next
> multixact and yet does not exist.  The value of the log message is
> that it discovered the bug mentioned in the previous line, so I think
> it's earning its keep.
>
> With this version, I'm able to see that when you start up a
> 9.3.latest+this patch with a cluster that has a bogus value of 1 in
> relminmxid, datminmxid, and the control file, autovacuum vacuums
> everything in sight, all the values get set back to the right thing,
> and the next checkpoint enables the member-wraparound guards.  This
> works with both autovacuum=on and autovacuum=off; the emergency
> mechanism kicks in as intended.  We'll want to warn people with big
> databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
> want to pre-vacuum those tables before upgrading to avoid a vacuum
> storm.  But generally I'm pretty happy with this: forcing those values
> to get fixed so that we can guard against member-space wraparound
> seems like the right thing to do.
>
> So, to summarize, this patch does the following:
>
> - Fixes the failure-to-start problems introduced in 9.4.2 in
> complicated pg_upgrade scenarios.
> - Prevents the new calls to find_multixact_start we added in 9.4.2
> from happening during recovery, where they can only create failure
> scenarios.  The call in TruncateMultiXact that has been there all
> along is not eliminated, but now handles failure more gracefully.
> - Fixes possible incorrect removal of every single
> pg_multixact/offsets file when no multixacts exist; one file should be
> kept.
> - Forces aggressive autovacuuming when the control file's
> oldestMultiXid doesn't point to a valid MultiXact and enables member
> wraparound at the next checkpoint following the correction of that
> problem.

With this patch, when I run the script
"checkpoint-segment-boundary.sh" from
http://www.postgresql.org/message-id/CAEepm=1_KbHGbmPVmkUGE5qTP+B4efoCJYS0unGo-Mc5NV=u...@mail.gmail.com
I see the following during shutdown checkpoint:

LOG:  could not truncate directory "pg_multixact/offsets": apparent wraparound

That message comes from SimpleLruTruncate.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch  wrote:
> While testing this (with inconsistent-multixact-fix-master.patch applied,
> FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
> omits the nextMXact==oldestMXact special case found in each other
> find_multixact_start() caller, so it reads the offset of a not-yet-created
> MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
> than we could.  This can't make us truncate excessively, because
> nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
> happens to be the first of a segment, an error is possible.  Procedure:
>
> 1. Make a fresh cluster.
> 2. UPDATE pg_database SET datallowconn = true
> 3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
> 4. vacuumdb --freeze --all
>
> Expected state after those steps:
> $ pg_controldata | grep NextMultiXactId
> Latest checkpoint's NextMultiXactId:  131072
>
> Checkpoint will fail like this:
> 26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
> 26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot 
> checkpoint
> 26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of 
> transaction 131072
> 26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file 
> "pg_multixact/offsets/0002": No such file or directory.
> 26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
> 26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the 
> server log for details.
> 26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint
>
> This does not block startup, and creating one mxid hides the problem again.
> Thus, it is not a top-priority bug like some other parts of this thread.  I
> mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks.   As mentioned elsewhere in the thread, I discovered that the
same problem exists for page boundaries, with a different error
message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
and master with the same results:

FATAL:  could not access status of transaction 2048
DETAIL:  Could not read from file "pg_multixact/offsets/" at
offset 8192: Undefined error: 0.

FATAL:  could not access status of transaction 131072
DETAIL:  Could not open file "pg_multixact/offsets/0002": No such file
or directory.

But, yeah, this isn't the bug we're looking for.

-- 
Thomas Munro
http://www.enterprisedb.com


checkpoint-page-boundary.sh
Description: Bourne shell script


checkpoint-segment-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  wrote:
> Thomas Munro wrote:
>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>> wrote:
>> > My guess is that the file existed, and perhaps had one or more pages,
>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>> > back.  read() returns 0 in this case but doesn't set errno.
>> >
>> > I didn't find a way to set things so that the file exists but is of
>> > shorter contents than oldestMulti by the time the checkpoint record is
>> > replayed.
>>
>> I'm just starting to learn about the recovery machinery, so forgive me
>> if I'm missing something basic here, but I just don't get this.  As I
>> understand it, offsets/0046 should either have been copied with that
>> page present in it if it existed before the backup started (apparently
>> not in this case), or extended to contain it by WAL records that come
>> after the backup label but before the checkpoint record that
>> references it (also apparently not in this case).
>
> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I have finally reproduced that error!  See attached repro shell script.

The conditions are:

1.  next multixact == oldest multixact (no active multixacts, pointing
past the end)
2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
3.  the page must not be the first in a segment (or we'd get the
read-zeroes case)

That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
vacuum followed by no multixact creations right before your backup
checkpoint).  That seems like reasonably low odds... if it happened
twice in a row, maybe I'm missing something here and there is some
other way to get this...

I realise now that this is actually a symptom of a problem spotted by
Noah recently:

http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com

He noticed the problem for segment boundaries, when not in recovery.
In recovery, segment boundaries don't raise an error (the read-zeroes
case applies), but page boundaries do.  The fix is probably to do
nothing if they are the same, as we do elsewhere, like in the attached
patch.

-- 
Thomas Munro
http://www.enterprisedb.com


fix-truncate-none.patch
Description: Binary data


copy-page-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  wrote:
> My guess is that the file existed, and perhaps had one or more pages,
> but the wanted page doesn't exist, so we tried to read but got 0 bytes
> back.  read() returns 0 in this case but doesn't set errno.
>
> I didn't find a way to set things so that the file exists but is of
> shorter contents than oldestMulti by the time the checkpoint record is
> replayed.

I'm just starting to learn about the recovery machinery, so forgive me
if I'm missing something basic here, but I just don't get this.  As I
understand it, offsets/0046 should either have been copied with that
page present in it if it existed before the backup started (apparently
not in this case), or extended to contain it by WAL records that come
after the backup label but before the checkpoint record that
references it (also apparently not in this case).  If neither of these
things happened then that is completely different from the
segment-does-not-exist case where we read zeroes if in recovery on the
assumption that later WAL records must be about to delete the file.
There is no way that future WAL records will make an existing segment
file shorter! So at this point don't we know that there is something
wrong with the backup itself?

Put another way, if you bring this up under 9.4.1, won't it also be
unable to access multixact 4624559 at this point?  Of course it won't
try to do so during recovery like 9.4.2 does, but I'm just trying to
understand how this is supposed to work for 9.4.1 if it needs to
access that multixact for other reasons once normal running is reached
(say you recover up to that checkpoint, and then run
pg_get_multixact_members, or a row has that xmax and its members to be
looked up by a vacuum or any normal transaction).  In other words,
isn't this a base backup that is somehow broken, not at all like the
pg_upgrade corruption case which involved the specific case of
multixact 1 and an entirely missing segment file, and 9.4.2 just tells
you about it sooner than 9.4.1?

For what it's worth, I've also spent a lot of time trying to reproduce
basebackup problems with multixact creation, vacuums and checkpoints
injected at various points between copying backup label, pg_multixact,
and pg_control.  I have so far failed to produce anything more
interesting than the 'reading zeroes' case (see attached
"copy-after-trunction.sh") and a case where the control file points at
a segment that doesn't exist, but it doesn't matter because the backup
label points at a checkpoint from a time when it did and
oldestMultiXactId is updated from there, and then procedes exactly as
it should (see "copy-before-truncation.sh").  I updated my scripts to
look a bit more like your nicely automated example (though mine use a
different trick to create small quantities of multixacts so they run
against unpatched master).  I have also been considering a scenario
where multixact ID wraparound occurs during basebackup with some
ordering that causes trouble, but I don't yet see why it would break
if you replay the WAL from the backup label checkpoint (and I think
the repro would take days/weeks to run...)

-- 
Thomas Munro
http://www.enterprisedb.com


copy-after-truncation.sh
Description: Bourne shell script


copy-before-truncation.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 1:46 PM, Andres Freund  wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.
>
> Consider the scenario where a very large database is copied while
> running. At the start of the backup we'll determine at which checkpoint
> recovery will start and store it in the label. After that the copy will
> start, copying everything slowly. That works because we expect recovery
> to fix things up.  The problem I see WRT multixacts is that the copied
> state of pg_multixact could be wildly different from the one at the
> label's checkpoint. During recovery, before reaching the first
> checkpoint, we'll create multixact files as used at the time of the
> checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

Yes, I think the code in TruncateMultiXact that scans for the earliest
multixact only works when the segment files span at most 2^31 of
multixact space. If they span more than that, MultiXactIdPrecedes is
no long able to provide a total ordering, so of the scan may be wrong,
depending on the order that it encounters the files.

Incidentally, your description of that scenario gave me an idea for
how to reproduce a base backup that 9.4.2 or master can't start.  I
tried this first:

1.  Set up with max_wal_senders = 1, wal_level = hot_standby, initdb
2.  Create enough multixacts to fill a couple of segments in
pg_multixacts/offsets using "explode_mxact_members 99 1000" (create
foo table first)
3.  Start a base backup with logs, but break in
src/backend/replication/basebackup.c after
sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending
the contents of the data dir (including pg_multixacts)... (or just put
a big sleep in there)
4.  UPDATE pg_database SET datallowconn = true; vacuumdb --freeze
--all; CHECKPOINT;, see that offsets/ is now gone and
oldestMultiXid is 98001 in pg_control
5.  ... allow the server backend to continue; the basebackup completes.

Inspecting the new data directory, I see that offsets/ is not
present as expected, and pg_control contains the oldestMultiXid 98001.

Since pg_control was copied after pg_multixacts and my database didn't
move between those copies, it points to a valid multixact (unlike the
pg_upgrade scenario) and is able to start up, but does something
different again which may or may not be good, I'm not sure:

LOG:  database system was interrupted; last known up at 2015-05-30 14:30:23 NZST
LOG:  file "pg_multixact/offsets/" doesn't exist, reading as zeroes
LOG:  redo starts at 0/728
LOG:  consistent recovery state reached at 0/70C8898
LOG:  redo done at 0/70C8898
LOG:  last completed transaction was at log time 2015-05-30 14:30:17.261436+12
LOG:  database system is ready to accept connections

My next theory about how to get a FATAL during startup is something
like this:  Break in basebackup.c in between copying pg_multixacts and
copying pg_control (simulating a very large/slow file copy, perhaps if
'base' happens to get copied after 'pg_multixacts', though I don't
know if that's possible), and while it's stopped, generate some
offsets segments, vacuum --freeze --all, checkpoint and then create a
few more multixacts, then checkpoint again (so that oldestMultiXact is
not equal to nextMultiXact).  Continue.  Now pg_control's
oldestMultiXactId now points at a segment file that didn't exist when
pg_multixacts was copied.  I haven't managed to get this to work (ie
produce a FATAL) and I'm out of time for a little while, but wanted to
share this idea in case it helps someone.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
here too, in this patch.

> C. I think we should also change TruncateMultiXact() to truncate
> offsets first, and then members.  As things stand, if we truncate
> members first, we increase the risk of seeing an offset that will fail
> when passed to find_multixact_start(), because TruncateMultiXact()
> might get interrupted before it finishes.  That seem like an
> unnecessary risk.

I don't see why the order matters.  find_multixact_start() doesn't
read the members, only the offsets SLRU (ie the index into members,
not the contents of members).  As I understand it, the only time we
need to access the members themselves is when we encounter multixacts
in tuple headers (updating, locking or vacuuming).  If you have
truncated multixacts referenced in your tuples then you have a
different form of corruption than the
pg_upgrade-tramples-on-oldestMultiXactId case we're trying to handle
gracefully here.

-- 
Thomas Munro
http://www.enterprisedb.com


tolerate-missing-offset-segments-wip.patch
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Thomas Munro
On Fri, May 29, 2015 at 7:56 AM, Robert Haas  wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas  wrote:
>> [ speculation ]
>
> [...]  However, since
> the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
> which will call SetMultiXactIdLimit, which will propagate the bogus
> datminmxid = 1 setting into shared memory.

Ah!

> [...]
>
> - There's a third possible problem related to boundary cases in
> SlruScanDirCbRemoveMembers, but I don't understand that one well
> enough to explain it.  Maybe Thomas can jump in here and explain the
> concern.

I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing:  SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd.  In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.

If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
< rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case.  Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be <= the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page).  Like this (not tested):

diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
  /* Recheck the deletion condition.  If it still holds, perform deletion */
  if ((range->rangeStart > range->rangeEnd &&
  segpage > range->rangeEnd && segpage < range->rangeStart) ||
- (range->rangeStart < range->rangeEnd &&
+ (range->rangeStart <= range->rangeEnd &&
  (segpage < range->rangeStart || segpage > range->rangeEnd)))
  SlruDeleteSegment(ctl, filename);

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Roadmap for Postgres on AIX

2013-04-17 Thread Thomas Munro
On 19 March 2013 01:00, Tom Lane  wrote:

> Wasim Arif  writes:
> > What is the road map for Postgres on the AIX platform? I understand that
> > the pg build farm contains an AIX 5.3 server; are there any plans to
> > upgrade to 6.1 and 7.1?
>
> The reason there's an AIX 5.3 buildfarm member is that someone cares
> enough about preserving portability to that platform to provide a
> buildfarm member.  If you're worried about other AIX releases, I suggest
> you do likewise.  It's not a huge burden.  Basic info about it is here:
> http://buildfarm.postgresql.org/cgi-bin/register-form.pl
>
> By and large, our approach to AIX is the same as to any other platform:
> we'll support it as long as users of the platform provide testing and
> any necessary portability fixes.  That burden might fall more directly
> on you, as one of a relatively small number of users, than it would on
> somebody using say Linux or BSD.  But as long as you are willing to do
> some of the work you need not fear that we're planning to abandon you.
>

For those organisations running PostgreSQL on AIX, I wonder if it would be
possible for a member of 'IBM PartnerWorld' to use the free 'Virtual
Loaner' programme[1] to set up build farm members on a range of AIX
versions and POWER chips.

My employer is in this category, but I am not in a position to contribute
company time patches or resources currently (although that might change).
 From memory we've had to make a couple of local tweaks to makefiles for
our particular versions of things, although we are mostly focussed on
clients on AIX, not the server.  But perhaps some of the other (rare!) AIX
users from the list might be able to look into VPL farm animals?

[1] http://www-304.ibm.com/partnerworld/wps/pub/systems/vlp/index


Re: [GENERAL] Money casting too liberal?

2013-03-29 Thread Thomas Munro
On 28 March 2013 13:52, Shaun Thomas  wrote:

> On 03/28/2013 07:43 AM, Gavan Schneider wrote:
>
>  Personally I have ignored the money type in favour of numeric. Money
>> seemed to do too much behind the scenes for my taste, but, that's me
>> being lazy as well, I haven't spend much time trying to understand its
>> features.
>>
>
> You're not the only one. In the financial industry, we can't even use the
> money type for a few reasons:
>
> [... snip ... ]
>

Speaking as a another finance/trading industry techie who works with
various kinds of price data, I also find the money type interesting but
useless.  I am interested in scaled integers in general though, be they of
fixed scale (per column, part of the type) or of floating scale (floating
point decimal).  I have run into those all over the place in software and
protocols.  They can be stored and computed more efficiently than the more
general variable sized BCD string system where scale and precision are more
like check constraints than limits of representation allowing for fixed
size bitfields

For floating point decimal, IEEE 754 2008 decimal32, decimal64, decimal128
types would make interesting additions (the scale travels with each
number.. it's essentially a bitfield of sign + exponent/scale + significand
which is efficient for software implements, or an isomorphic BCD-like fixed
size encoding which is used by IBM's POWER DFP hardware).  But that can be
implemented as custom types outside core PostgreSQL (I've done some initial
experimentation with this, defining a type DECIMAL64, and not encountered
any obstacles, using IBM decNumber, which is available under the liberal
ICU license or the GPL license, and is used by many projects; there is also
an Intel library with a BSD license IIRC).

For fixed point  decimal, a new scaled integer type with fixed scale and
precision could be made that uses different representation depending on the
parameters, much like the way Common LIsp implementations use fixnums based
on word size while possible, and fall back to arbitrary sized systems if
needed.  That would of course be implementable outside core too.

Even the built-in NUMERIC could in theory use multiple encodings, whenever
the scale and precision are provided, since it can work out whether they
are within the limits that are implementable with different binary
representations (in other words, when you ask for NUMERIC(*, 2), do what
MONEY for US locales does, otherwise fall back to the more general case).
 But that would change the rules about when rewrites are required if you
change scale/precision, so wouldn't be reasonable.


[GENERAL] Exclusion constraints with time expressions

2012-11-05 Thread Thomas Munro
Hi

I am using 9.1.6, and I've set up a partitioned table as described in the
manual, with partitions based on a timestamptz column called 'time'.  The
exclusion constraints work nicely when I select ranges of times with
literal constants.  But why would a WHERE clause like the following not
benefit from exclusion constraints?

  time > TIMESTAMPTZ '2012-11-03 00:00:00Z' + INTERVAL '24 hours'

Isn't that expression on the right reducible to a constant up front?
 Obviously I can use a host language to do the arithmetic and provide a
constant, but I am curious to understand why that wouldn't be
constant-folded.

Thanks
Thomas


Re: [GENERAL] migrating/spliting Postgres data directory on OpenIndiana

2012-10-25 Thread Thomas Munro
On 25 October 2012 19:46, Boris Epstein  wrote:
> And if I want to split the storage - i.e., put databases into different
> directories - can I do that?

Take a look at the tablespace feature:

http://www.postgresql.org/docs/current/static/sql-createtablespace.html

You can move existing tables with ALTER TABLE (likewise for indexes),
and change the default for new tables created in future in a database
with ALTER DATABASE:

http://www.postgresql.org/docs/9.2/static/sql-altertable.html
http://www.postgresql.org/docs/9.2/static/sql-alterdatabase.html

Before the tablespace feature was added, people used to create
symlinks in the data directory if they wanted to use different storage
for different objects (say, a special fast disk array for certain
data).  The tablespace feature actually works the same way, but it's
managed for you via DDL commands.

Hope that helps,
Thomas


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Curosity question regarding "LOCK" NOWAIT

2012-09-22 Thread Thomas Munro
On 22 September 2012 05:08, John R Pierce  wrote:
> On 09/21/12 7:43 PM, David Johnston wrote:
>>
>> Has there been any discussion regarding adding a time-limited version of
>> NOWAIT, say: “WAITONLY 50” (milliseconds), when dealing the explicit LOCK
>> TABLE or the SELECT…FOR(SHARE|UPDATE) commands?
>
>
> is this a feature in any other major databases?

There may be other/better ways to do this, I'm only a casual user of
at least one of these, but here is what I could find for the big
three:

Oracle:
Error if locks can't be obtained immediately:
  SELECT ... FOR UPDATE NOWAIT
Error if locks can't be obtained in :
  SELECT ... FOR UPDATE WAIT 
Skip rows that can't be locked immediately:
  SELECT ... FOR UPDATE SKIP LOCKED

Microsoft SQL Server:
Error if locks can't be obtained immediately:
  SELECT ... FOR UPDATE WITH (NOWAIT)
Error if locks can't be obtained in :
  SET LOCK_TIMEOUT 
  SELECT ... FOR UPDATE
Skip rows that can't be locked immediately:
  SELECT ... FOR UPDATE WITH (READPAST)

IBM DB2
Error if locks can't be obtained immediately:
  SET CURRENT LOCK TIMEOUT NOWAIT
  SELECT ... FOR UPDATE
Error if locks can't be obtained in :
  SET CURRENT LOCK TIMEOUT WAIT 
  SELECT ... FOR UPDATE
Skip rows that can't be locked immediately:
  -- currently z/OS version only
  SELECT ... FOR UPDATE SKIP LOCKED ROWS

> is this in the sql spec?

My understanding is that the SQL 1992 spec doesn't talk about
locking directly, it talks about isolation levels and
updatability (and I don't have accesss to the 2003 spec to check
if that has changed).  Although it does standardise FOR
UPDATE ( which is an optional part of a
), it's not for explicit locking, it
specifies that a cursor is updatable.  Locking is an
implementation matter (and the use of FOR UPDATE outside of a
, ie in a , may be
non-standard anyway).  NOWAIT is not an ANSI SQL keyword, and
WAIT is a keyword reserved for future use.

Regards,
Thomas Munro


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] PostgreSQL server embedded in NAS firmware?

2012-09-08 Thread Thomas Munro
On 6 September 2012 23:40, Andrew Barnham  wrote:
> Scratch that. An immediate show stopping pitfall occurs to me: the necessity
> to match CPU/OS Architecture between primary server and replicate target.
> Doubtful that there are any consumer NAS products out there running linux on
> 64bit/intel

Hi

I have a super cheapskate rig along those lines at home, doing
replication among other things: I used an HP Microserver (not
marketed as a 'consumer NAS' exactly, but the same general idea:
a low cost black cube with drives bays, SATA ports, a small
amount of ECC RAM and a low power dual core amd64 CPU).  I run
Debian GNU/Linux and have a bunch of PostgreSQL databases,
backups and virtual machines on it.  My goals were: cheap to buy,
cheap to run, reasonably reliable, quiet, small, inoffensive to
the eye.  I filled it up with 'green' 5400RPM drives that I had
spare from another project, configured software RAID arrays with
XFS on top, and put it on a shelf to run headless.  A friend has the
same box but runs FreeNAS on it so he can use ZFS and swears by
it (he also added a 4 x 2.5" adaptor to be able to reach the
maximum of 8 drives, which I think requires adding a controller
card, whereas I used the 5.25" bay for a 5th 3.5" drive).  The
machines were going for around 150 GBP when I bought, and I added
some RAM.  Last time I measured it it was drawing around 50W (a
bit more when busy, a bet less when idle), which works out to
under 50 quid a year to run at London retail electricity prices,
comparable to a light bulb.  This is surely about the slowest
database hardware money can buy, but handles my hobbiest
databases (~1TB for the largest) and a bunch of streaming
replicas and backups from remote servers just fine.  I haven't
checked, but I would expect it to be the slowest build farm
member...

Thomas Munro


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Statistical aggregates with intervals

2012-08-22 Thread Thomas Munro
Hi

I noticed that 'avg' works on 'interval', but 'stddev' and 'variance' don't:

hack=> create table test (start_time timestamptz, end_time timestamptz);
CREATE TABLE
hack=> insert into test values (now(), now() + interval '1 second');
INSERT 0 1`
hack=> insert into test values (now(), now() + interval '1 second');
INSERT 0 1
hack=> insert into test values (now(), now() + interval '4 second');
INSERT 0 1
hack=> select avg(end_time - start_time) from test;
   avg
--
 00:00:02
(1 row)

hack=> select stddev(end_time - start_time) from test;
ERROR:  function stddev(interval) does not exist
LINE 1: select stddev(end_time - start_time) from test;
   ^
HINT:  No function matches the given name and argument types. You
might need to add explicit type casts.

Sure enough the standard deviation of time intervals can be computed
by following that hint:

hack=> select interval '1 second' * stddev(extract(epoch from end_time
- start_time)) as stddev from test;
 stddev
-
 00:00:01.732051
(1 row)

But is there some way I can use CREATE AGGREGATE to define stddev for
intervals in terms of the built-in stddev aggregate, just transforming
the inputs and output?  Or am I missing something fundamental that
explains why stddev(interval) isn't supported?

Thanks!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] index update

2012-07-18 Thread Thomas Munro


Pawel Veselov  wrote:

>Hi.
>
>If I have a lot (10k) tables, and each table has a btree index, and all the
>tables are being constantly inserted into, would all the indexes have to be
>in memory, and would effectively start fighting for space?
>
>Thank you,
>  Pawel.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: function crosstab(unknown, unknown) does not exist

2012-07-13 Thread Thomas Munro


Adrian Klaver  wrote:

>On 07/05/2012 07:46 AM, Stefan Schwarzer wrote:
>>> Now, when I launch a query which includes "crosstab()" as a postgres 
>>> user, everything works fine. However, if I launch it as user XXX, it 
>>> complaints:
>

 The search path is indicated as:

   "$user", public, metadata, admin, gis, tablefunc, postgis, topology

 I added a GRANT USAGE ON SCHEMA tablefunc TO XXX. Just in case, but this 
 didn't solve the problem. Still wondering why it works for user Postgres, 
 but not for user XXX...

>>>
>>> Using psql log in as user XXX and run SELECT current_schemas('f');
>>> Do same as the postgres user.
>>> Let us know the results.
>>
>> Ahh…. Indeed, no "tablefunc" schema in there. So, what do I need to do? And 
>> what does the command mean?
>
>See here for info:
>http://www.postgresql.org/docs/9.1/interactive/functions-info.html
>
>Per Toms suggestion you need to check the permissions on the schema.
>One way to do that is, from psql type the following and enter:
>  \dn+
>>
>
>
>-- 
>Adrian Klaver
>adrian.kla...@gmail.com
>
>
>
>-- 
>Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-general

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] ctid ranges

2012-06-08 Thread Thomas Munro
Hi

In 9.1.3, this is fast, handled with a tid scan using the physical address:

SELECT ... FROM ... WHERE ctid = ...;

This is slow, handled with a seq scan (as are various rephrasing with
<, <=, etc):

SELECT ... FROM ... WHERE ctid BETWEEN ... AND ...;

Is there a way to retrieve the rows in a physical range quickly?

(I realise this is a pretty odd thing to want to do... I was
experimenting with a crackpot idea for storing some data in a known
physical order and finding the beginning of ends ranges by binary
chop, instead of using a btree.)

Thanks
Thomas Munro

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] IEEE 754-2008 decimal numbers

2010-10-10 Thread Thomas Munro
Hi

Has anyone done any work on IEEE 754-2008 decimal types for PostgreSQL?

I couldn't find anything, so I was thinking it might be a fun exercise
for learning about extending PostgreSQL with user defined types.  My
first goal is to be able to store decimal numbers with a smaller disk
footprint than NUMERIC.  I was thinking I would start out by defining
types DECIMAL32 and DECIMAL64 and some casts between those types and
NUMERIC.  (A more ambitious project for later would be defining
arithmetic operators etc using compiler/hardware support).

Thanks
Thomas

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general