Re: [GENERAL] mild modification to pg_dump

2017-11-17 Thread Andres Freund
On 2017-11-17 18:56:45 -0300, marcelo wrote:
> Truly, I'm catched in a very big app, so I have no time to read all
> the docs.

People on this list also have jobs.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-16 Thread Andres Freund


On November 16, 2017 7:06:23 PM PST, Tom Lane  wrote:
>Andres Freund  writes:
>> On 2017-11-16 21:39:49 -0500, Tom Lane wrote:
>>> What might be worth thinking about is allowing the syslogger process
>to
>>> inherit the postmaster's OOM-kill-proofness setting, instead of
>dropping
>>> down to the same vulnerability as the postmaster's other child
>processes.
>
>> Hm. I'm a bit scared about that - it doesn't seem that inconceivable
>> that various backends log humongous multi-line messages, leading to
>> syslogger *actually* taking up a fair amount of memory. Note that
>we're
>> using plain stringinfos that ereport(ERROR) out of memory situations,
>> rather than failing more gracefully.
>
>True, but there's no hard limits on the postmaster's memory consumption
>either ... 

Is there a credible scenario where it'd allocate many gigabytes of memory?

> and if the syslogger does get killed on such a basis, we
>have at the least lost a bunch of log output.  On the whole I think we'd be
>better off trying to prevent OOM kills on the syslogger.  (That doesn't
>preclude other mitigation measures.)

It doesn't seem impossible to get into a situation where syslogger is the 
source of the OOM. Just enabling a lot of logging in a workload with many large 
query strings might do it.  So making it less likely to be killed might make 
the problem worse...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-16 Thread Andres Freund
On 2017-11-16 21:39:49 -0500, Tom Lane wrote:
> > We could work around a situation like that if we made postmaster use a
> > *different* pipe as stderr than the one we're handing to normal
> > backends. If postmaster created a new pipe and closed the read end
> > whenever forking a syslogger, we should get EPIPEs when writing after
> > syslogger died and could fall back to proper stderr or such.
> 
> I think that's nonsense, unfortunately.

Nice phrasing.


> If the postmaster had its own pipe, that would reduce the risk of this
> deadlock because only the postmaster would be filling that pipe, not
> the postmaster and all its other children --- but it wouldn't
> eliminate the risk.

The deadlock happens because postmaster is waiting for syslogger accept
a message, and syslogger waits for postmaster to restart it. To resolve
the deadlock postmasterneeds to not wait for a dead sylogger, even if it
hasn't yet received & processed the SIGCLD - what other postmaster
children do or don't do doesn't matter for resolving that cycle. The
reason postmaster currently block on writing to the pipe, instead of
getting EPIPE, is because both ends of the pipe are still
existing. Which in turn is the case because we need to be able to
restart syslogger without passing a new file descriptor to all
subprocesses.  If postmaster instead uses a different pipe to write to
it'll not block anymore, instead getting EPIPE, and can continue towards
starting a new syslogger.  So I don't think the described deadlock
exists if we were to apply my proposed fix.


What this obviously would not *not* guarantee is being able start a new
syslogger, but it seems fairly impossible to guarantee that. So sure,
other processes would still block until syslogger has successfully
restarted - but it's a resolvable situation rather than a hard deadlock,
which the described situation appears to be.


Note that there's plenty of cases where you could run into this even
without being unable to fork new processes. You'd e.g. could also run
into this while logging the exit of some other subprocess or such,
there's enough ereports in postmaster.


> I doubt the increase in reliability would be enough to justify the
> extra complexity and cost.

I'm doubtful about that too.


> What might be worth thinking about is allowing the syslogger process to
> inherit the postmaster's OOM-kill-proofness setting, instead of dropping
> down to the same vulnerability as the postmaster's other child processes.
> That presumes that this was an otherwise-unjustified OOM kill, which
> I'm not quite sure of ... but it does seem like a situation that could
> arise from time to time.

Hm. I'm a bit scared about that - it doesn't seem that inconceivable
that various backends log humongous multi-line messages, leading to
syslogger *actually* taking up a fair amount of memory. Note that we're
using plain stringinfos that ereport(ERROR) out of memory situations,
rather than failing more gracefully.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-16 Thread Andres Freund
On 2017-11-17 11:09:56 +0900, Michael Paquier wrote:
> On Fri, Nov 17, 2017 at 10:50 AM, Andres Freund  wrote:
> > On 2017-11-06 15:35:03 -0500, Tom Lane wrote:
> >> David Pacheco  writes:
> >> > I ran into what appears to be a deadlock in the logging subsystem.  It
> >> > looks like what happened was that the syslogger process exited because it
> >> > ran out of memory.  But before the postmaster got a chance to handle the
> >> > SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
> >> > That also failed, and the postmaster went to log a message about it, but
> >> > it's blocked on the pipe that's normally connected to the syslogger,
> >> > presumably because the pipe is full because the syslogger is gone and
> >> > hasn't read from it.
> >>
> >> Ugh.
> >
> > I'm somewhat inclined to say that one has to live with this if the
> > system is so resource constrainted that processes barely using memory
> > get killed.
> >
> > We could work around a situation like that if we made postmaster use a
> > *different* pipe as stderr than the one we're handing to normal
> > backends. If postmaster created a new pipe and closed the read end
> > whenever forking a syslogger, we should get EPIPEs when writing after
> > syslogger died and could fall back to proper stderr or such.
> 
> I don't have the code on top of my mind, but isn't a custom fd causing
> a small penalty

Sure, there's some minor overhead because every process would need to
close another fd after forking.


> when redirection_done is switched to true because the first process
> generating a message to the syslogger pipe needs to open it first if
> not done yet?

I can't follow. The syslogger pipe is created when the first syslogger
is started (before it's forked!). Which happens before other processes
are created, because they all need to inherit that file descriptor.


> So you'd need proper locking to save from race conditions.

I completely fail to see why this'd be the case. All I'm talking about
is using another pipe between syslogger and postmaster than between
other-processes and syslogger.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postmaster deadlock while logging after syslogger exited

2017-11-16 Thread Andres Freund
On 2017-11-06 15:35:03 -0500, Tom Lane wrote:
> David Pacheco  writes:
> > I ran into what appears to be a deadlock in the logging subsystem.  It
> > looks like what happened was that the syslogger process exited because it
> > ran out of memory.  But before the postmaster got a chance to handle the
> > SIGCLD to restart it, it handled a SIGUSR1 to start an autovacuum worker.
> > That also failed, and the postmaster went to log a message about it, but
> > it's blocked on the pipe that's normally connected to the syslogger,
> > presumably because the pipe is full because the syslogger is gone and
> > hasn't read from it.
> 
> Ugh.

I'm somewhat inclined to say that one has to live with this if the
system is so resource constrainted that processes barely using memory
get killed.

We could work around a situation like that if we made postmaster use a
*different* pipe as stderr than the one we're handing to normal
backends. If postmaster created a new pipe and closed the read end
whenever forking a syslogger, we should get EPIPEs when writing after
syslogger died and could fall back to proper stderr or such.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Postgres 10.1 fails to start: server did not start in time

2017-11-12 Thread Andres Freund
On 2017-11-12 14:26:42 -0500, Tom Lane wrote:
> Christoph Berg  writes:
> > The default systemd timeout seems to be 90s. I have already changed
> > the systemd timeout to infinity (start) and 1h (stop), so only the
> > default pg_ctl timeout remains (60s), which I'd rather not override
> > unilaterally.
> 
> > That said, isn't 60s way too small for shutting down larger clusters?
> > And likewise for starting?
> 
> Well, that's tied into the fact that pg_ctl doesn't disturb the server's
> state if it gives up waiting.  If it did, we would certainly use a larger
> timeout or none at all.

Hm. So partially that's also related to the fact that we didn't have a
good way to know whether the server reacted to the shutdown request or
not. With the infrastructure from

commit f13ea95f9e473a43ee4e1baeb94daaf83535d37c
Author: Tom Lane 
Date:   2017-06-28 17:31:24 -0400

Change pg_ctl to detect server-ready by watching status in postmaster.pid.

we could really do better than just wonder whether our signal to
shutdown was received or not.  There probably should be a quite short
timeout for the server to change status, and then a much longer one for
that shutdown to finish.


> I don't feel a big need to change that default,
> but if you have a surrounding script that is going to take adverse action
> after a timeout then you need to use a larger value ...

Didn't we have to fiddle with this a bunch in the regression tests, to
get things to work properly on slow animals? If we had to do that, other
people had to do so as well. Not the friendliest experience...

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] xmin increasing within a transaction block?

2017-11-06 Thread Andres Freund
Hi,

On 2017-11-06 10:36:06 +0100, Luca Ferrari wrote:
> Hi all,
> I suspect this has a trivial explaination, but this is what I'm experiencing:
> 
> > CREATE TABLE foo( i int );
> > BEGIN;
> * > INSERT INTO foo(i) VALUES( 1 );
> * > INSERT INTO foo(i) VALUES( 2 );
> * > SELECT xmin, cmin, xmax, cmax, i FROM foo;
>  xmin | cmin | xmax | cmax | i
> --+--+--+--+---
>  2466 |0 |0 |0 | 1
>  2467 |1 |0 |1 | 2
> (2 rows)
> 
> * > SELECT txid_current();
>  txid_current
> --
>  2465
> (1 row)
> 
> 
> Why is xmin greater than the current transaction id (and most notably
> not "fixed")?
> What am I missing here?
> I'm running 9.6.5.

That doesn't look like plain postgres behaviour to me. Any chance you're
using a pooler in statement mode in front of postgres?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] checkpoint and recovering process use too much memory

2017-11-02 Thread Andres Freund
Hi,

On 2017-11-03 01:43:32 +, tao tony wrote:
> I had an asynchronous steaming replication HA cluster.Each node had 64G 
> memory.pg is 9.6.2 and deployed on centos 6.
> 
> 
> Last month the database was killed by OS kernel for OOM,the checkpoint 
> process was killed.
> 
> 
> I noticed checkpoint process occupied memory for more than 20GB,and it was 
> growing everyday.In the hot-standby node,the recovering process occupied 
> memory as big as checkpoint process.
> 
> 
> I turned the checkpoint parameters,but it didn't worked.
> 
> Now In the standby node,checkpoint and recovering process  used more then 
> 50GB memory as below,and I worried someday the cluster would be killed by OS 
> again.
> 
>PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
> 167158 postgres  20   0 34.9g  25g  25g S  0.0 40.4  46:36.86 postgres: 
> startup process   recovering 00040855004B
> 167162 postgres  20   0 34.9g  25g  25g S  0.0 40.2  17:58.38 postgres: 
> checkpointer process
> 
> 
> shared_buffers = 32GB

I think it's more likely that what you're seeing is just that both
checkpointer and startup process touch most shared buffers and thus show
up as having touched all that memory.

Regards,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pgcon2015, what happened to SMR disk technolgy ?

2017-10-17 Thread Andres Freund
On 2017-10-18 06:50:19 +0200, Laurent Laborde wrote:
> On Tue, Oct 17, 2017 at 1:38 PM, Geoff Winkless  wrote:
> 
> > On 17 October 2017 at 11:59, Laurent Laborde  wrote:
> >
> >> What's the point of the seagate archive now ?
> >> Ironwolf, for the same public price, have better performance (obviously)
> >> and, more surprising, a better MTBF.
> >>
> >
> > ​I have no real insight into whether Seagate are still pursuing the
> > product design, but I'm not really surprised that the MTBF is worse: if the
> > shingled disk must write some tracks twice for each individual track write,
> > it seems logical that there will be more write stress and therefore
> > shortened lifespan, no?
> >
> 
> I contacted seagate and just got a reply : they don't have strategic
> information to share about SMR technology at the moment.
> I guess i saw it coming ^^

What I heard as rumours, not super trustworthy ones but not entirely
uninformed, is that SMR drives are currently pretty much entirely sold
to companies doing online data storage and such.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] BDR, wal sender, high system cpu, mutex_lock_common

2017-10-11 Thread Andres Freund
On 2017-10-12 10:25:43 +0800, Craig Ringer wrote:
> On 4 October 2017 at 00:21, milist ujang  wrote:
> > On Tue, Oct 3, 2017 at 8:49 PM, Craig Ringer  wrote:
> >>
> >>
> >> Can you get stacks please?
> >>
> >> Use -g
> >
> >
> > # Events: 2K cpu-clock
> > #
> > # Overhead   Command  Shared ObjectSymbol
> > #     .  
> > #
> > 86.96%  postgres  [kernel.kallsyms]  [k] __mutex_lock_common.isra.5
> > |
> > --- __mutex_lock_common.isra.5
> > read
> 
> 
> Unfortunately it looks like you're using a postgres built with
> -fomit-frame-pointers (the default) on x64, with an older perf not
> built with libunwind. This produces useless stacks.

Just read this mail, but for libunwind to work you'd have to specify
"--call-graph dwarf", no?

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] core system is getting unresponsive because over 300 cpu load

2017-10-10 Thread Andres Freund
Hi,

On 2017-10-10 13:40:07 -0700, pinker wrote:
> and the total number of connections are increasing very fast (but I suppose
> it's the symptom not the root cause of cpu load) and exceed max_connections
> (1000).

Others mentioned already that that's worth improving.

> System:
> * CentOS Linux release 7.2.1511 (Core) 
> * Linux 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux

Some versions of this kernel have had serious problems with transparent
hugepages. I'd try turning that off. I think it defaults to off even in
that version, but also make sure zone_reclaim_mode is disabled.


> * postgresql95-9.5.5-1PGDG.rhel7.x86_64
> * postgresql95-contrib-9.5.5-1PGDG.rhel7.x86_64
> * postgresql95-docs-9.5.5-1PGDG.rhel7.x86_64
> * postgresql95-libs-9.5.5-1PGDG.rhel7.x86_64
> * postgresql95-server-9.5.5-1PGDG.rhel7.x86_64
> 
> * 4 sockets/80 cores

9.6 has quite some scalability improvements over 9.5. I don't know
whether it's feasible for you to update, but if so, It's worth trying.

How about taking perf profile to investigate?


> * vm.dirty_background_bytes = 0
> * vm.dirty_background_ratio = 2
> * vm.dirty_bytes = 0
> * vm.dirty_expire_centisecs = 3000
> * vm.dirty_ratio = 20
> * vm.dirty_writeback_centisecs = 500

I'd suggest monitoring /proc/meminfo for the amount of Dirty and
Writeback memory, and see whether rapid changes therein coincide with
periodds of slowdown.


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strange checkpoint behavior - checkpoints take a longtime

2017-10-05 Thread Andres Freund
Hi,

On 2017-10-06 05:53:39 +0300, Vladimir Nicolici wrote:
> Hello, it’s postgres 9.6.

Consider setting checkpoint_flush_after to 16MB or something large like that.


> I will probably try the compression on Monday or Tuesday, I can only
> experiment with a single set of changes in a day, and I plan to test
> something else tomorrow.
> 
> Thanks for the suggestions, and sorry for the reply style, but my mail
> client is not best suited for replying inline to individual points.

You should consider getting a new mail client then...

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

2017-10-05 Thread Andres Freund
Hi,

On 2017-10-05 22:58:31 +0300, Vladimir Nicolici wrote:
> I changed some configuration parameters during the night to the values I was 
> considering yesterday:
> 
> - shared_buffers = 144GB #previously 96 GB
> - bgwriter_lru_maxpages = 100  #previously 400
> - checkpoint_timeout = 30min  #previously 5min
> - checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes 
> writes out of 30 minutes.
> - max_wal_size = 96GB #previously 16GB

Which version of postgres is this?

> - wal_buffers = 16MB  #previously 32 MB

That seems quite unlikely to help.

You might want to try also enabling wal_compression, sometimes the WAL
volume is a considerable problem.

I'd suggest reporting some "pidstat -dl 1" output, so we can see which
processes are doing how much IO.

Regards,

Andres



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical Replication - test_decoding - unchanged-toast-datum

2017-09-27 Thread Andres Freund
On 2017-09-28 08:19:08 +0800, Craig Ringer wrote:
> This is one of the MANY reasons test_decoding isn't suitable as the base
> for a replication solution. It has "test" in its name for a reason.

FWIW, I don't see why the unchanged toast stuff is that. It's clearly
discernible from actual datums, so ...

I agree that test_decoding isn't a great base of a replication tool, but
I don't think it's completely unsuitable, and I also think that ship has
sailed.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Up to date conventional wisdom re max shared_buffer size?

2017-09-20 Thread Andres Freund
On 2017-09-20 13:00:34 -0500, Jerry Sievers wrote:
> >> Pg 9.3 on monster 2T/192 CPU Xenial thrashing
> >
> > Not sure what the word "thrashing" in that sentence means.
> 
> Cases of dozens or hundreds of sessions running typical statements for
> this system but running 100% on their CPUs.  Seems to be triggered by
> certain heavy weight batch jobs kicking off on this generally OLTP
> system.
> 
> ISTM there might be LW lock contention happening around some sort of
> shared resource where the lock wait implementation is a CPU spinner.

Yes, we improved that a lot in 9.5, 9.6 and 10.  The really bad
scenarios - I've seen 95% cpu time spent in locking - should all be
fixed.

I'd try to make sure that both transparent hugepages and zone reclaim
mode are disabled - the latter probably is already, but the former might
still cause some problems.


> > Things have improved a lot since 9.3 WRT to scalability, so I'd not
> > infer too much from 9.3 performance on a larger box.
> 
> Understood.  The situation got worse when we moved to the even bigger
> box also running a 4.x kernel which I presume was no where near existent
> when 9.3 was our current Pg version.

I suspect it's more the bigger box than the newer kernel. The more
sockets and cores you have, the more lock contention bites you. That's
because inter-socket / cpu transfers get more expensive with more cores.


> >> Upgrade pending but we recently started having $interesting performance
> >> issues at times looking like I/O slowness and other times apparently
> >> causing CPU spins.
> >
> > That's not something we can really usefully comment on given the amount
> > of information.
> 
> Ack'd.
> 
> I'd like to strace some of the spinning backends when/if we get another
> opportunity to observe the problem to see if by syscall or libfunc name
> we can learn more about what's the cause.

I think the causes are known, and fixed - don't think there's much you
can do besides upgrading, unless you want to backport a number of
complex patches yourself.

FWIW, usually perf gives better answers than strace in this type of
scenario.


> >> Anyway, shared_buffer coherency generally high but does take big dips
> >> that are sometimes sustained for seconds or even minutes.
> >
> > "shared_buffer coherency"?
> 
> As measured querying pg_stat_databases and comparing total reads to read
> hits.  Run frequently such as once /5-seconds and factored into a hit
> percentage.  May stay up around 100% for several ticks but then go way
> down which may or not sustain.
> 
> This is an OLTP app using Rails with hundreds of tables both trivial
> n structure as well as having partitions, large payloads... TOAST and
> the like.
> 
> TPS can measure in the ~5-10k range.

That's cache hit rate, not coherency ;)

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Up to date conventional wisdom re max shared_buffer size?

2017-09-19 Thread Andres Freund
Hi,

On 2017-09-19 17:00:05 -0500, Jerry Sievers wrote:
> Briefly, just curious if legacy max values for shared_buffers have
> scaled up since 8G was like 25% of RAM?

It's very workload dependent. I've successfully used PG with roughly 1TB
of shared buffers, where that performed better than lower
settings.


> Pg 9.3 on monster 2T/192 CPU Xenial thrashing

Not sure what the word "thrashing" in that sentence means.

Things have improved a lot since 9.3 WRT to scalability, so I'd not
infer too much from 9.3 performance on a larger box.


> Upgrade pending but we recently started having $interesting performance
> issues at times looking like I/O slowness and other times apparently
> causing CPU spins.

That's not something we can really usefully comment on given the amount
of information.

> Anyway, shared_buffer coherency generally high but does take big dips
> that are sometimes sustained for seconds or even minutes.

"shared_buffer coherency"?


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [HACKERS] pgjdbc logical replication client throwing exception

2017-09-15 Thread Andres Freund
On 2017-09-15 20:00:34 +, Vladimir Sitnikov wrote:
> ++pgjdbc dev list.
> 
> >I am facing unusual connection breakdown problem. Here is the simple code
> that I am using to read WAL file:
> 
> Does it always fails?
> Can you create a test case? For instance, if you file a pull request with
> the test, it will get automatically tested across various PG versions, so
> it would be easier to reson about
> 
> Have you tried "withStatusInterval(20, TimeUnit.SECONDS)" instead of 20
> millis? I don't think it matter much, however 20ms seems to be an overkill.

Also, have you checked the server log?

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Postgresql init and cleanup module functions

2017-09-05 Thread Andres Freund
On 2017-09-05 20:15:57 +0300, Yan Pas wrote:
> I'm writing C postgresql module with some psql-functions and global state.
> Do module API provide any init and cleanup functions? If yes then it would
> be fine to see them in "35.9. C-Language Functions" help page."

The relevant doc page is at
https://www.postgresql.org/docs/current/static/xfunc-c.html

what you're looking for is _PG_init(). There effectively is no cleanup
logic, as modules cannot be unloaded anymore.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Unlogged Crash Detection

2017-08-29 Thread Andres Freund
On 2017-08-29 20:19:52 +0900, Michael Paquier wrote:
> On Tue, Aug 29, 2017 at 6:06 PM, Gersner  wrote:
> > I see, interesting.
> 
> Please do not top-post. This is not the recommended way of dealing
> with threads on this mailing list.
> 
> > We have lots of unlogged tables, upon a crash we want to create a
> > feedback/alert that data disappeared.
> >
> > Not very familiar with the internal structure, but is it possible to
> > identify if the current table is the INIT_FORKNUM?
> 
> Using pg_relation_filepath, you can know the path to a relation file
> on disk. So a simple idea would be to use pg_read_binary_file with the
> path of the file and the path of the init fork, which is suffixed with
> "_init", and then a comparison between both. If the data read is the
> same, the relation has been untouched.

Huh, but that's not particularly meaningful, is it? That'll just as well
be the case for a freshly created relation, no?

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] streaming replication - crash on standby

2017-08-09 Thread Andres Freund
Hi,

Please quote properly on postgres mailing lists.

On 2017-08-09 22:31:23 +, Seong Son (US) wrote:
> I see.  Thank you.  
> 
> But the Postgresql process had crashed at that time so the streaming 
> replication was no longer working.  Why would it crash and is that normal?

You've given us absolutely zero information to be able to diagnose the
problem.  If you want somebody to help you you'll have to describe
exactly what happened, and what the problem you're facing is.

- Andres

> This email and any files transmitted with it are intended solely for the use 
> of the individual or entity to whom they are addressed. If you have received 
> this email in error please notify the system manager. This message contains 
> information that is intended only for the individual named. If you are not 
> the named addressee you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately by e-mail if you have received 
> this e-mail by mistake and delete this e-mail from your system. If you are 
> not the intended recipient you are notified that disclosing, copying, 
> distributing or taking any action in reliance on the contents of this 
> information is strictly prohibited.

This footer makes no sense on a public list.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] streaming replication - crash on standby

2017-08-09 Thread Andres Freund
Hi,

On 2017-08-09 22:03:43 +, Seong Son (US) wrote:
> The last line from pg_xlogdump of the last WAL file on the crashed standby 
> server shows the following.
> 
> pg_xlogdump: FATAL:  error in WAL record at DF/4CB95FD0: unexpected pageaddr 
> DB/62B96000 in log segment 00DF004C, offset 12148736
> 
> I believe this means the standby server received WAL file out of order?  But 
> why did it crash?  Is crashing normal behavior in case like this?

This likely just means that that's the end of the WAL.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Interesting streaming replication issue

2017-08-09 Thread Andres Freund
Hi,

On 2017-07-27 13:00:17 +1000, James Sewell wrote:
> Hi all,
> 
> I've got two servers (A,B) which are part of a streaming replication pair.
> A is the master, B is a hot standby. I'm sending archived WAL to a
> directory on A, B is reading it via SCP.
> 
> This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm
> currently working to reproduce with standard 9.6)
> 
> We have recently seen a situation where B does not catch up when taken
> offline for maintenance.
> 
> When B is started we see the following in the logs:
> 
> 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client=
> (0:0)LOG:  restored log file "000C005A00B5" from
> archive
> scp: /archive/xlog//000C005A00B6: No such file or directory
> 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client=
> (0:0)LOG:  started streaming WAL from primary at 5A/B500 on
> timeline 12
> 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 000C005A00B5 has already been
> removed
> 
> scp: /archive/xlog//000D.history: No such file or directory
> scp: /archive/xlog//000C005A00B6: No such file or directory
> 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client=
> (0:0)LOG:  started streaming WAL from primary at 5A/B500 on
> timeline 12
> 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client=
> (0:XX000)FATAL:  could not receive data from WAL stream: ERROR:
> requested WAL segment 000C005A00B5 has already been
> removed
> 
> This will loop indefinitely. At this stage the master reports no connected
> standbys in pg_stat_replication, and the standby has no running WAL
> receiver process.
> 
> This can be 'fixed' by running pg_switch_xlog() on the master, at which
> time a connection is seen from the standby and the logs show the following:
> 
> scp: /archive/xlog//000D.history: No such file or directory
> 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client=  (0:0)LOG:
>  restored log file "000C005A00B5" from archive
> scp: /archive/xlog//000C005A00B6: No such file or directory
> 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client=  (0:0)LOG:
>  started streaming WAL from primary at 5A/B500 on timeline 12
> 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client=  (0:XX000)FATAL:
>  could not receive data from WAL stream: ERROR:  requested WAL segment
> 000C005A00B5 has already been removed
> 
> scp: /archive/xlog//000D.history: No such file or directory
> 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client=  (0:0)LOG:
>  restored log file "000C005A00B5" from archive
> 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client=  (0:0)LOG:
>  restored log file "000C005A00B6" from archive

FWIW, I don't see a bug here. Archiving on its own doesn't guarantee
that replication progresses in increments smaller than 16MB, unless you
use archive_timeout (or as you do manually switch segments). Streaming
replication doesn't guarantee that WAL is retained unless you use
replication slots - which you don't appear to be. You can make SR retain
more with approximate methods like wal_keep_segments too, but that's not
a guarantee.  From what I can see you're just seeing the combination of
these two limitations, because you don't use the methods to address them
(archive_timeout, replication slots and/or wal_keep_segments).

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multixact members limit exceeded

2017-08-09 Thread Andres Freund
Hi,

On 2017-08-09 16:30:03 -0400, Alvaro Herrera wrote:

> > One particular table before vacuum full:
> > 
> >relname| relminmxid | table_size
> > --++
> >  delayed_jobs | 1554151198 | 21 GB
> > 
> > And after vacuum full:
> > 
> >relname| relminmxid | table_size
> > --++
> >  delayed_jobs | 1554155465 | 6899 MB
> > 
> > Shouldn't be the relminmxid changed after vacuum full, or am I not
> > understanding something?

FWIW, VACUUM FULL isn't a good tool to use here. It's commonly way more
expensive than a normal vacuum (it rewrites the entire table, and
rebuilds indexes).


> But it did change ... the problem is that it didn't change enough (only
> 4000 multixacts).  Maybe your multixact freeze min age is too high?
> Getting rid of 15 GB of bloat is a good side effect, though, I'm sure.

I wonder if there's longrunning transactions preventing cleanup. I
suggest checking pg_stat_activity, pg_prepared_xacts,
pg_replication_slot for older stuff.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Multixact members limit exceeded

2017-08-09 Thread Andres Freund
Hi,

On 2017-08-09 10:06:48 +, Peter Hunčár wrote:
> We started feeding it several weeks ago and everything went smoothly until
> we hit this issue:
> 
> 2017-08-09 05:21:50.946 WIB >DETAIL:  This command would create a multixact
> with 2 members, but the remaining space is only enough for 0 members.
> 2017-08-09 05:21:50.946 WIB >HINT:  Execute a database-wide VACUUM in
> database with OID 20101 with reduced vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age settings.

Ugh, that's not good.


> I did what the hint proposes, but on a such large database a vacuum takes
> several days.
> We are currently in data loading phase and we are sending only INSERT
> statements, there should be very little UPDATEs or DELETEs.
> Yesterday, the message disappeared shortly, but today it's back (vacuum is
> still running)

I suggest manually vacuuming the tables with the oldest xids first, that
ought to move the limits ahead. A query like
SELECT oid::regclass, relminmxid, mxid_age(relminmxid) FROM pg_class WHERE 
relminmxid <> '0' ORDER BY mxid_age(relminmxid) DESC;
might help you make that determination.


> *Is there a way how to prevent/fix this so we can finish the loading (97%
> done), because the performance went down from 100 ops/sec to 15ops/min.*
> 
> Most tables have around 150 M rows with toast data.
> There are several huge tables with toast data, currently autovacuumed, I
> guess this is the reason for the performance drop:
> 
>  | usename  | application_name | state  | backend_xmin |
> query
> -+--+--++--+--
>  | postgres |  | active |   1683428686 | autovacuum: VACUUM
> pg_toast.pg_toast_14548803 (to prevent wraparound)
>  | postgres |  | active |   1683428693 | autovacuum: VACUUM
> pg_toast.pg_toast_14548821 (to prevent wraparound)
>  | postgres |  | active |   1683428705 | autovacuum: VACUUM
> pg_toast.pg_toast_14548828 (to prevent wraparound)
>  | postgres |  | active |   1683428719 | autovacuum: VACUUM
> pg_toast.pg_toast_14548835 (to prevent wraparound)
>  | postgres |  | active |   1683428732 | autovacuum: VACUUM
> pg_toast.pg_toast_14549150 (to prevent wraparound)

If these are the most "urgent" tables from the query above, I'd let
autovac finish, otherwise I'd cancel them and manually choose the most
urgent ones.


> After changing the vacuum_cost_limit to 1 for one night, I saw 200MB/s
> of writes the whole night, but I had to change it back to 2000, because the
> insert perf went to 0.

You have a fair amount of autovacuum workers, perhaps that's also a bit
too much?  I'd consider making autovacuum's cleanup duties *less*
aggressive, but trying to keep the rate of cleanup high.  E.g. by
increasing vacuum_freeze_min_age and vacuum_multixact_freeze_min_age, so
only urgently old stuff gets cleaned up.


Do you know how you come to have a lot of large multixacts?  That's
often indicative of way too many sessions trying to lock the same rows
and such, might be worthwhile to work on that too, to reduce the
frequency at which new ones are needed.


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pglogical vs. built-in logical replication in pg-10

2017-06-22 Thread Andres Freund
On 2017-06-22 18:10:40 +0300, Achilleas Mantzios wrote:
> > Once again having pg_largeobject as a system-catalog prevents LOs
> > from working smoothly. Neither replication nor having LOs on a
> > different tablespace (by moving pg_largeobject) works.

> I think logical decoding was designed for supporting DML SQL commands
> (i.e. a finite set of commands) and not specific functions (lo_*)
> which by nature can be arbitrary, infinite and version specific.

That's not really the reason. The first reason its currently unsupported
is that LOs are stored in a system catalog, and currently all system
catalogs are excluded from the change stream.  The second problem is how
exactly to represent the changes - we can't represent it as the whole LO
being changed, as that'd increase the volume of WAL and replicated
writes dramatically.  Thus we need to invent an API that can represent
creation, deletion, and writes to arbitrary offsets, for output plugins.


> > I wish PG in some future version will address these quirks so one can 
> > operate on LOs more smoothly.

You're welcome to help...


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] pglogical vs. built-in logical replication in pg-10

2017-06-22 Thread Andres Freund
On 2017-06-22 12:43:02 +0300, Achilleas Mantzios wrote:
> On 22/06/2017 11:21, Andreas Joseph Krogh wrote:
> > Hi.
> > 1. Why should one prefer built-in logical replication in pg-10 to 
> > pglogical, does it do anything pglogical doesn't?
> > It seems pglogical is more feature-rich...
> > 2. As I understand built-in logical replication in pg-10 doesn't support
> > large-objects, which we use a lot. Does pglogical replicate large
> > objects? I cannot find any notes about large-objects under "Limitations
> > and Restrictions":
> > https://www.2ndquadrant.com/en/resources/pglogical/pglogical-docs/

> You may do a simple test, create a table with a largeobject and try to read
> the logical stream, if it cannot represent the lo_import, lo_open, lowrite,
> lo_close (and I 'd bet they can't be encoded) then neither pglogical (being
> based on the same logical decoding technology) will support them.

There's nothing fundamental preventing us supporting large objects , but
indeed logical decoding at the moment doesn't support it.  The technical
bits of extracting changes should actually be easy, it's a bit more
difficult to decide how to represent it to output plugins...

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] effective_io_concurrency increasing

2017-06-19 Thread Andres Freund
On 2017-06-19 15:21:20 -0700, Peter Geoghegan wrote:
> On Mon, Jun 19, 2017 at 8:36 AM, Jeff Janes  wrote:
> > Unfortunately, it is only implemented in very narrow circumstances.  You
> > have to be doing bitmap index scans of many widely scattered rows to make it
> > useful.  I don't think that this is all that common of a situation.  The
> > problem is that at every point in the scan, it has to be possible to know
> > what data block it is going to want N iterations in the future, so you can
> > inform the kernel to pre-fetch it.  That is only easy to know for bitmap
> > scans.
> 
> I think that you could prefetch in index scans by using the
> pointers/downlinks in the immediate parent page of the leaf page that
> the index scan currently pins. The sibling pointer in the leaf itself
> is no good for this, because there is only one block to prefetch
> available at a time.
> 
> I think that this is the way index scan prefetch is normally
> implemented. Index scans will on average have a much more random
> access pattern than what is typical for bitmap heap scans, making this
> optimization more compelling, so hopefully someone will get around to
> this.

I think for index based merge and nestloop joins, it'd be hugely
beneficial to do prefetching on the index, but more importantly on the
heap level.  Not entirely trivial to do however.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables

2017-05-10 Thread Andres Freund
Hi Tom, Mathieu,

On 2017-05-10 17:02:11 -0400, Tom Lane wrote:
> Mathieu Fenniak  writes:
> > Andres, it seems like the problem is independent of having large data
> > manipulations mixed with schema changes.  The below test case demonstrates
> > it with just schema changes.
> 
> > Tom, I've tested your patch, and it seems to have a positive impact for
> > sure.  I've documented a test case to reproduce this issue (below), and
> > your avoid-search-in-catcache-invalidate.patch reduces the test case time
> > from 63 seconds per run to 27 seconds per run.
> 
> Thanks for the test case.

Indeed!


> +   59.28%59.05%150349  postmaster   postgres 
>  [.] hash_seq_search
> +8.40% 8.36% 21250  postmaster   postgres 
>  [.] CallSyscacheCallbacks
> +6.37% 6.34% 16115  postmaster   postgres 
>  [.] LocalExecuteInvalidationMessage
> +5.69% 5.67% 14418  postmaster   postgres 
>  [.] CatCacheInvalidate
> +3.14% 3.12%  7942  postmaster   postgres 
>  [.] SysCacheInvalidate
> +1.72% 1.71%  4354  postmaster   postgres 
>  [.] ReorderBufferCommit
> +1.37% 1.33%  3512  postmaster   postgres 
>  [.] hash_search_with_hash_value
> +1.15% 1.15%  2913  postmaster   postgres 
>  [.] InvalidateCatalogSnapshot
> 
> I looked at the hash_seq_search time a bit more, and realized that
> actually there's a pretty easy fix for that, which is to reduce
> the initial size of RelfilenodeMapHash from 1024 entries to 64.

Heh, that's not a bad idea. Given the table resizes automatically, there
seems little reason not to go there.

> You could quibble about where to set that exactly, but 1024 is
> just way too many --- in your example, there are never more than
> 5 entries in the hash, despite the presence of 1 tables in
> the database.

That's likely because there's no DML.


> We're at a point of diminishing returns here; I think any further
> improvement would require reducing the number of invalidation calls,
> as Andres was speculating about upthread.  Still, this shows that
> it doesn't take very much work to get a 10X improvement in the
> overhead associated with inval activities.  We've never seen this
> overhead stick out quite this much before, and maybe logical replication
> will always be an atypical workload, but I think this may be worth
> committing even if Andres does managed to cut the number of calls.

I've seen these callsites prominently in profiles not using logical
decoding.  Never quite as big as here, but still.  Temp table heavy
workloads IIRC are one of the easy way to trigger it.


> It would be interesting to see how much these patches help for your real
> use-case, as opposed to this toy example.  Assuming that the results are
> positive, I would advocate for back-patching these changes as far as 9.4
> where logical decoding came in.

+1.


> BTW, I also noticed that we're getting scarily close to exceeding
> MAX_SYSCACHE_CALLBACKS.  There are 30-some calls to
> CacheRegisterSyscacheCallback in our code, and while I think not all of
> them can be reached in a single process, we demonstrably get as high as 21
> registered callbacks in some regression test runs.  That's not leaving a
> lot of daylight for add-on modules.  The second patch attached includes
> increasing MAX_SYSCACHE_CALLBACKS from 32 to 64.  I think we'd be well
> advised to apply and back-patch that, even if we don't use the rest of
> the patch.

+1 - I'd advocate for doing so all the way.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables

2017-05-05 Thread Andres Freund
On 2017-05-05 21:32:27 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:
> >> It appears that most of the time is spent in the
> >> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
> >> invalidation callbacks, both of which appear to be invalidating caches
> >> based upon the cache value.
> 
> > I think optimizing those has some value (and I see Tom is looking at
> > that aspect, but the bigger thing would probably be to do fewer lookups.
> 
> I'm confused --- the lookup side of things is down in the noise in
> Mathieu's trace.

Err, sorry. Completely mangled that sentence.  Executing fewer
invalidations.  We currently are likely re-executing the same set of
invalidations constantly in Mathieu's case.

Background: When decoding a transaction during logical decoding we're
currently re-executing *all* a transaction's own cache invalidations, if
it has any, at every new command-id observed in the WAL stream.  That's
because currently invalidations are only sent at commit, so we don't
know from "when" they are.  But I think there's some very low-hanging
fruits reducing the frequency at which those are executed.

In many cases where there's just a few schema changes in a transaction,
this doesn't hurt badly.  But if you have a transaction that does a
bootload of schema changes *and* a has a lot of other changes, it gets
expensive.

Mathieu: The above also indicates a possible workaround, you can try
separating larger amounts of data manipulations from schema changes,
into separate transactions.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables

2017-05-05 Thread Andres Freund
Hi,

On 2017-05-05 20:59:09 -0400, Tom Lane wrote:
> Hmm ... as for RelfilenodeMapInvalidateCallback, the lack of calls to
> hash_search() from it in your trace says that it usually isn't doing
> anything useful.  All the time is being spent in hash_seq_search,
> uselessly iterating over the hashtable.  I'm inclined to think that
> we need a smarter data structure there, maybe an independent hashtable
> tracking the reverse map from relation OID to filenode map entry.

Yea, that might be worthwhile.  Let me try to address the issue that we
do way too much invalidation, then we can check whether this is still
exercised hotly.  On the other hand, it's still a dumb invalidation
approach, so if somebody feels like working on this...


> As for CatalogCacheIdInvalidate, I wonder how many of those cycles
> are doing something useful, and how many are being wasted in the outer
> loop that just iterates over the cache list.  We could trivially get
> rid of that outer search by using syscache.c's array, as in the
> attached patch.  It'd be interesting to see if this patch helps your
> scenario #1.  (Patch is against HEAD but seems to apply cleanly to 9.5)

I've seen this be a significant fraction of CPU time completely
independent of logical decoding, so I'd guess this is worthwhile
independently.  Not sure what a good benchmark for this would be.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables

2017-05-05 Thread Andres Freund
Hi,

On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:
> The stalls occur unpredictably on my production system, but generally seem
> to be correlated with schema operations.  My source database has about
> 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I'm unfortunately not entirely surprised you're seeing some issues in
that case.  We're invalidating internal caches a bit bit
overjudiciously, and that invalidation is triggered by schema changes.


> I've performed a CPU sampling with the OSX `sample` tool based upon
> reproduction approach #1:
> https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8  It
> appears that most of the time is spent in the
> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
> invalidation callbacks, both of which appear to be invalidating caches
> based upon the cache value.

I think optimizing those has some value (and I see Tom is looking at
that aspect, but the bigger thing would probably be to do fewer lookups.


> Has anyone else run into this kind of performance problem?  Any thoughts on
> how it might be resolved?  I don't mind putting in the work if someone
> could describe what is happening here, and have a discussion with me about
> what kind of changes might be necessary to improve the performance.

If you could provide an easily runnable sql script that reproduces the
issue, I'll have a look.  I think I have a rough idea what to do.


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Andres Freund
On 2017-04-28 01:29:14 +0200, Tomas Vondra wrote:
> I can confirm this observation. I bought the Intel 750 NVMe SSD last year,
> the device has 1GB DDR3 cache on it (power-loss protected), can do ~1GB/s of
> sustained O_DIRECT sequential writes. But when running pgbench, I can't push
> more than ~300MB/s of WAL to it, no matter what I do because of
> WALWriteLock.

Hm, interesting.  Even if you up wal_buffers to 128MB, use
synchronous_commit = off, and play with wal_writer_delay/flush_after?

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Andres Freund
On 2017-04-27 10:29:48 -0700, Joshua D. Drake wrote:
> On 04/27/2017 09:34 AM, Andres Freund wrote:
> > On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote:
> > > On 04/27/2017 08:59 AM, Andres Freund wrote:
> > > 
> 
> > > I would agree it isn't yet a widespread issue.
> > 
> > I'm not yet sure about that actually.  I suspect a large percentage of
> > people with such workloads aren't lingering lots on the lists.
> 
> That would probably be true. I was thinking of it more as the "most new
> users are in the cloud" and the "cloud" is going to be rare that a cloud
> user is going to be able to hit that level of writes. (at least not without
> spending LOTS of money)

You can get pretty decent NVMe SSD drives on serveral cloud providers
these days, without immediately bancrupting you.  Sure, it's instance
storage, but with a decent replication and archival setup, that's not
necessarily an issue.

It's not that hard to get to the point where postgres can't keep up with
storage, at least for some workloads.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Andres Freund
On 2017-04-27 09:31:34 -0700, Joshua D. Drake wrote:
> On 04/27/2017 08:59 AM, Andres Freund wrote:
> 
> > 
> > Ok, based on the, few, answers I've got so far, my experience is indeed
> > skewed.  A number of the PG users I interacted with over the last couple
> > years had WAL write ranges somewhere in the range of 500MB/s to 2.2GB/s
> > (max I'veseen).  At that point WAL insertion became a major bottleneck,
> > even if storage was more than fast enough to keep up.  To address these
> > we'd need some changes, but the feedback so far suggest that it's not
> > yet a widespread issue...
> 
> I would agree it isn't yet a widespread issue.

I'm not yet sure about that actually.  I suspect a large percentage of
people with such workloads aren't lingering lots on the lists.


> The only people that are likely going to see this are going to be on bare
> metal. We should definitely plan on that issue for say 11.

"plan on that issue" - heh. We're talking about major engineering
projects here ;)


> I do have a question though, where you have seen this issue is it with
> synchronous_commit on or off?

Both. Whether that matters or not really depends on the workload. If you
have bulk writes, it doesn't really matter much.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [PERFORM] Questionaire: Common WAL write rates on busy servers.

2017-04-27 Thread Andres Freund
Hi,

On 2017-04-24 21:17:43 -0700, Andres Freund wrote:
> I've lately seen more and more installations where the generation of
> write-ahead-log (WAL) is one of the primary bottlenecks.  I'm curious
> whether that's primarily a "sampling error" of mine, or whether that's
> indeed more common.
> 
> The primary reason I'm curious is that I'm pondering a few potential
> optimizations, and would like to have some guidance which are more and
> which are less important.
> 
> Questions (answer as many you can comfortably answer):
> - How many MB/s, segments/s do you see on busier servers?
> - What generates the bulk of WAL on your servers (9.5+ can use
>   pg_xlogdump --stats to compute that)?
> - Are you seeing WAL writes being a bottleneck?OA
> - What kind of backup methods are you using and is the WAL volume a
>   problem?
> - What kind of replication are you using and is the WAL volume a
>   problem?
> - What are your settings for wal_compression, max_wal_size (9.5+) /
>   checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?
> - Could you quickly describe your workload?

Ok, based on the, few, answers I've got so far, my experience is indeed
skewed.  A number of the PG users I interacted with over the last couple
years had WAL write ranges somewhere in the range of 500MB/s to 2.2GB/s
(max I'veseen).  At that point WAL insertion became a major bottleneck,
even if storage was more than fast enough to keep up.  To address these
we'd need some changes, but the feedback so far suggest that it's not
yet a widespread issue...

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Questionaire: Common WAL write rates on busy servers.

2017-04-24 Thread Andres Freund
Hi,

I've lately seen more and more installations where the generation of
write-ahead-log (WAL) is one of the primary bottlenecks.  I'm curious
whether that's primarily a "sampling error" of mine, or whether that's
indeed more common.

The primary reason I'm curious is that I'm pondering a few potential
optimizations, and would like to have some guidance which are more and
which are less important.

Questions (answer as many you can comfortably answer):
- How many MB/s, segments/s do you see on busier servers?
- What generates the bulk of WAL on your servers (9.5+ can use
  pg_xlogdump --stats to compute that)?
- Are you seeing WAL writes being a bottleneck?OA
- What kind of backup methods are you using and is the WAL volume a
  problem?
- What kind of replication are you using and is the WAL volume a
  problem?
- What are your settings for wal_compression, max_wal_size (9.5+) /
  checkpoint_segments (< 9.5), checkpoint_timeout and wal_buffers?
- Could you quickly describe your workload?

Feel free to add any information you think is pertinent ;)

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Protocol 2 and query parameters support

2017-04-24 Thread Andres Freund
On 2017-04-24 16:07:01 -0400, Rader, David wrote:
> As Tom mentioned, it sounds like the issue is that Presto expects to only
> use simple query, not extended query (no server-side prepared statements).
> The JDBC driver supports setting the prepare threshold to 0 to disable
> using server-side prepares.
> 
> https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/PGStatement.html#setPrepareThreshold-int-
> 
> "If threshold is zero, server-side prepare will not be used."

Note that you can use the extended protocol without server side
prepares...


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Protocol 2 and query parameters support

2017-04-23 Thread Andres Freund
On 2017-04-23 12:08:51 -0700, Konstantin Izmailov wrote:
> Does Postgres protocol 2 support query parameters? e.g.:
>   SELECT colA FROM tbl WHERE colB=$1
> 
> Some systems (Presto) are still using protocol 2, and I need to understand
> the scope of changes in the middleware to support protocol 2.

Could you provide a reference about presto using v2 protocol?  A quick
search didn't turn anything up.  Presto seems a bit too new to rely on
v2, given how long ago v3 has been introduced.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] options for logical replication plugins?

2017-03-15 Thread Andres Freund
Hi,


On 2017-03-15 18:29:06 +, Chris Withers wrote:
> Shame the decoding has to be done on the server-side rather than the client
> side.

Why?  You can't filter on the client side. You don't have any catalog
information available, so you'd have to transport a lot of metadata and
/ or decode to a verbose default format.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Potential Bug: Frequent Unnecessary Degeneration

2017-02-15 Thread Andres Freund
Hi,

On 2017-02-15 20:00:11 -0330, David O'Mahony wrote:
> We're running two nodes using with replication enabled.
> 
> pgpool routinely (every day) performs a failover with the following
> statements appearing the in log:

This list is about bugs in postgresql.org maintained projects, pgpool
isn't one of those.  Check 
http://www.pgpool.net/mediawiki/index.php/Main_Page#Contacts

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5 (EDB binary).

2017-02-03 Thread Andres Freund
On 2017-02-03 22:17:55 +0300, Nikolai Zhubr wrote:
> 03.02.2017 20:29, Andres Freund:
> [...]
> > > > Could you use process monitor or such to see what the process is doing
> > > > while using a lot of CPU?
> > > 
> > > I'm not sure how to do this, especially considering that the process in
> > > question is running as a service?
> > 
> > I don't think that stops you, if you have admin privileges.
> 
> Well, profiling postgres.exe is still beyond my capability at this time
> anyway. Instead, I'll try to prepare a most simple client application
> example for testing the behaviour in question so that anyone could run it
> easily. (And while working on such an example, maybe I'll get some more
> understanding of what is actually going on here.)

I'm missing something. All you need to do is to start
processexplorer.exe and filter out other applications?

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5 (EDB binary).

2017-02-03 Thread Andres Freund
On 2017-02-01 01:02:11 +0300, Nikolai Zhubr wrote:
> 31.01.2017 19:51, Andres Freund:
> [...]
> > > Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)
> > 
> > That's good to know, less because of 519b0757, more because of the latch
> > changes - but they also went in in 9.5...
> > 
> > > Would it make sense to check some even older ones?
> > 
> > Could you use process monitor or such to see what the process is doing
> > while using a lot of CPU?
> 
> I'm not sure how to do this, especially considering that the process in
> question is running as a service?

I don't think that stops you, if you have admin privileges.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Causeless CPU load waves in backend, on windows, 9.5.5 (EDB binary).

2017-01-31 Thread Andres Freund
On 2017-01-31 11:45:33 +0300, Nikolai Zhubr wrote:
> 31.01.2017 10:37, I wrote:
> > [...]
> > > > 1. "select localtimestamp" 40 times (As separate requests, one by
> > > > one, but
> > > > no delay inserted in between)
> > > > 2. wait 1/2 second.
> > > > 3. goto 1
> > > 
> > > Craig, could this be a side-effect of 519b0757? That's new in 9.5, and
> > > that's directly related to the code paths discussed here.
> > 
> > Meanwhile I'll redo my tests using 9.4 instead of 9.5 and report back.
> 
> Exactly same trouble with 9.4 (Specifically EDB 9.4.10-1 win32)

That's good to know, less because of 519b0757, more because of the latch
changes - but they also went in in 9.5...

> Would it make sense to check some even older ones?

Could you use process monitor or such to see what the process is doing
while using a lot of CPU?

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] logical decoding output plugin

2016-12-10 Thread Andres Freund
Hi,

On 2016-12-10 22:34:02 +0100, Torsten Förtsch wrote:
> I am working on a logical decoding output plugin. I am only interested in
> the changes in one particular table.
> 
> One way to check for the table would be to compare class_form->relname
> and get_namespace_name(class_form->relnamespace). But I'd much prefer to
> just compare the OID of the table.
> 
> Is there a cheap way to do something like "SELECT
> 'schema.table'::regclass::oid" that can be called in the startup callback?

You shouldn't do it in the startup callback (no catalog access there),
but you can do it the first time through the change callback.

For lookups the most complete lookup is to use RangeVarGetRelid() to get
the oid. There's other variants, but that's the easiest approach.

Do you have to care about the table being renamed?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Is is safe to use SPI in multiple threads?

2016-12-09 Thread Andres Freund
On 2016-12-09 16:52:05 +0800, Qiu Xiafei wrote:
> 1. Is there a way to use SPI in multi-thread style?

No.

> 2. Another option is to use libpq, like normal clients do. Is libpq as
> efficient as SPI?

No.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] hot_standby_feedback

2016-11-28 Thread Andres Freund
On 2016-11-28 22:14:55 +0100, Torsten Förtsch wrote:
> Hi,
> 
> I am in the process of reviewing our configs for a number of 9.3 databases
> and found a replica with hot_standby_feedback=on. I remember when we set it
> long ago we were fighting cancelled queries. I also remember that it never
> really worked for us. In the end we set up 2 replicas, one suitable for
> short queries where we prefer low replication lag, and another one where we
> allow for long running queries but sacrifice timeliness
> (max_standby_*_delay=-1).

There's a few kind of conflicts against which hs_feedback doesn't
protect. E.g. exclusive locks on tables that are in use and such
(e.g. by vacuum truncating a table or an explicit drop table).

There's a table with some information about the causes of cancellations,
pg_stat_database_conflicts - did you check that?

> I have a hunch why hot_standby_feedback=on didn't work. But I never
> verified it. So, here it is. The key is this sentence:
> 
> "Feedback messages will not be sent more frequently than once per
> wal_receiver_status_interval."
> 
> That interval is 10 sec. So, assuming a transaction on the replica uses a
> row right after the message has been sent. Then there is a 10 sec window in
> which the master cannot know that the row is needed on the replica and can
> vacuum it. If then the transaction on the replica takes longer than
> max_standby_*_delay, the only option is to cancel it.
> 
> Is that explanation correct?

No. That just means that we don't update the value more frequently. The
value reported is a "horizon" meaning that nothing older than the
reported value can be accessed.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] out-of-order XID insertion in KnownAssignedXids

2016-10-20 Thread Andres Freund
On 2016-10-20 22:37:15 +0900, Michael Paquier wrote:
> On Thu, Oct 20, 2016 at 10:21 PM,   wrote:
> > - remove a file called backup_label, but I am not certain that this file is
> > in fact there (any more).
> 
> It is never a good idea when you are trying to restore from a backup,
> backup_label contains critical information when restoring from a
> backup, so you may finish with a corrupted data folder.

And this actually seems like a likely source of these errors.  Removing
a backup label unfortunately causes hard to diagnose errors, because
everything appears to be ok as long as there's no checkpoints while
taking the base backups (or when the control file was copied early
enough). But as soon as a second checkpoint happens before the control
file is copied...

Fredrik, how did you end up removing the label?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] out-of-order XID insertion in KnownAssignedXids

2016-10-18 Thread Andres Freund
Hi,

On 2016-10-18 14:57:52 +0200, fred...@huitfeldt.com wrote:
> we are running many postgresql master/slave setups. The slaves are
> initialised from a pg_basebackup from the master and are sync
> streaming from the master. When we determine the master has failed,
> the slave is promoted. Some time after that, the old master is again
> initialised with a pg_basebackup and starts streaming from the new
> master.

Could you describe in a bit more detail how exactly you're setting up
the standbys? E.g. the exact recovery.conf used, whether you remove any
files during starting a standby. Also how exactly you're promoting
standbys?

> Recently, we have gotten this error a fair amount of times: "out-of-order XID 
> insertion in KnownAssignedXids" when postgresql attempts to start after being 
> initialised with a pg_basebackup from the current master.


Which version are you encountering this on precisely?


> Once the issue has occurred, a subsequent re-initialisation (with a 
> completely new pg_basebackup) does not resolve the issue.

How have you recovered from this so far?


> I have a setup in the failing state, so I can produce any kind of log mesages 
> / details that would be helpful.

Could you use pg_xlogdump to dump the WAL file on which replay failed?
And then attach the output in a compressed manner?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] LOG: munmap(0x7fff80000000) failed: Invalid argument

2016-10-10 Thread Andres Freund
On 2016-10-10 18:21:48 -0400, Tom Lane wrote:
> Chris Richards  writes:
> > Setting up postgresql-9.5 (9.5.4-1.pgdg14.04+2) ...
> > Creating new cluster 9.5/main ...
> >   config /etc/postgresql/9.5/main
> >   data   /var/lib/postgresql/9.5/main
> >   locale en_US.UTF-8
> > LOG:  munmap(0x7fff8000) failed: Invalid argument
> > [... snip 14 or so repeats ...]
> > LOG:  munmap(0x7fff8000) failed: Invalid argument
> >   socket /var/run/postgresql
> >   port   5433
> > update-alternatives: using
> > /usr/share/postgresql/9.5/man/man1/postmaster.1.gz to provide
> > /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
> >  * Starting PostgreSQL 9.5 database server
> >   [ OK ]
> > Processing triggers for libc-bin (2.19-0ubuntu6.6) ...
> 
> > I'm able to connect and I dumped a few default relations.
> 
> > Is the munmap error of concern? It remains upon rebooting / restarting the
> > server.
> 
> Seems pretty fishy to me; I don't know what would be causing it.
> 
> [ digs in code... ]  One theory is that PGSharedMemoryDetach is getting
> called more than once, but I'm not sure how that would happen.  Can you
> characterize where this happens more precisely?  What nondefault settings
> have you got in postgresql.conf?

Hm. Could that be from the DSM code?

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lock contention in TransactionIdIsInProgress()

2016-10-08 Thread Andres Freund
On 2016-10-07 08:36:12 -0500, Merlin Moncure wrote:
> Won't happen. Only bugs get ported back

Hopefully we don't port bugs back all that often.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Nonblocking libpq + openssl = ?

2016-09-16 Thread Andres Freund
On 2016-09-17 03:12:53 +0300, Nikolai Zhubr wrote:
> 17.09.2016 2:05, Andres Freund:
> [...]
> > Well, it's not pretty. I quite dislike this bit, and I've complained
> > about it before.  But it is noteworthy that it's nearly impossible to
> > hit these days, due to ssl-renegotiation support having been ripped out.
> > That's what could trigger openssl to require writes upon reads.
> 
> Looks like it _usually_ happens so that such interdependent reads and writes
> are unnecessary in the absence of renegotiations. But still [1] instructs to
> always check for both SSL_ERROR_WANT_READ and SSL_ERROR_WANT_WRITE in all
> cases. Supposedly it is for a reason. The way it is implemented in
> fe-secure-openssl.c looks just somewhat unfinished.
> I'm wondering is there really something that prevents doing it properly?

The relevant user-level API of libpq (PQisBusy) doesn't have a way to
return "waiting for write". So we'd have to break API compatibility.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Nonblocking libpq + openssl = ?

2016-09-16 Thread Andres Freund
Hi,

> So going PQconsumeInput()->pqReadData()->pqsecure_read()->pgtls_read() in a
> supposedly non-blocking operation we finally come to a tight busy-loop
> waiting for SSL_ERROR_WANT_WRITE to go down! How could such thing ever be,
> 
> - with no even sleep(1),
> - no timeout,
> - no diagnostics of any sort,
> - a comment implying that getting stuck in a (potentially) infinite
> sleepless loop deep inside a library is OK.

> And looking more into this pgtls_read() function it seems it just has
> inadequate interface. So that it has really no way to reliably indicate some
> important details to its caller, namely the need to wait for
> write-readyness. It's like if ssl support was a quick-n-dirty hack rather
> than a consistently integrated feature. Or do I read it all wrong?
> Any thoughts?

Well, it's not pretty. I quite dislike this bit, and I've complained
about it before.  But it is noteworthy that it's nearly impossible to
hit these days, due to ssl-renegotiation support having been ripped out.
That's what could trigger openssl to require writes upon reads.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] What limits Postgres performance when the whole database lives in cache?

2016-09-02 Thread Andres Freund
On 2016-09-02 11:10:35 -0600, Scott Marlowe wrote:
> On Fri, Sep 2, 2016 at 4:49 AM, dandl  wrote:
> > Re this talk given by Michael Stonebraker:
> >
> > http://slideshot.epfl.ch/play/suri_stonebraker
> >
> >
> >
> > He makes the claim that in a modern ‘big iron’ RDBMS such as Oracle, DB2, MS
> > SQL Server, Postgres, given enough memory that the entire database lives in
> > cache, the server will spend 96% of its memory cycles on unproductive
> > overhead. This includes buffer management, locking, latching (thread/CPU
> > conflicts) and recovery (including log file reads and writes).

I think those numbers are overblown, and more PR than reality.

But there certainly are some things that can be made more efficient if
you don't care about durability and replication.


> > I wondered if there are any figures or measurements on Postgres performance
> > in this ‘enough memory’ environment to support or contest this point of
> > view?

I don't think that's really answerable without individual use-cases in
mind.  Answering that question for analytics, operational, ... workloads
is going to look different, and the overheads are elsewhere.

I personally think that each implementations restrictions are more
likely to be an issue than anything "fundamental".


> What limits postgresql when everything fits in memory? The fact that
> it's designed to survive a power outage and not lose all your data.
> 
> Stonebraker's new stuff is cool, but it is NOT designed to survive
> total power failure.
> 
> Two totally different design concepts. It's apples and oranges to compare 
> them.

I don't think they're that fundamentally different.


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] what change in postgres 9.5 improvements for multi-CPU machines

2016-08-16 Thread Andres Freund
On 2016-03-25 18:11:21 +0800, 657985...@qq.com wrote:
> hello everyone:
>   I was bothered by the postgres spinlock for a long time . How to 
> understand this sentence "postgres 9.5 performance improvements for multi-CPU 
> machines"
> at present my database is 9.3.5 。  Can  it solve the  spinlock problem, if 
> upgrade it to 9.5 ?

It's quite possible that the upgrade helps. But without additional data
it's hard to say. The change is that postgres internal reader/writer
lock now, in many cases, avoid the use of a spinlock, relying on atomic
operations instead.

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Confusing with commit time usage in logical decoding

2016-03-02 Thread Andres Freund
Hi,

On 2016-02-29 11:12:14 +0100, Weiping Qu wrote:
> If you received this message twice, sorry for annoying since I did not
> subscribe successfully previously due to conflicting email domain.
> 
> Dear postgresql general mailing list,
> 
> I am currently using the logical decoding feature (version 9.6 I think as
> far as I found in the source, wal_level: logical, max_replication_slot: > 1,
> track_commit_timestamp: on, I am not sure whether this will help or not).
> Following the online documentation, everything works fine until I input
> 
> SELECT * FROM pg_logical_slot_peek_changes('regression_slot', NULL, NULL,
> 'include-timestamp', 'on');
> 
> 
> I always got 1999-12-31 16:00 as the commit time for arbitrary transactions
> with DML statements.
> After several tries, I realize that the txn->commit_time returned was always
> 0.
> Could you help me by indicating me what could be wrong in my case? Any
> missing parameters set?

That was a bug introduced recently (9.5).  The issue was discussed in
http://archives.postgresql.org/message-id/56D42918.1010108%40postgrespro.ru
, and a fix has now been pushed.

Thanks for the report!

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] 2x Time difference between first and subsequent run of the same query on fresh established connection (on linux, with perf data included, all query data in the shared buffers) on postgre

2016-02-25 Thread Andres Freund
Hi,

On 2016-02-25 13:50:11 +1100, Maxim Boguk wrote:
> The first run (something fishy with kernel calls):
>   19,60%  postgres  [kernel.kallsyms]  [k] filemap_map_pages
>   15,86%  postgres  postgres   [.] hash_search_with_hash_value
>8,20%  postgres  postgres   [.] heap_hot_search_buffer
>8,20%  postgres  postgres   [.] heap_page_prune_opt
>5,72%  postgres  postgres   [.] PinBuffer
>4,38%  postgres  [kernel.kallsyms]  [k] page_fault
>4,04%  postgres  [kernel.kallsyms]  [k] page_waitqueue
>3,55%  postgres  [kernel.kallsyms]  [k] __wake_up_bit
>2,95%  postgres  postgres   [.] LWLockAcquire
>2,31%  postgres  [kernel.kallsyms]  [k] unlock_page
>1,96%  postgres  [vdso] [.] __vdso_gettimeofday
>1,83%  postgres  [kernel.kallsyms]  [k] radix_tree_next_chunk
>1,77%  postgres  [kernel.kallsyms]  [k] page_add_file_rmap
>1,66%  postgres  postgres   [.] _bt_checkkeys
>1,27%  postgres  postgres   [.] LWLockRelease

Looks like page faults are part of the proble here. Upgrade to 9.4 and
use them unfortunately is the best suggestion I have.

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] A motion

2016-01-23 Thread Andres Freund
On 2016-01-23 15:31:02 -0800, Joshua D. Drake wrote:
> With respect Adrian, that is a motion that never stands a chance. If you
> don't want to read it, set up a filter that sends it right to the round
> file.

It'd help if there weren't six, but one thread...


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Shared system resources

2016-01-04 Thread Andres Freund
On 2015-12-23 11:55:36 -0500, George Neuner wrote:
> With sufficient privileges, a debugger-like process can attach and
> examine the memory of a running - or just terminated - process, but it
> won't have access to discarded (unmapped) memory.

You just have to load a kernel module to do so - it's just a few dozen
lines of C.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Old source code needed

2015-11-27 Thread Andres Freund
On 2015-11-27 10:31:03 +0100, NTPT wrote:
> Thanx for help,  I grab the source code  that match  old cluster fs backup. 
> 
> However: Should it run fine compiled  with recent gcc 4.9.3 ? 
> 
> while compiled with this gcc , I got a lot of strange errors like 
> 
> ERROR:  could not identify an ordering operator for type name at character 
> 3336
> HINT:  Use an explicit ordering operator or modify the query.
> 
> with \dt or \list  or other  commands and select.
> 
> 
> Hovever compiling it with gcc 3.4.6 and everythig works. 
> 
> Is it intended (expected) behavior or a compiller bug (Being on Gentoo, 
> compiller bug scary me a lot).

This has been fixed later in the 9.0 branch = but as you want to
checkout a specific tag, that's not goign to help you... What are you
actually trying to do?

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full_page_writes on SSD?

2015-11-24 Thread Andres Freund
On 2015-11-24 13:09:58 -0600, Kevin Grittner wrote:
> On Tue, Nov 24, 2015 at 12:48 PM, Marcin Mańk  wrote:
> 
> > if SSDs have 4kB/8kB sectors, and we'd make the Postgres page
> > size equal to the SSD page size, do we still need full_page_writes?
> 
> If an OS write of the PostgreSQL page size has no chance of being
> partially persisted (a/k/a torn), I don't think full page writes
> are needed.  That seems likely to be true if pg page size matches
> SSD sector size.

At the very least it also needs to match the page size used by the OS
(4KB on x86).

But be generally wary of turning of fpw's if you use replication. Not
having them often turns a asynchronously batched write workload into one
containing a lot of synchronous, single threaded, reads. Even with SSDs
that can very quickly lead to not being able to keep up with replay
anymore.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: tablespace "archive2" is not empty

2015-10-19 Thread Andres Freund
On 2015-10-19 11:14:33 +0200, Josip Rodin wrote:
> On Mon, Oct 19, 2015 at 11:06:59AM +0200, Andres Freund wrote:
> > Hi,
> > 
> > On 2015-10-19 10:49:11 +0200, Josip Rodin wrote:
> > > % sudo -H -u postgres psql mydb -c "SELECT oid, relname, relkind FROM 
> > > pg_catalog.pg_class where oid IN (7877054, 7877056);"
> > >  oid | relname | relkind
> > > -+-+-
> > > (0 rows)
> > 
> > That's the wrong query. The files on disk are relefilenodes not
> > oids. Try WHERE pg_relation_filenode(oid) IN ...
> 
> Oh, okay, but still:
> 
> % sudo -H -u postgres psql mydb -c "SELECT pg_relation_filenode(7877054);"
>  pg_relation_filenode
> --
> 
> (1 row)

Please actually run the query I suggested
above. pg_relation_filenode(oid) returns the relfilenode of the table
with the passed in oid - which you then compare with the relfilenode you
saw on disk.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] postgresql 9.4 streaming replication

2015-10-19 Thread Andres Freund
On 2015-10-19 08:34:51 +, Sven Löschner wrote:
> My client recovery.conf looks this way:
> 
> standby_mode = 'on'
> primary_conninfo = 'host=arcserver1 port=5432 user=postgres pass=postgres'
> restore_command = 'pg_standby /db/pg_archived %f %p >> /var/log/standby.log'
> primary_slot_name='standby1'

pg_standby is for a "warm standby" - instead of signalling an error if
an archive file does not exist it'll sleep. Thus this node will never
enter streaming replication. Use cp or something instead.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ERROR: tablespace "archive2" is not empty

2015-10-19 Thread Andres Freund
Hi,

On 2015-10-19 10:49:11 +0200, Josip Rodin wrote:
> % sudo -H -u postgres psql mydb -c "SELECT oid, relname, relkind FROM 
> pg_catalog.pg_class where oid IN (7877054, 7877056);"
>  oid | relname | relkind
> -+-+-
> (0 rows)

That's the wrong query. The files on disk are relefilenodes not
oids. Try WHERE pg_relation_filenode(oid) IN ...

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] BDR workers exiting?

2015-10-12 Thread Andres Freund
On 2015-10-12 14:37:07 +, Steve Pribyl wrote:
> I am loading up a 60G database into BDR database and these "ERRORS" are in my 
> logs.  Is not normal behavior or is something going bad.
> 
> 2015-10-12 09:28:59.389 CDT,,,30371,,561bc17d.76a3,1,,2015-10-12 09:19:41 
> CDT,5/0,0,ERROR,XX000,"data stream ended","bdr 
> (6204748238611542317,1,16494,): apply"
> 2015-10-12 09:28:59.390 CDT,,,12693,,561bb1ae.3195,20,,2015-10-12 08:12:14 
> CDT,,0,LOG,0,"worker process: bdr (6204748238611542317,1,16494,)->bdr 
> (6204748255428234532,1, (PID 30371) exited with exit code 1",""
> 2015-10-12 09:29:04.395 CDT,,,12693,,561bb1ae.3195,21,,2015-10-12 08:12:14 
> CDT,,0,LOG,0,"starting background worker process ""bdr 
> (6204748238611542317,1,16494,)->bdr (6204748255428234532,1,""",""

There'll possibly be an error message on the other node about ending the
connection.

Do you use SSL? If so, try disabling renegotiation.

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Drop or alter column under load give ERROR #42804 structure of query does not match function result type:

2015-10-12 Thread Andres Freund
On 2015-10-12 06:42:52 -0700, Adrian Klaver wrote:
> >My guess is that the problem here is that table level locking prevents
> >modification of the "users" type when the table is used, but there's no
> >locking preventing the columns to be dropped while the function is
> >used. So what happens is that 1) the function is parsed & planned 2)
> >DROP COLUMN is executed 3) the contained statement is executed 4) a
> >mismatch between the contained statement and the function definition is
> >detected.
> 
> Except per Albe's[1] example, the above sequence of events does not fail. It
> fails in Victors's case when the server is under load, so it seems there is
> another factor in play.

The above sequence is only problematic if 2) happens exactly between 1)
and 3), which is not particularly likely given that 1) is a very quick
operation.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Drop or alter column under load give ERROR #42804 structure of query does not match function result type:

2015-10-12 Thread Andres Freund
Hi,

On 2015-10-09 14:32:44 +0800, Victor Blomqvist wrote:
> Typically I have user defined functions for all operations, and my table
> and functions follow this pattern:
> 
> CREATE TABLE users (
>   id integer PRIMARY KEY,
>   name varchar NOT NULL,
>   to_be_removed integer NOT NULL
> );
> 
> CREATE FUNCTION select_users(id_ integer) RETURNS SETOF users AS
> $$
> BEGIN
>   RETURN QUERY SELECT * FROM users WHERE id = id_;
> END;
> $$ LANGUAGE plpgsql;
> 
> Then the actual queries are run by our application as
> 
> SELECT id, name FROM select_users(18);
> 
> As you can see the column to_be_removed is not selected. Then to remove the
> column I use:
> 
> ALTER TABLE users DROP COLUMN to_be_removed;
> 
> However, while the system is under load sometimes (more frequently and
> persistent the more load the system is experiencing) I get errors like
> these:
> 
> ERROR #42804 structure of query does not match function result type:
> Number of returned columns (2) does not match expected column count (3).

My guess is that the problem here is that table level locking prevents
modification of the "users" type when the table is used, but there's no
locking preventing the columns to be dropped while the function is
used. So what happens is that 1) the function is parsed & planned 2)
DROP COLUMN is executed 3) the contained statement is executed 4) a
mismatch between the contained statement and the function definition is
detected.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Drop or alter column under load give ERROR #42804 structure of query does not match function result type:

2015-10-12 Thread Andres Freund
On 2015-10-12 08:07:54 +, Albe Laurenz wrote:
> Victor Blomqvist wrote:
> [race condition causes errors due to stale plans immediately after ALTER 
> TABLE DROP]
> > Note that these errors most of the time only happens very briefly at the 
> > same time as the ALTER is
> > run. When I did some experiments today the server in total had around 3k 
> > req/s with maybe 0.1% of them
> > touching the table being updated, and the error then happens maybe 1-10% of 
> > the times I try this
> > operation. If I do the operation on a table with more load the error will 
> > happen more frequently.
> 
> As far as I gleaned from reading the source, plan cache invalidation happens 
> by signals
> sent to the other backends, so I can see why there can be small delays.
> I wonder if there is any good way to improve this.

The signal based part is only relevant for idle backends, to wake them
up to process pending invalidations. The aim is to shrink the size of
the invalidation queue.

Normal invalidations are performed whenever a relation is locked:
void
LockRelationOid(Oid relid, LOCKMODE lockmode)
{
LOCKTAG tag;
LockAcquireResult res;

SetLocktagRelationOid(&tag, relid);

res = LockAcquire(&tag, lockmode, false, false);

/*
 * Now that we have the lock, check for invalidation messages, so that 
we
 * will update or flush any stale relcache entry before we try to use 
it.
 * RangeVarGetRelid() specifically relies on us for this.  We can skip
 * this in the not-uncommon case that we already had the same type of 
lock
 * being requested, since then no one else could have modified the
 * relcache entry in an undesirable way.  (In the case where our own 
xact
 * modifies the rel, the relcache update happens via
 * CommandCounterIncrement, not here.)
 */
if (res != LOCKACQUIRE_ALREADY_HELD)
AcceptInvalidationMessages();
}

I've not investigated what the OP's problem is.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] in defensive of zone_reclaim_mode on linux

2015-09-06 Thread Andres Freund
Hi,

On 2015-09-04 15:37:47 -0700, Ben Chobot wrote:
> So our load would hover under 10 most of the time, then spike to over 100 for 
> a minute or two. Connections would get refused, the system would freeze up... 
> and then everything would go back to normal. The solution? Turning on 
> zone_reclaim_mode.
> 
> It appears that connection churn is far more manageable to Linux with 
> zone_reclaim_mode enabled. I suspect that our dearth of large, complex 
> queries helps us out as well. Regardless, our systems no longer desperately 
> seek free memory when many idle backends wake up while others are getting 
> torn down and and replaced. Babies and puppies rejoice. 
> 
> Our situation might not apply to you. But if it does, give zone_reclaim_mode 
> a chance. It's not (always) as bad as others have made it out to be.

To me that sounds like the negative impact of transparent hugepages
being mitigated to some degree by zone reclaim mode (which'll avoid some
cross-node transfers).

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] FDW and BDR

2015-09-02 Thread Andres Freund
On 2015-09-02 20:27:40 +0800, Craig Ringer wrote:
> The reason for this is that BDR replicates at a database level, but
> CREATE SERVER and CREATE USER MAPPING are global, affecting all
> databases on a PostgreSQL install. BDR can't therefore guarantee to
> replicate CREATE SERVER to other nodes, since it might get run on a
> non-BDR-enabled database.

As Tom explained thats not the case for either of those two. To my
knowledge the only reason those two commands aren't implemented is that
either nobody implemented the required ddl deparsing or, actually
somewhat likely, nobody removed the error check. Either way it should be
simple to implement.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] bdr download

2015-08-31 Thread Andres Freund
On 2015-08-31 18:41:19 +0200, Andres Freund wrote:
> http://archives.postgresql.org/message-id/%2053A2AA64.9040709%402ndquadrant.com

http://archives.postgresql.org/message-id/53A2AA64.9040709%402ndquadrant.com

There was a space too much in the link...

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] bdr download

2015-08-31 Thread Andres Freund
On 2015-08-31 12:37:11 -0400, Bruce Momjian wrote:
> On Mon, Aug 31, 2015 at 12:30:52PM -0300, Alvaro Herrera wrote:
> > > BDR is not currently part of community Postgres so you will need to
> > > report it to them directly.
> > 
> > As discussed a year ago or so, this list is what to use for BDR reports
> > and discussions, so this report is in the right place.
> 
> Huh, why did we decide that when the community doesn't control any of
> it?  That doesn't make any sense.

http://archives.postgresql.org/message-id/%2053A2AA64.9040709%402ndquadrant.com

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Grouping sets, cube and rollup

2015-08-26 Thread Andres Freund
On 2015-08-26 17:09:26 -0500, Merlin Moncure wrote:
> On Tue, Aug 25, 2015 at 7:04 PM, Edson Richter  wrote:
> > Any chance to get those amazing wonderful features backported to 9.4?
> 
> you might have some luck merging in the feature yourself if you're so 
> inclined.

It's imo too large a feature for that:
 63 files changed, 5255 insertions(+), 618 deletions(-)
in addition there's a bunch of followup commits fixing
things. There'll be a significant number of conflicts when applying this
to 9.4.

More importantly it breaks the catalog compatibility, i.e. you can't
patch postgres and then continue with an existing data directory.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Logical decoding off of a replica?

2015-07-30 Thread Andres Freund
On 2015-07-28 17:54:57 +, Curt Micol wrote:
> Hello,
> 
> I've been working to get a replica setup to perform logical decoding
> and haven't been able to get the right configuration. I've tried
> everything I can think of. While researching I found this post on
> Stack Overflow:
> http://stackoverflow.com/questions/30351628/logical-decoding-on-a-standby-node
> 
> Mostly I want to verify that logical decoding is not currently
> supported on a replica.

Correct, logical decoding is currently not supported on a replica. I'm
not aware of recent progress towards allowing that, but I still consider
it not to be too hard to implement.

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Delete rule does not prevent truncate

2015-07-24 Thread Andres Freund
On 2015-07-24 10:29:21 +0100, Tim Smith wrote:
> That's not the point.  Backups are important, but so is the concept of
> various layers of anti-fat-finger protection.   Restoring off backups
> should be last resort, not first.

Oh, comeon. Install a TRUNCATE trigger and let this thread die.

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Delete rule does not prevent truncate

2015-07-23 Thread Andres Freund
On 2015-07-23 12:57:20 +0100, Tim Smith wrote:
> Thus, I should not have to use a trigger for TRUNCATE because the "each
> row" concept does not apply. Plus it makes perfect sense to want to
> transform the truncate command and transform into ignore

That'd entirely defeat the point of TRUNCATE being fast.


Either way, this isn't going to change, so it seems a bit pointless to
continue arguing around it circles.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lots of stuck queries after upgrade to 9.4

2015-07-20 Thread Andres Freund
On 2015-07-20 17:00:52 +0300, Spiros Ioannou wrote:
> FYI we have an 9.3.5 with commit_delay = 4000 and commit_siblings = 5 with
> a 8TB dataset which seems fine. (Runs on different - faster hardware
> though).

9.4 has a different xlog insertion algorithm (scaling much better), so
that unfortunately doesn't say very much...


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lots of stuck queries after upgrade to 9.4

2015-07-20 Thread Andres Freund
Hi,

On 2015-07-20 15:58:33 +0300, Spiros Ioannou wrote:
> Happened again, another backtrace from a COMMIT process. I changed the
> commit_delay to 0 (it was 4000 to help with our storage) and will report
> back.

What hardware & OS is this happening on?

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lots of stuck queries after upgrade to 9.4

2015-07-20 Thread Andres Freund
Heikki,

On 2015-07-20 13:27:12 +0200, Andres Freund wrote:
> On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
> > Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
> > XLogFlush().
> 
> These are the relevant stack traces:
> db9lock/debuglog-commit.txt
> #2  0x7f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, 
> valptr=0x7f70f2ab66a0, oldval=, newval=0x) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
> #3  0x7f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) 
> at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
> #4  0x7f7405a0e1d3 in XLogFlush (record=121713318911056) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849
> 
> db9lock/debuglog-insert-8276.txt
> #1  0x7f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 
> '\000') at pg_sema.c:421
> #2  0x7f7405bd4849 in LWLockAcquireCommon (val=, 
> valptr=, mode=, l=) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
> #3  LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
> #4  0x7f7405a0dcca in AdvanceXLInsertBuffer (upto=, 
> opportunistic=) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
> #5  0x7f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
> #6  0x7f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=, 
> StartPos=, rdata=0x71c21b90, isLogSwitch=, 
> write_len=) at 
> /tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
> #7  XLogInsert (rmid=, info=, rdata= out>) at /tmp/buildd/postgre


XLogFlush() has the following comment:
/*
 * Re-check how far we can now flush the WAL. It's 
generally not
 * safe to call WaitXLogInsertionsToFinish while holding
 * WALWriteLock, because an in-progress insertion might 
need to
 * also grab WALWriteLock to make progress. But we know 
that all
 * the insertions up to insertpos have already 
finished, because
 * that's what the earlier WaitXLogInsertionsToFinish() 
returned.
 * We're only calling it again to allow insertpos to be 
moved
 * further forward, not to actually wait for anyone.
 */
insertpos = WaitXLogInsertionsToFinish(insertpos);

but I don't think that's valid reasoning. WaitXLogInsertionsToFinish()
calls LWLockWaitForVar(oldval = InvalidXLogRecPtr), which will block if
there's a exlusive locker and some backend doesn't yet have set
initializedUpto. Which seems like a ossible state?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lots of stuck queries after upgrade to 9.4

2015-07-20 Thread Andres Freund
On 2015-07-20 13:22:42 +0200, Andres Freund wrote:
> Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
> XLogFlush().

These are the relevant stack traces:
db9lock/debuglog-commit.txt
#2  0x7f7405bd44f4 in LWLockWaitForVar (l=0x7f70f2ab6680, 
valptr=0x7f70f2ab66a0, oldval=, newval=0x) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:1011
#3  0x7f7405a0d3e6 in WaitXLogInsertionsToFinish (upto=121713318915952) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1755
#4  0x7f7405a0e1d3 in XLogFlush (record=121713318911056) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2849

db9lock/debuglog-insert-8276.txt
#1  0x7f7405b77d91 in PGSemaphoreLock (sema=0x7f73ff6531d0, interruptOK=0 
'\000') at pg_sema.c:421
#2  0x7f7405bd4849 in LWLockAcquireCommon (val=, 
valptr=, mode=, l=) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:626
#3  LWLockAcquire (l=0x7f70ecaaa1a0, mode=LW_EXCLUSIVE) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/storage/lmgr/lwlock.c:467
#4  0x7f7405a0dcca in AdvanceXLInsertBuffer (upto=, 
opportunistic=) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:2161
#5  0x7f7405a0e301 in GetXLogBuffer (ptr=121713318928384) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1848
#6  0x7f7405a0e9c9 in CopyXLogRecordToWAL (EndPos=, 
StartPos=, rdata=0x71c21b90, isLogSwitch=, 
write_len=) at 
/tmp/buildd/postgresql-9.4-9.4.4/build/../src/backend/access/transam/xlog.c:1494
#7  XLogInsert (rmid=, info=, rdata=) at /tmp/buildd/postgre



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Lots of stuck queries after upgrade to 9.4

2015-07-20 Thread Andres Freund
On 2015-07-20 13:06:51 +0200, Alvaro Herrera wrote:
> Spiros Ioannou wrote:
> > Hi Tom,
> > thank you for your input. The DB was stuck again, I attach all logs and
> > stack traces.
> > 
> > A stack trace from a COMMIT, an INSERT, an UPDATE, the wal writer, the
> > writer, and a sequence.
> > 
> > Stracing the commit was stuck at: semop(3145761, {{12, -1, 0}}, 1
> 
> Hmm, isn't this an indication of problems in the WAL insertion logic?
> Looks to me like there are deadlocks using lwlocks.

Hm. The problem seems to be the WaitXLogInsertionsToFinish() call in
XLogFlush().

Spiros, I guess you have commit_delay enabled? If so, does disabling it
"fix" the issue?

Regards,

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] cascading replication and replication slots.

2015-06-23 Thread Andres Freund
On 2015-06-22 13:32:23 +, Leif Gunnar Erlandsen wrote:
> Is it possible to use a replication_slot for a downstream-server when setting 
> up cascading replication on 9.4

yes.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] CLOG read problem after pg_basebackup

2015-06-21 Thread Andres Freund
Hi,

On 2015-01-23 15:31:15 +0100, Petr Novak wrote:
> I'd like to ask for help clarifying an issue I'm having.
> 
> I've recently prepared new servers in another datacenter for some of our
> databases which I want to set up as a streaming replicas. There are several
> instances(clusters) with size ranging from 50-150GB. Some of them were set
> up with no issue. Three of them failed to start after pg_basebackup
> completed with:
> 
> FATAL:  could not access status of transaction 923709700
> DETAIL:  Could not read from file "pg_clog/0370" at offset 237568: Success.
> 
> (the clog file differed in each case of course..)

How exactly are you starting the the standby after the basebackups? Any
chance you removed backup.label? This sounds like the typical symptoms
of doing that.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 14:23:32 -0300, Alvaro Herrera wrote:
> Sure.  I just concern that we might be putting excessive trust on
> emergency workers being launched at a high pace.

I'm not sure what to do about that. I mean, it'd not be hard to simply
ignore naptime upon wraparound, but I'm not sure that'd be well
received.

> (My personal alarm bells go off when I see autovac_naptime=15min or
> more, but apparently not everybody sees things that way.)

Understandably so. I'd be alarmed at much lower values than that
actually.

> > --- 
> > Please excuse brevity and formatting - I am writing this on my mobile phone.
> 
> I wonder if these notices are useful at all.

I only know that I'm less annoyed at reading a untrimmed/badly wrapped
email if it's sent from a mobile phone, where it's hard to impossible to
write a well formatted email, than when sent from a full desktop.
That's why I added the notice...

Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On June 8, 2015 7:06:31 PM GMT+02:00, Alvaro Herrera  
wrote:
>Andres Freund wrote:
>
>> A first version to address this problem can be found appended to this
>> email.
>> 
>> Basically it does:
>> * Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
>>   autovacuum once per members segment
>> * For both members and offsets, once hitting the hard limits, signal
>>   autovacuum everytime. Otherwise we loose the information when
>>   restarting the database, or when autovac is killed. I ran into this
>a
>>   bunch of times while testing.
>
>I might be misreading the code, but PMSIGNAL_START_AUTOVAC_LAUNCHER
>only
>causes things to happen (i.e. a new worker to be started) when
>autovacuum is disabled.  If autovacuum is enabled, postmaster receives
>the signal and doesn't do anything about it, because the launcher is
>already running.  Of course, regularly scheduled autovac workers will
>eventually start running, but perhaps this is not good enough.

Well that's just the same for the plain xid precedent? I'd not mind improving 
further, but that seems like a separate thing.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-08 15:15:04 +0200, Andres Freund wrote:
> 1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
>   autovacuum=off you can get into situations where emergency vacuums
>   aren't started when necessary. This is particularly likely to happen
>   if either very large multixacts are used, or if the server has been
>   shut down while emergency autovacuum where happening. No corruption
>   ensues, but it's not easy to get out of.

A first version to address this problem can be found appended to this
email.

Basically it does:
* Whenever more than MULTIXACT_MEMBER_SAFE_THRESHOLD are used, signal
  autovacuum once per members segment
* For both members and offsets, once hitting the hard limits, signal
  autovacuum everytime. Otherwise we loose the information when
  restarting the database, or when autovac is killed. I ran into this a
  bunch of times while testing.

Regards,

Andres
>From 9949d8ce4b69b4fd693da08d8e1854fd259a33a9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 8 Jun 2015 13:41:42 +0200
Subject: [PATCH] Improve multixact emergency autovacuum logic.

Previously autovacuum was not necessarily triggered if space in the
members slru got tight. The first problem was that the signalling was
tied to values in the offsets slru, but members can advance much
faster. Thats especially a problem if old sessions had been around that
previously prevented the multixact horizon to increase. Secondly the
skipping logic doesn't work if the database was restarted after
autovacuum was triggered - that knowledge is not preserved across
restart. This is especially a problem because it's a common
panic-reaction to restart the database if it gets slow to
anti-wraparound vacuums.

Fix the first problem by separating the logic for members from
offsets. Trigger autovacuum whenever a multixact crosses a segment
boundary, as the current member offset increases in irregular values, so
we can't use a simple modulo logic as for offsets.  Add a stopgap for
the second problem, by signalling autovacuum whenver ERRORing out
because of boundaries.

Backpatch into 9.3, where it became more likely that multixacts wrap
around.
---
 src/backend/access/transam/multixact.c | 61 +-
 1 file changed, 45 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d3336a8..3bc170d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -980,10 +980,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * Note these are pretty much the same protections in GetNewTransactionId.
 	 *--
 	 */
-	if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
-		!MultiXactState->oldestOffsetKnown ||
-		(MultiXactState->nextOffset - MultiXactState->oldestOffset
-		 > MULTIXACT_MEMBER_SAFE_THRESHOLD))
+	if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit))
 	{
 		/*
 		 * For safety's sake, we release MultiXactGenLock while sending
@@ -999,19 +996,17 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 		LWLockRelease(MultiXactGenLock);
 
-		/*
-		 * To avoid swamping the postmaster with signals, we issue the autovac
-		 * request only once per 64K multis generated.  This still gives
-		 * plenty of chances before we get into real trouble.
-		 */
-		if (IsUnderPostmaster && (result % 65536) == 0)
-			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
-
 		if (IsUnderPostmaster &&
 			!MultiXactIdPrecedes(result, multiStopLimit))
 		{
 			char	   *oldest_datname = get_database_name(oldest_datoid);
 
+			/*
+			 * Immediately kick autovacuum into action as we're already
+			 * in ERROR territory.
+			 */
+			SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 			/* complain even if that DB has disappeared */
 			if (oldest_datname)
 ereport(ERROR,
@@ -1032,6 +1027,14 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		{
 			char	   *oldest_datname = get_database_name(oldest_datoid);
 
+			/*
+			 * To avoid swamping the postmaster with signals, we issue the autovac
+			 * request only once per 64K multis generated.  This still gives
+			 * plenty of chances before we get into real trouble.
+			 */
+			if (IsUnderPostmaster && (result % 65536) == 0)
+SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 			/* complain even if that DB has disappeared */
 			if (oldest_datname)
 ereport(WARNING,
@@ -1099,6 +1102,10 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	if (MultiXactState->offsetStopLimitKnown &&
 		MultiXactOffsetWouldWrap(MultiXactState->offsetStopLimit, nextOffset,
  nmembers))
+	{
+		/* see comment in the corresponding offsets wraparound case */
+		SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);
+
 		ereport(ERROR,
 (errcode(ERRCODE_PRO

Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-08 Thread Andres Freund
On 2015-06-05 16:56:18 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas  
> > wrote:
> >> I think we would be foolish to rush that part into the tree.  We
> >> probably got here in the first place by rushing the last round of
> >> fixes too much; let's try not to double down on that mistake.
>
> > My problem with that approach is that I think the code has gotten 
> > significantly more complex in the least few weeks. I have very little trust 
> > that the interactions between vacuum, the deferred truncations in the 
> > checkpointer, the state management in shared memory and recovery are 
> > correct. There's just too many non-local subtleties here.
>
> > I don't know what the right thing to do here is.
>
> My gut feeling is that rushing to make a release date is the wrong thing.
>
> If we have confidence that we can ship something on Monday that is
> materially more trustworthy than the current releases, then let's aim to
> do that; but let's ship only patches we are confident in.  We can do
> another set of releases later that incorporate additional fixes.  (As some
> wise man once said, there's always another bug.)

I've tortured hardware a fair bit with HEAD. So far it looks much better
than 9.4.2+ et al. I've noticed a bunch of, to me at least, new issues:

1) the autovacuum trigger logic isn't perfect yet. I.e. especially with
  autovacuum=off you can get into situations where emergency vacuums
  aren't started when necessary. This is particularly likely to happen
  if either very large multixacts are used, or if the server has been
  shut down while emergency autovacuum where happening. No corruption
  ensues, but it's not easy to get out of.

2) I've managed to corrupt a cluster when a standby performed
  restartpoints less frequently than the master performed
  checkpoints. Because truncations happen in the checkpointer it's not
  that hard to end up with entirely full multixact slrus. This is a
  problem on several fronts. We can IIUC end up truncating away the
  wrong data, and we can be in a bad state upon promotion.  None of that
  is new.

3) It's really confusing that truncation (and thus the limits in shared
  memory) happens in checkpoints. If you hit a limit and manually do all
  the necessary vacuums you'll see a "good" limit in
  pg_database.datminmxid, but you'll still into the error. You manually
  have to force a checkpoint for the truncation to actually
  happen. That's particularly problematic because larger installations,
  where I presume wraparound issues are more likely, often have a large
  checkpoint_timeout setting.

Since none of these are really new, I don't think they should prevent us
from doing a back branch release. While I'm still not convinced we're
better of with 9.4.4 than with 9.4.1, we're certainly better of than
with 9.4.[23] et al.

If we want to go ahead with the release I plan to do a bit more testing
today and tomorrow. If not I'm first going to continue working on fixing
the above.

I've a "good" fix for 1). I'm not 100% sure I'll feel confident with
pushing if we wrap today. I am wondering if we shouldn't at least apply
the portion that unconditionally sends a signal in the ERROR
case. That's still an improvement.


One more thing:
Our testing infrastructure sucks. Without writing C code it's basically
impossible to test wraparounds and such. Even if not particularly useful
for non-devs, I really think we should have functions for creating
burning xids/multixacts in core. Or at least in some extension.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-07 Thread Andres Freund
On 2015-06-05 20:47:33 +0200, Andres Freund wrote:
> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
> > Robert Haas  writes:
> > > 1. The problem that we might truncate an SLRU members page away when
> > > it's in the buffers, but not drop it from the buffers, leading to a
> > > failure when we try to write it later.
> 
> I've got a fix for this, and about three other issues I found during
> development of the new truncation codepath.
> 
> I'll commit the fix tomorrow.

I've looked through multixact.c/slru.c and afaics there currently is, as
observed by Thomas, no codepath that exercises the broken behaviour. Due
to the way checkpoints and SLRU truncation are linked "problematic"
pages will have been flushed beforehand.

I think we should fix this either way as it seems like a bad trap, but
I'd rather commit it after the the next minor releases are out.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On June 5, 2015 10:02:37 PM GMT+02:00, Robert Haas  
wrote:
>On Fri, Jun 5, 2015 at 2:47 PM, Andres Freund 
>wrote:
>> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
>>> Robert Haas  writes:
>>> > 1. The problem that we might truncate an SLRU members page away
>when
>>> > it's in the buffers, but not drop it from the buffers, leading to
>a
>>> > failure when we try to write it later.
>>
>> I've got a fix for this, and about three other issues I found during
>> development of the new truncation codepath.
>>
>> I'll commit the fix tomorrow.
>
>OK.  Then I think we should release next week, so we get the fixes we
>have out before PGCon.  The current situation is not good.
>
>>> > I think we might want to try to fix one or both of those before
>>> > cutting a new release.  I'm less sold on the idea of installing
>>> > WAL-logging in this minor release.  That probably needs to be
>done,
>>> > but right now we've got stuff that worked in early 9.3.X release
>and
>>> > is now broken, and I'm in favor of fixing that first.
>>
>> I've implemented this, and so far it removes more code than it
>> adds. It's imo also a pretty clear win in how understandable the code
>> is.  The remaining work, besides testing, is primarily going over
>lots
>> of comment and updating them. Some of them are outdated by the patch,
>> and some already were.
>>
>> Will post tonight, together with the other fixes, after I get back
>from
>> climbing.
>>
>> My gut feeling right now is that it's a significant improvement, and
>> that it'll be reasonable to include it. But I'd definitely like some
>> independent testing for it, and I'm not sure if that's doable in time
>> for the wrap.
>
>I think we would be foolish to rush that part into the tree.  We
>probably got here in the first place by rushing the last round of
>fixes too much; let's try not to double down on that mistake.

My problem with that approach is that I think the code has gotten significantly 
more complex in the least few weeks. I have very little trust that the 
interactions between vacuum, the deferred truncations in the checkpointer, the 
state management in shared memory and recovery are correct. There's just too 
many non-local subtleties here. 

I don't know what the right thing to do here is.



--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > 1. The problem that we might truncate an SLRU members page away when
> > it's in the buffers, but not drop it from the buffers, leading to a
> > failure when we try to write it later.

I've got a fix for this, and about three other issues I found during
development of the new truncation codepath.

I'll commit the fix tomorrow.

> > I think we might want to try to fix one or both of those before
> > cutting a new release.  I'm less sold on the idea of installing
> > WAL-logging in this minor release.  That probably needs to be done,
> > but right now we've got stuff that worked in early 9.3.X release and
> > is now broken, and I'm in favor of fixing that first.

I've implemented this, and so far it removes more code than it
adds. It's imo also a pretty clear win in how understandable the code
is.  The remaining work, besides testing, is primarily going over lots
of comment and updating them. Some of them are outdated by the patch,
and some already were.

Will post tonight, together with the other fixes, after I get back from
climbing.

My gut feeling right now is that it's a significant improvement, and
that it'll be reasonable to include it. But I'd definitely like some
independent testing for it, and I'm not sure if that's doable in time
for the wrap.

> Okay, but if we're not committing today to a release wrap on Monday,
> I don't see it happening till after PGCon.

I wonder if, with all the recent, err, training, we could wrap it on
Tuesday instead. Independent of the truncation rework going in or not,
an additional work day to go over all the changes and do some more
testing would be good from my POV.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Andres Freund
On 2015-06-05 11:43:45 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch  wrote:
> >> I read through this version and found nothing to change.  I encourage other
> >> hackers to study the patch, though.  The surrounding code is challenging.
> 
> > Andres tested this and discovered that my changes to
> > find_multixact_start() were far more creative than intended.
> > Committed and back-patched with a trivial fix for that stupidity and a
> > novel-length explanation of the changes.
> 
> So where are we on this?  Are we ready to schedule a new set of
> back-branch releases?  If not, what issues remain to be looked at?

We're currently still doing bad things while the database is in an
inconsistent state (i.e. read from SLRUs and truncate based on the
results of that). It's quite easy to reproduce base backup startup
failures.

On the other hand, that's not new. And the fix requires, afaics, a new
type of WAL record (issued very infrequently). I'll post a first version
of the patch, rebased ontop of Robert's commit, tonight or tomorrow. I
guess we can then decide what we'd like to do.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund
Hi,

On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
> + /*
> +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
> +  */
> + return !oldestOffsetKnown ||
> + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);

I think without teaching autovac about those rules, this might just lead
to lots of autovac processes starting without knowing they should do
something? They know about autovacuum_multixact_freeze_age, but they
know neither about !oldestOffsetKnown nor, afaics, about pending offset
wraparounds?

> -static MultiXactOffset
> -find_multixact_start(MultiXactId multi)
> +static bool
> +find_multixact_start(MultiXactId multi, MultiXactOffset *result)
>  {
>   MultiXactOffset offset;
>   int pageno;
> @@ -2630,6 +2741,9 @@ find_multixact_start(MultiXactId multi)
>   pageno = MultiXactIdToOffsetPage(multi);
>   entryno = MultiXactIdToOffsetEntry(multi);
>  
> + if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
> + return false;
> +
>   /* lock is acquired by SimpleLruReadPage_ReadOnly */
>   slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
>   offptr = (MultiXactOffset *) 
> MultiXactOffsetCtl->shared->page_buffer[slotno];
> @@ -2642,25 +2756,31 @@ find_multixact_start(MultiXactId multi)
>

I think it'd be a good idea to also return false in case of a
InvalidMultiXactId - that'll be returned if the page has been zeroed.


Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > That's not necessarily the case though, given how the code currently
> > works. In a bunch of places the SLRUs are accessed *before* having been
> > made consistent by WAL replay. Especially if several checkpoints/vacuums
> > happened during the base backup the assumed state (i.e. the mxacts
> > checkpoints refer to) of the data directory soon after the initial
> > start, and the state of pg_multixact/ won't necessarily match at all.
> 
> Well, the log extract is quite simple; it says that as soon as the
> standby starts replay WAL, it fetches exactly one WAL segment, which
> contains exactly one online checkpoint record; this record contains
> oldestMulti=4624559, which belongs in offset file 0046.  But the
> basebackup only contains files starting from 004B onwards.  The base
> backup takes a long time, so my guess of what happened is that the
> backup label points to the start of the WAL stream (obviously), then the
> data files are copied to the standby; during this long process, a
> multixact truncation happens.  So by the time the base backup reaches
> the pg_multixact directory, the 0046 file has been removed.

Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...

> One idea I had was: what if the oldestMulti pointed to another multi
> earlier in the same 0046 file, so that it is read-as-zeroes (and the
> file is created), and then a subsequent multixact truncate tries to read
> a later page in the file.  In SlruPhysicalReadPage() this would give a
> change for open() to not fail, and then read() can fail as above.
> However, we would have an earlier LOG message about "reading as zeroes".
> 
> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?

> 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
> "000173DD00AD" from archive
> 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
> status of transaction 4624559
> 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
> file "pg_multixact/offsets/0046" at offset 147456: Success.
> 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
> 4782) exited with exit code 1
> 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
> startup process failure

>From this isn't not entirely clear where this error was triggered from.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> Thomas Munro wrote:
> > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> > wrote:
> > > My guess is that the file existed, and perhaps had one or more pages,
> > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > back.  read() returns 0 in this case but doesn't set errno.
> > >
> > > I didn't find a way to set things so that the file exists but is of
> > > shorter contents than oldestMulti by the time the checkpoint record is
> > > replayed.
> > 
> > I'm just starting to learn about the recovery machinery, so forgive me
> > if I'm missing something basic here, but I just don't get this.  As I
> > understand it, offsets/0046 should either have been copied with that
> > page present in it if it existed before the backup started (apparently
> > not in this case), or extended to contain it by WAL records that come
> > after the backup label but before the checkpoint record that
> > references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> > the disk it'll always get one at a segment boundary, right? I'm not sure
> > that's actually ok; because the value at the beginning of the segment
> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> > filled the page with zeros.
> >
> > I think that should be harmless, the worst that can happen is that
> > oldestOffset errorneously is 0, which should be correct, even though
> > possibly overly conservative, in these cases.
> 
> Uh oh.  That seems like a real bad problem for this approach.  What
> keeps that from being the opposite of too conservative?  There's no
> "safe" value in a circular numbering space.

I think it *might* (I'm really jetlagged) be fine because that'll only
happen after a upgrade from < 9.3. And in that case we initialize
nextOffset to 0. That ought to safe us?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote:

> commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
> Author: Robert Haas 
> Date:   Fri May 29 14:35:53 2015 -0400
> 
> foo

Hehe!

> diff --git a/src/backend/access/transam/multixact.c 
> b/src/backend/access/transam/multixact.c
> index 9568ff1..aca829d 100644
> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -199,8 +199,9 @@ typedef struct MultiXactStateData
>   MultiXactOffset nextOffset;
>  
>   /*
> -  * Oldest multixact that is still on disk.  Anything older than this
> -  * should not be consulted.  These values are updated by vacuum.
> +  * Oldest multixact that may still be referenced from a relation.
> +  * Anything older than this should not be consulted.  These values are
> +  * updated by vacuum.
>*/
>   MultiXactId oldestMultiXactId;
>   Oid oldestMultiXactDB;
> @@ -213,6 +214,18 @@ typedef struct MultiXactStateData
>*/
>   MultiXactId lastCheckpointedOldest;
>  
> + /*
> +  * This is the oldest file that actually exist on the disk.  This value
> +  * is initialized by scanning pg_multixact/offsets, and subsequently
> +  * updated each time we complete a truncation.  We need a flag to
> +  * indicate whether this has been initialized yet.
> +  */
> + MultiXactId oldestMultiXactOnDisk;
> + boololdestMultiXactOnDiskValid;
> +
> + /* Has TrimMultiXact been called yet? */
> + booldidTrimMultiXact;

I'm not really convinced tying things closer to having done trimming is
easier to understand than tying things to recovery having finished.

E.g.
if (did_trim)
oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
imo is harder to understand than if (!InRecovery).

Maybe we could just name it finishedStartup and rename the functions 
accordingly?

> @@ -1956,12 +1971,6 @@ StartupMultiXact(void)
>*/
>   pageno = MXOffsetToMemberPage(offset);
>   MultiXactMemberCtl->shared->latest_page_number = pageno;
> -
> - /*
> -  * compute the oldest member we need to keep around to avoid old member
> -  * data overrun.
> -  */
> - DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
>  }

Maybe it's worthwhile to add a 'NB: At this stage the data directory is
not yet necessarily consistent' StartupMultiXact's comments, to avoid
reintroducing problems like this?

>   /*
> +  * We can read this without a lock, because it only changes when nothing
> +  * else is running.
> +  */
> + did_trim = MultiXactState->didTrimMultiXact;

Err, Hot Standby? It might be ok to not lock, but the comment is
definitely wrong. I'm inclined to simply use locking, this doesn't look
sufficiently critical performancewise.

> +static MultiXactOffset
> +GetOldestReferencedOffset(MultiXactId oldestMXact)
> +{
> + MultiXactId earliest;
> + MultiXactOffset oldestOffset;
> +
> + /*
> +  * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
> +  * TrimMultiXact for details), oldest_datminmxid might point to a
> +  * nonexistent multixact.  If so, use the oldest one that actually 
> +  * exists.  Anything before this can't be successfully used anyway.
> +  */
> + earliest = GetOldestMultiXactOnDisk();
> + if (MultiXactIdPrecedes(oldestMXact, earliest))
> + oldestMXact = earliest;

Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
the disk it'll always get one at a segment boundary, right? I'm not sure
that's actually ok; because the value at the beginning of the segment
can very well end up being a 0, as MaybeExtendOffsetSlru() will have
filled the page with zeros.

I think that should be harmless, the worst that can happen is that
oldestOffset errorneously is 0, which should be correct, even though
possibly overly conservative, in these cases.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:49:56 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> > On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> >> The exact circumstances under which we're willing to replace a
> >> relminmxid with a newly-computed one that differs are not altogether
> >> clear to me, but there's an "if" statement protecting that logic, so
> >> there are some circumstances in which we'll leave the existing value
> >> intact.
> >
> > I guess we'd have to change that then.
> 
> Yeah, but first we'd need to assess why it's like that.  Tom was the
> one who installed the current logic, but I haven't been able to fully
> understand it.

We're talking about:
/* Similarly for relminmxid */
if (MultiXactIdIsValid(minmulti) &&
pgcform->relminmxid != minmulti &&
(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)))
{
pgcform->relminmxid = minmulti;
dirty = true;
}

right? Before that change (78db307bb/87f830e0ce) we only updated
relminmxid if the new value was newer than the old one. That's to avoid
values from going backwards, e.g. when a relation is first VACUUM
FREEZEd and then a normal VACUUM is performed (these values are
unfortunately based on the cutoff values instead of the observed
minima). The new thing is the || MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)
which is there to recognize values from the future. E.g. the 1
errorneously left in place by pg_upgrade.

Makes sense?

> >> It would similarly do so when the oldest MXID reference in the
> >> relation is in fact 1, but that value can't be vacuumed away yet.
> >
> > I'd thought of just forcing consumption of one multixact in that
> > case. Not pretty, but imo acceptable.
> 
> What if multixact 1 still has living members?

I think we should just trigger the logic if 1 is below the multi
freezing horizon - in which case a) the 1 will automatically be
replaced, because the horizon is newer b) it can't have a living member
anyway.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> The exact circumstances under which we're willing to replace a
> relminmxid with a newly-computed one that differs are not altogether
> clear to me, but there's an "if" statement protecting that logic, so
> there are some circumstances in which we'll leave the existing value
> intact.

I guess we'd have to change that then.

> It would similarly do so when the oldest MXID reference in the
> relation is in fact 1, but that value can't be vacuumed away yet.

I'd thought of just forcing consumption of one multixact in that
case. Not pretty, but imo acceptable.

> Also, the database might be really big.  Even if it were true that a
> full scan of every table would get us out of this state, describing
> the time that it would take to do that as "relatively short" seems to
> me to be considerably understating the impact of a full-cluster
> VACUUM.

Well. We're dealing with a corrupted cluster. Having a way out that's
done automatically, even if it takes a long while, is pretty good
already. In many cases the corruption (i.e. pg_upgrade) happened long
ago, so the table's relminmxid will already have been recomputed.  I
think that's acceptable.

> With regard to the more general question of WAL-logging this, are you
> going to work on that?  Are you hoping Alvaro or I will work on that?
> Should we draw straws?  It seems like somebody needs to do it.

I'm willing to invest the time to develop an initial version, but I'll
need help evaluating it. I don't have many testing resources available
atm, and I'm not going to trust stuff I developed while travelling by
just looking at the code.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:29:24 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> > But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> > this scenario now (since a couple weeks ago) does a
> > SimpleLruReadPage_ReadOnly() in the members slru via
> > find_multixact_start(). That just won't work acceptably when we're not
> > yet consistent. There very well could not be a valid members segment at
> > that point?  Am I missing something?
> 
> Yes: that code isn't new.

Good point.

> TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
> 9.3.0 and every subsequent release until 9.3.7/9.4.2.

But back then TruncateMultiXact() wasn't called during recovery. But
you're right in that it didn't seem to have reproduced attributable
bugreprorts since 9.3.2 where vacuuming during recovery was
introduced. So it indeed doesn't seem as urgent as fixing the new
callsites.

> That would be a departure from the behavior of every existing release
> that includes this code based on, to my knowledge, zero trouble
> reports.

On the other hand we're now at about bug #5 attributeable to the odd way
truncation works for multixacts. It's obviously complex and hard to get
right. It makes it harder to cope with the wrong values left in
datminxid etc. So I'm still wondering whether fixing this for good isn't
the better approach.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

If we were to truncate after vacuum, and only on the primary (via WAL
logging), we could, afaics, just rely on all the values to be
recomputed. I mean relminmxid will be recomputed after a vacuum, and
thus, after some time, will datminmxid and the control file value.  We
could just force a value of 1 to always trigger anti-wraparound vacuums
(or wait for that to happen implicitly, to delay the impact?). That'll
then should then fix the problem in a relatively short amount of time?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund  wrote:
> > The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> > era, where we didn't do any truncation during recovery...
> 
> Right, but now we're piggybacking on the checkpoint records, and I
> don't have any evidence that this approach can't be made robust.  It's
> possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

> >> By the time we've reached the minimum recovery point, they will have
> >> been recreated by the same WAL records that created them in the first
> >> place.
> >
> > I'm not sure that's true. I think we could end up errorneously removing
> > files that were included in the base backup. Anyway, let's focus on your
> > patch for now.
> 
> OK, but I am interested in discussing the other thing too.  I just
> can't piece together the scenario myself - there may well be one.  The
> base backup will begin replay from the checkpoint caused by
> pg_start_backup() and remove anything that wasn't there at the start
> of the backup.  But all of that stuff should get recreated by the time
> we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState->lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

> > I'm more worried about the cases where we didn't ever actually "badly
> > wrap around" (i.e. overwrite needed data); but where that's not clear on
> > the standby because the base backup isn't in a consistent state.
> 
> I agree. The current patch tries to make it so that we never call
> find_multixact_start() while in recovery, but it doesn't quite
> succeed: the call in TruncateMultiXact still happens during recovery,
> but only once we're sure that the mxact we plan to call it on actually
> exists on disk.  That won't be called until we replay the first
> checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


  1   2   3   >