from:"Vladimir Borodin"

Re: [HACKERS] Proposal: pg_rewind to skip config files

2017-09-05 Thread Vladimir Borodin

> 5 сент. 2017 г., в 15:48, Michael Paquier  
> написал(а):
> 
> On Tue, Sep 5, 2017 at 9:40 PM, Vladimir Borodin  wrote:
>> We do compress WALs and send them over network. Doing it via archive_command
>> in single thread is sometimes slower than new WALs are written under heavy
>> load.
> 
> Ah, yeah, true. I do use pg_receivexlog --compress for that locally
> and do a bulk copy of only the compressed WALs needed, when needed...
> So there is a guarantee that completed segments are durable locally,
> which is very useful.

It seems that option --compress appeared only in postgres 10 which is not ready 
for production yet. BTW I assume that pg_receivexlog is single-threaded too? So 
it still may be the bottleneck when 3-5 WALs per second are written.

> You should definitely avoid putting that in
> PGDATA though, the same counts for tablespaces within PGDATA for
> example.

I would love to but there might be some problems with archiving and in many 
cases the only partition with enough space to accumulate WALs is partition for 
PGDATA.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Proposal: pg_rewind to skip config files

2017-09-05 Thread Vladimir Borodin


> 5 сент. 2017 г., в 15:42, Chris Travers  написал(а):
> 
> On Tue, Sep 5, 2017 at 2:40 PM, Vladimir Borodin  <mailto:r...@simply.name>> wrote:
> 
>> 5 сент. 2017 г., в 14:04, Michael Paquier > <mailto:michael.paqu...@gmail.com>> написал(а):
>> 
>>> For example, in archive_command we put WALs for archiving from
>>> pg_xlog/pg_wal into another directory inside PGDATA and than another cron
>>> task makes real archiving. This directory ideally should be skipped by
>>> pg_rewind, but it would not be handled by proposed change.
>> 
>> I would be curious to follow the reasoning for such a two-phase
>> archiving (You basically want to push it in two places, no? But why
>> not just use pg_receivexlog then?). This is complicated to handle from
>> the point of view of availability and backup reliability + durability.
> 
> We do compress WALs and send them over network. Doing it via archive_command 
> in single thread is sometimes slower than new WALs are written under heavy 
> load.
> 
> How would this work when it comes to rewinding against a file directory? 

Very bad, of course. Sometimes we get 'could not remove file 
"/var/lib/postgresql/9.6/data/wals/000100C300C6": No such file or 
directory’ while running pg_rewind ($PGDATA/wals is a directory where 
archive_command copies WALs). That’s why I want to solve the initial problem. 
Both proposed solutions (using only needed files and skipping files through 
glob/regex) are fine for me, but not the initial patch.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Proposal: pg_rewind to skip config files

2017-09-05 Thread Vladimir Borodin


> 5 сент. 2017 г., в 14:04, Michael Paquier  
> написал(а):
> 
>> For example, in archive_command we put WALs for archiving from
>> pg_xlog/pg_wal into another directory inside PGDATA and than another cron
>> task makes real archiving. This directory ideally should be skipped by
>> pg_rewind, but it would not be handled by proposed change.
> 
> I would be curious to follow the reasoning for such a two-phase
> archiving (You basically want to push it in two places, no? But why
> not just use pg_receivexlog then?). This is complicated to handle from
> the point of view of availability and backup reliability + durability.

We do compress WALs and send them over network. Doing it via archive_command in 
single thread is sometimes slower than new WALs are written under heavy load.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Proposal: pg_rewind to skip config files

2017-09-05 Thread Vladimir Borodin


> 5 сент. 2017 г., в 12:31, Chris Travers  написал(а):
> 
> I think the simplest solution for now is to skip any files ending in .conf, 
> .log, and serverlog.

Why don’t you want to solve the problem once? It is a bit harder to get 
consensus on a way how to do it, but it seems that there are no reasons to make 
temporary solution here.

For example, in archive_command we put WALs for archiving from pg_xlog/pg_wal 
into another directory inside PGDATA and than another cron task makes real 
archiving. This directory ideally should be skipped by pg_rewind, but it would 
not be handled by proposed change.

> 
> Long run, it would be nice to change pg_rewind from an opt-out approach to an 
> approach of processing the subdirectories we know are important.

While it is definitely an awful idea the user can easily put something strange 
(i.e. logs) to any important directory in PGDATA (i.e. into base or pg_wal). Or 
how for example pg_replslot should be handled (I asked about it a couple of 
years ago [1])? It seems that a glob/regexp for things to skip is a more 
universal solution.

[1] 
https://www.postgresql.org/message-id/flat/8DDCCC9D-450D-4CA2-8CF6-40B382F1F699%40simply.name


--
May the force be with you…
https://simply.name

Re: [HACKERS] Funny WAL corruption issue

2017-08-10 Thread Vladimir Borodin

Hi, Chris.

> 10 авг. 2017 г., в 15:09, Chris Travers  написал(а):
> 
> Hi;
> 
> I ran into a funny situation today regarding PostgreSQL replication and wal 
> corruption and wanted to go over what I think happened and what I wonder 
> about as a possible solution.
> 
> Basic information is custom-build PostgreSQL 9.6.3 on Gentoo, on a ~5TB 
> database with variable load.  Master database has two slaves and generates 
> 10-20MB of WAL traffic a second.  The data_checksum option is off.
> 
> 
> The problem occurred when I attempted to restart the service on the slave 
> using pg_ctl (I believe the service had been started with sys V init 
> scripts).  On trying to restart, it gave me a nice "Invalid memory allocation 
> request" error and promptly stopped.
> 
> The main logs showed a lot of messages like before the restart:
> 2017-08-02 11:47:33 UTC LOG:  PID 19033 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19032 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19024 in cancel request did not match any 
> process
> 2017-08-02 11:47:33 UTC LOG:  PID 19034 in cancel request did not match any 
> process
> 
> On restart, the following was logged to stderr:
> LOG:  entering standby mode
> LOG:  redo starts at 1E39C/8B77B458
> LOG:  consistent recovery state reached at 1E39C/E1117FF8
> FATAL:  invalid memory alloc request size 3456458752
> LOG:  startup process (PID 18167) exited with exit code 1
> LOG:  terminating any other active server processes
> LOG:  database system is shut down
> 
> After some troubleshooting I found that the wal segment had become corrupt, I 
> copied the correct one from the master and everything came up to present.
> 
> So It seems like somewhere something crashed big time on the back-end and 
> when we tried to restart, the wal ended in an invalid way.

We have reported the same thing [1] nearly a year ago. Could you please check 
with pg_xlogdump that both WALs (normal from master and corrupted) are exactly 
the same until some certain LSN?

[1] 
https://www.postgresql.org/message-id/20160614103415.5796.6885%40wrigleys.postgresql.org

> 
> I am wondering what can be done to prevent these sorts of things from 
> happening in the future if, for example, a replica dies in the middle of a 
> wal fsync. 
> -- 
> Best Wishes,
> Chris Travers
> 
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor 
> lock-in.
> http://www.efficito.com/learn_more 

--
May the force be with you…
https://simply.name

Re: [HACKERS] SCRAM auth and Pgpool-II

2017-07-14 Thread Vladimir Borodin


> 14 июля 2017 г., в 1:33, Stephen Frost  написал(а):
> 
> What would be really nice for such cases is support for Kerberos and
> delegated Kerberos credentials.  Having pgpool support that would remove
> the need to deal with passwords at all.

Since nearly all systems with some kind of load nowadays use connection poolers 
(pgpool-II or pgbouncer) between applications and postgres, it is a pretty big 
pain to re-implement all authentication methods supported by postgres in such 
poolers. Kerberos is cool but not the only thing that should be supported by 
FDWs or connection poolers. I.e. many users would want to have support for LDAP 
and SCRAM. And every time when there would be some changes in postgres auth 
methods, exactly the same work (or even worse) should be done in many (at least 
two) other products widely used by people.

It seems that postgres either should provide connection pooling feature in core 
or give external poolers a kind of generic mechanism to transparently proxy 
auth requests/responses, so that authentication would be fully managed by 
postgres and that would be the only place where changes in auth methods should 
be done. Yes, in this case connection pooler actually behaves like man in the 
middle so it should be done very carefully but it seems that there is no other 
way.


--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-06-12 Thread Vladimir Borodin


> 12 июня 2017 г., в 13:19, Amit Kapila  написал(а):
> 
> On Sun, Jun 11, 2017 at 11:59 PM, Vladimir Borodin  <mailto:r...@simply.name>> wrote:
>> 
>> 8 июня 2017 г., в 17:03, Amit Kapila  написал(а):
>> 
>> On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
>>  wrote:
>> 
>> 
>> Why didn't rsync made the copies on master and replica same?
>> 
>> 
>> Because rsync was running with —size-only flag.
>> 
>> 
>> IIUC the situation, the new WAL and updated pg_control file has been
>> copied, but not updated data files due to which the WAL has not been
>> replayed on replicas?  If so, why the pg_control file is copied, it's
>> size shouldn't have changed?
>> 
>> 
>> Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to
>> $prefix/9.5/data/global/pg_control.old and creates new
>> $prefix/9.6/data/global/pg_control without making hardlink. When running
>> rsync from master to replica rsync sees $prefix/9.6/data/global/pg_control
>> on master and checks if it is a hardlink. Since it is not a hardlink and
>> $prefix/9.6/data/global/pg_control does not exist on replica rsync copies
>> it. For data files the logic is different since they are hardlinks,
>> corresponding files exist on replica and they are the same size.
>> 
> 
> Okay, in that case, I guess it is better to run Analyze on master
> after the upgrade is complete (including an upgrade for replicas).  If
> you are worried about the performance of read-only replicas till the
> time Analyze on the master in completed, you might want to use
> --analyze-in-stages of vaccumdb and or use (-j njobs) along with it to
> parallelize the operation.

What about the following sequence?

1. Run pg_upgrade on master,
2. Start it in single-user mode and stop (to get right wal_level in pg_control),
3. Copy pg_control somewhere,
4. Start master, run analyze and stop.
5. Put the control file from step 3 to replicas and rsync them according to the 
documentation.

And I think that step 10.f in the documentation [1] should be fixed to mention 
starting in single-user mode or with disabled autovacuum.

[1] https://www.postgresql.org/docs/devel/static/pgupgrade.html

> 
> 
> -- 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/>

--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-06-11 Thread Vladimir Borodin

> 8 июня 2017 г., в 17:03, Amit Kapila  написал(а):
> 
> On Thu, Jun 8, 2017 at 6:49 PM, Dmitriy Sarafannikov
>  wrote:
>> 
>>> Why didn't rsync made the copies on master and replica same?
>> 
>> Because rsync was running with —size-only flag.
>> 
> 
> IIUC the situation, the new WAL and updated pg_control file has been
> copied, but not updated data files due to which the WAL has not been
> replayed on replicas?  If so, why the pg_control file is copied, it's
> size shouldn't have changed?

Because on master pg_upgrade moves $prefix/9.5/data/global/pg_control to 
$prefix/9.5/data/global/pg_control.old and creates new 
$prefix/9.6/data/global/pg_control without making hardlink. When running rsync 
from master to replica rsync sees $prefix/9.6/data/global/pg_control on master 
and checks if it is a hardlink. Since it is not a hardlink and 
$prefix/9.6/data/global/pg_control does not exist on replica rsync copies it. 
For data files the logic is different since they are hardlinks, corresponding 
files exist on replica and they are the same size.

> 
> -- 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-06-07 Thread Vladimir Borodin


> 6 июня 2017 г., в 23:30, Sergey Burladyan  написал(а):
> 
> Dmitriy Sarafannikov  writes:
> 
>> Starting and stopping master after running pg_upgrade but before rsync to 
>> collect statistics
>> was a bad idea.
> 
> But, starting and stopping master after running pg_upgrade is *required*
> by documentation:
> https://www.postgresql.org/docs/9.6/static/pgupgrade.html
>> f. Start and stop the new master cluster
>> In the new master cluster, change wal_level to replica in the 
>> postgresql.conf file and then start and stop the cluster.
> 
> and there is no any suggestion to disable autovacuum for it.


Yep. This should probably be fixed in the documentation?


--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-05-27 Thread Vladimir Borodin


> 27 мая 2017 г., в 19:56, Andres Freund  написал(а):
> 
> On 2017-05-27 19:48:24 +0300, Vladimir Borodin wrote:
>> Well, actually clean shutdown of master with exit code 0 from `pg_ctl
>> stop -m fast` guarantees that all WAL has been replicated to standby.
> 
> It does not.  It makes it likely, but the connection to the standby
> could be not up just then, you could run into walsender timeout, and a
> bunch of other scenarios.

AFAIK in this case exit code would not be zero. Even if archiver has not been 
able to archive all WALs before timeout for shutting down happened, exit code 
will not be zero.

> 
> 
>> But just in case we also check that "Latest checkpoint's REDO
>> location" from control file on old master after shutdown is less than
>> pg_last_xlog_replay_location() on standby to be promoted.
> 
> The *redo* location? Or the checkpoint location itself?  Because the
> latter is what needs to be *equal* than the replay location not less
> than.  Normally there won't be other records inbetween, but that's not
> guaranteed.

I've asked about it some time ago [1]. In that case checkpoint location and 
redo location were equal after shutdown and last replay location on standby was 
higher on 104 bytes (the size of shutdown checkpoint record).

But we do check exactly redo location. Should we change it for checking 
checkpoint location?

[1] 
https://www.postgresql.org/message-id/A7683985-2EC2-40AD-AAAC-B44BD0F29723%40simply.name

> 
> 
>> And if something would go wrong in above logic, postgres will not let you 
>> attach old master as a standby of new master. So it is highly probable not a 
>> setup problem.
> 
> There's no such guarantee.  There's a bunch of checks that'll somewhat
> likely trigger, but nothing more than that.
> 
> - Andres


--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-05-27 Thread Vladimir Borodin


> 26 мая 2017 г., в 21:39, Amit Kapila  написал(а):
> 
>> And LSN on replica is greater that LSN on master (838D/C4A0D280 > 
>> 8092/6A26DD08)
>> How can this be possible?
>> 
> 
> Yeah, I think this is quite suspicious.  This seems to indicate that
> not all WAL records are replicated before the switchover.  What is the
> value of "synchronous_commit" you are using?

synchronous_commit = on.

>  I think you somehow need
> to ensure before switchover that all the WAL is replicated to ensure
> this is not a setup problem.

Well, actually clean shutdown of master with exit code 0 from `pg_ctl stop -m 
fast` guarantees that all WAL has been replicated to standby. But just in case 
we also check that "Latest checkpoint's REDO location" from control file on old 
master after shutdown is less than pg_last_xlog_replay_location() on standby to 
be promoted.

And if something would go wrong in above logic, postgres will not let you 
attach old master as a standby of new master. So it is highly probable not a 
setup problem.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-05-26 Thread Vladimir Borodin

> 26 мая 2017 г., в 23:04, Michael Paquier  
> написал(а):
> 
> On Fri, May 26, 2017 at 2:39 PM, Amit Kapila  wrote:
>> Yeah, I think this is quite suspicious.  This seems to indicate that
>> not all WAL records are replicated before the switchover.  What is the
>> value of "synchronous_commit" you are using?  I think you somehow need
>> to ensure before switchover that all the WAL is replicated to ensure
>> this is not a setup problem.
> 
> It is so easy to corrupt a server because of an incorrect base backup
> flow or an incorrect switchover that it would be good first to
> understand how you are doing your switchover. Any corruption happening
> after a promotion, a failover or a switchover may be the top of the
> iceberg of what's on the data pages, and you may just see one symptom
> among other problems. Particularly, did you kill the master in any
> violent way after promoting the standby? Has the former master been
> able to perform at least once a clean shutdown checkpoint and has it
> been rewound?

At first we cleanly stop the former master and then we extra check that all 
data has been received by the standby to be promoted. Old master is returned by 
simply generating recovery.conf and starting it, all other standbys are done in 
the same way. WAL history on all hosts and in archive remains linear, no 
pg_rewind is needed. This procedure has been well tested on 9.3 (when the 
ability to do such switchover without restoring standbys from backup appeared), 
automated and has not changed for all these years.

Actually we have already found that LSNs of all corrupted tuples are somewhere 
near the time of upgrade from 9.5 to 9.6. There is still no evidence but it 
seems that it is mostly related to upgrade procedure. We now extract backups of 
9.5 and 9.6 for a database where we now have corrupted pages to check this 
version. But it is still not obvious if it could be a pg_upgrade bug or our 
mistake in a way we did upgrade.

> -- 
> Michael

--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken hint bits (freeze)

2017-05-24 Thread Vladimir Borodin


> 24 мая 2017 г., в 15:44, Robert Haas  написал(а):
> 
> On Wed, May 24, 2017 at 7:27 AM, Dmitriy Sarafannikov
>  wrote:
>> It seems like replica did not replayed corresponding WAL records.
>> Any thoughts?
> 
> heap_xlog_freeze_page() is a pretty simple function.  It's not
> impossible that it could have a bug that causes it to incorrectly skip
> records, but it's not clear why that wouldn't affect many other replay
> routines equally, since the pattern of using the return value of
> XLogReadBufferForRedo() to decide what to do is widespread.
> 
> Can you prove that other WAL records generated around the same time as
> the freeze record *were* replayed on the master?  If so, that proves
> that this isn't just a case of the WAL never reaching the standby.
> Can you look at the segment that contains the relevant freeze record
> with pg_xlogdump?  Maybe that record is messed up somehow.

Not yet. Most of such cases are long before our recovery window so 
corresponding WALs have been deleted. We have already tuned retention policy 
and we are now looking for a fresh case.

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


--
May the force be with you…
https://simply.name

Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-02 Thread Vladimir Borodin

Hi.

> 25 апр. 2017 г., в 18:13, Dmitriy Sarafannikov  
> написал(а):
> 
> I'd like to propose to search min and max value in index with SnapshotAny in 
> get_actual_variable_range function.
> Current implementation scans index with SnapshotDirty which accepts 
> uncommitted rows and rejects dead rows.

The patch is already being discussed here [1].

[1] 
https://www.postgresql.org/message-id/05C72CF7-B5F6-4DB9-8A09-5AC897653113%40yandex.ru



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Use SnapshotAny in get_actual_variable_range

2017-05-02 Thread Vladimir Borodin

> 29 апр. 2017 г., в 17:34, Tom Lane  написал(а):
> 
> Dmitriy Sarafannikov  writes:
>>> Maybe we need another type of snapshot that would accept any
>>> non-vacuumable tuple.  I really don't want SnapshotAny semantics here,
> 
>> If I understood correctly, this new type of snapshot would help if
>> there are long running transactions which can see this tuples.
>> But if there are not long running transactions, it will be the same.
>> Am i right?
> 
> Right.  You haven't shown us much about the use-case you're concerned
> with, so it's not clear what's actually needed.

The use case is nearly the same as the way to reproduce the problem described 
in the first letter. It’s an OLTP database with short mostly read-only queries 
(~ 6k rps). Every 10 minutes new data is inserted (~5-10% of rows in 
polygon_table) and old is deleted (~ 5-10% or rows in polygon_table). Insert 
and delete are made in different transactions. Until the vacuum after delete is 
finished the planning time is two orders of magnitude is higher than usually. 
If we use prepared statements the problem doesn’t reproduce since planning is 
not actually done.

> 
>> And what about don’t fetch actual min and max values from indexes
>> whose columns doesn’t involved in join? 
> 
> We don't fetch that info unless we need it.

We used to think that it’s actually not so (there was a problem in our test 
case), but we rechecked and it is, the planner doesn’t find min and max in 
unneeded for join indexes.

> 
> I'm not entirely certain, but there could be cases where a single
> planning cycle ends up fetching that data more than once.  (There's
> caching at the RestrictInfo level, but that might not be enough.)
> So a line of thought that might be worth looking into is adding a
> lower layer of caching to make sure it's not done more than once per
> plan.  Again, whether this saves anything would depend a lot on
> specific use-cases.
> 
>   regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Vladimir Borodin


> 6 февр. 2017 г., в 4:57, Peter Geoghegan  написал(а):
> 
> I meant that I find the fact that there were no user reports in all
> these years to be a good reason to not proceed for now in this
> instance.

Well, we had some strange situations with indexes (see below for example) but I 
couldn’t even think that CIC is the problem.

And it is really difficult to give diagnostics for problems of such kind. 
Because 1. you may see the problem several days after last major change in the 
database and 2. you don’t even know how to start investigating the problem.

xdb314g/maildb M # show enable_indexscan ;
 enable_indexscan
--
 off
(1 row)

Time: 0.120 ms
xdb314g/maildb M #  select * from mail.folders where uid=448300241 and fid=1;
-[ RECORD 1 ]---+--
uid | 448300241
fid | 1
<...>

Time: 30398.637 ms
xdb314g/maildb M # set enable_indexscan to true;
SET
Time: 0.111 ms
xdb314g/maildb M #  select * from mail.folders where uid=448300241 and fid=1;
(0 rows)

Time: 0.312 ms

xdb314g/maildb M #

The row of course hasn’t been deleted.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-01-30 Thread Vladimir Borodin

> 31 янв. 2017 г., в 9:50, Michael Paquier  
> написал(а):
> 
> On Mon, Jan 30, 2017 at 4:28 PM, Andrew Borodin  wrote:
>> I'll summarize here my state of studying concurrent methods of page 
>> unlinking.
>> 
>> GIN B-tree does not have "high key". That means, that rightmost key on
>> a page is maximal for the non-leaf page.
>> But I do not see anything theoretical in a way of implementation of
>> Lanin and Shasha`s methods of page merging, with slight modifications.
>> Their paper does not even mention high key(high fence key in papers by
>> Goetz Graefe).
>> 
>> But it's not a simple task due to large portions of shared code
>> between entry tree and posting tree.
>> 
>> Also, I do not see a reason why this method can be practically
>> superior to proposed patch.
>> 
>> Currently, I do not have resources to implement a proof of concept for
>> fully concurrent page unlinking to make benchmarking.
> 
> I am marking this patch as returned with feedback.

Michael, sorry, but why? If I understood everything right, the main question 
from Jeff was why is it implemented in such way? And Jeff wanted to see other 
ways of solving the problem. Andrew wrote about them above and it seems that 
implementing them would be quite expensive and without any obvious win. I would 
rather expect some reaction from Jeff or may be someone else, so shouldn’t it 
be marked as «Ready for committer» or at least «Moved to next CF»?

> -- 
> Michael

--
May the force be with you…
https://simply.name

Re: [HACKERS] Checksums by default?

2017-01-22 Thread Vladimir Borodin

> 21 янв. 2017 г., в 18:18, Petr Jelinek  
> написал(а):
> 
> On 21/01/17 11:39, Magnus Hagander wrote:
>> Is it time to enable checksums by default, and give initdb a switch to
>> turn it off instead?

+1

> 
> I'd like to see benchmark first, both in terms of CPU and in terms of
> produced WAL (=network traffic) given that it turns on logging of hint bits.

Here are the results of my testing for 9.4 in December 2014. The benchmark was 
done on a production like use case when all the data fits in memory (~ 20 GB) 
and doesn’t fit into shared_buffers (it was intentionally small - 128 MB on 
stress stend), so that shared_blk_read / shared_blk_hit = 1/4. The workload was 
a typical OLTP but with mostly write queries (80% writes, 20% reads).

Here are the number of WALs written during the test:
Defaults263
wal_log_hints   410
checksums   367

I couldn’t find the answer why WAL write amplification is even worse for 
wal_log_hints than for checksums but I checked several times and it always 
reproduced.

As for CPU I couldn’t see the overhead [1]. And perf top showed me less then 2% 
in calculating CRC.

For all new DBs we now enable checksums at initdb and several dozens of our 
shards use checksums now. I don’t see any performance difference for them 
comparing with non-checksumed clusters. And we have already had one case when 
we caught data corruption with checksums.

[1] https://yadi.sk/i/VAiWjv6t3AQCs2?lang=en

--
May the force be with you…
https://simply.name

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tzdata 2016j

2016-12-06 Thread Vladimir Borodin


> 6 дек. 2016 г., в 19:19, David Fetter  написал(а):
> 
> On Tue, Dec 06, 2016 at 10:52:47AM -0500, Tom Lane wrote:
>> Vladimir Borodin  writes:
>>> Any chance to get tzdata 2016j in supported branches?
>> 
>> When the next scheduled releases come around (February), we'll update
>> to whatever tzdata is current at that time.
> 
> I'm guessing that request came through because Vladimir is actually
> affected by the change.
> 
> Apparently, you can replace the tzdata file and restart the server per
> https://wiki.postgresql.org/wiki/FAQ#Will_PostgreSQL_handle_recent_daylight_saving_time_changes_in_various_countries.3F
>  
> <https://wiki.postgresql.org/wiki/FAQ#Will_PostgreSQL_handle_recent_daylight_saving_time_changes_in_various_countries.3F>

Yep, both packages for RHEL and Ubuntu are built with 
--with-system-tzdata=/usr/share/zoneinfo configure option so updating system 
package and restarting postgres is sufficient. Thanks!


--
May the force be with you…
https://simply.name

[HACKERS] tzdata 2016j

2016-12-06 Thread Vladimir Borodin

Hi all.

Any chance to get tzdata 2016j in supported branches?

--
May the force be with you…
https://simply.name

Re: [HACKERS] GIN non-intrusive vacuum of posting tree

2016-11-28 Thread Vladimir Borodin


> 28 нояб. 2016 г., в 20:31, Andrew Borodin  написал(а):
> 
> This patch solved a problem encountered by Evgeniy Efimkin and
> Vladimir Borodin from Yandex.Mail.
> 
> and eventually deleting some of data. This testbed showed VACUUM
> holding inserts for up to tenths of seconds. They claim that proposed
> patch made vacuum invisible in this test.
> 
> Evgeniy, Vladimir, if I missed something or you have something to add,
> please join discussion.

Yep, in our load environment the table is not so big (~ 50 GB), GIN index size 
is ~ 1 GB. And under our load profile we have seen 90 seconds of impossibility 
to do an insert into the table because of vacuuming this index. I confirm that 
with this patch we now don’t see any spikes of errors as it was previously.

--
May the force be with you…
https://simply.name

Re: [HACKERS] [PERFORM] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-07-04 Thread Vladimir Borodin


> 13 июня 2016 г., в 21:58, Vladimir Borodin  написал(а):
> 
>> 
>> 13 июня 2016 г., в 0:51, Andres Freund > <mailto:and...@anarazel.de>> написал(а):
>> 
>> Hi Vladimir,
>> 
>> Thanks for these reports.
>> 
>> On 2016-06-13 00:42:19 +0300, Vladimir Borodin wrote:
>>> perf report -g -i pg9?_all.data >/tmp/pg9?_perf_report.txt
>> 
>> Any chance you could redo the reports with --no-children --call-graph=fractal
>> added? The mode that includes child overheads unfortunately makes the
>> output hard to interpet/compare.
> 
> Of course. Not sure if that is important but I upgraded perf for that 
> (because --no-children option was introduced in ~3.16), so perf record and 
> perf report were done with different perf versions.
> 
> 
> 
> 
> 
> Also I’ve done the same test on same host (RHEL 6) but with 4.6 kernel/perf 
> and writing perf data to /dev/shm for not loosing events. Perf report output 
> is also attached but important thing is that the regression is not so 
> significant:
> 
> root@pgload05g ~ # uname -r
> 4.6.0-1.el6.elrepo.x86_64
> root@pgload05g ~ # cat /proc/sys/kernel/sched_autogroup_enabled
> 1
> root@pgload05g ~ # /tmp/run.sh
> RHEL 69.4 71634   0.893
> RHEL 69.5 54005   1.185
> RHEL 69.6 65550   0.976
> root@pgload05g ~ # echo 0 >/proc/sys/kernel/sched_autogroup_enabled
> root@pgload05g ~ # /tmp/run.sh
> RHEL 69.4 73041   0.876
> RHEL 69.5 60105   1.065
> RHEL 69.6 67984   0.941
> root@pgload05g ~ #
> 
> 
> 
> 

Andres, is there any chance that you would find time to look at those results? 
Are they actually useful?

> 
> 
>> 
>>> The results from pg9?_perf_report.txt are attached. Note that in all cases 
>>> some events were lost, i.e.:
>>> 
>>> root@pgload05g ~ # perf report -g -i pg94_all.data 
>>> >/tmp/pg94_perf_report.txt
>>> Failed to open [vsyscall], continuing without symbols
>>> Warning:
>>> Processed 537137 events and lost 7846 chunks!
>> 
>> You can reduce the overhead by reducing the sampling frequency, e.g. by
>> specifying -F 300.
>> 
>> Greetings,
>> 
>> Andres Freund
>> 
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org 
>> <mailto:pgsql-hackers@postgresql.org>)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers 
>> <http://www.postgresql.org/mailpref/pgsql-hackers>
> 
> 
> --
> May the force be with you…
> https://simply.name <https://simply.name/>

--
May the force be with you…
https://simply.name

Re: [HACKERS] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-05-31 Thread Vladimir Borodin

27 мая 2016 г., в 19:57, Vladimir Borodin <r...@simply.name> написал(а):-performance+hackers25 мая 2016 г., в 17:33, Vladimir Borodin <r...@simply.name> написал(а):Hi all.We have found that queries through PgBouncer 1.7.2 (with transaction pooling) to local PostgreSQL are almost two times slower in 9.5.3 than in 9.4.8 on RHEL 6 hosts (all packages are updated to last versions). Meanwhile the problem can’t be reproduced i.e. on Ubuntu 14.04 (also fully-updated).Here is how the results look like for 9.4, 9.5 and 9.6. All are built from latest commits on yesterday in	* REL9_4_STABLE (a0cc89a28141595d888d8aba43163d58a1578bfb),	* REL9_5_STABLE (e504d915bbf352ecfc4ed335af934e799bf01053),	* master (6ee7fb8244560b7a3f224784b8ad2351107fa55d).All of them are build on the host where testing is done (with stock gcc versions). Sysctls, pgbouncer config and everything we found are the same, postgres configs are default, PGDATA is in tmpfs. All numbers are reproducible, they are stable between runs.Shortly:OS			PostgreSQL version	TPS			Avg. latencyRHEL 6		9.4	44898		1.425 msRHEL 6		9.5	26199		2.443 msRHEL 6		9.5	43027		1.487 msUbuntu 14.04	9.4	67458		0.949 msUbuntu 14.04	9.5	64065		0.999 msUbuntu 14.04	9.6	64350		0.995 msThe results above are not really fair, pgbouncer.ini was a bit different on Ubuntu host (application_name_add_host was disabled). Here are the right results with exactly the same configuration:OS			PostgreSQL version	TPS			Avg. latencyRHEL 6		9.4	44898		1.425 msRHEL 6		9.5	26199		2.443 msRHEL 6		9.5	43027		1.487 msUbuntu 14.04	9.4	45971		1.392 msUbuntu 14.04	9.5	40282		1.589 msUbuntu 14.04	9.6	45410		1.409 msIt can be seen that there is a regression for 9.5 in Ubuntu also, but not so significant. We first thought that the reason is 38628db8d8caff21eb6cf8d775c0b2d04cf07b9b (Add memory barriers for PgBackendStatus.st_changecount protocol), but in that case the regression should also be seen in 9.6 also.There also was a bunch of changes in FE/BE communication (like 387da18874afa17156ee3af63766f17efb53c4b9 or 98a64d0bd713cb89e61bef6432befc4b7b5da59e) and that may answer the question of regression in 9.5 and normal results in 9.6. Probably the right way to find the answer is to do bisect. I’ll do it but if some more diagnostics information can help, feel free to ask about it.Yep, bisect confirms that the first bad commit in REL9_5_STABLE is 387da18874afa17156ee3af63766f17efb53c4b9. Full output is attached.And bisect for master branch confirms that the situation became much better after 98a64d0bd713cb89e61bef6432befc4b7b5da59e. Output is also attached.On Ubuntu performance degradation is ~15% and on RHEL it is ~100%. I don’t know what is the cause for different numbers on RHEL and Ubuntu but certainly there is a regression when pgbouncer is connected to postgres through localhost. When I try to connect pgbouncer to postgres through unix-socket performance is constantly bad on all postgres versions.Both servers are for testing but I can easily provide you SSH access only to Ubuntu host if necessary. I can also gather more diagnostics if needed.

bisect95.out
Description: Binary data


bisect96.out
Description: Binary data
You could see that the difference between major versions on Ubuntu is not significant, but on RHEL 9.5 is 70% slower than 9.4 and 9.6.Below are more details.RHEL 6:postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg94'transaction type: SELECT onlyscaling factor: 100query mode: simplenumber of clients: 64number of threads: 64duration: 60 snumber of transactions actually processed: 2693962latency average: 1.425 mstps = 44897.461518 (including connections establishing)tps = 44898.763258 (excluding connections establishing)postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg95'transaction type: SELECT onlyscaling factor: 100query mode: simplenumber of clients: 64number of threads: 64duration: 60 snumber of transactions actually processed: 1572014latency average: 2.443 mstps = 26198.928627 (including connections establishing)tps = 26199.803363 (excluding connections establishing)postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg96'transaction type: SELECT onlyscaling factor: 100query mode: simplenumber of clients: 64number of threads: 64duration: 60 snumber of transactions actually processed: 2581645latency average: 1.487 mstps = 43025.676995 (including connections establishing)tps = 43027.038275 (excluding connections establishing)postgres@pgload05g ~ $Ubuntu 14.04 (the same hardware):postgres@pgloadpublic02:~$ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg94'transaction type: SELECT onlyscaling factor:

Re: [HACKERS] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-05-30 Thread Vladimir Borodin


> 28 мая 2016 г., в 0:56, Andres Freund  написал(а):
> 
> On 2016-05-27 19:57:34 +0300, Vladimir Borodin wrote:
>> 
>> OS   PostgreSQL version  TPS Avg. 
>> latency
>> RHEL 6   9.4 44898   
>> 1.425 ms
>> RHEL 6   9.5 26199   
>> 2.443 ms
>> RHEL 6   9.5 43027   
>> 1.487 ms
> 
> Hm. I'm a bit confused. You show one result for 9.5 with bad and one
> with good performance. I suspect the second one is supposed to be a 9.6?

Sorry, I misunderstood. Yes, the last line above is for 9.6, that was a typo.

> 
> Greetings,
> 
> Andres Freund


--
May the force be with you…
https://simply.name

Re: [HACKERS] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-05-30 Thread Vladimir Borodin


> 28 мая 2016 г., в 0:56, Andres Freund  написал(а):
> 
> Hi,
> 
> 
> On 2016-05-27 19:57:34 +0300, Vladimir Borodin wrote:
>> -performance
>>> Here is how the results look like for 9.4, 9.5 and 9.6. All are built from 
>>> latest commits on yesterday in
>>> * REL9_4_STABLE (a0cc89a28141595d888d8aba43163d58a1578bfb),
>>> * REL9_5_STABLE (e504d915bbf352ecfc4ed335af934e799bf01053),
>>> * master (6ee7fb8244560b7a3f224784b8ad2351107fa55d).
>>> 
>>> All of them are build on the host where testing is done (with stock gcc 
>>> versions). Sysctls, pgbouncer config and everything we found are the same, 
>>> postgres configs are default, PGDATA is in tmpfs. All numbers are 
>>> reproducible, they are stable between runs.
>>> 
>>> Shortly:
>>> 
>>> OS  PostgreSQL version  TPS Avg. 
>>> latency
>>> RHEL 6  9.4 44898   
>>> 1.425 ms
>>> RHEL 6  9.5 26199   
>>> 2.443 ms
>>> RHEL 6  9.5 43027   
>>> 1.487 ms
>>> Ubuntu 14.049.4 67458   
>>> 0.949 ms
>>> Ubuntu 14.049.5 64065   
>>> 0.999 ms
>>> Ubuntu 14.049.6 64350   
>>> 0.995 ms
>> 
>> The results above are not really fair, pgbouncer.ini was a bit different on 
>> Ubuntu host (application_name_add_host was disabled). Here are the right 
>> results with exactly the same configuration:
>> 
>> OS   PostgreSQL version  TPS Avg. 
>> latency
>> RHEL 6   9.4 44898   
>> 1.425 ms
>> RHEL 6   9.5 26199   
>> 2.443 ms
>> RHEL 6   9.5 43027   
>> 1.487 ms
>> Ubuntu 14.04 9.4 45971   1.392 ms
>> Ubuntu 14.04 9.5 40282   1.589 ms
>> Ubuntu 14.04 9.6 45410   1.409 ms
> 
> Hm. I'm a bit confused. You show one result for 9.5 with bad and one
> with good performance. I suspect the second one is supposed to be a 9.6?

No, they are both for 9.5. One of them is on RHEL 6 host, another one on Ubuntu 
14.04.

> 
> Am I understanding correctly that the performance near entirely
> recovered with 9.6?

Yes, 9.6 is much better than 9.5.

> If so, I suspect we might be dealing with a memory
> alignment issue. Do the 9.5 results change if you increase
> max_connections by one or two (without changing anything else)?

Results with max_connections=100:

OS  Version TPS Avg. latency
RHEL 6  9.4 69810   0.917
RHEL 6  9.5 35303   1.812
RHEL 6  9.6 71827   0.891
Ubuntu 14.049.4 76829   0.833
Ubuntu 14.049.5 67574   0.947
Ubuntu 14.049.6 79200   0.808

Results with max_connections=101:

OS  Version TPS Avg. latency
RHEL 6  9.4 70059   0.914
RHEL 6  9.5 35979   1.779
RHEL 6  9.6 71183   0.899
Ubuntu 14.049.4 78934   0.811
Ubuntu 14.049.5 67803   0.944
Ubuntu 14.049.6 79624   0.804


Results with max_connections=102:

OS  Version TPS Avg. latency
RHEL 6  9.4 70710   0.905
RHEL 6  9.5 36615   1.748
RHEL 6  9.6 69742   0.918
Ubuntu 14.049.4 76356   0.838
Ubuntu 14.049.5 66814   0.958
Ubuntu 14.049.6 78528   0.815

Doesn’t seem that it is a memory alignment issue. Also please note that there 
is no performance degradation when connections from pgbench to postgres are 
established without pgbouncer:

OS  Version TPS Avg. latency
RHEL 6  9.4 167427  0.382
RHEL 6  9.5 223674  0.286
RHEL 6  9.6 215580  0.297
Ubuntu 14.049.4 176659  0.362
Ubuntu 14.049.5 248277  0.258
Ubuntu 14.049.6 2458

Re: [HACKERS] 9.4 -> 9.5 regression with queries through pgbouncer on RHEL 6

2016-05-27 Thread Vladimir Borodin

-performance
+hackers

> 25 мая 2016 г., в 17:33, Vladimir Borodin  написал(а):
> 
> Hi all.
> 
> We have found that queries through PgBouncer 1.7.2 (with transaction pooling) 
> to local PostgreSQL are almost two times slower in 9.5.3 than in 9.4.8 on 
> RHEL 6 hosts (all packages are updated to last versions). Meanwhile the 
> problem can’t be reproduced i.e. on Ubuntu 14.04 (also fully-updated).
> 
> Here is how the results look like for 9.4, 9.5 and 9.6. All are built from 
> latest commits on yesterday in
>   * REL9_4_STABLE (a0cc89a28141595d888d8aba43163d58a1578bfb),
>   * REL9_5_STABLE (e504d915bbf352ecfc4ed335af934e799bf01053),
>   * master (6ee7fb8244560b7a3f224784b8ad2351107fa55d).
> 
> All of them are build on the host where testing is done (with stock gcc 
> versions). Sysctls, pgbouncer config and everything we found are the same, 
> postgres configs are default, PGDATA is in tmpfs. All numbers are 
> reproducible, they are stable between runs.
> 
> Shortly:
> 
> OSPostgreSQL version  TPS Avg. 
> latency
> RHEL 69.4 44898   
> 1.425 ms
> RHEL 69.5 26199   
> 2.443 ms
> RHEL 69.5 43027   
> 1.487 ms
> Ubuntu 14.04  9.4 67458   0.949 ms
> Ubuntu 14.04  9.5 64065   0.999 ms
> Ubuntu 14.04  9.6 64350   0.995 ms

The results above are not really fair, pgbouncer.ini was a bit different on 
Ubuntu host (application_name_add_host was disabled). Here are the right 
results with exactly the same configuration:

OS  PostgreSQL version  TPS Avg. 
latency
RHEL 6  9.4 44898   1.425 ms
RHEL 6  9.5 26199   2.443 ms
RHEL 6  9.5 43027   1.487 ms
Ubuntu 14.049.4 45971   1.392 ms
Ubuntu 14.049.5 40282   1.589 ms
Ubuntu 14.049.6 45410   1.409 ms

It can be seen that there is a regression for 9.5 in Ubuntu also, but not so 
significant. We first thought that the reason is 
38628db8d8caff21eb6cf8d775c0b2d04cf07b9b (Add memory barriers for 
PgBackendStatus.st_changecount protocol), but in that case the regression 
should also be seen in 9.6 also.

There also was a bunch of changes in FE/BE communication (like 
387da18874afa17156ee3af63766f17efb53c4b9 or 
98a64d0bd713cb89e61bef6432befc4b7b5da59e) and that may answer the question of 
regression in 9.5 and normal results in 9.6. Probably the right way to find the 
answer is to do bisect. I’ll do it but if some more diagnostics information can 
help, feel free to ask about it.

> 
> You could see that the difference between major versions on Ubuntu is not 
> significant, but on RHEL 9.5 is 70% slower than 9.4 and 9.6.
> 
> Below are more details.
> 
> RHEL 6:
> 
> postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 
> -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg94'
> transaction type: SELECT only
> scaling factor: 100
> query mode: simple
> number of clients: 64
> number of threads: 64
> duration: 60 s
> number of transactions actually processed: 2693962
> latency average: 1.425 ms
> tps = 44897.461518 (including connections establishing)
> tps = 44898.763258 (excluding connections establishing)
> postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 
> -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg95'
> transaction type: SELECT only
> scaling factor: 100
> query mode: simple
> number of clients: 64
> number of threads: 64
> duration: 60 s
> number of transactions actually processed: 1572014
> latency average: 2.443 ms
> tps = 26198.928627 (including connections establishing)
> tps = 26199.803363 (excluding connections establishing)
> postgres@pgload05g ~ $ /usr/lib/postgresql/9.4/bin/pgbench -U postgres -T 60 
> -j 64 -c 64 -S -n 'host=localhost port=6432 dbname=pg96'
> transaction type: SELECT only
> scaling factor: 100
> query mode: simple
> number of clients: 64
> number of threads: 64
> duration: 60 s
> number of transactions actually processed: 2581645
> latency average: 1.487 ms
> tps = 43025.676995 (including connections establishing)
> tps = 43027.038275 (excluding connections establishing)
> postgres@pgload05g ~ $
> 
> Ubuntu 1

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2016-03-25 Thread Vladimir Borodin

> 25 марта 2016 г., в 19:11, David Steele  написал(а):
> 
> Hi Vladimir,
> 
> On 3/14/16 2:15 PM, Vladimir Borodin wrote:
> 
>> JFYI, I’m preparing the stand to reproduce the initial problem and I
>> hope to finish testing this week.
> 
> Do you know when you'll have the results from the testing you were going to 
> do?  It seems this patch is currently waiting on that to be finished.

I couldn’t reproduce the problem on pgbench database with scale factor 5 
last week. The test case was quite simple:
1. On master I was adding data to pgbench_accounts table.
2. On standby I was doing the following:
postgres@pgload01d ~ $ cat /tmp/query
\set naccounts 10 * :scale
SELECT aid FROM pgbench_accounts WHERE aid = :naccounts;
postgres@pgload01d ~ $ /usr/pgsql-9.6/bin/pgbench -M prepared -f /tmp/query -c 
1 -j 1 -T 3600 -P 10 -S -n pgbench
3. On master I was sometimes calling VACUUM pgbench_accounts.

Without applying patches there weren’t huge replication lags on standbys. 
Seems, that I'm doing something wrong… I’m doing my best right now to find the 
reason but I can’t give you any time evaluation :(

> 
> Thanks,
> -- 
> -David
> da...@pgmasters.net
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-15 Thread Vladimir Borodin


> 15 марта 2016 г., в 19:57, Oleg Bartunov  написал(а):
> 
> 
> 
> On Tue, Mar 15, 2016 at 7:43 PM, Alexander Korotkov 
> mailto:a.korot...@postgrespro.ru>> wrote:
> On Tue, Mar 15, 2016 at 12:57 AM, Robert Haas  > wrote:
> On Mon, Mar 14, 2016 at 4:42 PM, Andres Freund  > wrote:
> > On 2016-03-14 16:16:43 -0400, Robert Haas wrote:
> >> > I have already shown [0, 1] the overhead of measuring timings in linux on
> >> > representative workload. AFAIK, these tests were the only one that showed
> >> > any numbers. All other statements about terrible performance have been 
> >> > and
> >> > remain unconfirmed.
> >>
> >> Of course, those numbers are substantial regressions which would
> >> likely make it impractical to turn this on on a heavily-loaded
> >> production system.
> >
> > A lot of people operating production systems are fine with trading a <=
> > 10% impact for more insight into the system; especially if that
> > configuration can be changed without a restart.  I know a lot of systems
> > that use pg_stat_statements, track_io_timing = on, etc; just to get
> > that. In fact there's people running perf more or less continuously in
> > production environments; just to get more insight.
> >
> > I think it's important to get as much information out there without
> > performance overhead, so it can be enabled by default. But I don't think
> > it makes sense to not allow features in that cannot be enabled by
> > default, *if* we tried to make them cheap enough beforehand.
> 
> Hmm, OK.  I would have expected you to be on the other side of this
> question, so maybe I'm all wet.  One point I am concerned about is
> that, right now, we have only a handful of types of wait events.  I'm
> very interested in seeing us add more, like I/O and client wait.  So
> any overhead we pay here is likely to eventually be paid in a lot of
> places - thus it had better be extremely small.
> 
> OK. Let's start to produce light, not heat.
> 
> As I get we have two features which we suspect to introduce overhead:
> 1) Recording parameters of wait events which requires some kind of 
> synchronization protocol.
> 2) Recording time of wait events because time measurements might be expensive 
> on some platforms.
> 
> Simultaneously there are machines and workloads where both of these features 
> doesn't produce measurable overhead.  And, we're talking not about toy 
> databases. Vladimir is DBA from Yandex which is in TOP-20 (by traffic) 
> internet companies in the world.  They do run both of this features in 
> production highload database without noticing any overhead of them. 
> 
> It would be great progress, if we decide that we could add both of these 
> features controlled by GUC (off by default).
> 
> enable_waits_statistics ?
>  
> 
> If we decide so, then let's start working on this. At first, we should 
> construct list of machines and workloads for testing. Each list of machines 
> and workloads would be not comprehensive. But let's find something that would 
> be enough for testing of GUC controlled, off by default features.  Then we 
> can turn our conversation from theoretical thoughts to particular benchmarks 
> which would be objective and convincing to everybody. 
> 
> Vladimir, could you provide a test suite, so other people could measure 
> overhead on their machines ?

I have somehow described it here [0]. Since the majority of concerns were 
around LWLocks, the plan was to reconstruct a workload under heavy LWLocks 
pressure. This can easily be done even with pgbench in two following scenarios:
1. Put all the data in shared buffers and on tmpfs and run read/write 
test. Contention would be around ProcArrayLock.
2. Put all the data in RAM but not all in shared buffers and run 
read-only test. Contention would be around buffer manager.

IMHO, these two tests are good to be representative and not depend much on 
hardware.

[0] 
http://www.postgresql.org/message-id/eee78e40-0e48-411a-9f90-cf9339da9...@simply.name

> 
> 
>  
> 
> Otherwise, let's just add these features to the list of unwanted 
> functionality and close this question.
> 
> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com 
> 
> The Russian Postgres Company 


--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-14 Thread Vladimir Borodin

> 14 марта 2016 г., в 22:21, Robert Haas  написал(а):
> 
> On Sat, Mar 12, 2016 at 6:05 AM, Oleg Bartunov  wrote:
>>> So?
>> 
>> So, Robert already has experience with the subject, probably,  he has bad
>> experience with edb implementation and he'd like to see something better in
>> community version. That's fair and I accept his position.
> 
> Bingo - though maybe "bad" experience is not quite as accurate as
> "could be better".
> 
>> Wait monitoring is one of the popular requirement of russian companies, who
>> migrated from Oracle. Overwhelming majority of them use Linux, so I suggest
>> to have configure flag for including wait monitoring at compile time
>> (default is no wait monitoring), or have GUC variable, which is also off by
>> default, so we have zero to minimal overhead of monitoring. That way we'll
>> satisfy many enterprises and help them to choose postgres, will get feedback
>> from production use and have time for feature improving.
> 
> So, right now we can only display the wait information in
> pg_stat_activity.  There are a couple of other things that somebody
> might want to do:
> 
> 1. Sample the wait state information across all backends in the
> system.  On a large, busy system, this figures to be quite cheap, and
> the sampling interval could be configurable.
> 
> 2. Count every instance of every wait event in every backend, and roll
> that up either via shared memory or additional stats messges.
> 
> 3. Like #2, but with timing information.
> 
> 4. Like #2, but on a per-query basis, somehow integrated with
> pg_stat_statements.

5. Show extra information about wait event (i.e. exclusive of shared mode for 
LWLocks, relation/forknum/blknum for I/O operations, etc.).

> 
> The challenge with any of these except #1 is that they are going to
> produce a huge volume of data, and, whether you believe it or not, #3
> is going to sometimes be crushingly slow.  Really.  I tend to think
> that #1 might be better than #2 or #3, but I'm not unwilling to listen
> to contrary arguments, especially if backed up by careful benchmarking
> showing that the performance hit is negligible.

I have already shown [0, 1] the overhead of measuring timings in linux on 
representative workload. AFAIK, these tests were the only one that showed any 
numbers. All other statements about terrible performance have been and remain 
unconfirmed.

As for the size of such information it of course should be configurable. I.e. 
in Oracle there is a GUC for the size of ring buffer to store history of 
sampling with extra information about each wait event.

[0] 
http://www.postgresql.org/message-id/eee78e40-0e48-411a-9f90-cf9339da9...@simply.name
[1] 
http://www.postgresql.org/message-id/5f3dd73a-2a85-44bf-9f47-54049a81c...@simply.name

>  My reason for wanting
> to get the stuff we already had committed first is because I have
> found that it is best to proceed with these kinds of problems
> incrementally, not trying to solve too much in a single commit.  Now
> that we have the basics, we can build on it, adding more wait events
> and possibly more recordkeeping for the ones we have already - but
> anything that regresses performance for people not using the feature
> is a dead end in my book, as is anything that introduces overall
> stability risks.

Ok, doing it in short steps seems to be a good plan. Any objections against 
giving people an ability to turn some feature (i.e. notorious measuring 
timings) even if it makes some performance degradation? Of course, it should be 
turned off by default.

> 
> I think the way forward from here is that Postgres Pro should (a)
> rework their implementation to work with what has already been
> committed, (b) consider carefully whether they've done everything
> possible to contain the performance loss, (c) benchmark it on several
> different machines and workloads to see how much performance loss
> there is, and (d) stop accusing me of acting in bad faith.

If anything, I’m not from PostgresPro and I’m not «accusing you». But to be 
honest current committed implementation has been tested exactly on one machine 
with two workloads. And I think, it is somehow unfair to demand more from 
others. Although it doesn’t mean that testing on exactly one machine with only 
one OS is enough, of course. I suppose, you should ask the authors to test it 
on some representative hardware and workload but if authors don’t have them, it 
would be nice to help them with that.

Also it would be really interesting to hear your opinion about the initial 
Andres’s question. Any thoughts about changing current committed implementation?

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
May the force be with you…
https://simply.name

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2016-03-14 Thread Vladimir Borodin


> 10 марта 2016 г., в 14:38, Simon Riggs  написал(а):
> 
> On 10 March 2016 at 09:22, Michael Paquier  <mailto:michael.paqu...@gmail.com>> wrote:
> On Thu, Mar 10, 2016 at 10:00 AM, Vladimir Borodin  <mailto:r...@simply.name>> wrote:
> > Let’s do immediately after you will send a new version of your patch? Or
> > even better after testing your patch? Don’t get me wrong, but rejecting my
> > patch without tangible work on your patch may lead to forgiving about the
> > problem before 9.6 freeze.
> 
> This makes sense. Let's not reject this patch yet if the alternative
> approach is not committed.
> 
> I attach 2 patches.
> 
> avoid_pin_scan_always.v1.patch 
> Takes the approach that we generate the same WAL records as in 9.5, we just 
> choose not to do anything with that information. This is possible because we 
> don't care anymore whether it is toast or other relations. So it effectively 
> reverts parts of the earlier patch.
> This could be easily back-patched more easily.
> 
> toast_recheck.v1.patch
> Adds recheck code for toast access. I'm not certain this is necessary, but 
> here it is. No problems found with it.

JFYI, I’m preparing the stand to reproduce the initial problem and I hope to 
finish testing this week.

> 
> -- 
> Simon Riggshttp://www.2ndQuadrant.com/ 
> <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 


--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-12 Thread Vladimir Borodin


> 12 марта 2016 г., в 13:59, Amit Kapila  написал(а):
> 
> On Sat, Mar 12, 2016 at 3:10 AM, Andres Freund  > wrote:
> >
> >
> > > Similarly for the wait event stuff - checkpointer, wal writer,
> > > background writer are in many cases processes that very often are
> > > blocked on locks, IO and such.  Thus restricting the facility to
> > > database connected processes seems like a loss.
> >
> > I think one way to address this would be to not only report
> > PgBackendStatus type processes in pg_stat_activity. While that'd
> > obviously be a compatibility break, I think it'd be an improvement.
> >
> 
> I think here another point which needs more thoughts is that many of the 
> pg_stat_activity fields are not relevant for background processes, ofcourse 
> one can say that we can keep those fields as NULL, but still I think that 
> indicates it is not the most suitable way to expose such information.
> 
> Another way could be to have new view like pg_stat_background_activity with 
> only relevant fields or try expose via individual views like pg_stat_bgwriter.

From the DBA point of view it is much more convenient to see all wait events in 
one view. I don’t know if it is right to break compability even more, but IMHO 
exposing this data in different views is a bad plan.

> 
> Do you intend to get this done for 9.6 considering an add-on patch for wait 
> event information displayed in pg_stat_activity?
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-12 Thread Vladimir Borodin


> 12 марта 2016 г., в 2:45, Andres Freund  написал(а):
> 
> On 2016-03-12 02:24:33 +0300, Alexander Korotkov wrote:
>> Idea of individual time measurement of every wait event met criticism
>> because it might have high overhead [1].
> 
> Right. And that's actually one of the point which I meant with "didn't
> listen to criticism". There've been a lot of examples, on an off list,
> where taking timings trigger significant slowdowns.  Yes, in some
> bare-metal environments, which a coherent tsc, the overhead can be
> low. But that doesn't make it ok to have a high overhead on a lot of
> other systems.

That’s why proposal included GUC for that with a default to turn timings 
measuring off. I don’t remember any objections against that.

And I’m absolutely sure that a real highload production (which of course 
doesn’t use virtualization and windows) can’t exist without measuring timings. 
Oracle guys have written several chapters (!) about that [0]. Long story short, 
sampling doesn’t give enough precision. I have shown overhead [1] on bare metal 
linux with high stressed lwlocks worload. BTW Oracle doesn’t give you any ways 
to turn timings measurement off, even with hidden parameters. All other 
commercial databases have waits monitoring with timings measurement. Let’s do 
it and turn it off by default so that all other platforms don’t suffer from it.

[0] 
http://www.amazon.com/Optimizing-Oracle-Performance-Cary-Millsap/dp/059600527X
[1] 
http://www.postgresql.org/message-id/eee78e40-0e48-411a-9f90-cf9339da9...@simply.name

> 
> Just claiming that that's not a problem will only lead to your position
> not being taken serious.
> 
> 
>> This is really so at least for Windows [2].
> 
> Measuring timing overhead for a simplistic workload on a single system
> doesn't mean that.  Try doing such a test on a vmware esx virtualized
> windows machine, on a multi-socket server; in a lot of instances you'll
> see two-three orders of magnitude longer average times; with peaks going
> into 4-5 orders of magnitude.  And, as sad it is, realistically most
> postgres instances will run in virtualized environments.
> 
> 
>> But accessing only current values wouldn't be very useful.  We
>> anyway need to gather some statistics.  Gathering it by sampling would be
>> both more expensive and less accurate for majority of systems.  This is why
>> I proposed hooks to make possible platform dependent extensions.  Robert
>> rejects hook because he is "not a big fan of hooks as a way of resolving
>> disagreements about the design" [3].
> 
> I think I agree with Robert here. Providing hooks into very low level
> places tends to lead to problems in my experience; tight control over
> what happens is often important - I certainly don't want any external
> code to run while we're waiting for an lwlock.
> 
> 
>> Besides that is actually not design issues but platform issues...
> 
> I don't see how that's the case.
> 
> 
>> Another question is wait parameters.  We want to expose wait event with
>> some parameters.  Robert rejects that because it *might* add additional
>> overhead [3]. When I proposed to fit something useful into hard-won
>> 4-bytes, Roberts claims that it is "too clever" [4].
> 
> I think stopping to treat this as "Robert/EDB vs. pgpro" would be a good
> first step to make progress here.
> 
> 
> It seems entirely possible to extend the current API in an incremental
> fashion, either allowing to disable the individual pieces, or providing
> sufficient measurements that it's not needed.
> 
> 
>> So, situation looks like dead-end.  I have no idea how to convince Robert
>> about any kind of advanced functionality of wait monitoring to PostgreSQL.
>> I'm thinking about implementing sampling extension over current
>> infrastructure just to make community see that it sucks. Andres, it would
>> be very nice if you have any idea how to move this situation forward.
> 
> I've had my share of conflicts with Robert. But if I were in his shoes,
> targeted by this kind of rhetoric, I'd be very tempted to just ignore
> any further arguments from the origin.  So I think the way forward is
> for everyone to cool off, and to see how we can incrementally make
> progress from here on.
> 
> 
>> Another aspect is that EnterpriseDB offers waits monitoring in proprietary
>> fork [5].
> 
> So?
> 
> Greetings,
> 
> Andres Freund
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-11 Thread Vladimir Borodin

> 12 марта 2016 г., в 0:22, Andres Freund  написал(а):
> 
> On 2016-03-11 23:53:15 +0300, Vladimir Borodin wrote:
>> It was many times stated in threads about waits monitoring [0, 1, 2]
>> and supported by different people, but ultimately waits information
>> was stored in PgBackendStatus.
> 
> Only that it isn't. It's stored in PGPROC.  

Sorry, I missed that. So monitoring of wait events for auxiliary processes 
still could be implemented?

> This criticism is true of
> the progress reporting patch, but a quick scan of the thread doesn't
> show authors of the wait events patch participating there.
> 
> 
>> Can’t we think one more time about implementation provided by Ildus
>> and Alexander here [3]?
> 
> I don't think so. Afaics the proposed patch tried to do too many things
> at once, and it's authors didn't listen well to criticism.  Trying to go
> back to that seems like a surefire way to have nothing in 9.6.

The idea is not to try implement all that at once (and more in 9.6) but give an 
ability to implement all that features eventually. If it is still possible, 
it’s great.

> 
> 
>> Seems that current implementation doesn’t give reasonable ways to
>> implement all that features and it is really sad.
> 
> Why is that?

Storing information about wait event in 4 bytes gives an ability to store only 
wait type and event. No way to store duration or extra information (i.e. buffer 
number for I/O events or buffer manager LWLocks). Maybe I’m missing something...

> 
> 
> Andres Freund

--
May the force be with you…
https://simply.name

Re: [HACKERS] Background Processes and reporting

2016-03-11 Thread Vladimir Borodin


> 11 марта 2016 г., в 22:16, Andres Freund  написал(а):
> 
> Hi,
> 
> We now have "Provide much better wait information in pg_stat_activity"
> and "Add a generic command progress reporting facility" making it easier
> to provide insight into the system.
> 
> 
> While working on the writeback control / checkpoint sorting patch I'd
> the following statement in BufferSync()'s main loop:
> 
>fprintf(stderr, "\33[2K\rto_scan: %d, scanned: %d, %%processed: %.2f, 
> %%writeouts: %.2f",
>num_to_scan, num_processed,
>(((double) num_processed) / num_to_scan) * 100,
>((double) num_written / num_processed) * 100);
> 
> which basically printed the progress of a checkpoint, and some
> additional detail to stderr. Quite helpful to see whether progress is
> "unsteady".
> 
> Obviously that's not something that could be committed.
> 
> So I'm wondering how we can make it possible to use the aforementioned
> "progress reporting facility" to monitor checkpoint progress. To which
> Robert replied on IM:
> "it wouldn't quite help with that because the checkpointer doesn't show
> up as a regular backend"
> 
> 
> It seems rather worthwhile to think about how we can expand the coverage
> of progress tracking to other types of background processes.
> 
> 
> Similarly for the wait event stuff - checkpointer, wal writer,
> background writer are in many cases processes that very often are
> blocked on locks, IO and such.  Thus restricting the facility to
> database connected processes seems like a loss.

It was many times stated in threads about waits monitoring [0, 1, 2] and 
supported by different people, but ultimately waits information was stored in 
PgBackendStatus. Can’t we think one more time about implementation provided by 
Ildus and Alexander here [3]? That implementation included 1. waits monitoring 
for all process, 2. ability to trace waits of a particular process to file, 3. 
wait events history with sampling every N ms and 4. configurable measurement of 
wait timings. It has much more features, has been used in production on 100+ 
databases (patched 9.4) and gives wide opportunities for further development. 
Of course, huge work should be done to rebase across current master and cleanup 
but IMHO it is much better approach. Seems that current implementation doesn’t 
give reasonable ways to implement all that features and it is really sad.

[0] http://www.postgresql.org/message-id/55a3acc3.6020...@postgrespro.ru
[1] http://www.postgresql.org/message-id/55c3306b.5010...@postgrespro.ru
[2] 
http://www.postgresql.org/message-id/9a99c2a7-1760-419f-bdc9-a2cf99ecd...@simply.name
[3] http://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru

> 
> 
> Greetings,
> 
> Andres Freund
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2016-03-10 Thread Vladimir Borodin


> 10 марта 2016 г., в 11:50, Simon Riggs  написал(а):
> 
> On 10 March 2016 at 06:27, Michael Paquier  > wrote:
> On Thu, Mar 10, 2016 at 1:29 AM, David Steele  > wrote:
> > On 1/8/16 9:34 AM, Alvaro Herrera wrote:
> >> Simon Riggs wrote:
> >>>
> >>> On 8 January 2016 at 13:36, Alvaro Herrera  >>> >
> >>> wrote:
> 
>  I would agree except for the observation on toast indexes.  I think
>  that's an important enough use case that perhaps we should have both.
> >>>
> >>> The exclusion of toast indexes is something we can remove also, I have
> >>> recently discovered. When we access toast data we ignore MVCC, but we
> >>> still
> >>> have the toast pointer and chunkid to use for rechecking our scan
> >>> results.
> >>> So a later patch will add some rechecks.
> >>
> >> Ah, interesting, glad to hear.  I take it you're pushing your patch
> >> soon, then?
> >
> > ISTM that this patch should be "returned with feedback" or "rejected" based
> > on the thread.  I'm marking it "waiting for author" for the time being.
> 
> I think that we are still waiting for some input from Simon here...
> Simon, are you going to finish wrapping up your other patch?
> 
> Yes, I have done the research, so think patch should be rejected now.

Let’s do immediately after you will send a new version of your patch? Or even 
better after testing your patch? Don’t get me wrong, but rejecting my patch 
without tangible work on your patch may lead to forgiving about the problem 
before 9.6 freeze.

> 
> Thanks to everyone for their input. It's good to have alternate approaches.
> 
> -- 
> Simon Riggshttp://www.2ndQuadrant.com/ 
> 
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
May the force be with you…
https://simply.name

Re: [HACKERS] [WIP] ALTER ... OWNER TO ... CASCADE

2016-02-16 Thread Vladimir Borodin


> 16 февр. 2016 г., в 18:20, Alvaro Herrera  
> написал(а):
> 
> Vladimir Borodin wrote:
> 
>>> Moreover, the use case you've sketched (ie, change ownership of all
>>> objects inside a database) doesn't actually have anything to do with
>>> following dependencies.  It's a lot closer to REASSIGN OWNED ... in
>>> fact, it's not clear to me why REASSIGN OWNED doesn't solve that
>>> use-case already.
>> 
>> Sometimes I hit the following. You have created a database and schema
>> inside it from the superuser (i.e. postgres). Than you want to change
>> ownership of whole database to another user (i.e. alice), but only
>> this database, not all other objects in all other databases. It seems
>> that REASSIGN OWNED doesn’t solve this already.
> 
> So essentially you want to change all the objects in the database except
> those that were created together with the database itself (i.e. those
> that were copied from the template database).  

Yes. Without such syntax it is now done in a really awful way now, i.e. [0].

[0] 
https://github.com/saltstack/salt/blob/405d0aef1cf11bb56b5d2320b176f6992e6cdf3b/salt/modules/postgres.py#L1806-L1847

> That seems a reasonable
> use-case, but I'm not sure that this ALTER .. OWNER CASCADE is the right
> thing for that -- What object would you start with?  Each schema other
> than pg_catalog, pg_toast, information_schema?  As I recall, the problem
> is that REASSIGN OWNED refuses to work on pinned objects.  Maybe what
> you want is something like
>  REASSIGN OWNED BY xyz IN SCHEMA public TO xyzxxz
> i.e., an extension of the current REASSIGN OWNED BY command?

Well, I don’t know what syntax and implementation would be correct. I just want 
to give a specific user all rights to manage all objects in a specific database 
(which was created from postgres user earlier). It would be really useful.

> 
> -- 
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [WIP] ALTER ... OWNER TO ... CASCADE

2016-02-16 Thread Vladimir Borodin


> 15 февр. 2016 г., в 19:25, Tom Lane  написал(а):
> 
> Teodor Sigaev  writes:
>>> So basically, a generic CASCADE facility sounds like a lot of work to
>>> produce something that would seldom be anything but a foot-gun.
> 
>> DELETE FROM  or TRUNCATE could be a foot-gun too, but it's not a reason to 
>> remove tham. I faced with problem when I tried to change owner of datadase 
>> with 
>> all objects inside. Think, this feature could be useful although it should 
>> restricted to superuser obly.
> 
> That's a pretty weak argument, and I do not think you have thought through
> all the consequences.  It is not hard at all to imagine cases where using
> this sort of thing could be a security vulnerability.  Are you familiar
> with the reasons why Unix systems don't typically allow users to "give
> away" ownership of files?  The same problem exists here.
> 
> To be concrete about it:
> 
> 1. Alice does, say, "CREATE EXTENSION cube".
> 
> 2. Bob creates a security-definer function owned by himself, using a
>   "cube"-type parameter so that it's dependent on the extension.
>   (It needn't actually do anything with that parameter.)
> 
> 3. Alice does ALTER EXTENSION cube OWNER TO charlie CASCADE.
> 
> 4. Bob now has a security-definer function owned by (and therefore
>   executing as) Charlie, whose contents were determined by Bob.
>   Game over for Charlie ... and for everyone else too, if Charlie is
>   a superuser, which is not unlikely for an extension owner.
> 
> The only way Alice can be sure that the ALTER EXTENSION is safe is if
> she manually inspects every dependent object, in which case she might
> as well not use CASCADE.

Seems to be a problem that should be addressed.

> 
> Moreover, the use case you've sketched (ie, change ownership of all
> objects inside a database) doesn't actually have anything to do with
> following dependencies.  It's a lot closer to REASSIGN OWNED ... in
> fact, it's not clear to me why REASSIGN OWNED doesn't solve that
> use-case already.

Sometimes I hit the following. You have created a database and schema inside it 
from the superuser (i.e. postgres). Than you want to change ownership of whole 
database to another user (i.e. alice), but only this database, not all other 
objects in all other databases. It seems that REASSIGN OWNED doesn’t solve this 
already.

> 
> I remain of the opinion that this is a terrible idea.
> 
>   regards, tom lane
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2016-01-08 Thread Vladimir Borodin


> 7 янв. 2016 г., в 5:26, Michael Paquier  
> написал(а):
> 
> On Thu, Jan 7, 2016 at 12:20 AM, Alvaro Herrera
> mailto:alvhe...@2ndquadrant.com>> wrote:
>> Vladimir Borodin wrote:
>> 
>>> There are situations in which vacuuming big btree index causes stuck
>>> in WAL replaying on hot standby servers for quite a long time. I’ve
>>> described the problem in more details in this thread [0]. Below in
>>> that thread Kevin Grittner proposed a good way for improving btree
>>> scans so that btree vacuuming logic could be seriously simplified.
>>> Since I don’t know when that may happen I’ve done a patch that makes
>>> some improvement right now. If Kevin or someone else would expand [1]
>>> for handling all types of btree scans, I suppose, my patch could be
>>> thrown away and vacuuming logic should be strongly rewritten.
>> 
>> You submitted this patch in May 2015 -- and 7 months later, Simon came
>> up with another patch that's supposed to fix the underlying problem, so
>> that this shouldn't be a problem anymore.
>> 
>> Would you please have a look at Simon's patch, in particular verify
>> whether it solves the performance dip in your testing environment?
>> https://www.postgresql.org/message-id/CANP8%2BjJuyExr1HnTAdZraWsWkfc-octhug7YPtzPtJcYbyi4pA%40mail.gmail.com
>> (Note there's an updated patch a few emails down the thread.)
>> 
>> If it seems to fix the problem for you, I think we should mark yours
>> rejected and just apply Simon’s.

Ok, I’ll try this patch with my use case. Basically, it’s not so easy now since 
I’ve partitioned that big table to not have such problems but there is a way to 
reproduce it once again. If it helps, I agree that my should be rejected in 
favor of the Simon’s patch because my patch just reduces replication lag but 
Simon’s seems to remove lag at all.

> 
> Simon's patch (justly) does not update lastBlockVacuumed in the case
> of toast indexes, but Vladimir's patch would still optimize this case,
> no?

I suppose, in case of _not_ toast indexes. But yes, seems that my patch should 
help in that case too.

> -- 
> Michael


--
May the force be with you…
https://simply.name

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-11-17 Thread Vladimir Borodin


> 14 нояб. 2015 г., в 10:50, Amit Kapila  написал(а):
> 
> On Wed, Sep 16, 2015 at 11:22 PM, Robert Haas  > wrote:
> > On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
> > mailto:aekorot...@gmail.com>> wrote:
> >
> > >> I think it's reasonable to consider reporting this data in the PGPROC
> > >> using a 4-byte integer rather than reporting it through a singe byte
> > >> in the backend status structure.  I believe that addresses the
> > >> concerns about reporting from auxiliary processes, and it also allows
> > >> a little more data to be reported.  For anything in excess of that, I
> > >> think we should think rather harder.  Most likely, such addition
> > >> detail should be reported only for certain types of wait events, or on
> > >> a delay, or something like that, so that the core mechanism remains
> > >> really, really fast.
> > >
> > > That sounds reasonable. There are many pending questions, but it seems 
> > > like
> > > step forward to me.
> >
> > Great, let's do it.  I think we should probably do the work to
> > separate the non-individual lwlocks into tranches first, though.
> >
> 
> One thing that occurred to me in this context is that if we store the wait
> event information in PGPROC, then can we think of providing the info
> about wait events in a separate view pg_stat_waits (or pg_stat_wait_info or
> any other better name) where we can display wait information about
> all-processes rather than only backends?  This will avoid the confusion
> about breaking the backward compatibility for the current 'waiting' column
> in pg_stat_activity.
> 
> pg_stat_waits can have columns:
> pid  - Process Id
> wait_class_name - Name of the wait class
> wait class_event  - name of the wait event
> 
> We can extend it later with the information about timing for wait event.
> 
> Also, if we follow this approach, I think we don't need to store this
> information in PgBackendStatus.

Sounds like exactly the same that was proposed by Ildus in this thead [0]. 
Great to be thinking in the same direction. And on the rights of advertisements 
I’ve somehow described using all those views here [1].

[0] http://www.postgresql.org/message-id/559d4729.9080...@postgrespro.ru
[1] https://simply.name/pg-stat-wait.html

>   
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

--
May the force be with you…
https://simply.name

[HACKERS] pg_rewind and replication slots

2015-11-13 Thread Vladimir Borodin

Hi.

Should pg_rewind ignore pg_replslot dir at all? As it does pg_basebackup, for 
example.

--
May the force be with you…
https://simply.name

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-11-03 Thread Vladimir Borodin


> 3 нояб. 2015 г., в 11:38, Andres Freund  написал(а):
> 
> On 2015-11-02 15:37:57 -0500, Robert Haas wrote:
>> On Fri, Oct 30, 2015 at 9:49 AM, Vladimir Borodin  wrote:
>>> I’ve tried two ways - bare SELECT in autocommit mode and BEGIN; SELECT;
>>> ROLLBACK. I first described the problem in thread on pgsql-admin@ [0], there
>>> is copy-paste from psql there, but during conversation initial description
>>> was lost.
>>> 
>>> [0]
>>> http://www.postgresql.org/message-id/7f74c5ea-6741-44fc-b6c6-e96f18d76...@simply.name
>> 
>> Hmm.  That behavior seems unexpected to me, but I might be missing something.
> 
> The conflict is because of a relation lock, not because of
> visibility. Hot-Standby feedback changes nothing about that.
> 
> I presume all the other conflicts are all because of relation level
> locks? Check pg_stat_database_conflicts and the server logs to verify.

Oh, good point, thank you, it gives the answer. Actually I’ve already done a 
switchover in this cluster, so pg_stat_database_conflicts started from scratch 
:( But the logs haven’t been rotated yet:

root@rpopdb01e ~ # fgrep -e 562f9ef0.23df,6 -e 562fa107.451a -e 562fa1d9.5146 
-e 562f9ef0.23df,10 -e 562fa259.56d1 
/var/lib/pgsql/9.4/data/pg_log/postgresql-2015-10-27_185736.csv
2015-10-27 19:06:28.656 MSK,,,9183,,562f9ef0.23df,6,,2015-10-27 18:57:36 
MSK,,0,LOG,0,"parameter ""hot_standby_feedback"" changed to 
""off""",""
2015-10-27 19:10:05.039 
MSK,"postgres","rpopdb",17690,"[local]",562fa107.451a,1,"",2015-10-27 19:06:31 
MSK,12/54563,0,ERROR,40001,"canceling statement due to conflict with 
recovery","User query might have needed to see row versions that must be 
removed.","select count(*) from rpop.rpop_imap_uidls;",,,"psql"
2015-10-27 19:10:05.995 
MSK,"monitor","rpopdb",20806,"localhost:51794",562fa1d9.5146,1,"",2015-10-27 
19:10:01 MSK,15/24192,0,ERROR,40001,"canceling statement due to conflict with 
recovery","User was holding shared buffer pin for too long.""SQL function 
""to_timestamp"" statement 1","select cnt from monitor.bad_rpop_total",,,""
2015-10-27 19:12:06.878 MSK,,,9183,,562f9ef0.23df,10,,2015-10-27 18:57:36 
MSK,,0,LOG,0,"parameter ""hot_standby_feedback"" changed to 
""on""",""
2015-10-27 19:17:57.056 
MSK,"postgres","rpopdb",5,"[local]",562fa259.56d1,1,"",2015-10-27 19:12:09 
MSK,3/35442,0,FATAL,40001,"terminating connection due to conflict with 
recovery","User was holding a relation lock for too long.","select count(*) 
from rpop.rpop_imap_uidls;",,,"psql"
root@rpopdb01e ~ #

So FATAL is due to relation lock and one ERROR is due to pinned buffers (this 
is actually from another user) but there is also one ERROR due to old snapshots 
(first line). But I actually turned off hs_feedback before first ERROR and 
turned it on after it. So it seems to work expectedly.

Does it actually mean that I could get such conflicts (due to relation locks, 
for example) even in repeatable read or serializable? I mean, is there any 
dependency between transaction isolation level on standby and conflicts with 
recovery?

And am I right that the only way not to have confl_lock is to increase 
max_standby_streaming_delay?


> 
> Andres
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-11-02 Thread Vladimir Borodin

> 2 нояб. 2015 г., в 23:37, Robert Haas  написал(а):
> 
> On Fri, Oct 30, 2015 at 9:49 AM, Vladimir Borodin  wrote:
>> I’ve tried two ways - bare SELECT in autocommit mode and BEGIN; SELECT;
>> ROLLBACK. I first described the problem in thread on pgsql-admin@ [0], there
>> is copy-paste from psql there, but during conversation initial description
>> was lost.
>> 
>> [0]
>> http://www.postgresql.org/message-id/7f74c5ea-6741-44fc-b6c6-e96f18d76...@simply.name
> 
> Hmm.  That behavior seems unexpected to me, but I might be missing something.

Me too. That’s why I started the thread. One small detail that might have a 
value is that the big table being queried is partitioned into 64 inhereted 
tables. Now I’m trying to write a simple script to reproduce the problem, but 
that is not so easy because AFAIK VACUUM on master should happen while single 
query on standby is running and it should vacuum those rows that have not been 
accessed by the query on standby yet.

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
May the force be with you…
https://simply.name

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-30 Thread Vladimir Borodin


> 30 окт. 2015 г., в 16:04, Robert Haas  написал(а):
> 
> On Fri, Oct 30, 2015 at 12:40 PM, Vladimir Borodin  wrote:
>> I still don’t fully understand why is it so (the problem occurs while
>> running only one SELECT-statement in READ COMMITED so only one snapshot is
>> taken), but if is expected behavior shouldn’t the documentation mention that
>> using READ COMMITED (which is the default) you may still get conflicts with
>> recovery while using replication slots?
> 
> Are you doing BEGIN / one or more SELECT statements / END?
> 
> Or just a bare SELECT with no explicit transaction control?

I’ve tried two ways - bare SELECT in autocommit mode and BEGIN; SELECT; 
ROLLBACK. I first described the problem in thread on pgsql-admin@ [0], there is 
copy-paste from psql there, but during conversation initial description was 
lost.

[0] 
http://www.postgresql.org/message-id/7f74c5ea-6741-44fc-b6c6-e96f18d76...@simply.name

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> -- 
> Sent via pgsql-admin mailing list (pgsql-ad...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-30 Thread Vladimir Borodin

> 30 окт. 2015 г., в 14:30, Robert Haas  написал(а):
> 
> On Thu, Oct 29, 2015 at 3:29 PM, Oleksii Kliukin  wrote:
>> Could it be a consequence of how REPEATABLE READ transactions handle
>> snapshots? With REPEATABLE READ the snapshot is acquired only once at the
>> beginning of a transaction; a READ COMMITTED transaction re-evaluates its
>> snapshot with each new command.
> 
> I bet that's exactly it.

I still don’t fully understand why is it so (the problem occurs while running 
only one SELECT-statement in READ COMMITED so only one snapshot is taken), but 
if is expected behavior shouldn’t the documentation mention that using READ 
COMMITED (which is the default) you may still get conflicts with recovery while 
using replication slots?

> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

--
May the force be with you…
https://simply.name

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-29 Thread Vladimir Borodin

> 29 окт. 2015 г., в 15:29, Michael Paquier  
> написал(а):
> 
> On Thu, Oct 29, 2015 at 12:35 PM, Vladimir Borodin wrote:
>> 29 окт. 2015 г., в 14:03, Michael Paquier написал(а):
>>> Standby will receive the record but not replay it until the
>>> transaction doing REPEATABLE READ transactions that needs those rows
>>> commits on the standby. The WAL flush position on the standby
>>> continues to move on.
>> 
>> By replication lag on standby I mean exactly replay_location, not
>> flush_location.
>> Well, the initial problem is that in read commited mode heavy
>> SELECT-statement hits max_standby_streaming_delay but in repeatable read
>> mode doesn’t. My question is if it is expected behavior? If yes, why is it
>> so?
> 
> Er, well. If I enforce on master the deletion then VACUUM-cleanup of a
> page with a REPEATABLE READ transaction on standby still expecting to
> have this page items visible until its commit the startup process puts
> itself in waiting state when trying to replay the cleanup record, and
> the replay_location does not move on, still the wal receiver gets WAL
> in parallel, so it continues to flush things and flush_position
> progresses. With a READ COMMITTED transaction running on the standby,
> this transaction considers as visible stuff that has been committed,
> so WAL replay can move on, and indeed there is a risk to face a
> recovery conflict. So this behavior as-is is correct, based on how
> isolation levels should behave when a node performs recovery.

Everything you describe is exactly true for setups without replication slots. 
And the ability to run heavy SELECT statements on hot standby without 
replication lag and recovery conflicts was the reason why I tried to use them. 
And the documentation [0] directly says that «Replication slots provide an 
automated way to ensure ... that the master does not remove rows which could 
cause a recovery conflict even when the standby is disconnected». My question 
is why is it true for REPEATABLE READ transactions but it doesn’t work for READ 
COMMITED queries? Seems, that «even when the standby is disconnected» is much 
stronger limitation and READ COMMITED should work fine, but it doesn’t.

If I understand right, with hot_standby_feedback = on standby tells the master 
xmin of the earliest transaction on standby. And autovacuum worker on master 
takes it into account when doing vacuum cleanup (because it can see it from 
pg_replications_slots view), AFAIK with GetOldestXmin function. But I don’t 
understand why with READ COMMITED transactions xmin in pg_replication_slots 
view on master continues to increase while with REPEATABLE READ xmin freezes 
until this transaction finishes.

[0] 
http://www.postgresql.org/docs/9.4/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

<http://www.postgresql.org/docs/9.4/static/warm-standby.html#STREAMING-REPLICATION-SLOTS>
> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-admin mailing list (pgsql-ad...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin

--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-29 Thread Vladimir Borodin


> 29 окт. 2015 г., в 14:03, Michael Paquier  
> написал(а):
> 
> On Thu, Oct 29, 2015 at 11:33 AM, Vladimir Borodin wrote:
>> 29 окт. 2015 г., в 13:12, Michael Paquier написал(а):
>>> In the case of repeatable read the standby will wait before applying
>>> the VACUUM WAL record cleaning up a relation page. Hence you won't get
>>> conflicts in this case.
>> 
>> Standby will receive but will not apply? Or master will not vacuum needed by
>> standby pages? It seems that the second one is happening because replication
>> lag on standby does not increase while issuing such repeatable read
>> transaction.
> 
> Standby will receive the record but not replay it until the
> transaction doing REPEATABLE READ transactions that needs those rows
> commits on the standby. The WAL flush position on the standby
> continues to move on.

By replication lag on standby I mean exactly replay_location, not 
flush_location.

> This depends of course on
> max_standby_streaming_delay which may decide or not to force the
> transaction to cancel if it takes too long. Someone feel free to
> correct me if I am missing something here.

Well, the initial problem is that in read commited mode heavy SELECT-statement 
hits max_standby_streaming_delay but in repeatable read mode doesn’t. My 
question is if it is expected behavior? If yes, why is it so?

Thanks for your response!

> -- 
> Michael


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-29 Thread Vladimir Borodin


> 29 окт. 2015 г., в 13:12, Michael Paquier  
> написал(а):
> 
> On Thu, Oct 29, 2015 at 9:42 AM, Vladimir Borodin wrote:
>> I’m wondering why do I get conflicts with recovery on hot standby using
>> replication slots and read commited isolation level? And if I start
>> repeatable read transaction I don’t get any errors. Below is some
>> diagnostics.
> 
> In the case of repeatable read the standby will wait before applying
> the VACUUM WAL record cleaning up a relation page. Hence you won't get
> conflicts in this case.

Standby will receive but will not apply? Or master will not vacuum needed by 
standby pages? It seems that the second one is happening because replication 
lag on standby does not increase while issuing such repeatable read transaction.

> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-admin mailing list (pgsql-ad...@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-29 Thread Vladimir Borodin


> 27 окт. 2015 г., в 19:45, Vladimir Borodin  написал(а):
> 
> Hi all.
> 
> I’m wondering why do I get conflicts with recovery on hot standby using 
> replication slots and read commited isolation level? And if I start 
> repeatable read transaction I don’t get any errors. Below is some diagnostics.

+hackers@

Could anybody explain, why this is happening?

> 
> I’m using 9.4.4 (upgrade is planned) from yum.postgresql.org 
> <http://yum.postgresql.org/> packages on both master and standby. Configs are 
> the same on both master and standby:
> 
> rpopdb01d/postgres M # SELECT name, setting FROM pg_settings
> WHERE category LIKE 'Replication%' or category LIKE 'Write-Ahead Log';
>  name | setting
> --+-
>  hot_standby  | on
>  hot_standby_feedback | on
>  max_replication_slots| 1
>  max_standby_archive_delay| 3
>  max_standby_streaming_delay  | 3
>  max_wal_senders  | 10
>  synchronous_standby_names|
>  vacuum_defer_cleanup_age | 20
>  wal_keep_segments| 64
>  wal_receiver_status_interval | 1
>  wal_receiver_timeout | 6
>  wal_sender_timeout   | 3000
> (12 rows)
> 
> Time: 1.583 ms
> rpopdb01d/postgres M #
> 
> On the master I’ve created a physical replication slot and attached standby 
> to it, I do see changing xmin and restart_lsn fields in pg_replication_slots 
> view.
> 
> rpopdb01d/postgres M # select * from pg_replication_slots ;
>  slot_name| plugin | slot_type | datoid | database | active |
> xmin| catalog_xmin |  restart_lsn
> --++---++--+++--+---
>  rpopdb01e_domain_com | [null] | physical  | [null] | [null]   | t  | 
> 2127399287 |   [null] | 960B/415C79C8
> (1 row)
> 
> Time: 0.463 ms
> rpopdb01d/postgres M #
> 
> When I start a read commited transaction on standby (or use autocommit mode, 
> doesn’t matter) I still see that xmin in pg_replication_slots view on master 
> increases. If I do run a heavy SELECT statement, at some point of time 
> (presumably after vacuum_defer_cleanup_age expires) standby starts to lag 
> replication apply and when it hits max_standby_streaming_delay I get 40001 
> sql code, either ERROR or FATAL:
> 
> rpopdb01e/rpopdb R # SHOW transaction_isolation ;
>  transaction_isolation
> ---
>  read committed
> (1 row)
> 
> Time: 0.324 ms
> rpopdb01e/rpopdb R # SELECT count(*) FROM big_table;
> ERROR:  40001: canceling statement due to conflict with recovery
> DETAIL:  User query might have needed to see row versions that must be 
> removed.
> LOCATION:  ProcessInterrupts, postgres.c:2990
> Time: 199791.339 ms
> rpopdb01e/rpopdb R #
> 
> 
> 
> rpopdb01e/rpopdb R # SHOW transaction_isolation ;
>  transaction_isolation
> ---
>  read committed
> (1 row)
> 
> Time: 0.258 ms
> rpopdb01e/rpopdb R # BEGIN;
> BEGIN
> Time: 0.067 ms
> rpopdb01e/rpopdb R # SELECT count(*) FROM big_table;
> FATAL:  40001: terminating connection due to conflict with recovery
> DETAIL:  User was holding a relation lock for too long.
> LOCATION:  ProcessInterrupts, postgres.c:2857
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
> Time: 307864.830 ms
> rpopdb01e/rpopdb R #
> 
> The behavior is the same as expected to be without using replication slots.
> 
> But when I start repeatable read transaction xmin field in 
> pg_replication_slots view on master freezes (while restart_lsn is still 
> increasing) and I don’t get any replication lag and conflicts with recovery. 
> When I end this transaction, xmin starts increasing again.
> 
> rpopdb01e/rpopdb R # begin transaction isolation level repeatable read;
> BEGIN
> Time: 0.118 ms
> rpopdb01e/rpopdb R # SELECT count(*) FROM big_table;
>count
> 
>  3106222429
> (1 row)
> 
> Time: 411944.889 ms
> rpopdb01e/rpopdb R # ROLLBACK;
> ROLLBACK
> Time: 0.269 ms
> rpopdb01e/rpopdb R #
> 
>  And that is what I expect. Am I missing something or is it expected behavior 
> in read commited mode?
> 
> Thanks in advance.
> 
> --
> May the force be with you…
> https://simply.name <https://simply.name/>


--
May the force be with you…
https://simply.name

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-18 Thread Vladimir Borodin


> 18 сент. 2015 г., в 20:16, Robert Haas  написал(а):
> 
> On Fri, Sep 18, 2015 at 4:08 AM, Vladimir Borodin  wrote:
>> For both scenarios on linux we got approximately the same results - version
>> with timings was faster then version with sampling (sampling was done every
>> 10 ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version
>> with timings gave ~14500 tps while version with sampling gave ~13800 tps. In
>> all cases processor was 100% utilized. Comparing vanilla PostgreSQL and
>> version with timings on constant workload (12000 tps) gave the following
>> results in latencies for queries:
> 
> If the timing is speeding things up, that's most likely a sign that
> the spinlock contention on that workload is so severe that you are
> spending a lot of time in s_lock.  Adding more things for the system
> to do that don't require that lock will speed the system up by
> reducing the contention.  Instead of inserting gettimeofday() calls,
> you could insert a for loop that counts to some large number without
> doing any useful work, and that would likely have a similar effect.
> 
> In any case, I think your experiment clearly proves that the presence
> or absence of this instrumentation *is* performance-relevant and that
> we *do* need to worry about what it costs. If the system gets 20%
> faster when you call gettimeofday() a lot, does that mean we should
> insert gettimeofday() calls all over the system in random places to
> speed it up?

No, probably you misunderstood the results, let me explain one more time. 
Unpatched PostgreSQL from REL9_4_STABLE gave 15500 tps. Version with timings - 
14500 tps which is 6,5% worse. Version with sampling wait events every 10 ms 
gave 13800 tps (11% worse than unpatched and 5% worse than with timings).

We also made a test with a stable workload of 12000 tps for unpatched version 
and version with timings. In thas test we saw that response times are a bit 
worse in version with timings as shown in the table below. You should read this 
table as follows: 99% of all queries in unpatched version fits in 79 ms while 
in version with timings 99% of all queries fits in 97 ms which is 18 ms slower, 
and so on. That test also showed that version with timings consumes extra 7% of 
CPU to handle the same workload as unpatched version.

So this is the cost of waits monitoring with timings on lwlock stress workload 
- 6,5% less throughput, a bit worse timings and extra 7% of CPU. If you will 
insert gettimeofday() calls all over the system in random places, you 
expectedly will not speed up, you will be getting slower.

q'thvanilla timing
99% 79.097.0(+18.0)
98% 64.076.0(+12.0)
95% 38.047.0(+9.0)
90% 16.021.0(+5.0)
85% 7.0 11.0(+4.0)
80% 5.0 7.0 (+2.0)
75% 4.0 5.0 (+1.0)
50% 2.0 3.0 (+1.0)

> 
> I do agree that if we're going to include support for timings, having
> them be controlled by a GUC is a good idea.
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-18 Thread Vladimir Borodin

> 16 сент. 2015 г., в 20:52, Robert Haas  написал(а):
> 
> On Wed, Sep 16, 2015 at 12:29 PM, Alexander Korotkov
>  wrote:
>> Yes, the major question is cost. But I think we should validate our thoughts
>> by experiments assuming there are more possible synchronization protocols.
>> Ildus posted implemention of double buffering approach that showed quite low
>> cost.
> 
> I'm not sure exactly which email you are referring to, but I don't
> believe that anyone has done experiments that are anywhere near
> comprehensive enough to convince ourselves that this won't be a
> problem.  If a particular benchmark doesn't show an issue, that can
> just mean that the benchmark isn't hitting the case where there is a
> problem.  For example, EDB has had customers who have severe
> contention apparently on the buffer content lwlocks, resulting in big
> slowdowns.  You don't see that in, say, a pgbench run.  But for people
> who have certain kinds of queries, it's really bad.  Those sort of
> loads, where the lwlock system really gets stressed, are cases where
> adding overhead seems likely to pinch.

Alexander and Ildus gave us two patches for REL9_4_STABLE - one with sampling 
every N milliseconds and one with measuring timings. We have tested both of 
them on two kinds of our workload besides pgbench runs. One workload is OLTP 
when all the data fits in shared buffers, synchronous_commit if off and the 
bottleneck is ProcArrayLock (all backtraces look something like the following):

[Thread debugging using libthread_db enabled]
0x7f10e01d4f27 in semop () from /lib64/libc.so.6
#0  0x7f10e01d4f27 in semop () from /lib64/libc.so.6
#1  0x0061fe27 in PGSemaphoreLock (sema=0x7f11f2d4f430, interruptOK=0 
'\000') at pg_sema.c:421
#2  0x006769ba in LWLockAcquireCommon (l=0x7f10e1e00120, 
mode=LW_EXCLUSIVE) at lwlock.c:626
#3  LWLockAcquire (l=0x7f10e1e00120, mode=LW_EXCLUSIVE) at lwlock.c:467
#4  0x00667862 in ProcArrayEndTransaction (proc=0x7f11f2d4f420, 
latestXid=182562881) at procarray.c:404
#5  0x004b579b in CommitTransaction () at xact.c:1957
#6  0x004b6ae5 in CommitTransactionCommand () at xact.c:2727
#7  0x006819d9 in finish_xact_command () at postgres.c:2437
#8  0x00684f05 in PostgresMain (argc=, argv=, dbname=0x21e1a70 "xivadb", username=) at 
postgres.c:4270
#9  0x00632d7d in BackendRun (argc=, argv=) at postmaster.c:4155
#10 BackendStartup (argc=, argv=) at 
postmaster.c:3829
#11 ServerLoop (argc=, argv=) at 
postmaster.c:1597
#12 PostmasterMain (argc=, argv=) at 
postmaster.c:1244
#13 0x005cadb8 in main (argc=3, argv=0x21e0aa0) at main.c:228

Another is when all the data fits in RAM but does not fit in shared buffers and 
the bottleneck is mostly BufFreelistLock with sensible contention around buffer 
partitions locking and buffers locking. Taking several backtraces with GDB of 
several backends and doing some bash magic gives the following:

root@pgload01g ~ # grep '^#4 ' /tmp/bt | awk '{print $2, $4, $NF}' | sort | 
uniq -c | sort -rn
126 0x0065db61 BufferAlloc bufmgr.c:591
 67 0x0065e03a BufferAlloc bufmgr.c:760
 43 0x005c8c3b pq_getbyte pqcomm.c:899
 39 0x0065dd93 BufferAlloc bufmgr.c:765
  6 0x004b52bb RecordTransactionCommit xact.c:1194
  4 0x0065da0e ReadBuffer_common bufmgr.c:476
  1 ReadBuffer_common relpersistence=112 bufmgr.c:340
  1 exec_eval_expr expr=0x166e908, pl_exec.c:4796
  1 0x7f78b8cb217b ?? /usr/pgsql-9.4/lib/pg_stat_statements.so
  1 0x005d4cbb _copyList copyfuncs.c:3849
root@pgload01g ~ #

For both scenarios on linux we got approximately the same results - version 
with timings was faster then version with sampling (sampling was done every 10 
ms). Vanilla PostgreSQL from REL9_4_STABLE gave ~15500 tps and version with 
timings gave ~14500 tps while version with sampling gave ~13800 tps. In all 
cases processor was 100% utilized. Comparing vanilla PostgreSQL and version 
with timings on constant workload (12000 tps) gave the following results in 
latencies for queries:

q'thvanilla timing
99% 79.097.0(+18.0)
98% 64.076.0(+12.0)
95% 38.047.0(+9.0)
90% 16.021.0(+5.0)
85% 7.0 11.0(+4.0)
80% 5.0 7.0 (+2.0)
75% 4.0 5.0 (+1.0)
50% 2.0 3.0 (+1.0)

And it that test version with timings consumed about 7% more of CPU. Does it 
seem to be the results on workload where lwlock system is stressed?

And when the data does not fit in RAM you really don’t see much difference 
between all three version because your contention is moved from lwlock system 
to I/O, even with newest NVMe SSDs, or at least is divided between lwlocks and 
other waits events.

> 
>> Yes, but some competing products also provides comprehensive waits
>> monitoring too. That makes me think

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-09-14 Thread Vladimir Borodin


> 12 сент. 2015 г., в 14:05, Amit Kapila  написал(а):
> 
> On Thu, Aug 6, 2015 at 3:31 PM, Ildus Kurbangaliev 
> mailto:i.kurbangal...@postgrespro.ru>> wrote:
> >
> > On 08/05/2015 09:33 PM, Robert Haas wrote:
> >>
> >>
> >> You're missing the point.  Those multi-byte fields have additional
> >> synchronization requirements, as I explained in some detail in my
> >> previous email. You can't just wave that away.
> >
> > I see that now. Thank you for the point.
> >
> > I've looked deeper and I found PgBackendStatus to be not a suitable
> > place for keeping information about low level waits. Really, PgBackendStatus
> > is used to track high level information about backend. This is why auxiliary
> > processes don't have PgBackendStatus, because they don't have such 
> > information
> > to expose. But when we come to the low level wait events then auxiliary
> > processes are as useful for monitoring as backends are. WAL writer,
> > checkpointer, bgwriter etc are using LWLocks as well. This is certainly 
> > unclear
> > why they can't be monitored.
> >
> 
> I think the chances of background processes stuck in LWLock is quite less
> as compare to backends as they do the activities periodically.  As an example
> WALWriter will take WALWriteLock to write the WAL, but actually there will 
> never
> be any much contention for WALWriter. In synchronous_commit = on, the
> backends themselves write the WAL so WALWriter won't do much in that
> case and for synchronous_commit = off, backends won't write the WAL so
> WALWriter won't face any contention unless some buffers have to be written
> by bgwriter or checkpoint for which WAL is not flushed which I don't think
> would lead to any contention. 

WALWriter is not a good example, IMHO. And monitoring LWLocks is not the only 
thing that waits monitoring brings to us. Here [0] is an example when 
understanding of what is happening inside the startup process took some long 
time and led to GDB usage. With waits monitoring I could do a couple of SELECTs 
and use oid2name to understand the reason of a problem.

Also we should consider that PostgreSQL has a good infrastructure to 
parallelize many auxilary processes. Can we be sure that we will always have 
exactly one wal writer process? Perhaps, some time in the future we would need 
several of them and there would be contention for WALWriteLock between them. 
Perhaps, wal writer is not a good example here too, but having multiple 
checkpointer or bgwriter processes on the near future seems very likely, no?

[0] 
http://www.postgresql.org/message-id/fe82a9a7-0d52-41b5-a9ed-967f6927c...@simply.name
 

> 
> I am not denying from the fact that there could be some contention in rare
> scenarios for background processes, but I think tracking them is not as
> important as tracking the LWLocks for backends.
> 
> Also as we are planning to track the wait_event information in 
> pg_stat_activity
> along with other backends information, it will not make sense to include
> information about backend processes in this variable as pg_stat_activity
> just displays information of backend processes.
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com 

--
May the force be with you…
https://simply.name

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-09-02 Thread Vladimir Borodin

25 авг. 2015 г., в 16:03, Michael Paquier  написал(а):On Sun, Jul 26, 2015 at 9:46 PM, Andres Freund wrote:On 2015-07-24 09:53:49 +0300, Heikki Linnakangas wrote:To me it sounds like this shouldn't go through the full ReadBuffer()rigamarole. That code is already complex enough, and here it's reallynot needed. I think it'll be much easier to review - and actually fasterin many cases to simply have something likeboolBufferInCache(Relation, ForkNumber, BlockNumber){    /* XXX: setup tag, hash, partition */    LWLockAcquire(newPartitionLock, LW_SHARED);    buf_id = BufTableLookup(&newTag, newHash);    LWLockRelease(newPartitionLock);    return buf_id != -1;}and then fall back to the normal ReadBuffer() when it's in cache.Yep, sounds good. Patch with implementation attached.Patch marked as returned with feedback as input from the author hasbeen waited for some time now.Sorry for delay, I will link it to the current commitfest.

btree_vacuum_v3.patch
Description: Binary data
-- Michael
--May the force be with you…https://simply.name

Re: [HACKERS] creating extension including dependencies

2015-07-10 Thread Vladimir Borodin


> 10 июля 2015 г., в 16:09, Heikki Linnakangas  написал(а):
> 
> On 07/09/2015 07:05 PM, Petr Jelinek wrote:
>> On 2015-07-07 15:41, Andres Freund wrote:
>>> On 2015-07-07 22:36:29 +0900, Fujii Masao wrote:
 On Mon, Jun 15, 2015 at 7:50 AM, Petr Jelinek  wrote:
> Hi,
> 
> I am getting tired installing manually required extensions manually. I was
> wondering if we might want to add option to CREATE SEQUENCE that would 
> allow
> automatic creation of the extensions required by the extension that is 
> being
> installed by the user.
 
 I'm wondering how much helpful this feature is. Because, even if we can 
 save
 some steps for CREATE EXTENSION by using the feature, we still need to
 manually find out, download and install all the extensions that the target
 extension depends on. So isn't it better to implement the tool like yum, 
 i.e.,
 which performs all those steps almost automatically, rather than the 
 proposed
 feature? Maybe it's outside PostgreSQL core.
>>> 
>>> That doesn't seem to make much sense to me. Something like yum can't
>>> install everything in all relevant databases. Sure, yum will be used to
>>> install dependencies between extensions on the filesystem level.
>>> 
>>> At the minimum I'd like to see that CREATE EXTENSION foo; would install
>>> install extension 'bar' if foo dependended on 'bar' if CASCADE is
>>> specified. Right now we always error out saying that the dependency on
>>> 'bar' is not fullfilled - not particularly helpful.
>> 
>> That's what the proposed patch does (with slightly different syntax but
>> syntax is something that can be changed easily).
> 
> This seems quite reasonable, but I have to ask: How many extensions are there 
> out there that depend on another extension? Off the top of my head, I can't 
> think of any..

pg_stat_kcache depends on pg_stat_statements, for example.

> 
> - Heikki
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org 
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers 
> 

--
May the force be with you…
https://simply.name

Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-28 Thread Vladimir Borodin


> 28 июня 2015 г., в 21:46, Heikki Linnakangas  написал(а):
> 
> On 06/24/2015 09:43 AM, Michael Paquier wrote:
>> Attached is a new set of patches. Except for the last ones that
>> addresses one issue of pg_rewind (symlink management when streaming
>> PGDATA), all the others introduce if_not_exists options for the
>> functions of genfile.c. The pg_rewind stuff could be more polished
>> though. Feel free to comment.
> 
> I've committed the additional option to the functions in genfile.c (I renamed 
> it to "missing_ok", for clarity), and the pg_rewind changes to use that 
> option.

And since it changes API it would not be back-ported to 9.4, right?

> 
> I ended up refactoring the patch quite a bit, so if you could double-check 
> what I committed to make sure I didn't break anything, that would be great.
> 
> I didn't commit the tablespace or symlink handling changes yet, will review 
> those separately.
> 
> I also didn't commit the new regression test yet. It would indeed be nice to 
> have one, but I think it was a few bricks shy of a load. It should work in a 
> freshly initdb'd system, but not necessarily on an existing installation. 
> First, it relied on the fact that postgresql.conf.auto exists, but a DBA 
> might remove that if he wants to make sure the feature is not used. Secondly, 
> it relied on the fact that pg_twophase is empty, but there is no guarantee of 
> that either. Third, the error messages included in the expected output, e.g 
> "No such file or directory", depend on the operating system and locale. And 
> finally, it'd be nice to test more things, in particular the behaviour of 
> different offsets and lengths to pg_read_binary_file(), although an 
> incomplete test would be better than no test at all.
> 
> - Heikki
> 
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name

Re: [HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Vladimir Borodin

> 17 июня 2015 г., в 9:48, Michael Paquier  
> написал(а):
> 
> On Wed, Jun 17, 2015 at 3:17 PM, Michael Paquier
>  wrote:
>> As pointed by dev1ant on the original bug report, process_remote_file
>> should ignore files named as pg_xlog/xlogtemp.*, and I think that this
>> is the right thing to do. Any objections for a patch that at the same
>> time makes "xlogtemp." a define declaration in xlog_internal.h?

Declaration seems to be the right thing.

Another problem I’ve caught twice already in the same test:

error reading xlog record: record with zero length at 0/7890
unexpected result while fetching remote files: ERROR:  could not open file 
"base/13003/t6_2424967" for reading: No such file or directory
The servers diverged at WAL position 0/76BADD50 on timeline 303.
Rewinding from Last common checkpoint at 0/7651F870 on timeline 303

I don’t know if this problem could be solved the same way (by skipping such 
files)… Should I start a new thread for that?

> 
> And attached is a patch following those lines.
> -- 
> Michael
> <20150617_rewind_xlogtemp.patch>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
May the force be with you…
https://simply.name

Re: [HACKERS] Broken --dry-run mode in pg_rewind

2015-05-08 Thread Vladimir Borodin

8 мая 2015 г., в 16:39, Vladimir Borodin <r...@simply.name> написал(а):8 мая 2015 г., в 16:11, Stephen Frost <sfr...@snowman.net> написал(а):* Heikki Linnakangas (hlinn...@iki.fi) wrote:On 05/08/2015 03:39 PM, Michael Paquier wrote:On Fri, May 8, 2015 at 9:34 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:On 05/08/2015 03:25 PM, Vladimir Borodin wrote:Seems, that pg_rewind does not account --dry-run option properly. A simplefixfor that is attached.No, the --dry-run takes effect later. It performs all the actions itnormally would, including reading files from the source, except for actuallywriting anything in the target. See the dry-run checks in file_ops.cUrgh, my test script had an error and I made grep only in pg_rewind.c. Sorry for noise.Even if the patch sent is incorrect, shouldn't there be some processbypass in updateControlFile() and createBackupLabel() in case of a--dry-run?They both use open_target_file() and write_target_file(), whichcheck for --dry-run and do nothing if it's set.Hmm, I wonder it we should print something else than "Done!" at theend, if run in --dry-run mode. Or give some indication around thetime it says "Rewinding from last common checkpoint at ...", thatit's running in dry-run mode and won't actually modify anything. Theprogress messages are a bit alarming if you don't realize that it'sskipping all the writes.That would be really nice.Added comments in all places mentioned in this thread.

pg_rewind_dry_run_comments.patch
Description: Binary data
Wouldn't hurt to also augment that rather doom-looking "point of noreturn" comment with a note that says writes won't happen if indry-run. :)For my 2c anyway.	Thanks!		Stephen--May the force be with you…https://simply.name
--May the force be with you…https://simply.name

Re: [HACKERS] Broken --dry-run mode in pg_rewind

2015-05-08 Thread Vladimir Borodin


> 8 мая 2015 г., в 16:11, Stephen Frost  написал(а):
> 
> * Heikki Linnakangas (hlinn...@iki.fi <mailto:hlinn...@iki.fi>) wrote:
>> On 05/08/2015 03:39 PM, Michael Paquier wrote:
>>> On Fri, May 8, 2015 at 9:34 PM, Heikki Linnakangas  wrote:
>>>> On 05/08/2015 03:25 PM, Vladimir Borodin wrote:
>>>>> Seems, that pg_rewind does not account --dry-run option properly. A simple
>>>>> fix
>>>>> for that is attached.
>>>> 
>>>> 
>>>> No, the --dry-run takes effect later. It performs all the actions it
>>>> normally would, including reading files from the source, except for 
>>>> actually
>>>> writing anything in the target. See the dry-run checks in file_ops.c

Urgh, my test script had an error and I made grep only in pg_rewind.c. Sorry 
for noise.

>>> 
>>> Even if the patch sent is incorrect, shouldn't there be some process
>>> bypass in updateControlFile() and createBackupLabel() in case of a
>>> --dry-run?
>> 
>> They both use open_target_file() and write_target_file(), which
>> check for --dry-run and do nothing if it's set.
>> 
>> Hmm, I wonder it we should print something else than "Done!" at the
>> end, if run in --dry-run mode. Or give some indication around the
>> time it says "Rewinding from last common checkpoint at ...", that
>> it's running in dry-run mode and won't actually modify anything. The
>> progress messages are a bit alarming if you don't realize that it's
>> skipping all the writes.

That would be really nice.

> 
> Wouldn't hurt to also augment that rather doom-looking "point of no
> return" comment with a note that says writes won't happen if in
> dry-run. :)
> 
> For my 2c anyway.
> 
>   Thanks!
> 
>   Stephen


--
May the force be with you…
https://simply.name

[HACKERS] Broken --dry-run mode in pg_rewind

2015-05-08 Thread Vladimir Borodin

Hi all.Seems, that pg_rewind does not account --dry-run option properly. A simple fix for that is attached.

pg_rewind_dry_run_fix.patch
Description: Binary data

--May the force be with you…https://simply.name

Re: [HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-05-03 Thread Vladimir Borodin

Hi, Jim.Thanks for review.2 мая 2015 г., в 2:10, Jim Nasby <jim.na...@bluetreble.com> написал(а):On 5/1/15 11:19 AM, Vladimir Borodin wrote:There are situations in which vacuuming big btree index causes stuck inWAL replaying on hot standby servers for quite a long time. I’vedescribed the problem in more details in this thread [0]. Below in thatthread Kevin Grittner proposed a good way for improving btree scans sothat btree vacuuming logic could be seriously simplified. Since I don’tknow when that may happen I’ve done a patch that makes some improvementright now. If Kevin or someone else would expand [1] for handling alltypes of btree scans, I suppose, my patch could be thrown away andvacuuming logic should be strongly rewritten.This looks like a good way to address this until the more significant work can be done.I'm not a fan of "RBM_ZERO_NO_BM_VALID"; how about RBM_ZERO_BM_INVALID? or BM_NOT_VALID? Or maybe I'm just trying to impose too much English on the code; I see the logic to NO_BM_VALID…Perhaps, RBM_ZERO_NO_BM_VALID is not so good (it makes more difficult to grep BM_VALID in code), but I don’t actually like BM_INVALID and BM_NOT_VALID, sorry :( But I also don’t insist on NO_BM_VALID, any other suggestions?+ * RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK, but does not set+ * BM_VALID bit before returning buffer so that noone could pin it.It would be better to explain why we want that mode. How about:RBM_ZERO_NO_BM_VALID is the same as RBM_ZERO_AND_LOCK but does not set BM_VALID before returning the buffer. This is done to ensure that no one can pin the buffer without actually reading the buffer contents in. This is necessary while replying XLOG_BTREE_VACUUM records in hot standby.Good point, fixed in attached patch.+		if (mode == RBM_ZERO_NO_BM_VALID)+			TerminateBufferIO(bufHdr, false, 0);+		else+			TerminateBufferIO(bufHdr, false, BM_VALID);Simply passing in a 0 seems a bit odd to me; is there anywhere else we do that?Yes, it is done the same way in FlushBuffer function [0]. Also comments before TerminateBufferIO say that 0 is expected value for third argument.[0] http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/storage/buffer/bufmgr.c;h=a68eae81695f3f3b711d35d6c910e46b031f1cbc;hb=HEAD#l2426

btree_vacuum_v2.patch
Description: Binary data
-- Jim Nasby, Data Architect, Blue Treble ConsultingData in Trouble? Get it in Treble! http://BlueTreble.com
--May the force be with you…https://simply.name

[HACKERS] Improving replay of XLOG_BTREE_VACUUM records

2015-05-01 Thread Vladimir Borodin

Hi all.There are situations in which vacuuming big btree index causes stuck in WAL replaying on hot standby servers for quite a long time. I’ve described the problem in more details in this thread [0]. Below in that thread Kevin Grittner proposed a good way for improving btree scans so that btree vacuuming logic could be seriously simplified. Since I don’t know when that may happen I’ve done a patch that makes some improvement right now. If Kevin or someone else would expand [1] for handling all types of btree scans, I suppose, my patch could be thrown away and vacuuming logic should be strongly rewritten.The idea of the patch is not to read pages from disk to make sure they are unpinned (like btree_xlog_vacuum does it right now). This is done with creating a new ReadBufferMode which returns locked buffer without setting BM_VALID flag on it. I don’t know if that is the right way of doing that but it seems to work well.Testing it my environment gives a good win [2] - green is unpatched replica, blue is replica with a patch, two spikes are results of calling manual vacuuming of big table. Since the picture could be unavailable I’ll write here that: 1. replication delay reduced from ~1250 MB to 200 MB of replay_location lag, 2. patched replica caught master in less than a minute against 12 minutes of unpatched replica, 3. Physical I/O load on patched replica didn’t change compared with the normal workload while unpatched replica did lots of reads from PGDATA during spikes.There is still a stuck in WAL replay but much smaller that right now. Also this change seems not to affect any other scenarios.I’ll add it to 2015-06 commitfest.[0] http://www.postgresql.org/message-id/058c9d59-9200-45fd-a565-0e4431a6f...@simply.name[1] http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069[2] https://yadi.sk/i/l13PZUNhgNB8u

btree_vacuum_v1.patch
Description: Binary data

--May the force be with you…https://simply.name

[HACKERS] ONLY in queries by RI triggers

2015-04-17 Thread Vladimir Borodin

Hi all.

A long time ago in 04b31609b63ce77fb9273193f07cf21b2a7176af ONLY keyword was 
added to all queries in src/backend/utils/adt/ri_triggers.c. Since that time 
foreign keys do not work with inheritance trees and it is mentioned in the 
documentation for all versions since at least 7.3. 

I wonder what are the pitfalls of removing ONLY keyword from those queries? I 
have made such change, it passes all tests and foreign keys for partitioned 
tables do work, but I suppose that there are lots of places where it could 
break something.

Thanks in advance.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Vacuuming big btree indexes without pages with deleted items

2015-04-01 Thread Vladimir Borodin


> 31 марта 2015 г., в 23:33, Kevin Grittner  написал(а):
> 
> Jim Nasby  wrote:
>> On 3/27/15 5:15 AM, Vladimir Borodin wrote:
> 
>>> Master writes this record to xlog in btvacuumscan function after
>>> vacuuming of all index pages. And in case of no pages with
>>> deleted items xlog record would contain lastBlockVacuumed 0.
>>> 
>>> In btree_xlog_vacuum replica reads all blocks from
>>> lastBlockVacuumed to last block of the index while applying this
>>> record because there is no api in the buffer manager to
>>> understand if the page is unpinned.
>>> 
>>> So if the index is quite big (200+ GB in described case) it
>>> takes much time to do it.
> 
>>> 2. Is it possible not to write to xlog record with
>>> lastBlockVacuumed 0 in some cases? For example, in case of not
>>> deleting any pages.
> 
>> Possibly, but that's much higher risk. Without studying it, if we
>> wanted to mess around with that it might actually make more sense
>> to XLOG a set of blkno's that got vacuumed, but I suspect that
>> wouldn't be a win.
> 
> I feel pretty confident that it would be a win in some significant
> cases, but it could be worse in some cases by changing sequential
> access to random, unless we use heuristics to protect against
> that.  But...
> 
>>> Or maybe there are some better ways of improving this situation?
> 
> This is a start of a better way:
> 
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
> 
> If we expand on that commit to cover non-MVCC index scans,
> index-only scans, and index scans of non-WAL-logged indexes, then
> this whole aspect of btree vacuum can be eliminated.  It seems
> extremely dubious that all of that could be done for 9.5, and it's
> certainly not material for back-patching to any stable branches,
> but it would be a more complete and better-performing fix than the
> alternatives being discussed here.

Kevin, thanks for your work in this direction.

This way seems to be definitely better. It doesn’t matter that it would not be 
included in 9.5 and back-patched to stable versions. This thread is mostly 
about what could be done in the future. If other cases (including index-only 
scans) would be addressed in 9.6, for example, that would be really cool.

> 
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


--
May the force be with you…
https://simply.name

[HACKERS] Vacuuming big btree indexes without pages with deleted items

2015-03-27 Thread Vladimir Borodin

Hi all.

I have described [0] a problem with delaying replicas after vacuuming a 
relation with big btree index. It stucks in replaying WAL record of type 
XLOG_BTREE_VACUUM like that (with lastBlockVacuumed 0):

rmgr: Btree   len (rec/tot): 20/52, tx:  0, lsn: 
4115/56126DC0, prev 4115/56126D90, bkp: , desc: vacuum: rel 
1663/16420/16796; blk 31222118, lastBlockVacuumed 0

Master writes this record to xlog in btvacuumscan [1] function after vacuuming 
of all index pages. And in case of no pages with deleted items xlog record 
would contain lastBlockVacuumed 0.

In btree_xlog_vacuum [2] replica reads all blocks from lastBlockVacuumed to 
last block of the index while applying this record because there is no api in 
the buffer manager to understand if the page is unpinned. 

So if the index is quite big (200+ GB in described case) it takes much time to 
do it. So the questions are:

1. Aren’t there still any api in buffer manager to understand that the page is 
not in shared_buffers without reading it?
2. Is it possible not to write to xlog record with lastBlockVacuumed 0 in some 
cases? For example, in case of not deleting any pages.

Or maybe there are some better ways of improving this situation?

[0] 
http://www.postgresql.org/message-id/fe82a9a7-0d52-41b5-a9ed-967f6927c...@simply.name
[1] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtree.c;hb=refs/heads/REL9_4_STABLE#l813
[2] 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtxlog.c;hb=refs/heads/REL9_4_STABLE#l482

--
May the force be with you…
https://simply.name

Re: [HACKERS] pg_rewind in contrib

2015-03-26 Thread Vladimir Borodin


> 26 марта 2015 г., в 7:32, Michael Paquier  
> написал(а):
> 
> On Thu, Mar 26, 2015 at 12:23 PM, Venkata Balaji N  wrote:
>> Test 1 :
>> 
>> [...]
>> 
>> If the master is crashed or killed abruptly, it may not be possible to do a
>> rewind. Is my understanding correct ?
> 
> Yep. This is mentioned in the documentation:
> http://www.postgresql.org/docs/devel/static/app-pgrewind.html
> "The target server must shut down cleanly before running pg_rewind».

You can start old master, wait for crash recovery to complete, stop it cleanly 
and then use pg_rewind. It works.

> 
>> Test 2 :
>> 
>> - On a successfully running streaming replication with one master and one
>> slave, i did a clean shutdown of master
>> - promoted slave
>> - performed some operations (data changes) on newly promoted slave and did a
>> clean shutdown
>> - Executed pg_rewind on the old master to sync with the latest changes on
>> new master. I got the below message
>> 
>> The servers diverged at WAL position 0/A298 on timeline 1.
>> No rewind required.
>> 
>> I am not getting this too.
> 
> In this case the master WAL visibly did not diverge from the slave WAL
> line. A rewind is done if the master touches new relation pages after
> the standby has been promoted, and before the master is shutdown.
> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] [GENERAL] [pgadmin-support] Issue with a hanging apply process on the replica db after vacuum works on primary

2015-03-23 Thread Vladimir Borodin


> 20 марта 2015 г., в 18:00, Vladimir Borodin  написал(а):
> 
>> 
>> 19 марта 2015 г., в 20:30, Sergey Shchukin > <mailto:shchukin@gmail.com>> написал(а):
>> 
>> 17.03.2015 13:22, Sergey Shchukin пишет:
>>> 05.03.2015 11:25, Jim Nasby пишет:
>>>> On 2/27/15 5:11 AM, Sergey Shchukin wrote: 
>>>>> 
>>>>> show max_standby_streaming_delay; 
>>>>>   max_standby_streaming_delay 
>>>>> - 
>>>>>   30s 
>>>> 
>>>> We both need to be more clear about which server we're talking about 
>>>> (master or replica). 
>>>> 
>>>> What are max_standby_streaming_delay and max_standby_archive_delay set to 
>>>> *on the replica*? 
>>>> 
>>>> My hope is that one or both of those is set to somewhere around 8 minutes 
>>>> on the replica. That would explain everything. 
>>>> 
>>>> If that's not the case then I suspect what's happening is there's 
>>>> something running on the replica that isn't checking for interrupts 
>>>> frequently enough. That would also explain it. 
>>>> 
>>>> When replication hangs, is the replication process using a lot of CPU? Or 
>>>> is it just sitting there? What's the process status for the replay process 
>>>> show? 
>>>> 
>>>> Can you get a trace of the replay process on the replica when this is 
>>>> happening to see where it's spending all it's time? 
>>>> 
>>>> How are you generating these log lines? 
>>>>  Tue Feb 24 15:05:07 MSK 2015 Stream: MASTER-masterdb:79607161592048 
>>>> SLAVE:79607161550576 Replay:79607160986064 :: REPLAY 592 KBytes 
>>>> (00:00:00.398376 seconds) 
>>>> 
>>>> Do you see the confl_* fields in pg_stat_database_conflicts on the 
>>>> *replica* increasing? 
>>> 
>>> Hi Jim,
>>> 
>>> max_standby_streaming_delay and max_standby_archive_delay  both are 30s on 
>>> master and replica dbs
>>> 
>>> I don't see any specific or heavy workload during this issue with a hanging 
>>> apply process. Just a normal queries as usual. 
>>> 
>>> But I see an increased disk activity during the time when the apply issue 
>>> is ongoing
>>> 
>>> DSK |  sdc  |  | busy 61%  | read   11511 | 
>>>   | write   4534 | KiB/r 46  |  |  KiB/w  4 | MBr/s 
>>>  52.78 |   | MBw/s   1.88 |  avq 1.45 |  |  
>>> avio 0.38 ms |
>>> DSK |  sde  |  | busy 60%  | read   11457 | 
>>>   | write   4398 | KiB/r 46  |  |  KiB/w  4 | MBr/s 
>>>  51.97 |   | MBw/s   1.83 |  avq 1.47 |  |  
>>> avio 0.38 ms |
>>> DSK |  sdd  |  | busy 60%  | read9673 | 
>>>   | write   4538 | KiB/r 61  |  |  KiB/w  4 | MBr/s 
>>>  58.24 |   | MBw/s   1.88 |  avq 1.47 |  |  
>>> avio 0.42 ms |
>>> DSK |  sdj  |  | busy 59%  | read9576 | 
>>>   | write   4177 | KiB/r 63  |  |  KiB/w  4 | MBr/s 
>>>  59.30 |   | MBw/s   1.75 |  avq 1.48 |  |  
>>> avio 0.43 ms |
>>> DSK |  sdh  |  | busy 59%  | read9615 | 
>>>   | write   4305 | KiB/r 63  |  |  KiB/w  4 | MBr/s 
>>>  59.23 |   | MBw/s   1.80 |  avq 1.48 |  |  
>>> avio 0.42 ms |
>>> DSK |  sdf  |  | busy 59%  | read9483 | 
>>>   | write   4404 | KiB/r 63  |  |  KiB/w  4 | MBr/s 
>>>  59.11 |   | MBw/s   1.83 |  avq 1.47 |  |  
>>> avio 0.42 ms |
>>> DSK |  sdi  |  | busy 59%  | read   11273 | 
>>>   | write   4173 | KiB/r 46  |  |  KiB/w  4 | MBr/s 
>>>  51.50 |   | MBw/s   1.75 |  avq 1.43 |  |  
>>> avio 0.38 ms |
>>> DSK |  sdg  |  | busy 59%  | read   11406 | 
>>>   | write   4297 | KiB/r 46  |  |  KiB/w  4 | MBr/s 
>>>  51.66 |   | MBw/s   1.80 |  avq 1.46 |  |  
>>> av

Re: [HACKERS] pg_upgrade and rsync

2015-03-05 Thread Vladimir Borodin


> 6 марта 2015 г., в 6:11, Bruce Momjian  написал(а):
> 
> On Thu, Mar  5, 2015 at 10:55:28AM +0300, Vladimir Borodin wrote:
>>You are correct that a pg_controldata file is copied over that has
>>wal_level=minimal, but that should not be a problem.
>> 
>> 
>> I suppose, this is the root cause of why replica does not start as hot 
>> standby.
>> It it enough to start it as warm standby, but not hot standby.
>> See CheckRequiredParameterValues function in xlog.c which is called inside of
>> StartupXLOG function.
> 
> Yes, you are correct.  I spent all day building a test harness so I
> could automate this setup and test various failures.  I was able to
> reproduce your failure, and you are correct that the proper fix is to
> set wal_level=hot_standby on the new master, and then start and stop the
> new cluster just before rsync.
> 
> The root cause is that pg_upgrade calls pg_resetxlog -o on the new
> cluster _after_ the new cluster stopped for the final time, so rsync is
> copying the incorrect pg_controldata wal_level value.  Also, even if
> pg_resetxlog preserved wal_level in the control file, there is no
> guarantee that the user configured the new cluster's wal_level for
> hot_standby anyway.
> 
> What I have done is to update the pg_upgrade instructions to add this
> required step.  Updated doc patch attached.  (I also added the --delete
> flag to rsync.)  Thanks so much for your detailed report.

It seems to work fine now. The only thing that would be nice to change is to 
explicitly write that these instructions are correct for hot standby 
installations too.

+ 
+  If you have Log-Shipping Standby Servers (), follow these steps to upgrade them (before
+  starting any servers):
+ 

Actually, I’ve entered this thread because it is not obvious from the paragraph 
above or any other places.

> 
>>But it could not be done with --size-only key, because control-file is
>>of fixed
>>size and rsync would skip it. Or may be everything should be copied
>>with
>>--size-only and control-file should be copied without this option.
>> 
>> 
>>Well, what happens is that there is no _new_ standby pg_controldata
>>file, so it is copied fully from the new master.  Are you running initdb
>>to create the new standby --- you shouldn't be doing that as the rsync
>>will do that for you.  
>> 
>> 
>> No, I don’t. The scenario of the problem with copying control-file was in 
>> case
>> when I:
>> 1. ran pg_upgrade on master and got control-file with "wal_level = minimal",
>> 2. did rsync --size-only to replica (and it got this control-file with
>> "wal_level = minimal"),
>> 3. started and stopped postgres on master to get «good» control-file with
>> "wal_level = hot_standby»,
>> 4. did rsync --size-only to replica one more time. And this time control-file
>> is not copied because of the same size of control-file.
>> 
>> Actually, if you don’t do step 2, everything works as expected. Sorry for
>> bothering you.
> 
> Ah, yes, I think doing rsync twice is never a good suggestion.  It can
> lead to too many failures.  Doing the start/stop before rsync seems like
> the best solution.
> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +
> 


--
May the force be with you…
https://simply.name

Re: [HACKERS] pg_upgrade and rsync

2015-03-04 Thread Vladimir Borodin

> 4 марта 2015 г., в 19:28, Bruce Momjian  написал(а):
> 
> On Wed, Mar  4, 2015 at 01:53:47PM +0300, Vladimir Borodin wrote:
>>After running initdb to create the new master, but before running
>>pg_upgrade, modify the new master's postgresql.conf and change wal_level
>>= hot_standby.  (Don't modify pg_hba.conf at this stage.)
>> 
>> 
>> 
>> That does not help. The reason is that pg_upgrade sets 'Current wal_level
>> setting: minimal' in control-file, and it does not depend on what is set in
>> postgresql.conf before running pg_upgrade. Details could be seen here - 
>> http://
>> pastie.org/9998671.
> 
> Well, what is happening is that the pg_resetxlog commands we run inside
> pg_upgrade set the pg_controldata file's wal_level to minimal, but as
> you saw, starting the server sets the pg_controldata properly. 
> pg_resetxlog is not changing the WAL files at all, just the control
> file.
> 
>> The workaround for this is to start  and cleanly shut down postgres on master
>> after running pg_upgrade but before running rsync. After that there would be 
>> a
>> good control-file for streaming replication and rsync to replica can be done.
> 
> You are correct that a pg_controldata file is copied over that has
> wal_level=minimal, but that should not be a problem.

I suppose, this is the root cause of why replica does not start as hot standby. 
It it enough to start it as warm standby, but not hot standby. See 
CheckRequiredParameterValues function in xlog.c which is called inside of 
StartupXLOG function.

> 
>> But it could not be done with --size-only key, because control-file is of 
>> fixed
>> size and rsync would skip it. Or may be everything should be copied with
>> --size-only and control-file should be copied without this option.
> 
> Well, what happens is that there is no _new_ standby pg_controldata
> file, so it is copied fully from the new master.  Are you running initdb
> to create the new standby --- you shouldn't be doing that as the rsync
> will do that for you.  

No, I don’t. The scenario of the problem with copying control-file was in case 
when I:
1. ran pg_upgrade on master and got control-file with "wal_level = minimal",
2. did rsync --size-only to replica (and it got this control-file with 
"wal_level = minimal"),
3. started and stopped postgres on master to get «good» control-file with  
"wal_level = hot_standby»,
4. did rsync --size-only to replica one more time. And this time control-file 
is not copied because of the same size of control-file.

Actually, if you don’t do step 2, everything works as expected. Sorry for 
bothering you.

> Also, are you cleanly shutting down all the
> servers, or using pg_ctl -m immediate?

I use init-script, it shuts down cleanly with "-m fast".

> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +

--
May the force be with you…
https://simply.name

Re: [HACKERS] pg_upgrade and rsync

2015-03-04 Thread Vladimir Borodin


> 3 марта 2015 г., в 18:01, Bruce Momjian  написал(а):
> 
> On Tue, Mar  3, 2015 at 04:55:56PM +0300, Vladimir Borodin wrote:
>>OK, hmmm.  Thanks for testing.  It feels like you didn't have your new
>>master set up for streaming replication when you ran pg_upgrade.  Is
>>that correct?  Should I specify that specifically in the instructions?
>> 
>> 
>> After running pg_upgrade I do put in new PGDATA on master old pg_hba.conf and
>> postgresql.conf with wal_level = hot_standby. The full content of
>> postgresql.conf could be seen here - http://pastie.org/9995902. Then I do 
>> rsync
>> to replica, put recovery.conf and try to start both - first master, then
>> replica. If I turn off hot_standby in replica configuration, it starts. What 
>> am
>> I doing wrong?
> 
> After running initdb to create the new master, but before running
> pg_upgrade, modify the new master's postgresql.conf and change wal_level
> = hot_standby.  (Don't modify pg_hba.conf at this stage.)
> 

That does not help. The reason is that pg_upgrade sets 'Current wal_level 
setting: minimal' in control-file, and it does not depend on what is set in 
postgresql.conf before running pg_upgrade. Details could be seen here - 
http://pastie.org/9998671 <http://pastie.org/9998671>.

The workaround for this is to start  and cleanly shut down postgres on master 
after running pg_upgrade but before running rsync. After that there would be a 
good control-file for streaming replication and rsync to replica can be done. 
But it could not be done with --size-only key, because control-file is of fixed 
size and rsync would skip it. Or may be everything should be copied with 
--size-only and control-file should be copied without this option.

> I didn't think that was necessary, but this might be some 9.3-specific
> problem, but let's get it working first.
> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] pg_upgrade and rsync

2015-03-03 Thread Vladimir Borodin


> 3 марта 2015 г., в 16:38, Bruce Momjian  написал(а):
> 
> On Tue, Mar  3, 2015 at 11:38:58AM +0300, Vladimir Borodin wrote:
>>No, you would not need to take a full backup if you use these 
>> instructions.
>> 
>> 
>> Although it would be applied to documentation for 9.5 only, are these
>> instructions applicable for upgrading from 9.3.6 to 9.4.1?
> 
> Yes.  They work all the way back to 9.0.
> 
>> Following the instructions from patch I’ve got following errors in
>> postgresql.log of replica after trying to start it with hot_standby = on:
>> 
>> < 2015-02-24 11:47:22.861 MSK >WARNING:  WAL was generated with wal_level=
>> minimal, data may be missing
>> < 2015-02-24 11:47:22.861 MSK >HINT:  This happens if you temporarily set
>> wal_level=minimal without taking a new base backup.
>> < 2015-02-24 11:47:22.861 MSK >FATAL:  hot standby is not possible because
>> wal_level was not set to "hot_standby" or higher on the master server
>> < 2015-02-24 11:47:22.861 MSK >HINT:  Either set wal_level to "hot_standby" 
>> on
>> the master, or turn off hot_standby here.
>> < 2015-02-24 11:47:22.862 MSK >LOG:  startup process (PID 28093) exited with
>> exit code 1
>> < 2015-02-24 11:47:22.862 MSK >LOG:  aborting startup due to startup process
>> failure
> 
> OK, hmmm.  Thanks for testing.  It feels like you didn't have your new
> master set up for streaming replication when you ran pg_upgrade.  Is
> that correct?  Should I specify that specifically in the instructions?

After running pg_upgrade I do put in new PGDATA on master old pg_hba.conf and 
postgresql.conf with wal_level = hot_standby. The full content of 
postgresql.conf could be seen here - http://pastie.org/9995902 
<http://pastie.org/9995902>. Then I do rsync to replica, put recovery.conf and 
try to start both - first master, then replica. If I turn off hot_standby in 
replica configuration, it starts. What am I doing wrong?

> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +


--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] pg_upgrade and rsync

2015-03-03 Thread Vladimir Borodin

> 2 марта 2015 г., в 21:28, Bruce Momjian  написал(а):
> 
> On Tue, Feb 24, 2015 at 12:13:17PM +0300, Vladimir Borodin wrote:
>> 
>>20 февр. 2015 г., в 18:21, Bruce Momjian  написал(а):
>> 
>>On Fri, Feb 20, 2015 at 09:45:08AM -0500, Bruce Momjian wrote:
>> 
>>#3 bothered me as well because it was not specific enough.  I like
>>what
>>you've added to clarify the procedure.
>> 
>> 
>>Good.  It took me a while to understand why they have to be in sync 
>> ---
>>because we are using rsync in size-only-comparison mode, if they are
>>not
>>in sync we might update some files whose sizes changed, but not 
>> others,
>>and the old slave would be broken.  The new slave is going to get all
>>new files or hard links for user files, so it would be fine, but we
>>should be able to fall back to the old slaves, and having them in sync
>>allows that.
>> 
>> 
>>Also, since there was concern about the instructions, I am thinking of
>>applying the patch only to head for 9.5, and then blog about it if
>>people want to test it.
>> 
>> 
>> Am I right that if you are using hot standby with both streaming replication
>> and WAL shipping you do still need to take full backup of master after using
>> pg_upgrade?
> 
> No, you would not need to take a full backup if you use these instructions.

Although it would be applied to documentation for 9.5 only, are these 
instructions applicable for upgrading from 9.3.6 to 9.4.1?

Following the instructions from patch I’ve got following errors in 
postgresql.log of replica after trying to start it with hot_standby = on:

< 2015-02-24 11:47:22.861 MSK >WARNING:  WAL was generated with 
wal_level=minimal, data may be missing
< 2015-02-24 11:47:22.861 MSK >HINT:  This happens if you temporarily set 
wal_level=minimal without taking a new base backup.
< 2015-02-24 11:47:22.861 MSK >FATAL:  hot standby is not possible because 
wal_level was not set to "hot_standby" or higher on the master server
< 2015-02-24 11:47:22.861 MSK >HINT:  Either set wal_level to "hot_standby" on 
the master, or turn off hot_standby here.
< 2015-02-24 11:47:22.862 MSK >LOG:  startup process (PID 28093) exited with 
exit code 1
< 2015-02-24 11:47:22.862 MSK >LOG:  aborting startup due to startup process 
failure

> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +

--
Да пребудет с вами сила…
https://simply.name/ru

Re: [HACKERS] pg_upgrade and rsync

2015-02-24 Thread Vladimir Borodin


> 20 февр. 2015 г., в 18:21, Bruce Momjian  написал(а):
> 
> On Fri, Feb 20, 2015 at 09:45:08AM -0500, Bruce Momjian wrote:
>>> #3 bothered me as well because it was not specific enough.  I like what
>>> you've added to clarify the procedure.
>> 
>> Good.  It took me a while to understand why they have to be in sync ---
>> because we are using rsync in size-only-comparison mode, if they are not
>> in sync we might update some files whose sizes changed, but not others,
>> and the old slave would be broken.  The new slave is going to get all
>> new files or hard links for user files, so it would be fine, but we
>> should be able to fall back to the old slaves, and having them in sync
>> allows that.
> 
> Also, since there was concern about the instructions, I am thinking of
> applying the patch only to head for 9.5, and then blog about it if
> people want to test it.

Am I right that if you are using hot standby with both streaming replication 
and WAL shipping you do still need to take full backup of master after using 
pg_upgrade?

> 
> -- 
>  Bruce Momjian  http://momjian.us
>  EnterpriseDB http://enterprisedb.com
> 
>  + Everyone has their own god. +
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you...
https://simply.name

Re: [HACKERS] Check that streaming replica received all data after master shutdown

2015-01-13 Thread Vladimir Borodin


05 янв. 2015 г., в 18:15, Vladimir Borodin  написал(а):

> Hi all.
> 
> I have a simple script for planned switchover of PostgreSQL (9.3 and 9.4) 
> master to one of its replicas. This script checks a lot of things before 
> doing it and one of them is that all data from master has been received by 
> replica that is going to be promoted. Right now the check is done like below:
> 
> On the master:
> 
> postgres@pgtest03d ~ $ psql -t -A -c 'select pg_current_xlog_location();'
> 0/3390
> postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast
> waiting for server to shut down done
> server stopped
> postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head
> pg_control version number:937
> Catalog version number:   201306121
> Database system identifier:   6061800518091528182
> Database cluster state:   shut down
> pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK
> Latest checkpoint location:   0/3428
> Prior checkpoint location:0/3328
> Latest checkpoint's REDO location:0/3428
> Latest checkpoint's REDO WAL file:001B0034
> Latest checkpoint's TimeLineID:   27
> postgres@pgtest03d ~ $
> 
> On the replica (after shutdown of master):
> 
> postgres@pgtest03g ~ $ psql -t -A -c "select 
> pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428');"
> 104
> postgres@pgtest03g ~ $
> 
> These 104 bytes seems to be the size of shutdown checkpoint record (as I can 
> understand from pg_xlogdump output).
> 
> postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump -s 
> 0/3390 -t 27
> rmgr: XLOGlen (rec/tot):  0/32, tx:  0, lsn: 
> 0/3390, prev 0/3328, bkp: , desc: xlog switch
> rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
> 0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428; 
> tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0; 
> oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; 
> shutdown
> pg_xlogdump: FATAL:  error in WAL record at 0/3428: record with zero 
> length at 0/3490
> 
> postgres@pgtest03g ~/9.3/data/pg_xlog $
> 
> I’m not sure that these 104 bytes will always be 104 bytes to have a strict 
> equality while checking. Could it change in the future? Or is there a better 
> way to understand that streaming replica received all data after master 
> shutdown? The check that pg_xlog_location_diff returns 104 bytes seems a bit 
> strange.
> 

+hackers

Could anyone help?

Thanks.

> Thanks.
> 
> --
> May the force be with you...
> http://simply.name
> 
> 
> 
> 


--
May the force be with you...
http://simply.name

72 matches

Mail list logo