Re: [GENERAL] Strange checkpoint behavior - checkpoints take a long time

2017-10-05 Thread Igor Polishchuk
Vladimir,
Just curious, if your goal is to reduce checkpoint overhead, shouldn’t you 
decrease shared_buffers instead of increasing it? 
With bigger shared_buffers, you can accumulate more dirty buffers for 
checkpoint to take care. 
I remember in early versions ( around 8.4), when checkpoint_completion_target 
was not available, one suggested way of fighting heavy checkpoints was 
setting very low shared_buffers.
Also, why do yo need to reduce your checkpoint write time to 20 minutes from 
25? What will you gain? If you will have the same number of dirty buffers to 
flush, your IO intensity and overhead will  increase in these 20 minutes.

Igor

  
> On Oct 5, 2017, at 12:58, Vladimir Nicolici  wrote:
> 
> Some further updates about the issue.
>  
> I did a bit of benchmarking on the disk system with iozone, and the during 
> the test the SSDs seemed to be able to easily sustain 200 MB/second of writes 
> each, they fluctuated between 200 MB/s and 400 MB/s when doing 96 GB of 
> random writes in a file. That would mean between 400 and 800 MB/s for the 
> entire RAID volume, since it’s 1+0 and has 4 SSDs, 2 in each mirror.
>  
> I wasn’t able to benchmark reads properly because the machine has so much RAM 
> that they work mostly from cache. But reads  shouldn’t be an issue anyway. 
> I’ll try to do more comprehensive tests tomorrow but, from what I’ve seen so 
> far, SSD I/O limits shouldn’t be a concern.
>  
> I changed some configuration parameters during the night to the values I was 
> considering yesterday:
>  
> shared_buffers = 144GB #previously 96 GB
> bgwriter_lru_maxpages = 100  #previously 400
> checkpoint_timeout = 30min  #previously 5min
> checkpoint_completion_target = 0.83 #previously 0.85; 0.83 means 25 minutes 
> writes out of 30 minutes.
> max_wal_size = 96GB #previously 16GB
> wal_buffers = 16MB  #previously 32 MB
>  
> With the new settings the checkpoints now finish on time, more or less. One 
> recent checkpoint looked like this:
>  
> 2017-10-05 14:16:22.891 EDT [7828] LOG:  checkpoint starting: time
> 2017-10-05 14:42:35.429 EDT [7828] LOG:  checkpoint complete: wrote 4770679 
> buffers (25.3%); 0 transaction log file(s) added, 0 removed, 2088 recycled; 
> write=1501.567 s, sync=1.844 s, total=1572.538 s; sync files=750, 
> longest=0.029 s, average=0.002 s; distance=33579763 kB, estimate=35950738 kB
>  
> So it took 1502 seconds for writes, instead of the configured 1494, that’s 
> close enough, just 6 seconds more. Sync was another 2 seconds. So 1504 write 
> + sync. However, the total is 1573 seconds, 69 seconds more. Not sure what 
> the checkpoint process does in those 69 seconds.
>  
> Looking further back at the logs with the checkpoints from today, the first 
> checkpoint of the day that did some significant work didn’t have such a large 
> difference between write time and checkpoint time, it had write + sync  = 
> 1494, and total 1495, just 1 second difference:
>  
> 2017-10-05 09:16:22.851 EDT [7828] LOG:  checkpoint starting: time
> 2017-10-05 09:41:18.139 EDT [7828] LOG:  checkpoint complete: wrote 4445573 
> buffers (23.6%); 0 transaction log file(s) added, 0 removed, 98 recycled; 
> write=1493.322 s, sync=0.575 s, total=1495.287 s; sync files=734, 
> longest=0.022 s, average=0.000 s; distance=29166105 kB, estimate=29166105 kB
>  
> The difference is it didn’t need to recycle so many buffers as the later one. 
> Not sure what exactly the recycling does, if anything, but it seems to add 
> some overhead to the process. The checkpoint with the 69 seconds delay 
> recycled 2088 wal segments, the one with the 1 second delay recycled just 98.
>  
> If the overhead is indeed caused by recycling, I’m not sure if it’s 
> proportional to the size of the recycled segments, or with the number of 
> recycled segments. If it’s the number of segments, then maybe a wal segment 
> size larger than 16 MB would help with this issue, but unfortunately for the 
> time being that can only be configured at compile time and forces you to 
> recreate the cluster as well.
>  
> For tomorrow I plan to reduce bgwriter_lru_maxpages from 100 to 0 and rely 
> just on checkpoints for datafile writes, and reduce 
> checkpoint_completion_target from 0.83 to 0.66 to try make the checkpoints 
> finish the writes in 20 minutes instead of 25 minutes, out of 30 minutes.
>  
> Thanks,
> Vlad



Re: [GENERAL] Rsync to a recovering streaming replica?

2017-09-27 Thread Igor Polishchuk
Scott,
Thank you for your insight. I do have some extra disk and network throughput to 
spare. However, my question is ‘Can I run rsync while streaming is running?’
A streaming replica is a physical copy of a master, so why not. My concern is a 
possible silent introduction of some block corruptions, that would not be fixed 
by a block copy in wal files. I think such corruptions should not happen, and I 
saw a few instances where running rsync seemed to work. 
I’m curious if somebody is aware about a situation where a corruption is likely 
to happen.

Igor

> On Sep 27, 2017, at 12:48, Scott Mead <sco...@openscg.com> wrote:
> 
> 
> 
> On Wed, Sep 27, 2017 at 1:59 PM, Igor Polishchuk <ora4...@gmail.com 
> <mailto:ora4...@gmail.com>> wrote:
> Sorry, here are the missing details, if it helps:
> Postgres 9.6.5 on CentOS 7.2.1511
> 
> > On Sep 27, 2017, at 10:56, Igor Polishchuk <ora4...@gmail.com 
> > <mailto:ora4...@gmail.com>> wrote:
> >
> > Hello,
> > I have a multi-terabyte streaming replica on a bysy database. When I set it 
> > up, repetative rsyncs take at least 6 hours each.
> > So, when I start the replica, it begins streaming, but it is many hours 
> > behind right from the start. It is working for hours, and cannot reach a 
> > consistent state
> > so the database is not getting opened for queries. I have plenty of WAL 
> > files available in the master’s pg_xlog, so the replica never uses archived 
> > logs.
> > A question:
> > Should I be able to run one more rsync from the master to my replica while 
> > it is streaming?
> > The idea is to overcome the throughput limit imposed by a single recovery 
> > process on the replica and allow to catch up quicker.
> > I remember doing it many years ago on Pg 8.4, and also heard from other 
> > people doing it. In all cases, it seamed working.
> > I’m just not sure if there is no high risk of introducing some hidden data 
> > corruption, which I may not notice for a while on such a huge database.
> > Any educated opinions on the subject here?
> 
> It really comes down to the amount of I/O (network and disk) your system can 
> handle while under load.  I've used 2 methods to do this in the past:
> 
> - http://moo.nac.uci.edu/~hjm/parsync/ <http://moo.nac.uci.edu/~hjm/parsync/>
> 
>   parsync (parallel rsync)is nice, it does all the hard work for you of 
> parellizing rsync.  It's just a pain to get all the prereqs installed.
> 
> 
> - rsync --itemize-changes
>   Essentially, use this to get a list of files, manually split them out and 
> fire up a number of rsyncs.  parsync does this for you, but, if you can't get 
> it going for any reason, this works.
> 
> 
> The real trick, after you do your parallel rsync, make sure that you run one 
> final rsync to sync-up any missed items.
> 
> Remember, it's all about I/O.  The more parallel threads you use, the harder 
> you'll beat up the disks / network on the master, which could impact 
> production.
> 
> Good luck
> 
> --Scott
> 
> 
> 
> 
> 
>  
> >
> > Thank you
> > Igor Polishchuk
> 
> 
> 
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org 
> <mailto:pgsql-general@postgresql.org>)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general 
> <http://www.postgresql.org/mailpref/pgsql-general>
> 
> 
> 
> -- 
> --
> Scott Mead
> Sr. Architect
> OpenSCG <http://openscg.com/>
> http://openscg.com <http://openscg.com/>


Re: [GENERAL] Rsync to a recovering streaming replica?

2017-09-27 Thread Igor Polishchuk
Sorry, here are the missing details, if it helps:
Postgres 9.6.5 on CentOS 7.2.1511

> On Sep 27, 2017, at 10:56, Igor Polishchuk <ora4...@gmail.com> wrote:
> 
> Hello,
> I have a multi-terabyte streaming replica on a bysy database. When I set it 
> up, repetative rsyncs take at least 6 hours each.
> So, when I start the replica, it begins streaming, but it is many hours 
> behind right from the start. It is working for hours, and cannot reach a 
> consistent state
> so the database is not getting opened for queries. I have plenty of WAL files 
> available in the master’s pg_xlog, so the replica never uses archived logs.
> A question:
> Should I be able to run one more rsync from the master to my replica while it 
> is streaming?
> The idea is to overcome the throughput limit imposed by a single recovery 
> process on the replica and allow to catch up quicker. 
> I remember doing it many years ago on Pg 8.4, and also heard from other 
> people doing it. In all cases, it seamed working. 
> I’m just not sure if there is no high risk of introducing some hidden data 
> corruption, which I may not notice for a while on such a huge database.
> Any educated opinions on the subject here? 
> 
> Thank you
> Igor Polishchuk  



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Rsync to a recovering streaming replica?

2017-09-27 Thread Igor Polishchuk
Hello,
I have a multi-terabyte streaming replica on a bysy database. When I set it up, 
repetative rsyncs take at least 6 hours each.
So, when I start the replica, it begins streaming, but it is many hours behind 
right from the start. It is working for hours, and cannot reach a consistent 
state
so the database is not getting opened for queries. I have plenty of WAL files 
available in the master’s pg_xlog, so the replica never uses archived logs.
A question:
Should I be able to run one more rsync from the master to my replica while it 
is streaming?
The idea is to overcome the throughput limit imposed by a single recovery 
process on the replica and allow to catch up quicker. 
I remember doing it many years ago on Pg 8.4, and also heard from other people 
doing it. In all cases, it seamed working. 
I’m just not sure if there is no high risk of introducing some hidden data 
corruption, which I may not notice for a while on such a huge database.
Any educated opinions on the subject here? 

Thank you
Igor Polishchuk  

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Upgrade to 9.1 causing function problem

2012-02-23 Thread Igor Polishchuk
Actually, what works is
set search_path='crabdata', 'public' ;


On 2/23/12 1:10 PM, Adrian Klaver adrian.kla...@gmail.com wrote:

 On 02/23/2012 01:08 PM, Willem Buitendyk wrote:
 I have it set in postgresql.conf and I've also used:

 alter user postgres set search_path = crabdata,public;
 
 
 Well search_path is a string, so have you tried?;
 search_path='crabdata,public'
 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Transaction ID wraparound, Oracle style

2012-01-18 Thread Igor Polishchuk
Here is an article on a recently discovered Oracle flaw, which allows SCN to
reach its limit.
http://www.computerworld.com/s/article/9223506/Fundamental_Oracle_flaw_revea
led?taxonomyId=18pageNumber=1
 
Please don't beat me for posting a link for an Oracle related article.
If you despise a very notion of mentioning Oracle, please just don't read
the post.
This article may be interesting to any RDBMS  professional, no mater what db
flavor he/she is working with.
Also, this story may be a lesson for the Postgresql community on how not do
things. I'm not a developer, but it seems that having synchronized
transaction id between let say streaming-replicated databases would give
some advantages if done properly.
Regards
Igor Polishchuk



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Re: Large values for duration of COMMITs and slow queries. Due to large WAL config values?

2011-11-12 Thread Igor Polishchuk
I saw such behavior a few years ago on multiple very busy databases
connected to the same EMC SAN. The SAN's cache got overwhelmed by the
databases IO, and the storage latency went up significantly. I don't
remember now what was the latency, but it was above 40ms. 
Is everything ok with your storage system? Is it possible your databases
produce more IOPS than your storage may handle?


On Sat, 2011-11-12 at 13:47 -0800, Cody Caughlan wrote:
 I've run VACUUM ANALYZE on all my tables to make sure the house has
 been cleaned. I still see a lot of slow queries / commits, even on
 primary key lookups and well indexed tables.
 
 /Cody
 
 On Fri, Nov 11, 2011 at 11:04 PM, Cody Caughlan tool...@gmail.com wrote:
  Postgres 9.1.1, master with 2 slaves via streaming replication.
 
  I've enabled slow query logging of 150ms and am seeing a large number
  of slow COMMITs:
 
  2011-11-12 06:55:02 UTC pid:30897 (28/0-0) LOG:  duration: 232.398 ms
  statement: COMMIT
  2011-11-12 06:55:08 UTC pid:30896 (27/0-0) LOG:  duration: 1078.789 ms
   statement: COMMIT
  2011-11-12 06:55:09 UTC pid:30842 (15/0-0) LOG:  duration: 2395.432 ms
   statement: COMMIT
  2011-11-12 06:55:09 UTC pid:30865 (23/0-0) LOG:  duration: 2395.153 ms
   statement: COMMIT
  2011-11-12 06:55:09 UTC pid:30873 (17/0-0) LOG:  duration: 2390.106 ms
   statement: COMMIT
 
  The machine has 16GB of RAM and plenty of disk space. What I think
  might be relevant settings are:
 
  wal_buffers = 16MB
  checkpoint_segments = 32
  max_wal_senders = 10
  checkpoint_completion_target = 0.9
  wal_keep_segments = 1024
  maintenance_work_mem = 256MB
  work_mem = 88MB
  shared_buffers = 3584MB
  effective_cache_size = 10GB
 
  Recently we have bumped up wal_keep_segments and checkpoint_segments
  because we wanted to run long running queries on the slaves and we're
  receiving cancellation errors on the slaves. I think the master was
  recycling WAL logs from underneath the slave and thus canceling the
  queries. Hence, I believed I needed to crank up those values. It seems
  to work, I can run long queries (for statistics / reports) on the
  slaves just fine.
 
  But I now wonder if its having an adverse effect on the master, ala
  these slow commit times and other slow queries (e.g. primary key
  lookups on tables with not that many records), which seem to have
  increased since the configuration change.
 
  I am watching iostat and sure enough, when %iowait gets  15 or so
  then a bunch more slow queries get logged. So I can see its disk
  related.
 
  I just dont know what the underlying cause is.
 
  Any pointers would be appreciated. Thank you.
 
 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] CLONE DATABASE (with copy on write?)

2011-11-12 Thread Igor Polishchuk
I'm in a similar position, cloning a database multiple times to provide
development and qa databases for multiple groups. A feature desired by
Clark would greatly help many people with their development and qa
databases.
On the other hand, I imagine,  a feature like this would not be easy to
develop, and it is useful for many but not all the users.
So I doubt we can have it any time soon. 


On Sat, 2011-11-12 at 16:40 -0500, Clark C. Evans wrote:
 Hello all!  
 
 Our company has some headaches in our application development
 and deployment process.  The chief problem is, creating stages, 
 which to this audience is, cloning a database efficiently, 
 making and testing a few changes, perhaps recording the 
 differences between databases, and then dropping the database.
 
 I'm eternally grateful for someone who pointed out that we
 should be using CREATE DATABASE ... WITH TEMPLATE.  However,
 this has two big disadvantages.  First, it only works if you
 can kick the users off the clone.  Secondly, it still takes
 time, uses disk space, etc.  We have some big databases.  
 
 I was also thinking about using ZFS with PostgreSQL to do
 some sort of copy-on-write.  However, this would require me
 to spawn a whole *new* PostgreSQL instance.  In both of these
 cases, you lose your cache...
 
 So, I was wondering... could PostgreSQL grow the ability to
 CLONE a database by re-using existing file system blocks,
 sharing them across databases?  This would perhaps be fast,
 keep the shared memory cache relevant for both the old copy
 and the clone, and remove WAL overhead.  Then, if the block
 has to be edited, it'd be cloned in memory, and the clone 
 would be flushed.  
 
 I'm just imagining a world where deploying a new version
 of our software that makes very small catalog changes and
 tweaks a few rows would be... quick.  Quick to try, check,
 test, and even deploy on live servers.
 
 Best,
 
 Clark
 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general