Re: [HACKERS] Asynchronous execution on FDW
Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ == Hello. This is the new version of FDW async exection feature. The status of this feature is as follows, as of the last commitfest. - Async execution is valuable to have. - But do the first kick in ExecInit phase is wrong. So the design outline of this version is as following, - The patch set consists of three parts. The fist is the infrastracture in core-side, second is the code to enable asynchronous execution of Postgres-FDW. The third part is the alternative set of three methods to adapt fetch size, which makes asynchronous execution more effective. - It was a problem when to give the first kick for async exec. It is not in ExecInit phase, and ExecProc phase does not fit, too. An extra phase ExecPreProc or something is too invasive. So I tried pre-exec callback. Any init-node can register callbacks on their turn, then the registerd callbacks are called just before ExecProc phase in executor. The first patch adds functions and structs to enable this. - The second part is not changed from the previous version. Add PgFdwConn as a extended PgConn which have some members to support asynchronous execution. The asynchronous execution is kicked only for the first ForeignScan node on the same foreign server. And the state lasts until the next scan comes. This behavior is mainly controlled in fetch_more_data(). The behavior limits the number of simultaneous exection for one foreign server to 1. This behavior is decided from the reason that no reasonable method to limit multiplicity of execution on *single peer* was found so far. - The third part is three kind of trials of adaptive fetch size feature. The first one is duration-based adaptation. The patch increases the fetch size by every FETCH execution but try to keep the duration of every FETCH below 500 ms. But it is not promising because it looks very unstable, or the behavior is nearly unforeseeable.. The second one is based on byte-based FETCH feature. This patch adds to FETCH command an argument to limit the number of bytes (octets) to send. But this might be a over-exposure of the internals. The size is counted based on internal representation of a tuple and the client is needed to send the overhead of its internal tuple representation in bytes. This is effective but quite ugly.. The third is the most simple and straight-forward way, that is, adds a foreign table option to specify the fetch_size. The effect of this is also in doubt since the size of tuples for one foreign table would vary according to the return-columns list. But it is foreseeable for users and is a necessary knob for those who want to tune it. Foreign server also could have the same option as the default for that for foreign tables but this patch have not added it. The attached patches are the following, - 0001-Add-infrastructure-of-pre-execution-callbacks.patch Infrastructure of pre-execution callback - 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch FDW asynchronous execution feature - 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch Adaptive fetch size alternative 1: duration based control - 0003b-POC-Experimental-fetch_by_size-feature.patch Adaptive fetch size alternative 2: FETCH by size - 0003c-Add-foreign-table-option-to-set-fetch-size.patch Adaptive fetch size alternative 3: Foreign table option. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 2015-07-02 PM 03:43, Beena Emerson wrote: Amit wrote: Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Deciding the failover standby is not exactly part of this patch but we should be able to set up a mechanism to decide which is the best standby to be promoted. We might not be able to conclude this from the sync parameter alone. As specified before in some cases an async standby could also be most eligible for the promotion. Thanks for the explanation. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE
Hi, While working on the foreign-join-pushdown issue, I noticed that in READ COMMITTED isolation level it's possible that the result of SELECT ... ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent updates that replaced the sort key columns with new values as shown in the below example. That seems odd to me. So, I'd like to propose raising an error rather than returning a possibly-incorrect result for cases where the sorted tuples to be locked were modified by concurrent updates. Patch attached. Is it OK to add this to the current CF? Create an environment: postgres=# create table test (a int); CREATE TABLE postgres=# insert into test values (1); INSERT 0 1 postgres=# insert into test values (2); INSERT 0 1 Run an example: [Terminal 1] postgres=# begin; BEGIN postgres=# update test set a = 3 where a = 1; UPDATE 1 [Terminal 2] postgres=# select * from test order by a for update; [Terminal 1] postgres=# commit; COMMIT [Terminal 2] (The following result will be shown after the commit in Terminal 1. Note that the output ordering is not correct.) a --- 3 2 (2 rows) Best regards, Etsuro Fujita *** a/src/backend/executor/nodeLockRows.c --- b/src/backend/executor/nodeLockRows.c *** *** 39,44 TupleTableSlot */* return: a tuple or NULL */ --- 39,45 ExecLockRows(LockRowsState *node) { TupleTableSlot *slot; + Plan *plan; EState *estate; PlanState *outerPlan; bool epq_needed; *** *** 47,52 ExecLockRows(LockRowsState *node) --- 48,54 /* * get information from the node */ + plan = node-ps.plan; estate = node-ps.state; outerPlan = outerPlanState(node); *** *** 214,219 lnext: --- 216,225 ereport(ERROR, (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), errmsg(could not serialize access due to concurrent update))); + if (((LockRows *) plan)-ordering) + ereport(ERROR, + (errcode(ERRCODE_TRANSACTION_ROLLBACK), + errmsg(tuple to be locked was changed due to concurrent update))); if (ItemPointerEquals(hufd.ctid, tuple.t_self)) { /* Tuple was deleted, so don't return it */ *** a/src/backend/nodes/copyfuncs.c --- b/src/backend/nodes/copyfuncs.c *** *** 967,972 _copyLockRows(const LockRows *from) --- 967,973 /* * copy remainder of node */ + COPY_SCALAR_FIELD(ordering); COPY_NODE_FIELD(rowMarks); COPY_SCALAR_FIELD(epqParam); *** a/src/backend/nodes/outfuncs.c --- b/src/backend/nodes/outfuncs.c *** *** 842,847 _outLockRows(StringInfo str, const LockRows *node) --- 842,848 _outPlanInfo(str, (const Plan *) node); + WRITE_BOOL_FIELD(ordering); WRITE_NODE_FIELD(rowMarks); WRITE_INT_FIELD(epqParam); } *** a/src/backend/optimizer/plan/createplan.c --- b/src/backend/optimizer/plan/createplan.c *** *** 4801,4807 make_setop(SetOpCmd cmd, SetOpStrategy strategy, Plan *lefttree, * Build a LockRows plan node */ LockRows * ! make_lockrows(Plan *lefttree, List *rowMarks, int epqParam) { LockRows *node = makeNode(LockRows); Plan *plan = node-plan; --- 4801,4807 * Build a LockRows plan node */ LockRows * ! make_lockrows(Plan *lefttree, bool ordering, List *rowMarks, int epqParam) { LockRows *node = makeNode(LockRows); Plan *plan = node-plan; *** *** 4816,4821 make_lockrows(Plan *lefttree, List *rowMarks, int epqParam) --- 4816,4822 plan-lefttree = lefttree; plan-righttree = NULL; + node-ordering = ordering; node-rowMarks = rowMarks; node-epqParam = epqParam; *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *** *** 2302,2308 grouping_planner(PlannerInfo *root, double tuple_fraction) --- 2302,2311 */ if (parse-rowMarks) { + bool ordering = parse-sortClause ? true : false; + result_plan = (Plan *) make_lockrows(result_plan, + ordering, root-rowMarks, SS_assign_special_param(root)); *** a/src/include/nodes/plannodes.h --- b/src/include/nodes/plannodes.h *** *** 809,814 typedef struct SetOp --- 809,815 typedef struct LockRows { Plan plan; + bool ordering; /* do we need to preserve the ordering? */ List *rowMarks; /* a list of PlanRowMark's */ int epqParam; /* ID of Param for EvalPlanQual re-eval */ } LockRows; *** a/src/include/optimizer/planmain.h --- b/src/include/optimizer/planmain.h *** *** 74,80 extern Group *make_group(PlannerInfo *root, List *tlist, List *qual, Plan *lefttree); extern Plan *materialize_finished_plan(Plan *subplan); extern Unique *make_unique(Plan *lefttree, List *distinctList); ! extern LockRows *make_lockrows(Plan *lefttree, List *rowMarks, int epqParam); extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount,
Re: [HACKERS] WAL-related tools and .paritial WAL file
On Thu, Jul 2, 2015 at 2:06 PM, Fujii Masao wrote: I implemented the patch accordingly. Patch attached. Cool, thanks. I have done some tests with pg_archivecleanup, and it works as expected, aka if I define a backup file, the backup file as well as the segments equal or newer than it remain. It works as well when defining a .partial file. I have done as well some testing with pg_resetxlog and the partial segment gets removed. Here are some comments: 1) pgarchivecleanup.sgml needs to be updated: In this mode, if you specify a filename.backup/ file name, then only the file prefix Here we should mention that it is also the case of a .partial file. 2) -* restartWALFileName is a .backup filename, make sure we use the prefix -* of the filename, otherwise we will remove wrong files since +* restartWALFileName is a .partial or .backup filename, make sure we use +* the prefix of the filename, otherwise we will remove wrong files since * 00010010.0020.backup is after * 00010010. Shouldn't this be made clearer as well regarding .partial files? For example with a comment like that: otherwise we will remove wrong files since 00010010.0020.backup or 00010010.0020.partial are after 00010010. Simply not mentioning those file names directly is fine for me. 3) Something not caused by this patch that I just noticed... But pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they get moved away as well? Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ Done. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE
On 7/2/15 9:15 AM, Etsuro Fujita wrote: While working on the foreign-join-pushdown issue, I noticed that in READ COMMITTED isolation level it's possible that the result of SELECT ... ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent updates that replaced the sort key columns with new values as shown in the below example. That seems odd to me. So, I'd like to propose raising an error rather than returning a possibly-incorrect result for cases where the sorted tuples to be locked were modified by concurrent updates. I don't like the idea of READ COMMITTED suddenly throwing errors due to concurrency problems. Using FOR UPDATE correctly is really tricky, and this is just one example. And a documented one, at that, too. .m -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 2, 2015 at 3:21 AM, Josh Berkus j...@agliodbs.com wrote: All: Replying to multiple people below. On 07/01/2015 07:15 AM, Fujii Masao wrote: On Tue, Jun 30, 2015 at 2:40 AM, Josh Berkus j...@agliodbs.com wrote: You're confusing two separate things. The primary manageability problem has nothing to do with altering the parameter. The main problem is: if there is more than one synch candidate, how do we determine *after the master dies* which candidate replica was in synch at the time of failure? Currently there is no way to do that. This proposal plans to, effectively, add more synch candidate configurations without addressing that core design failure *at all*. That's why I say that this patch decreases overall reliability of the system instead of increasing it. I agree this is a problem even today, but it's basically independent from the proposed feature *itself*. So I think that it's better to discuss and work on the problem separately. If so, we might be able to provide good way to find new master even if the proposed feature finally fails to be adopted. I agree that they're separate features. My argument is that the quorum synch feature isn't materially useful if we don't create some feature to identify which server(s) were in synch at the time the master died. The main reason I'm arguing on this thread is that discussion of this feature went straight into GUC syntax, without ever discussing: * what use cases are we serving? * what features do those use cases need? I'm saying that we need to have that discussion first before we go into syntax. We gave up on quorum commit in 9.1 partly because nobody was convinced that it was actually useful; that case still needs to be established, and if we can determine *under what circumstances* it's useful, then we can know if the proposed feature we have is what we want or not. Myself, I have two use case for changes to sync rep: 1. the ability to specify a group of three replicas in the same data center, and have commit succeed if it succeeds on two of them. The purpose of this is to avoid data loss even if we lose the master and one replica. 2. the ability to specify that synch needs to succeed on two replicas in two different data centers. The idea here is to be able to ensure consistency between all data centers. Yeah, I'm also thinking those *simple* use cases. I'm not sure how many people really want to have very complicated quorum commit setting. Speaking of which: how does the proposed patch roll back the commit on one replica if it fails to get quorum? You meant the case where there are two sync replicas and the master needs to wait until both send the ACK, then only one replica goes down? In this case, the master receives the ACK from only one replica and it must keep waiting until new sync replica appears and sends back the ACK. So the committed transaction (written WAL record) would not be rolled back. Well, one possibility is to have each replica keep a flag which indicates whether it thinks it's in sync or not. This flag would be updated every time the replica sends a sync-ack to the master. There's a couple issues with that though: I don't think this is good approach because there can be the case where you need to promote even the standby server not having sync flag. Please imagine the case where you have sync and async standby servers. When the master goes down, the async standby might be ahead of the sync one. This is possible in practice. In this case, it might be better to promote the async standby instead of sync one. Because the remaining sync standby which is behind can easily follow up with new master. We can promote the sync standby in this case. But since the remaining async standby is ahead, it's not easy to follow up with new master. Probably new base backup needs to be taken onto async standby from new master, or pg_rewind needs to be executed. That is, the async standby basically needs to be set up again. So I'm thinking that we basically need to check the progress on each standby to choose new master. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Amit wrote: Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Deciding the failover standby is not exactly part of this patch but we should be able to set up a mechanism to decide which is the best standby to be promoted. We might not be able to conclude this from the sync parameter alone. As specified before in some cases an async standby could also be most eligible for the promotion. - -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856201.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-07-02 PM 03:12, Fujii Masao wrote: So I'm thinking that we basically need to check the progress on each standby to choose new master. Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Replication progress is a factor of choice, but not the only one. The sole role of this patch is just to allow us to have more advanced policy in defining how synchronous replication works, aka how we want to let the master acknowledge a commit synchronously from a set of N standbys. In any case, this is something unrelated to the discussion happening here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup and replication slots
On Wed, Jul 1, 2015 at 11:35 PM, Peter Eisentraut pete...@gmx.net wrote: On 7/1/15 8:37 AM, Michael Paquier wrote: On Wed, Jul 1, 2015 at 10:33 AM, Peter Eisentraut wrote: (If you're looking at the patch and wondering why there is no code to actually do anything with the replication slot, that's because the code that does the WAL streaming is already aware of replication slots because of the pg_receivexlog functionality that it also serves. So all that needs to be done is set replication_slot.) This is cool to see this patch taking shape. +my ($stdout, $stderr); +run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f', '-' ], '', \$sql, '', \$stdout, '2', \$stderr or die; +chomp $stdout; +return $stdout; Could it be possible to chomp and return $stderr as well here? It seems useful to me to perform sanity checks after calling psql. Sure, makes sense. OK, so here is more input about this set of patches. Patch 1 looks good. It adds some tests to cover pg_basebackup -R and checks if standby_mode and primary_conninfo are set correctly. Patch 2 also looks fine. It adds tests for pg_basebackup -X. Both patches are independent on the feature. Regarding patch 3, I have more comments: 1) I think that documentation should clearly mention that if -R and -S are used together, a primary_slot_name entry is added in the recovery.conf generated with the slot name defined. 2) sub psql { my ($dbname, $sql) = @_; - run [ 'psql', '-X', '-q', '-d', $dbname, '-f', '-' ], '', \$sql or die; + my ($stdout, $stderr); + run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f', '-' ], '', \$sql, '', \$stdout, '2', \$stderr or die; + chomp $stdout; + return $stdout; } RewindTest.pm has a routine called check_query defined. I would be great to extend a bit more psql than what I thought previously by returning from it ($result, $stdout, $strerr) and make check_query directly use it. This way we have a unique interface to run psql and capture output. I don't mind writing this refactoring patch on top of your set if that's necessary. 3) +command_ok([ 'pg_basebackup', '-D', $tempdir/backupxs_sl_R, '-X', 'stream', '-S', 'slot1', '-R' ], + 'pg_basebackup with replication slot and -R runs'); +$recovery_conf = slurp_file $tempdir/backupR/recovery.conf; +like(slurp_file($tempdir/backupxs_sl_R/recovery.conf), slurp_file is slurping an incorrect file and $recovery_conf is used nowhere after, so you could simply remove this line. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Odd behaviour of SELECT ... ORDER BY ... FOR UPDATE
Hi Marko, On 2015/07/02 16:27, Marko Tiikkaja wrote: On 7/2/15 9:15 AM, Etsuro Fujita wrote: While working on the foreign-join-pushdown issue, I noticed that in READ COMMITTED isolation level it's possible that the result of SELECT ... ORDER BY ... FOR UPDATE is not sorted correctly due to concurrent updates that replaced the sort key columns with new values as shown in the below example. That seems odd to me. So, I'd like to propose raising an error rather than returning a possibly-incorrect result for cases where the sorted tuples to be locked were modified by concurrent updates. I don't like the idea of READ COMMITTED suddenly throwing errors due to concurrency problems. Using FOR UPDATE correctly is really tricky, and this is just one example. And a documented one, at that, too. Ah, you are right. I'll withdraw this. Sorry for the noise. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Asynchronous execution on FDW
Thank you. At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier michael.paqu...@gmail.com wrote in CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=jdtpyutn...@mail.gmail.com On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Ouch! I mistakenly made two CF entries for this patch. Could someone remove this entry for me? https://commitfest.postgresql.org/5/290/ The correct entry is /5/291/ Done. -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 2015-07-02 PM 03:52, Michael Paquier wrote: On Thu, Jul 2, 2015 at 3:29 PM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 2015-07-02 PM 03:12, Fujii Masao wrote: So I'm thinking that we basically need to check the progress on each standby to choose new master. Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Replication progress is a factor of choice, but not the only one. The sole role of this patch is just to allow us to have more advanced policy in defining how synchronous replication works, aka how we want to let the master acknowledge a commit synchronously from a set of N standbys. In any case, this is something unrelated to the discussion happening here. Got it, thanks! Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 2015-07-02 PM 03:12, Fujii Masao wrote: So I'm thinking that we basically need to check the progress on each standby to choose new master. Does HA software determine a standby to promote based on replication progress or would things be reliable enough for it to infer one from the quorum setting specified in GUC (or wherever)? Is part of the job of this patch to make the latter possible? Just wondering or perhaps I am completely missing the point. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Hello, There has been a lot of discussion. It has become a bit confusing. I am summarizing my understanding of the discussion till now. Kindly let me know if I missed anything important. Backward compatibility: We have to provide support for the current format and behavior for synchronous replication (The first running standby from list s_s_names) In case the new format does not include GUC, then a special value to be specified for s_s_names to indicate that. Priority and quorum: Quorum treats all the standby with same priority while in priority behavior, each one has a different priority and ACK must be received from the specified k lowest priority servers. I am not sure how combining both will work out. Mostly we would like to have some standbys from each data center to be in sync. Can it not be achieved by quorum only? GUC parameter: There are some arguments over the text format. However if we continue using this, specifying the number before the group is a more readable option than specifying it later. S_s_names = 3(A, (P,Q), 2(X,Y,Z)) is better compared to S_s_names = (A, (P,Q), (X,Y,Z) (2)) (3) Catalog Method: Is it safe to assume we may not going ahead with the Catalog approach? A system catalog and some built in functions to set the sync parameters is not viable because it can cause - promoted master to sync rep itself - changes to catalog may continuously wait for ACK from a down server. The main problem of unlogged system catalog is data loss during crash. JSON: I agree it would make GUC very complex and unreadable. We can consider using is as meta data. I think the only point in favor of JSON is to be able to set it using functions instead of having to edit and reload right? Identifying standby: The main concern for the current use of application_name seems to be that multiple standby with same name would form an intentional group (maybe across data clusters too?). I agree it would be better to have a mechanism to uniquely identify a standby and groups can be made using whatever method we use to set the sync requirements. Main concern seems to be about deciding which standby is to be promoted is a separate issue altogether. - -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856216.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs si...@2ndquadrant.com wrote: On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote: In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better. Agreed Do we want to consider if wal_compression is enabled as that can reduce the effect full_page_writes? Also I am planning to run some tests for this patch, but not sure if tps and or latency numbers by pgbench are sufficient or do you people want to see actual read/write count via some form of dynamic tracing (stap) as done by the reporter of this issue? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Memory Accounting v11
On 14 June 2015 at 23:51, Tomas Vondra tomas.von...@2ndquadrant.com wrote: The current state, where HashAgg just blows up the memory, is just not reasonable, and we need to track the memory to fix that problem. Meh. HashAgg could track its memory usage without loading the entire system with a penalty. +1 to a solution like that, although I don't think that's doable without digging the info from memory contexts somehow. Jeff is right, we desperately need a solution and this is the place to start. Tom's concern remains valid: we must not load the entire system with a penalty. The only questions I have are: * If the memory allocations adapt to the usage pattern, then we expect to see few memory chunk allocations. Why are we expecting the entire system to experience a penalty? * If we do not manage our resources, how are we certain this does not induce a penalty? Not tracking memory could be worse than tracking it. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Patch to improve a few appendStringInfo* calls
On 06/15/2015 03:56 AM, David Rowley wrote: On 29 May 2015 at 12:51, Peter Eisentraut pete...@gmx.net wrote: On 5/12/15 4:33 AM, David Rowley wrote: Shortly after I sent the previous patch I did a few more searches and also found some more things that are not quite right. Most of these are to use the binary append method when the length of the string is already known. For these cases it might be better to invent additional functions such as stringinfo_to_text() and appendStringInfoStringInfo() instead of repeating the pattern of referring to data and length separately. You're probably right. It would be nicer to see some sort of wrapper functions that cleaned these up a bit. I really think that's something for another patch though, this patch just intends to put things the way they're meant to be in the least invasive way possible. What you're proposing is changing the way it's meant to work, which will cause much more code churn. I've attached a re-based patch. Applied the straightforward parts. I left out the changes like - appendStringInfoString(collist, buf.data); + appendBinaryStringInfo(collist, buf.data, buf.len); because they're not an improvement in readablity, IMHO, and they were not in performance-critical paths. I also noticed that the space after CREATE EVENT TRIGGER name in pg_dump was actually spurious, and just added a space before newline. I removed that space altogether, - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL-related tools and .paritial WAL file
On Thu, Jul 2, 2015 at 3:48 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jul 2, 2015 at 2:06 PM, Fujii Masao wrote: I implemented the patch accordingly. Patch attached. Cool, thanks. I have done some tests with pg_archivecleanup, and it works as expected, aka if I define a backup file, the backup file as well as the segments equal or newer than it remain. It works as well when defining a .partial file. I have done as well some testing with pg_resetxlog and the partial segment gets removed. Thanks for testing the patch! Attached is the updated version of the patch. Here are some comments: 1) pgarchivecleanup.sgml needs to be updated: In this mode, if you specify a filename.backup/ file name, then only the file prefix Here we should mention that it is also the case of a .partial file. Applied. 2) -* restartWALFileName is a .backup filename, make sure we use the prefix -* of the filename, otherwise we will remove wrong files since +* restartWALFileName is a .partial or .backup filename, make sure we use +* the prefix of the filename, otherwise we will remove wrong files since * 00010010.0020.backup is after * 00010010. Shouldn't this be made clearer as well regarding .partial files? For example with a comment like that: otherwise we will remove wrong files since 00010010.0020.backup or 00010010.0020.partial are after 00010010. Simply not mentioning those file names directly is fine for me. Applied. 3) Something not caused by this patch that I just noticed... But pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they get moved away as well? pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining .backup and .history files are harmless after pg_resetxlog is executed. So I don't think that it's required to remove them. Of course we can do that, but some existing applications might depend on the current behavior... So unless there is strong reason to do that, I'd like to let it as it is. Regards, -- Fujii Masao *** a/doc/src/sgml/ref/pg_xlogdump.sgml --- b/doc/src/sgml/ref/pg_xlogdump.sgml *** *** 215,220 PostgreSQL documentation --- 215,226 Only the specified timeline is displayed (or the default, if none is specified). Records in other timelines are ignored. /para + + para + applicationpg_xlogdump/ cannot read WAL files with suffix + literal.partial/. If those files need to be read, literal.partial/ + suffix needs to be removed from the filename. + /para /refsect1 refsect1 *** a/doc/src/sgml/ref/pgarchivecleanup.sgml --- b/doc/src/sgml/ref/pgarchivecleanup.sgml *** *** 60,67 archive_cleanup_command = 'pg_archivecleanup replaceablearchivelocation/ %r' para When used as a standalone program all WAL files logically preceding the replaceableoldestkeptwalfile/ will be removed from replaceablearchivelocation/. !In this mode, if you specify a filename.backup/ file name, then only the file prefix !will be used as the replaceableoldestkeptwalfile/. This allows you to remove all WAL files archived prior to a specific base backup without error. For example, the following example will remove all files older than WAL file name filename000100370010/: --- 60,69 para When used as a standalone program all WAL files logically preceding the replaceableoldestkeptwalfile/ will be removed from replaceablearchivelocation/. !In this mode, if you specify a filename.partial/ or filename.backup/ !file name, then only the file prefix will be used as the !replaceableoldestkeptwalfile/. This treatment of filename.backup/ !file name allows you to remove all WAL files archived prior to a specific base backup without error. For example, the following example will remove all files older than WAL file name filename000100370010/: *** a/src/bin/pg_archivecleanup/pg_archivecleanup.c --- b/src/bin/pg_archivecleanup/pg_archivecleanup.c *** *** 125,131 CleanupPriorWALFiles(void) * file. Note that this means files are not removed in the order * they were originally written, in case this worries you. */ ! if (IsXLogFileName(walfile) strcmp(walfile + 8, exclusiveCleanupFileName + 8) 0) { /* --- 125,131 * file. Note that this means files are not removed in the order * they were originally written, in case this worries you. */ ! if ((IsXLogFileName(walfile) || IsPartialXLogFileName(walfile)) strcmp(walfile + 8, exclusiveCleanupFileName + 8) 0) { /* *** *** 181,187 CleanupPriorWALFiles(void) * SetWALFileNameForCleanup() * * Set the earliest WAL filename that we want to keep on the archive ! * and
Re: [HACKERS] raw output from copy
Hi I'll do it today evening Pavel 2015-07-02 12:55 GMT+02:00 Simon Riggs si...@2ndquadrant.com: On 1 July 2015 at 07:42, Pavel Golub pa...@microolap.com wrote: I looked through the patch. Sources are OK. However I didn't find any docs and test cases. Would you please provide me with short description on this feature and why it is important. Because I didn't manage to find the old Andrew Dunstan's post either. Feature sounds OK, so lets do it. Pavel S, please submit a polished patch. Coding guidelines, tests, docs etc. Set back to Waiting On Author. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
On 13 May 2015 at 09:35, Heikki Linnakangas hlinn...@iki.fi wrote: In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better. Agreed I don't want to over-compensate for the full-page-write effect either, because there are also applications where that effect isn't so big. For example, an application that performs a lot of updates, but all the updates are on a small number of pages, so the full-page-write storm immediately after checkpoint doesn't last long. A worst case for this patch would be such an application - lots of updates on only a few pages - with a long checkpoint_timeoout but relatively small checkpoint_segments, so that checkpoints are always driven by checkpoint_segments. I'd like to see some benchmarking of that worst case before committing anything like this. We could do better, but that is not a reason not to commit this, as is. Commit, please. This has been in place for a while and still remains: TODO: reduce impact of full page writes -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx
2015-07-02 11:03 GMT+02:00 Heikki Linnakangas hlinn...@iki.fi: On 05/29/2015 10:41 AM, Pavel Stehule wrote: 2015-05-29 9:28 GMT+02:00 Jeevan Chalke jeevan.cha...@gmail.com: I agree with Peter that We don't tab-complete everything we possibly could, but using tabs after SET ROLE TO provides DEFAULT as an option which seems wrong. This patch adds list of roles over there, which I guess good to have than giving something unusual. ... But back to this topic. I am thinking so it is little bit different due fact so we support two very syntax for one feature. And looks little bit strange, so one way is supported by autocomplete and second not. Yeah, it's a bit strange. We have a specific autocomplete rule for SET ROLE, but SET ROLE TO is treated as a generic GUC. With your patch, we'd also lose the auto-completion to SET ROLE TO DEFAULT. I think we want to encourage people to use the SQL-standard syntax SET ROLE ... rather than the PostgreSQL-specific SET ROLE TO On the whole, this just doesn't seem like much of an improvement. I'll mark this as 'rejected' in the commitfest app. ok Pavel PS. I note that the auto-completion for SET XXX TO ... is pretty lousy in general. We have rules for DateStyle, IntervalStyle, GEQO and search_path, but that's it. That could be expanded a lot. All enum-type GUCs could be handled with a single rule that queries pg_settings.enumvals, for example, and booleans would be easy too. But that's a different story. - Heikki
Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench
On 05/08/2015 03:02 PM, Fabien COELHO wrote: Remove pgbench constraint that the number of clients must be a multiple of the number of threads, by sharing clients among threads as evenly as possible. Rational: allows to test the database load with any number of client without unrelated constraint. The current setting means for instance that testing with a prime number of clients always uses one thread, whereas other number of clients can use a different number of threads. This constraint does not make much sense. .. Minor v2 update to change a not badly chosen variable name. +1, it's really annoying that you can't just do -jhigh number and always run with that. This doesn't behave correctly if you set -j to greater than -c, and also use the rate limit option: This works $ ./pgbench -R 100 -j3 -c 3 -T 10 postgres starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 5 query mode: simple number of clients: 3 number of threads: 3 duration: 10 s number of transactions actually processed: 1031 latency average: 1.397 ms latency stddev: 0.276 ms rate limit schedule lag: avg 0.119 (max 1.949) ms tps = 102.947257 (including connections establishing) tps = 102.967251 (excluding connections establishing) This does not; too small TPS $ ./pgbench -R 100 -j100 -c 3 -T 10 postgres starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 5 query mode: simple number of clients: 3 number of threads: 100 duration: 10 s number of transactions actually processed: 40 latency average: 3.573 ms latency stddev: 1.724 ms rate limit schedule lag: avg 0.813 (max 4.722) ms tps = 3.246618 (including connections establishing) tps = 3.246639 (excluding connections establishing) I think that can be fixed by just setting nthreads internally to nclients, if nthreads nclients. But please double-check that the logic used to calculate throttle_delay makes sense in general, when nthreads is not a multiple of nclients. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring speculative insertion with unique indexes a little
On 07/01/2015 09:19 PM, Peter Geoghegan wrote: On Wed, Jul 1, 2015 at 10:45 AM, Peter Geoghegan p...@heroku.com wrote: On Tue, Jun 30, 2015 at 11:19 AM, Heikki Linnakangas hlinn...@iki.fi wrote: I agree it would be cleaner to have a separate CHECK_UNIQUE_XXX code for speculative insertions. You've defined CHECK_UNIQUE_SPECULATIVE as like CHECK_UNIQUE_PARTIAL, but you don't have to insert the index tuple if there's a conflict. I think it'd be better to define it as like CHECK_UNIQUE_YES, but return FALSE instead of throwing an error on conflict. The difference is that the aminsert would not be allowed to return FALSE when there is no conflict. Suppose we do it that way. Then what's the difference between CHECK_UNIQUE_SPECULATIVE and CHECK_UNIQUE_PARTIAL? You've just effectively required the CHECK_UNIQUE_YES case to not physically insert a physical tuple before throwing an error, which does not seem essential to the existing definition of CHECK_UNIQUE_YES -- you've redefined CHECK_UNIQUE_YES in a way that nbtree happens to meet at the moment. If we had an amcanunique AM that worked a bit like exclusion constraints, this new obligation for CHECK_UNIQUE_YES might make it impossible for that to work. Another more obvious and important thing: CHECK_UNIQUE_YES waits for conflicts to be resolved before returning to its caller. If you don't get an error, you're done. CHECK_UNIQUE_PARTIAL never waits, and if we add a CHECK_UNIQUE_SPECULATIVE, it ought to not wait either. Sure, if a speculative inserter detects a conflict, it still has to wait. But not in the aminsert call, and not until it cleans up its physical insertion (by super-deleting). Clearly a CHECK_UNIQUE_SPECULATIVE would have to be much closer to CHECK_UNIQUE_PARTIAL than to CHECK_UNIQUE_YES. Why is it not OK for aminsert to do the waiting, with CHECK_UNIQUE_SPECULATIVE? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
Hi Fujita-san, Sorry for my late. On 2015/06/27 21:09, Kouhei Kaigai wrote: BTW, if you try newer version of postgres_fdw foreign join patch, please provide me to reproduce the problem/ OK Did you forget to attach the patch, or v17 is in use? Sorry, I made a mistake. The problem was produced using v16 [1]. I'd like to suggest a solution that re-construct remote tuple according to the fdw_scan_tlist on ExecScanFetch, if given scanrelid == 0. It enables to run local qualifier associated with the ForeignScan node, and it will also work for the case when tuple in es_epqTupleSet[] was local heap. Maybe I'm missing something, but I don't think your proposal works properly because we don't have any component ForeignScan state node or subsidiary join state node once we've replaced the entire join with the ForeignScan performing the join remotely, IIUC. So, my image was to have another subplan for EvalPlanQual as well as the ForeignScan, to do the entire join locally for the component test tuples if we are inside an EvalPlanQual recheck. Hmm... Probably, we have two standpoints to tackle the problem. The first standpoint tries to handle the base foreign table as a prime relation for locking. Thus, we have to provide a way to fetch a remote tuple identified with the supplied ctid. The advantage of this approach is the way to fetch tuples from base relation is quite similar to the existing form, however, its disadvantage is another side of the same coin, because the ForeignScan node with scanrelid==0 (that represents remote join query) may have local qualifiers which shall run on the tuple according to fdw_scan_tlist. IIUC, I think this approach would also need to evaluate join conditions and remote qualifiers in addition to local qualifiers in the local, for component tuples that were re-fetched from the remote (and remaining component tuples that were copied from whole-row vars, if any), in cases where the re-fetched tuples were updated versions of those tuples rather than the same versions priviously obtained. One other standpoint tries to handle a bunch of base foreign tables as a unit. That means, if any of base foreign table is the target of locking, it prompts FDW driver to fetch the latest joined tuple identified by ctid, even if this join contains multiple base relations to be locked. The advantage of this approach is that we can use qualifiers of the ForeignScan node with scanrelid==0 and no need to pay attention of remote qualifier and/or join condition individually. Its disadvantage is, we may extend EState structure to keep the joined tuples, in addition to es_epqTupleSet[]. That is an idea. However, ISTM there is another disadvantage; that is not efficient because that would need to perform another remote join query having such additional conditions during an EvalPlanQual check, as you proposed. I'm inclined to think the later standpoint works well, because it does not need to reproduce an alternative execution path in local side again, even if a ForeignScan node represents much complicated remote query. If we would fetch tuples of individual base relations, we need to reconstruct identical join path to be executed on remote- side, don't it? Yeah, that was my image for fixing this issue. IIUC, the purpose of EvalPlanQual() is to ensure the tuples to be locked is still visible, so it is not an essential condition to fetch base tuples individually. I think so too, but taking the similarity and/or efficiency of processing into consideration, I would vote for the idea of having an alternative execution path in the local. That would also allow FDW authors to write the foreign join pushdown functionality in their FDWs by smaller efforts. Even though I'd like to see committer's opinion, I could not come up with the idea better than what you proposed; foreign-/custom-scan has alternative plan if scanrelid==0. Let me introduce a few cases we should pay attention. Foreign/CustomScan node may stack; that means a Foreign/CustomScan node may have child node that includes another Foreign/CustomScan node with scanrelid==0. (At this moment, ForeignScan cannot have child node, however, more aggressive push-down [1] will need same feature to fetch tuples from local relation and construct VALUES() clause.) In this case, the highest Foreign/CustomScan node (that is also nearest to LockRows or ModifyTuples) run the alternative sub-plan that includes scan/join plans dominated by fdw_relids or custom_relids. For example: LockRows - HashJoin - CustomScan (AliceJoin) - SeqScan on t1 - CustomScan (CarolJoin) - SeqScan on t2 - SeqScan on t3 - Hash - CustomScan (BobJoin) - SeqScan on t4 - ForeignScan (remote join involves ft5, ft6) In this case, AliceJoin will have alternative sub-plan to join t1, t2 and t3, then
[HACKERS] 9.6 First Commitfest Begins
Hello everyone! It is time to start reviewing patches: 1. Pick a patch from the list at: https://commitfest.postgresql.org/5/ 2. Review it. Test it. Try to break it. Do we want the feature? Any weird interactions in corner-cases? 3. Reply on the thread on pgsql-hackers with your findings. It is perfectly OK to review a patch that someone else has signed up for as reviewer - different people tend to pay attention to different things. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tabcomplete - minor bugfix - tabcomplete for SET ROLE TO xxx
On 05/29/2015 10:41 AM, Pavel Stehule wrote: 2015-05-29 9:28 GMT+02:00 Jeevan Chalke jeevan.cha...@gmail.com: I agree with Peter that We don't tab-complete everything we possibly could, but using tabs after SET ROLE TO provides DEFAULT as an option which seems wrong. This patch adds list of roles over there, which I guess good to have than giving something unusual. ... But back to this topic. I am thinking so it is little bit different due fact so we support two very syntax for one feature. And looks little bit strange, so one way is supported by autocomplete and second not. Yeah, it's a bit strange. We have a specific autocomplete rule for SET ROLE, but SET ROLE TO is treated as a generic GUC. With your patch, we'd also lose the auto-completion to SET ROLE TO DEFAULT. I think we want to encourage people to use the SQL-standard syntax SET ROLE ... rather than the PostgreSQL-specific SET ROLE TO On the whole, this just doesn't seem like much of an improvement. I'll mark this as 'rejected' in the commitfest app. PS. I note that the auto-completion for SET XXX TO ... is pretty lousy in general. We have rules for DateStyle, IntervalStyle, GEQO and search_path, but that's it. That could be expanded a lot. All enum-type GUCs could be handled with a single rule that queries pg_settings.enumvals, for example, and booleans would be easy too. But that's a different story. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to improve a few appendStringInfo* calls
On 2 July 2015 at 21:59, Heikki Linnakangas hlinn...@iki.fi wrote: Applied the straightforward parts. Thanks for committing. I left out the changes like - appendStringInfoString(collist, buf.data); + appendBinaryStringInfo(collist, buf.data, buf.len); because they're not an improvement in readablity, IMHO, and they were not in performance-critical paths. Perhaps we can come up with appendStringInfoStringInfo at some later date. -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
[HACKERS] pg_archivecleanup, and backup filename to specify as an argument
Hi, While I'm implementing the patch around pg_archivecleanup, I found the following warning about pg_archivecleanup in the document. Note that the .backup file name passed to the program should not include the extension. ISTM that pg_archivecleanup works as expected even if the .backup file with the extension is specified as follows. So, isn't this warning already obsolete? Or I'm missing something? $ pg_archivecleanup -d -x .zip myarchive 00010009.0028.backup.zip pg_archivecleanup: keep WAL file myarchive/00010009 and later pg_archivecleanup: removing file myarchive/00010005.zip pg_archivecleanup: removing file myarchive/00010003.zip pg_archivecleanup: removing file myarchive/00010001.zip pg_archivecleanup: removing file myarchive/00010007.zip pg_archivecleanup: removing file myarchive/00010006.zip pg_archivecleanup: removing file myarchive/00010004.zip pg_archivecleanup: removing file myarchive/00010002.zip pg_archivecleanup: removing file myarchive/00010008.zip Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench
This doesn't behave correctly if you set -j to greater than -c, and also use the rate limit option: [...] I think that can be fixed by just setting nthreads internally to nclients, if nthreads nclients. But please double-check that the logic used to calculate throttle_delay makes sense in general, when nthreads is not a multiple of nclients. Indeed, I probably did not consider nthreads nclients. The fix you suggest seems ok. I'll check it before sending a new v3. -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On 2015-07-02 14:34:48 +0100, Simon Riggs wrote: This was pushed back from last CF and I haven't worked on it at all, nor will I. Pushing back again. Let's return with feedback, not move, it then.. Moving a entries along which aren't expected to receive updates anytime soon isn't a good idea, there's more than enough entries each CF. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 2 July 2015 at 09:44, Beena Emerson memissemer...@gmail.com wrote: I am not sure how combining both will work out. Use cases needed. Catalog Method: Is it safe to assume we may not going ahead with the Catalog approach? Yes -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Hello, Unless I am missing something, I guess you can still keep the actual code that updates counters outside the core if you adopt an approach that Simon suggests. Yes. The code to extract progress information from VACUUM and storing in shared memory can be outside core even with pg_stat_activity as a user interface. Whatever the view (existing/new), any related counters would have a valid (non-NULL) value when read off the view iff hooks are set perhaps because you have an extension that sets them. I guess that means any operation that supports progress tracking would have an extension with suitable hooks implemented. Do you mean to say , any operation/application that want progress tracking feature will dynamically load the progress checker module which will set the hooks for progress reporting? If yes , unless I am missing something such dynamic loading cannot happen if we use pg_stat_activity as it gets values from shared memory. Module has to be a shared_preload_library to allocate a shared memory. So this will mean the module can be loaded only at server restart. Am I missing something? Thank you, Rahila Syed __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 07/02/2015 07:14 AM, Pavel Stehule wrote: Hi I'll do it today evening Pavel, Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On 2 July 2015 at 14:38, Andres Freund and...@anarazel.de wrote: On 2015-07-02 14:34:48 +0100, Simon Riggs wrote: This was pushed back from last CF and I haven't worked on it at all, nor will I. Pushing back again. Let's return with feedback, not move, it then.. Moving a entries along which aren't expected to receive updates anytime soon isn't a good idea, there's more than enough entries each CF. Although I agree, the interface won't let me do that, so will leave as-is. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Patch to improve a few appendStringInfo* calls
David Rowley wrote: On 2 July 2015 at 21:59, Heikki Linnakangas hlinn...@iki.fi wrote: I left out the changes like - appendStringInfoString(collist, buf.data); + appendBinaryStringInfo(collist, buf.data, buf.len); because they're not an improvement in readablity, IMHO, and they were not in performance-critical paths. Perhaps we can come up with appendStringInfoStringInfo at some later date. I had this exact thought when I saw your patch (though it was appendStringInfoSI to me because the other name is too long and a bit confusing). It seems straightforward enough ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the size of BufferTag remodeling forks
Andres Freund and...@anarazel.de writes: 1) Introduce a shared pg_relfilenode table. Every table, even shared/nailed ones, get an entry therein. It's there to make it possibly to uniquely allocate relfilenodes across databases tablespaces. 2) Replace relation forks, with the exception of the init fork which is special anyway, with separate relfilenodes. Stored in seperate columns in pg_class. Thoughts? I'm concerned about the traffic and contention involved with #1. I'm also concerned about the assumption that relfilenode should, or even can be, unique across an entire installation. (I suppose widening it to 8 bytes would fix some of the hazards there, but that bloats your buffer tag again.) But here's the big problem: you're talking about a huge amount of work for what seems likely to be a microscopic improvement in some operations. Worse, we'll be taking penalties for other operations. How will you do DropDatabaseBuffers() for instance? CREATE DATABASE is going to be a problem, too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 07/02/2015 04:33 PM, Simon Riggs wrote: On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. Oh, I missed the NBuffers / 4 threshold. (The total_blocks calculation is prone to overflowing, btw, if you have a table close to 32 TB in size. But that's trivial to fix) I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I doubt that it can be managed efficiently while retaining optimal memory management. If it can its going to be very low priority (or should be). This approach works, so lets do it, now. If someone comes up with a better way later, great. *shrug*. I don't think this is ready to be committed. I can't stop you from working on this, but as far as this commitfest is concerned, case closed. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 07/02/2015 04:18 PM, Amit Kapila wrote: On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. Dunno, but there is no patch for that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On 2 July 2015 at 14:31, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote: Currently, WALReceiver writes and fsyncs data it receives. Clearly, while we are waiting for an fsync we aren't doing any other useful work. Following patch starts WALWriter during recovery and makes it responsible for fsyncing data, allowing WALReceiver to progress other useful actions. With the patch, replication didn't work fine in my machine. I started the standby server after removing all the WAL files from the standby. ISTM that the patch doesn't handle that case. That is, in that case, the standby tries to start up walreceiver and replication to retrieve the REDO-starting checkpoint record *before* starting up walwriter (IOW, before reaching the consistent point). Then since walreceiver works without walwriter, no received WAL data cannot be fsync'd in the standby. So replication cannot advance furthermore. I think that walwriter needs to start before walreceiver starts. I just marked this patch as Waiting on Author. This patch was moved to current CF with the status Needs review. But there are already some review comments which have not been addressed yet, so I marked the patch as Waiting on Author again. This was pushed back from last CF and I haven't worked on it at all, nor will I. Pushing back again. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] WAL-related tools and .paritial WAL file
On Thu, Jul 2, 2015 at 8:42 PM, Fujii Masao wrote: 3) Something not caused by this patch that I just noticed... But pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they get moved away as well? pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining .backup and .history files are harmless after pg_resetxlog is executed. So I don't think that it's required to remove them. Of course we can do that, but some existing applications might depend on the current behavior... So unless there is strong reason to do that, I'd like to let it as it is. Well, I was just surprised to not see them wiped out. Let's not change the behavior then that exists for ages. The rest of the patch looks fine to me (I would add dots at the end of sentences in comment blocks, but that's a detail). The new macros really make the code easier to read and understand! -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 2015-07-02 16:08:03 +0300, Heikki Linnakangas wrote: I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) I think upthread there was talk about only using the O(relsize) approach if relsize NBuffers. That's actually a pretty common scenario, especially in testsuites. I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I unsurprisingly think that's the right way to go. But I'm not sure if it's not worth to add another layer of bandaid till were there... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Thu, Jul 2, 2015 at 6:38 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) Can you please explain how did you reach that conclusion? It does only when relation size is less than 25% of shared buffers. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
[HACKERS] Reducing the size of BufferTag remodeling forks
Hi, I've complained a number of times that our BufferTag is ridiculously large: typedef struct buftag { RelFileNode rnode; /* physical relation identifier */ ForkNumber forkNum; BlockNumber blockNum; /* blknum relative to begin of reln */ } BufferTag; typedef struct RelFileNode { Oid spcNode;/* tablespace */ Oid dbNode; /* database */ Oid relNode;/* relation */ } RelFileNode; that amounts to 20 bytes. That's problematic because we frequently have to compare or hash the entire buffer tag. Comparing 20bytes is rather branch intensive, and shows up noticably on profiles. It's also a stumbling block on the way to a smarter buffer mapping data structure, because it makes e.g. trees rather deep. The buffer tag is currently used in two situations: 1) Dealing with the buffer mapping, we need to identify the underlying file uniquely and we need the block number (8 bytes). 2) When writing out the a block we need, in addition to 1), have information about where to store the file. That requires the tablespace and database. You may know that a filenode (RelFileNode-relNode) is currently *not* unique across databases and tablespaces. Additionally you might have noticed that the above description also disregards relation forks. I think we should work towards 1) being sufficient for its purpose. My suggestion to get there is twofold: 1) Introduce a shared pg_relfilenode table. Every table, even shared/nailed ones, get an entry therein. It's there to make it possibly to uniquely allocate relfilenodes across databases tablespaces. 2) Replace relation forks, with the exception of the init fork which is special anyway, with separate relfilenodes. Stored in seperate columns in pg_class. This scheme has a number of advantages: We don't need to look at the filesystem anymore to find out whether a relfilenode exists. The buffer tags are 8 bytes. The number of stats doesn't scale O(#forks * #relations) anymore, allowing us to add additional forks more easily. I think something akin to init forks is going to survive because they've to be copied without access to the catalogs - but that's fine, they just aren't allowed to go through shared buffers. Afaics that's not a problem. Obviously this is a rather high-level description, but right now this sounds doable to me. Thoughts? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 07/02/2015 09:43 AM, Simon Riggs wrote: On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net wrote: Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. It's a shame that both solutions are restricted to either COPY or psql. Both of those are working on suggestions from Tom, so there is no history of preference there. Can we have both please, gentlemen? If we implemented Andrew's solution, how would we request it in a COPY statement? Seems like we would want the RAW format keyword anyway. What's the use case? My original motivation was that I had a function that returned a bytea (it was a PDF in fact) that I wanted to be able to write to a file. Of course, this is easy enough to do with a client library like perl's DBD::Pg, but it seems sad to have to resort to that for something so simple. My original suggestion (http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com) was to invent a \bcopy command. I don't have a problem in building in a RAW mode for copy, but we'll still need to teach psql how to deal with it. Another case where it could be useful is JSON - so we can avoid having to play tricks like http://adpgtech.blogspot.com/2014/09/importing-json-data.html. Similar considerations probably apply to XML, and the tricks are less guaranteed to work. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the size of BufferTag remodeling forks
On 2015-07-02 09:51:59 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: 1) Introduce a shared pg_relfilenode table. Every table, even shared/nailed ones, get an entry therein. It's there to make it possibly to uniquely allocate relfilenodes across databases tablespaces. 2) Replace relation forks, with the exception of the init fork which is special anyway, with separate relfilenodes. Stored in seperate columns in pg_class. Thoughts? I'm concerned about the traffic and contention involved with #1. I don't think that'll be that significant in comparison to all the other work done when creating a relation. Unless we do something wrong it'll be highly unlikely to get row level contention, as the oids of the individual relations will be from the oid counter or something similar. I'm also concerned about the assumption that relfilenode should, or even can be, unique across an entire installation. (I suppose widening it to 8 bytes would fix some of the hazards there, but that bloats your buffer tag again.) Why? Because it limits the number of relations forks we can have to 2**32? That seems like an extraordinary large limit? The catalog sizes (pg_attribute most prominently) are a problem at a much lower number of relations than that. Also rel/catcache management. But here's the big problem: you're talking about a huge amount of work for what seems likely to be a microscopic improvement in some operations. I don't think it's microscopic at all. Just hacking away database tablespace from hashing comparisons in the buffer tag (obviously not a correct thing, but works enough for pgbench) results in quite measurable performance benefits. But the main point isn't the performance improvements themselves, but that it opens the door to smarter buffer mapping algorithms, which e.g. will allow ordered access. Also not having the current problem with increasing the number of forks would be good. Worse, we'll be taking penalties for other operations. How will you do DropDatabaseBuffers() for instance? CREATE DATABASE is going to be a problem, too. More promently than that, without access to the database/tablespace we couldn't even write out dirty buffers in a reasonable manner. That's why I think we're going to have to continue storing those two in the buffer descriptors, just not include them in the buffer mapping. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
Simon Riggs si...@2ndquadrant.com writes: On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: I'm marking this as returned with feedback in the commitfest. There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. Have you read the thread? There were plenty of objections, as well as a design for a better solution. In addition, we should think about more holistic solutions such as Andres' nearby proposal (which I doubt will work as-is, but maybe somebody will think of how to fix it). Committing a band-aid will just make it harder to pursue real fixes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
Hi KaiGai-san, On 2015/07/02 18:31, Kouhei Kaigai wrote: Even though I'd like to see committer's opinion, I could not come up with the idea better than what you proposed; foreign-/custom-scan has alternative plan if scanrelid==0. I'd also like to hear the opinion! Let me introduce a few cases we should pay attention. Foreign/CustomScan node may stack; that means a Foreign/CustomScan node may have child node that includes another Foreign/CustomScan node with scanrelid==0. (At this moment, ForeignScan cannot have child node, however, more aggressive push-down [1] will need same feature to fetch tuples from local relation and construct VALUES() clause.) In this case, the highest Foreign/CustomScan node (that is also nearest to LockRows or ModifyTuples) run the alternative sub-plan that includes scan/join plans dominated by fdw_relids or custom_relids. For example: LockRows - HashJoin - CustomScan (AliceJoin) - SeqScan on t1 - CustomScan (CarolJoin) - SeqScan on t2 - SeqScan on t3 - Hash - CustomScan (BobJoin) - SeqScan on t4 - ForeignScan (remote join involves ft5, ft6) In this case, AliceJoin will have alternative sub-plan to join t1, t2 and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the ForeignScan will also have alternative sub-plan, however, these are not used in this case. Probably, it works fine. Yeah, I think so too. Do we have potential scenario if foreign-/custom-join is located over LockRows node. (Subquery expansion may give such a case?) Anyway, doesn't it make a problem, does it? IIUC, I don't think we have such a case. On the next step, how do we implement this design? I guess that planner needs to keep a path that contains neither foreign-join nor custom-join with scanrelid==0. Probably, cheapest_builtin_path of RelOptInfo is needed that never contains these remote/custom join logic, as a seed of alternative sub-plan. Yeah, I think so too, but I've not fugiured out how to implement this yet. To be honest, ISTM that it's difficult to do that simply and efficiently for the foreign/custom-join-pushdown API that we have for 9.5. It's a little late, but what I started thinking is to redesign that API so that that API is called at standard_join_search, as discussed in [2]; (1) to place that API call *after* the set_cheapest call and (2) to place another set_cheapest call after that API call for each joinrel. By the first set_cheapest call, I think we could probably save an alternative path that we need in cheapest_builtin_path. By the second set_cheapest call following that API call, we could consider foreign/custom-join-pushdown paths also. What do you think about this idea? create_foreignscan_plan() or create_customscan_plan() will be able to construct these alternative plan, regardless of the extensions. So, individual FDW/CSP don't need to care about this alternative sub-plan, do it? After that, once ExecScanFetch() is called under EvalPlanQual(), these Foreign/CustomScan with scanrelid==0 runs the alternative sub-plan, to validate the latest tuple. Hmm... It looks to me a workable approach. Year, I think so too. Fujita-san, are you available to make a patch with this approach? If so, I'd like to volunteer its reviewing. Yeah, I'm willing to make a patch if we obtain the consensus! And I'd be happy if you help me doing the work! Best regards, Etsuro Fujita [2] http://www.postgresql.org/message-id/5451.1426271...@sss.pgh.pa.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On Thu, Jul 2, 2015 at 5:44 PM, Beena Emerson memissemer...@gmail.com wrote: Hello, There has been a lot of discussion. It has become a bit confusing. I am summarizing my understanding of the discussion till now. Kindly let me know if I missed anything important. Backward compatibility: We have to provide support for the current format and behavior for synchronous replication (The first running standby from list s_s_names) In case the new format does not include GUC, then a special value to be specified for s_s_names to indicate that. Priority and quorum: Quorum treats all the standby with same priority while in priority behavior, each one has a different priority and ACK must be received from the specified k lowest priority servers. I am not sure how combining both will work out. Mostly we would like to have some standbys from each data center to be in sync. Can it not be achieved by quorum only? So you're wondering if there is the use case where both quorum and priority are used together? For example, please imagine the case where you have two standby servers (say A and B) in local site, and one standby server (say C) in remote disaster recovery site. You want to set up sync replication so that the master waits for ACK from either A or B, i.e., the setting of 1(A, B). Also only when either A or B crashes, you want to make the master wait for ACK from either the remaining local standby or C. On the other hand, you don't want to use the setting like 1(A, B, C). Because in this setting, C can be sync standby when the master craches, and both A and B might be very behind of C. In this case, you need to promote the remote standby server C to new master,,, this is what you'd like to avoid. The setting that you need is 1(1[A, C], 1[B, C]) in Michael's proposed grammer. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 2 July 2015 at 14:08, Heikki Linnakangas hlinn...@iki.fi wrote: On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) There are no unresolved issues with the approach, nor is it true it is slower. If you think there are some, you should say what they are, not act high handedly to reject a patch without reason. I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. I doubt that it can be managed efficiently while retaining optimal memory management. If it can its going to be very low priority (or should be). This approach works, so lets do it, now. If someone comes up with a better way later, great. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On Thu, Jul 2, 2015 at 7:27 PM, Heikki Linnakangas hlinn...@iki.fi wrote: On 07/02/2015 04:18 PM, Amit Kapila wrote: Don't you think the approach discussed (delayed cleanup of buffers during checkpoint scan) is sufficient to fix the issues raised. Dunno, but there is no patch for that. That's fair enough, however your reply sounded to me like there is no further need to explore this idea, unless we have something like radix tree based approach. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] raw output from copy
2015-07-02 16:02 GMT+02:00 Andrew Dunstan and...@dunslane.net: On 07/02/2015 09:43 AM, Simon Riggs wrote: On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net mailto: and...@dunslane.net wrote: Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. It's a shame that both solutions are restricted to either COPY or psql. Both of those are working on suggestions from Tom, so there is no history of preference there. Can we have both please, gentlemen? If we implemented Andrew's solution, how would we request it in a COPY statement? Seems like we would want the RAW format keyword anyway. What's the use case? My original motivation was that I had a function that returned a bytea (it was a PDF in fact) that I wanted to be able to write to a file. Of course, this is easy enough to do with a client library like perl's DBD::Pg, but it seems sad to have to resort to that for something so simple. My original suggestion ( http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com) was to invent a \bcopy command. I don't have a problem in building in a RAW mode for copy, but we'll still need to teach psql how to deal with it. It can be used from psql without any problems. Another case where it could be useful is JSON - so we can avoid having to play tricks like http://adpgtech.blogspot.com/2014/09/importing-json-data.html. Similar considerations probably apply to XML, and the tricks are less guaranteed to work. cheers andrew
Re: [HACKERS] drop/truncate table sucks for large values of shared buffers
On 06/27/2015 07:45 AM, Amit Kapila wrote: Sometime back on one of the PostgreSQL blog [1], there was discussion about the performance of drop/truncate table for large values of shared_buffers and it seems that as the value of shared_buffers increase the performance of drop/truncate table becomes worse. I think those are not often used operations, so it never became priority to look into improving them if possible. I have looked into it and found that the main reason for such a behaviour is that for those operations it traverses whole shared_buffers and it seems to me that we don't need that especially for not-so-big tables. We can optimize that path by looking into buff mapping table for the pages that exist in shared_buffers for the case when table size is less than some threshold (say 25%) of shared buffers. Attached patch implements the above idea and I found that performance doesn't dip much with patch even with large value of shared_buffers. I have also attached script and sql file used to take performance data. I'm marking this as returned with feedback in the commitfest. In addition to the issues raised so far, ISTM that the patch makes dropping a very large table with small shared_buffers slower (DropForkSpecificBuffers() is O(n) where n is the size of the relation, while the current method is O(shared_buffers)) I concur that we should explore using a radix tree or something else that would naturally allow you to find all buffers for relation/database X quickly. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 07/02/2015 09:02 AM, Andrew Dunstan wrote: On 07/02/2015 07:14 AM, Pavel Stehule wrote: Hi I'll do it today evening Pavel, Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. For reference, here's the Wayback Machine's version of the original blog post referred to: http://web.archive.org/web/20110916023912/http://people.planetpostgresql.org/andrew/index.php?/archives/196-Clever-trick-challenge.html cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WALWriter active during recovery
On Thu, Mar 5, 2015 at 5:22 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Dec 18, 2014 at 6:43 PM, Fujii Masao masao.fu...@gmail.com wrote: On Tue, Dec 16, 2014 at 3:51 AM, Simon Riggs si...@2ndquadrant.com wrote: Currently, WALReceiver writes and fsyncs data it receives. Clearly, while we are waiting for an fsync we aren't doing any other useful work. Following patch starts WALWriter during recovery and makes it responsible for fsyncing data, allowing WALReceiver to progress other useful actions. With the patch, replication didn't work fine in my machine. I started the standby server after removing all the WAL files from the standby. ISTM that the patch doesn't handle that case. That is, in that case, the standby tries to start up walreceiver and replication to retrieve the REDO-starting checkpoint record *before* starting up walwriter (IOW, before reaching the consistent point). Then since walreceiver works without walwriter, no received WAL data cannot be fsync'd in the standby. So replication cannot advance furthermore. I think that walwriter needs to start before walreceiver starts. I just marked this patch as Waiting on Author. This patch was moved to current CF with the status Needs review. But there are already some review comments which have not been addressed yet, so I marked the patch as Waiting on Author again. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net wrote: Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. It's a shame that both solutions are restricted to either COPY or psql. Both of those are working on suggestions from Tom, so there is no history of preference there. Can we have both please, gentlemen? If we implemented Andrew's solution, how would we request it in a COPY statement? Seems like we would want the RAW format keyword anyway. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] raw output from copy
2015-07-02 15:43 GMT+02:00 Simon Riggs si...@2ndquadrant.com: On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net wrote: Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. It's a shame that both solutions are restricted to either COPY or psql. Both of those are working on suggestions from Tom, so there is no history of preference there. Can we have both please, gentlemen? If we implemented Andrew's solution, how would we request it in a COPY statement? Seems like we would want the RAW format keyword anyway. I prefer a COPY like solution - it can be used on both sides (server, client), and it can be used little bit simply for psql -c XXX pattern. Regards Pavel -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Information of pg_stat_ssl visible to all users
On 6/10/15 2:17 AM, Magnus Hagander wrote: AIUI that one was just about the DN field, and not about the rest. If I understand you correctly, you are referring to the whole thing, not just one field? I think at least the DN field shouldn't be visible to unprivileged users. Actually, I think the whole view shouldn't be accessible to unprivileged users, except maybe your own row. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
Let me introduce a few cases we should pay attention. Foreign/CustomScan node may stack; that means a Foreign/CustomScan node may have child node that includes another Foreign/CustomScan node with scanrelid==0. (At this moment, ForeignScan cannot have child node, however, more aggressive push-down [1] will need same feature to fetch tuples from local relation and construct VALUES() clause.) In this case, the highest Foreign/CustomScan node (that is also nearest to LockRows or ModifyTuples) run the alternative sub-plan that includes scan/join plans dominated by fdw_relids or custom_relids. For example: LockRows - HashJoin - CustomScan (AliceJoin) - SeqScan on t1 - CustomScan (CarolJoin) - SeqScan on t2 - SeqScan on t3 - Hash - CustomScan (BobJoin) - SeqScan on t4 - ForeignScan (remote join involves ft5, ft6) In this case, AliceJoin will have alternative sub-plan to join t1, t2 and t3, then it shall be used on EvalPlanQual(). Also, BobJoin will have alternative sub-plan to join t4, ft5 and ft6. CarolJoin and the ForeignScan will also have alternative sub-plan, however, these are not used in this case. Probably, it works fine. Yeah, I think so too. Sorry, I need to adjust my explanation above a bit: In this case, AliceJoin will have alternative sub-plan to join t1 and CarolJoin, then CarolJoin will have alternative sub-plan to join t2 and t3. Also, BobJoin will have alternative sub-plan to join t4 and the ForeignScan with remote join, and this ForeignScan node will have alternative sub-plan to join ft5 and ft6. Why this recursive design is better? Because it makes planner enhancement much simple than overall approach. Please see my explanation in the section below. On the next step, how do we implement this design? I guess that planner needs to keep a path that contains neither foreign-join nor custom-join with scanrelid==0. Probably, cheapest_builtin_path of RelOptInfo is needed that never contains these remote/custom join logic, as a seed of alternative sub-plan. Yeah, I think so too, but I've not fugiured out how to implement this yet. To be honest, ISTM that it's difficult to do that simply and efficiently for the foreign/custom-join-pushdown API that we have for 9.5. It's a little late, but what I started thinking is to redesign that API so that that API is called at standard_join_search, as discussed in [2]; (1) to place that API call *after* the set_cheapest call and (2) to place another set_cheapest call after that API call for each joinrel. By the first set_cheapest call, I think we could probably save an alternative path that we need in cheapest_builtin_path. By the second set_cheapest call following that API call, we could consider foreign/custom-join-pushdown paths also. What do you think about this idea? Disadvantage is larger than advantage, sorry. The reason why we put foreign/custom-join hook on add_paths_to_joinrel() is that the source relations (inner/outer) were not obvious, thus, we cannot reproduce which relations are the source of this join. So, I had to throw a spoon when I tried this approach before. My idea is that we save the cheapest_total_path of RelOptInfo onto the new cheapest_builtin_path just before the GetForeignJoinPaths() hook. Why? It should be a built-in join logic, never be a foreign/custom-join, because of the hook location; only built-in logic shall be added here. Even if either/both of join sub-trees contains foreign/custom-join, these path have own alternative sub-plan at their level, no need to care about at current level. (It is the reason why I adjust my explanation above.) Once this built-in path is kept and foreign/custom-join get chosen by set_cheapest(), it is easy to attach this sub-plan to ForeignScan or CustomScan node. I don't find any significant down-side in this approach. How about your opinion? Regarding to the development timeline, I prefer to put something workaround not to kick Assert() on ExecScanFetch(). We may add a warning in the documentation not to replace built-in join if either/both of sub-trees are target of UPDATE/DELETE or FOR SHARE/UPDATE. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 07/02/2015 11:02 AM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: Does the COPY line protocol even support binary data? The protocol, per se, just transmits a byte stream. There is a field in the CopyInResponse/CopyOutResponse messages that indicates whether a text or binary copy is being done. One thing we'd have to consider is whether raw mode is sufficiently different from binary to justify an additional value for this field, and if so whether that constitutes a protocol break. IIRC, psql wouldn't really care; it just transfers the byte stream to or from the target file, regardless of text or binary mode. But there might be other client libraries that are smarter and expect binary mode to mean the binary file format specified in the COPY reference page. So there may be value in being explicit about raw mode in these messages. A key point in all this is that people who need raw transfer probably need it in both directions, a point that your SELECT proposal cannot satisfy, but hacking COPY could. So I lean towards the latter really. OK, let's do that. I await the result with interest. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #13126: table constraint loses its comment
On 2015-05-27 15:10, Michael Paquier wrote: On Wed, Apr 29, 2015 at 1:30 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Sun, Apr 26, 2015 at 6:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: x...@resolvent.net writes: In some circumstances, the comment on a table constraint disappears. Here is an example: Hm, yeah. The problem is that ATExecAlterColumnType() rebuilds all the affected indexes from scratch, and it isn't doing anything about copying their comments to the new objects (either comments on the constraints, or comments directly on the indexes). The least painful way to fix it might be to charter ATPostAlterTypeCleanup to create COMMENT commands and add those to the appropriate work queue, rather than complicating the data structure initially emitted by ATExecAlterColumnType. But it'd still be a fair amount of new code I'm afraid. Not planning to fix this personally, but maybe someone else would like to take it up. In order to fix this, an idea would be to add a new routine in ruleutils.c that generates the COMMENT query string, and then call it directly from tablecmds.c. On master, I imagine that we could even add some SQL interface if there is some need. Thoughts? After looking at this problem, I noticed that the test case given above does not cover everything: primary key indexes, constraints (CHECK for example) and indexes alone have also their comments removed after ALTER TABLE when those objects are re-created. I finished with the patch attached to fix everything, patch that includes a set of regression tests covering all the code paths added. I was going through the code and have few comments: - Why do you change the return value of TryReuseIndex? Can't we use reuse the same OidIsValid(stmt-oldNode) check that ATExecAddIndex is doing instead? - I think the changes to ATPostAlterTypeParse should follow more closely the coding of transformTableLikeClause - namely use the idxcomment - Also the changes to ATPostAlterTypeParse move the code somewhat uncomfortably over the 80 char width, I don't really see a way to fix that except for moving some of the nesting out to another function. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 07/02/2015 10:07 AM, Pavel Stehule wrote: 2015-07-02 16:02 GMT+02:00 Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net: On 07/02/2015 09:43 AM, Simon Riggs wrote: On 2 July 2015 at 14:02, Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net mailto:and...@dunslane.net mailto:and...@dunslane.net wrote: Please don't top-post on the PostgreSQL lists. You've been around here long enough to know that bottom posting is our custom. I posted a patch for this in 2013 at http://www.postgresql.org/message-id/50f2fa92.9040...@dunslane.net but it can apply to a SELECT, and doesn't need COPY. Nobody seemed very interested, so I dropped it. Apparently people now want something along these lines, which is good. It's a shame that both solutions are restricted to either COPY or psql. Both of those are working on suggestions from Tom, so there is no history of preference there. Can we have both please, gentlemen? If we implemented Andrew's solution, how would we request it in a COPY statement? Seems like we would want the RAW format keyword anyway. What's the use case? My original motivation was that I had a function that returned a bytea (it was a PDF in fact) that I wanted to be able to write to a file. Of course, this is easy enough to do with a client library like perl's DBD::Pg, but it seems sad to have to resort to that for something so simple. My original suggestion (http://www.postgresql.org/message-id/4ea1b83b.2050...@pgexperts.com) was to invent a \bcopy command. I don't have a problem in building in a RAW mode for copy, but we'll still need to teach psql how to deal with it. It can be used from psql without any problems. In fact your patch will not work with psql's \copy nor to stdout at all, unless I'm misreading it: -if (cstate-binary) +if (cstate-binary || cstate-raw) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg(COPY BINARY is not supported to stdout or from stdin))); So it looks like you're only supporting this where the server is writing to a file. That's horribly narrow, and certainly doesn't meet my original need. Does the COPY line protocol even support binary data? If not, we're dead in the water here from the psql POV. Because my patch doesn't use the COPY protocol it doesn't have this problem. Perhaps we should do both, although I'm not sure I understand the use case for the COPY solution. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
Syed, Rahila rahila.s...@nttdata.com writes: Hello, Unless I am missing something, I guess you can still keep the actual code that updates counters outside the core if you adopt an approach that Simon suggests. Yes. The code to extract progress information from VACUUM and storing in shared memory can be outside core even with pg_stat_activity as a user interface. Whatever the view (existing/new), any related counters would have a valid (non-NULL) value when read off the view iff hooks are set perhaps because you have an extension that sets them. I guess that means any operation that supports progress tracking would have an extension with suitable hooks implemented. Do you mean to say , any operation/application that want progress tracking feature will dynamically load the progress checker module which will set the hooks for progress reporting? If yes , unless I am missing something such dynamic loading cannot happen if we use pg_stat_activity as it gets values from shared memory. Module has to be a shared_preload_library to allocate a shared memory. So this will mean the module can be loaded only at server restart. Am I missing something? TBH, I think that designing this as a hook-based solution is adding a whole lot of complexity for no value. The hard parts of the problem are collecting the raw data and making the results visible to users, and both of those require involvement of the core code. Where is the benefit from pushing some trivial intermediate arithmetic into an external module? If there's any at all, it's certainly not enough to justify problems such as you mention here. So I'd just create a pgstat_report_percent_done() type of interface in pgstat.c and then teach VACUUM to call it directly. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Thu, Jul 2, 2015 at 1:06 AM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Jul 2, 2015 at 12:13 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Thu, May 28, 2015 at 11:34 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Thu, Apr 30, 2015 at 8:07 PM, Sawada Masahiko sawada.m...@gmail.com wrote: On Fri, Apr 24, 2015 at 11:21 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Fri, Apr 24, 2015 at 1:31 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 11:06 AM, Petr Jelinek wrote: On 23/04/15 17:45, Bruce Momjian wrote: On Thu, Apr 23, 2015 at 09:45:38AM -0400, Robert Haas wrote: Agreed, no extra file, and the same write volume as currently. It would also match pg_clog, which uses two bits per transaction --- maybe we can reuse some of that code. Yeah, this approach seems promising. We probably can't reuse code from clog because the usage pattern is different (key for clog is xid, while for visibility/freeze map ctid is used). But visibility map storage layer is pretty simple so it should be easy to extend it for this use. Actually, there may be some bit manipulation functions we could reuse; things like efficiently counting how many things in a byte are set. Probably doesn't make sense to fully refactor it, but at least CLOG is a good source for cut/paste/whack. I agree with adding a bit that indicates corresponding page is all-frozen into VM, just like CLOG. I'll change the patch as second version patch. The second patch is attached. In second patch, I added a bit that indicates all tuples in page are completely frozen into visibility map. The visibility map became a bitmap with two bit per heap page: all-visible and all-frozen. The logics around vacuum, insert/update/delete heap are almost same as previous version. This patch lack some point: documentation, comment in source code, etc, so it's WIP patch yet, but I think that it's enough to discuss about this. The previous patch is no longer applied cleanly to HEAD. The attached v2 patch is latest version. Please review it. Attached new rebased version patch. Please give me comments! Now we should review your design and approach rather than code, but since I got an assertion error while trying the patch, I report it. initdb -D test -k caused the following assertion failure. vacuuming database template1 ... TRAP: FailedAssertion(!PageHeader) (heapPage))-pd_flags 0x0004)), File: visibilitymap.c, Line: 328) sh: line 1: 83785 Abort trap: 6 /dav/000_add_frozen_bit_into_visibilitymap_v3/bin/postgres --single -F -O -c search_path=pg_catalog -c exit_on_error=true template1 /dev/null child process exited with exit code 134 initdb: removing data directory test Thank you for bug report, and comments. Fixed version is attached, and source code comment is also updated. Please review it. And I explain again here about what this patch does, current design. - A additional bit for visibility map. I added additional bit, say all-frozen bit, which indicates whether the all pages of corresponding page are frozen, to visibility map. This structure is similar to CLOG. So the size of VM grew as twice as today. Also, the flags of each heap page header might be set PD_ALL_FROZEN, as well as all-visible - Set and clear a all-frozen bit Update and delete and insert(multi insert) operation would clear a bit of that page, and clear flags of page header at same time. Only vauum operation can set a bit if all tuple of a page are frozen. - Anti-wrapping vacuum We have to scan whole table for XID anti-warring today, and it's really quite expensive because disk I/O. The main benefit of this proposal is to reduce and avoid such extremely large quantity I/O even when anti-wrapping vacuum is executed. We have to scan whole table for XID anti-warring today, and it's really quite expensive. In lazy_scan_heap() function, I added a such logic for experimental. There were several another idea on previous discussion such as read-only table, frozen map. But advantage of this direction is that we don't need additional heap file, and can use the matured VM mechanism. Regards, -- Sawada Masahiko diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 86a2e6b..835d714 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -88,7 +88,8 @@ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup, static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf, Buffer newbuf, HeapTuple oldtup, HeapTuple newtup, HeapTuple old_key_tup, -bool all_visible_cleared, bool new_all_visible_cleared); +bool all_visible_cleared, bool new_all_visible_cleared, +bool all_frozen_cleared, bool new_all_frozen_cleared); static void HeapSatisfiesHOTandKeyUpdate(Relation relation, Bitmapset *hot_attrs, Bitmapset *key_attrs, Bitmapset *id_attrs, @@ -2107,7 +2108,8 @@ heap_insert(Relation
Re: [HACKERS] Raising our compiler requirements for 9.6
On Wed, Jul 1, 2015 at 6:24 PM, Andres Freund and...@anarazel.de wrote: On 2015-07-02 00:15:14 +0200, Andres Freund wrote: animal OS compiler inline quiet inline varargs brolga cygwin gcc-4.3 yy 4.3 obviously supports varargs. Human error. pademelonHP-UX 10.2 HP C Compiler 10.32 nn n Since, buildfarm/quiet inline test issues aside, pademelon is the only animal not supporting inlines and varargs, I think we should just go ahead and start to use both. So far this thread is all about the costs of desupporting compilers that don't have these features, and you're making a good argument (that I think we all agree with) that the cost is small. But you haven't really said what the benefits are. In the case of static inline, the main problem with the status quo AFAICS is that you have to do a little dance any time you'd otherwise use a static inline function (for those following along at home, git grep INCLUDE_DEFINITIONS src/include to see how this is done). Now, it is obviously the case that the first time somebody has to do this dance, they will be somewhat inconvenienced, but once you know how to do it, it's not incredibly complicated. So, just to play the devil's advocate here, who really cares? The way we're doing it right now works everywhere and is as efficient on each platform as the compiler on that platform can support. I accept that there is some developer convenience to not having to worry about the INCLUDE_DEFINITIONS thing, and maybe that's a sufficient reason to do it, but is that the only benefit we're hoping to get? I'd consider an argument for adopting one of these features to be much stronger if were accompanied by arguments like: - If we require feature X, we can reduce the size of the generated code and improve performance. - If we require feature X, we can reduce the risk of bugs. - If we require feature X, static analyzers will be able to understand our code better. If everybody else here says working around the lack of feature X is too much of a pain in the butt and we want to adopt it purely too make development easier, I'm not going to sit here and try to fight the tide. But I personally don't find such arguments all that exciting. It's not at all clear to me that static inline or variadic macros would make our lives meaningfully better, and like Peter, I think // comments are a not something we should adopt, because one style is better than two. Designated initializers would have meaningful documentation value and might help to decrease the risk of bugs, so that almost seems more interesting to me than static inline. We'd need to check what pgindent thinks of them, though, and how much platform coverage we'd be surrendering. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
Amit Kapila wrote: On Sat, Jun 27, 2015 at 12:54 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Jun 15, 2015 at 2:52 PM, Amit Kapila amit.kapil...@gmail.com wrote: Attached patch provides a fix as per above discussion. I think we should emit some LOG messages here. When we detect the file is there: LOG: ignoring tablespace_map file because no backup_label file exists If the rename fails: LOG: could not rename file %s to %s: %m If it works: LOG: renamed file %s to %s Added the above log messages in attached patch with small change such that in message, file names will be displayed with quotes as most of other usages of rename (failure) in that file uses quotes to display filenames. Why emit two messages? Can we reduce that to a single one? Maybe the first one could be errdetail or something. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] raw output from copy
On 2 July 2015 at 15:07, Pavel Stehule pavel.steh...@gmail.com wrote: It can be used from psql without any problems. It can, but your patch does not yet do that, while Andrew's does. We want a solution that works from psql and other clients. Hopefully the same-ish solution. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Rework the way multixact truncations work
On Mon, Jun 29, 2015 at 3:48 PM, Andres Freund and...@anarazel.de wrote: New version attached. 0002 looks good, but the commit message should perhaps mention the comment fix. Or commit that separately. Will look at 0003 next. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] two-arg current_setting() with fallback
Jeevan Chalke jeevan.cha...@enterprisedb.com writes: Attached patch which fixes my review comments. Applied with minor adjustments (mostly cosmetic, but did neither of you notice the compiler warning?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add checksums without --initdb
On 07/02/2015 11:28 PM, Andres Freund wrote: On 2015-07-02 22:53:40 +0300, Heikki Linnakangas wrote: Add a enabling-checksums mode to the server where it calculates checksums for anything it writes, but doesn't check or complain about incorrect checksums on reads. Put the server into that mode, and then have a background process that reads through all data in the cluster, calculates the checksum for every page, and writes all the data back. Once that's completed, checksums can be fully enabled. You'd need, afaics, a bgworker that connects to every database to read pg_class, to figure out what type of page a relfilenode has. And this probably should call back into the relevant AM or such. Nah, we already assume that every relation data file follows the standard page format, at least enough to have the checksum field at the right location. See FlushBuffer() - it just unconditionally calculates the checksum before writing out the page. (I'm not totally happy about that, but that ship has sailed) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH:do not set Win32 server-side socket buffer size on windows 2012
On 04/10/2015 01:46 PM, chenhj wrote: PostgreSQL set Win32 server-side socket buffer size to 32k since 2006, for performance reasons. While,on the newer version of Windows,such as windows 2012,the default socket buffer size is 64k, and seem has better performance(high throughput). So, i propose to apply the attached patch(based on the snapshot of 9.5dev) to set Win32 server-side socket buffer size to 32k only when the default value is less than 32k. Seems reasonable. I edited the comment somewhat, and added #ifdefs on the new variables to avoid compiler warnings on other platforms. OSdefault socket buffer size(get from getsockopt(SO_SNDBUF)) Window7: 8k Windows2003:8k Windows2008:8k Windows8: 64k Windows2012:64k The following is my performance test for various SO_SNDBUF setting. Test method: Use psql to fetch about 100MB data from PostgreSQL(Windows) with various SO_SNDBUF setting. [chenhj@node2 ~]$ time psql -h dbsvr -p 5432 -U postgres -A -t -c select '1'::char(1000),generate_series(1,10)/dev/null real0m3.295s user0m0.222s sys0m0.250s Environment1(default SO_SNDBUF 32k): Client: PostgreSQL 9.4.1 at RHEL6(x64) Server: PostgreSQL 9.4.1 at Windows 2012(x64) Network:1Gbit LAN Result(execute time): default(64K),1.118s set SO_SNDBUF to 32K, 3.295s(the current implement) set SO_SNDBUF to 64K, 2.048s set SO_SNDBUF to 128K, 1.404s set SO_SNDBUF to 256K, 1.290s 1)When use Windows as client OS,the result is similar,but there's no /dev/null used by my test in windows. 2)I think the reason that the default(64K) is fast than set SO_SNDBUF to 64K is that dynamic send buffering was disabled after set SO_SNDBUF option. https://msdn.microsoft.com/en-us/library/windows/desktop/bb736549(v=vs.85).aspx Dynamic send buffering for TCP was added on Windows 7 and Windows Server 2008 R2. By default, dynamic send buffering for TCP is enabled unless an application sets the SO_SNDBUF socket option on the stream socket. Environment2(default SO_SNDBUF 32k): Client: PostgreSQL 9.4.1 at RHEL6(x64) Server: PostgreSQL 9.4.1 at Windows 2008 R2(x64) Network:1Gbit LAN Result(execute time): default(8K), 7.370s set SO_SNDBUF to 32K, 4.159s(the current implement) set SO_SNDBUF to 64K, 2.875s set SO_SNDBUF to 128K, 1.593s set SO_SNDBUF to 256K, 1.324s I was about to commit the attached, but when I tested this between my Windows 8.1 virtual machine and Linux host, I was not able to see any performance difference. It may be because the case is hobbled by other inefficiencies, in the virtualization or somewhere else, but I wonder if others can reproduce the speedup? - Heikki From 2a8b9c5fe10d5b8dc4e4bcf00ed0025bc6d24a23 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas heikki.linnakan...@iki.fi Date: Thu, 2 Jul 2015 22:09:08 +0300 Subject: [PATCH 1/1] Don't set SO_SNDBUF on recent Windows versions that have a bigger default. It's unnecessary to set it if the default is higher in the first place. Furthermore, setting SO_SNDBUF disables the so-called dynamic send buffering feature, which hurts performance further. Chen Huajun diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index a4b37ed..7672287 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -726,6 +726,11 @@ StreamConnection(pgsocket server_fd, Port *port) if (!IS_AF_UNIX(port-laddr.addr.ss_family)) { int on; +#ifdef WIN32 + int oldopt; + int optlen; + int newopt; +#endif #ifdef TCP_NODELAY on = 1; @@ -747,16 +752,43 @@ StreamConnection(pgsocket server_fd, Port *port) #ifdef WIN32 /* - * This is a Win32 socket optimization. The ideal size is 32k. - * http://support.microsoft.com/kb/823764/EN-US/ + * This is a Win32 socket optimization. The OS send buffer should be + * large enough to send the whole Postgres send buffer in one go, or + * performance suffers. The Postgres send buffer can be enlarged if a + * very large message needs to be sent, but we won't attempt to + * enlarge the OS buffer if that happens, so somewhat arbitrarily + * ensure that the OS buffer is at least PQ_SEND_BUFFER_SIZE * 4. + * (That's 32kB with the current default). + * + * The default OS buffer size used to be 8kB in earlier Windows + * versions, but was raised to 64kB in Windows 2012. So it shouldn't + * be necessary to change it in later versions anymore. Changing it + * unnecessarily can even reduce performance, because setting + * SO_SNDBUF in the application disables the dynamic send buffering + * feature that was introduced in Windows 7. So before fiddling with + * SO_SNDBUF, check if the current buffers size is already large + * enough and only increase it if necessary. + * + * See
Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file
On Thu, Jul 2, 2015 at 7:44 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila wrote: Added the above log messages in attached patch with small change such that in message, file names will be displayed with quotes as most of other usages of rename (failure) in that file uses quotes to display filenames. Why emit two messages? not necessary. Can we reduce that to a single one? Maybe the first one could be errdetail or something. I think it is better other way (basically have second one as errdetail). We already have one similar message in xlog.c that way. ereport(LOG, (errmsg(online backup mode canceled), errdetail(\%s\ was renamed to \%s\., BACKUP_LABEL_FILE, BACKUP_LABEL_OLD))); Attached patch consolidates errmsg into one message. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com rename_mapfile_if_backupfile_not_present_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synch failover WAS: Support for N synchronous standby servers - take 2
On Fri, Jul 3, 2015 at 6:54 AM, Josh Berkus j...@agliodbs.com wrote: On 07/02/2015 12:44 PM, Andres Freund wrote: On 2015-07-02 11:50:44 -0700, Josh Berkus wrote: So there's two parts to this: 1. I need to ensure that data is replicated to X places. 2. I need to *know* which places data was synchronously replicated to when the master goes down. My entire point is that (1) alone is useless unless you also have (2). I think there's a good set of usecases where that's really not the case. Please share! My plea for usecases was sincere. I can't think of any. And do note that I'm talking about information on the replica, not on the master, since in any failure situation we don't have the old master around to check. How would you, even theoretically, synchronize that knowledge to all the replicas? Even when they're temporarily disconnected? You can't, which is why what we need to know is when the replica thinks it was last synced from the replica side. That is, a sync timestamp and lsn from the last time the replica ack'd a sync commit back to the master successfully. Based on that information, I can make an informed decision, even if I'm down to one replica. ... because we would know definitively which servers were in sync. So maybe that's the use case we should be supporting? If you want automated failover you need a leader election amongst the surviving nodes. The replay position is all they need to elect the node that's furthest ahead, and that information exists today. I can do that already. If quorum synch commit doesn't help us minimize data loss any better than async replication or the current 1-redundant, why would we want it? If it does help us minimize data loss, how? In your example of 2 : { local_replica, london_server, nyc_server }, if there is not something like quorum commit, only local_replica is synch and the other two are async. In this case, if the local data center gets destroyed, you need to promote either london_server or nyc_server. But since they are async, they might not have the data which have been already committed in the master. So data loss! Of course, as I said yesterday, they might have all the data and no data loss happens at the promotion. But the point is that there is no guarantee that no data loss happens. OTOH, if we use quorum commit, we can guarantee that either london_server or nyc_server has all the data which have been committed in the master. So I think that quorum commit is helpful for minimizing the data loss. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bugfix: incomplete implementation of errhidecontext
Andres Freund and...@anarazel.de writes: On 2015-06-08 14:44:53 +, Jeevan Chalke wrote: This is trivial bug fix in the area of hiding error context. I observed that there are two places from which we are calling this function to hide the context in log messages. Those were broken. Broken in which sense? They did prevent stuff to go from the server log? I'm not convinced that hiding stuff from the client is really necessarily the same as hiding it from the server log. We e.g. always send the verbose log to the client, even if we only send the terse version to the server log. I don't mind adjusting things for errhidecontext(), but it's not just a bug. Not only is it not just a bug, I disagree that it's a bug at all. The documentation of the errhidestmt function is crystal clear about what it does: * errhidecontext --- optionally suppress CONTEXT: field of log entry That says log entry, not anything else. Furthermore, this is clearly modeled on errhidestmt(), which also only affects what's written to the log. Generally our position on error reporting is that it's the client's responsibility to decide what parts of a report it will or won't show to the user, so even if we agreed the overall behavior was undesirable, I do not think this is the appropriate fix. I especially object to the part of the patch that suppresses calling the context callback stack functions; that's just introducing inconsistent behavior for no reason. It doesn't prevent collection of context (there are lots of errcontext() calls directly in ereports, which this wouldn't stop), and it will break callers that are using those callbacks for anything more than just calling errcontext(). An example here is that in clauses.c's sql_inline_error_callback, this would not only suppress the CONTEXT line but also reporting of the error cursor location. What is the actual use-case that prompted this complaint? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL-related tools and .paritial WAL file
On Thu, Jul 2, 2015 at 8:59 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Jul 2, 2015 at 8:42 PM, Fujii Masao wrote: 3) Something not caused by this patch that I just noticed... But pg_resetxlog does not remove .backup files in pg_xlog. Shouldn't they get moved away as well? pg_resetxlog doesn't remove also .history file in pg_xlog. Those remaining .backup and .history files are harmless after pg_resetxlog is executed. So I don't think that it's required to remove them. Of course we can do that, but some existing applications might depend on the current behavior... So unless there is strong reason to do that, I'd like to let it as it is. Well, I was just surprised to not see them wiped out. Let's not change the behavior then that exists for ages. The rest of the patch looks fine to me (I would add dots at the end of sentences in comment blocks, but that's a detail). The new macros really make the code easier to read and understand! Yep! Applied the patch. Thanks! Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
On 2015/07/02 23:13, Kouhei Kaigai wrote: To be honest, ISTM that it's difficult to do that simply and efficiently for the foreign/custom-join-pushdown API that we have for 9.5. It's a little late, but what I started thinking is to redesign that API so that that API is called at standard_join_search, as discussed in [2]; (1) to place that API call *after* the set_cheapest call and (2) to place another set_cheapest call after that API call for each joinrel. By the first set_cheapest call, I think we could probably save an alternative path that we need in cheapest_builtin_path. By the second set_cheapest call following that API call, we could consider foreign/custom-join-pushdown paths also. What do you think about this idea? Disadvantage is larger than advantage, sorry. The reason why we put foreign/custom-join hook on add_paths_to_joinrel() is that the source relations (inner/outer) were not obvious, thus, we cannot reproduce which relations are the source of this join. So, I had to throw a spoon when I tried this approach before. Maybe I'm missing something, but my image about this approach is that if base relations for a given joinrel are all foreign tables and belong to the same foreign server, then by calling that API there, we consider the remote join over all the foreign tables, and that if not, we give up to consider the remote join. My idea is that we save the cheapest_total_path of RelOptInfo onto the new cheapest_builtin_path just before the GetForeignJoinPaths() hook. Why? It should be a built-in join logic, never be a foreign/custom-join, because of the hook location; only built-in logic shall be added here. My concern about your idea is that since that (a) add_paths_to_joinrel is called multiple times per joinrel and that (b) repetitive add_path calls through GetForeignJoinPaths in add_paths_to_joinrel might remove old paths that are builtin, it's possible to save a path that is not builtin onto the cheapest_total_path and thus to save that path wrongly onto the cheapest_builtin_path. There might be a good way to cope with that, though. Regarding to the development timeline, I prefer to put something workaround not to kick Assert() on ExecScanFetch(). We may add a warning in the documentation not to replace built-in join if either/both of sub-trees are target of UPDATE/DELETE or FOR SHARE/UPDATE. I'm not sure that that is a good idea, but anyway, I think we need to hurry fixing this issue. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.
On 2015-07-02 PM 11:00, Syed, Rahila wrote: Hello, Unless I am missing something, I guess you can still keep the actual code that updates counters outside the core if you adopt an approach that Simon suggests. Yes. The code to extract progress information from VACUUM and storing in shared memory can be outside core even with pg_stat_activity as a user interface. Whatever the view (existing/new), any related counters would have a valid (non-NULL) value when read off the view iff hooks are set perhaps because you have an extension that sets them. I guess that means any operation that supports progress tracking would have an extension with suitable hooks implemented. Do you mean to say , any operation/application that want progress tracking feature will dynamically load the progress checker module which will set the hooks for progress reporting? If yes , unless I am missing something such dynamic loading cannot happen if we use pg_stat_activity as it gets values from shared memory. Module has to be a shared_preload_library to allocate a shared memory. So this will mean the module can be loaded only at server restart. Am I missing something? Assuming that set of hooks per command and shared memory structure(s) is a way to go, I meant to say that hook implementations per command would be in their separate modules, of course loaded at the server start for shared memory). Of those, your proposed patch has vacuum_progress, for example. And in context of my comment above, that means the view would say NULL for commands for which the module has not been set up in advance. IOW, between showing NULL in the view and dynamically loading hook functions, we choose the former because I don't know what the latter means in postgres. Having said that, Tom's suggestion to export pgstat.c function(s) for command(s) may be a more appealing way to go. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Hello Simon, We could do better, but that is not a reason not to commit this, as is. Commit, please. My 0,02€: Please do not commit without further testing... I've submitted a patch to improve checkpoint write scheduling, including X00 hours of performance test on various cases. This patch changes significantly the load distribution over the whole checkpoint, and AFAICS has been tested on rather small cases. I'm not sure that the power 1.5 is the right one for all cases. For a big checkpoint over 30 minutes, it may have, or not, very large and possibly unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I really think is that it needs performance measures. In conclusion, and very egoistically, I would prefer if this patch could wait for the checkpoint scheduling patch to be considered, as it would basically invalidate the X00 hours of performance tests I ran:-) -- Fabien. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout klep...@svana.org wrote: On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote: Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? Default config, was just initdb'd. So yes, the default wal_level = minimal. === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? The container has persistant storage on the host. What I think is actually unusual is that the script that started postgres was missing an 'exec so postgres never gets the signal to shutdown. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. Really? Comparing the relfilenodes gives the same values before and after the truncate. Yep, the relfilenodes are not changed in this case because CREATE TABLE and TRUNCATE were executed in the same transaction block. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. I could reproduce the problem in the master branch by doing the following steps. 1. start the PostgreSQL server with wal_level = minimal 2. execute the following SQL statements begin; create table test(id serial primary key); truncate table test; commit; 3. shutdown the server with immediate mode 4. restart the server (crash recovery occurs) 5. execute the following SQL statement select * from test; The optimization of TRUNCATE opereation that we can use when CREATE TABLE and TRUNCATE are executed in the same transaction block seems to cause the problem. In this case, only index file truncation is logged, and index creation in btbuild() is not logged because wal_level is minimal. Then at the subsequent crash recovery, index file is truncated to 0 byte... Very simple fix is to log an index creation in that case, but not sure if that's ok to do.. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing PG_VERSION_NUM in pg_config
On Fri, Jul 3, 2015 at 6:27 AM, Tom Lane t...@sss.pgh.pa.us wrote: Michael Paquier michael.paqu...@gmail.com writes: On Wed, Mar 25, 2015 at 8:26 AM, Tom Lane t...@sss.pgh.pa.us wrote: ... So attached is a patch that adds VERSION_NUM in Makefile.global. While there was not exactly universal consensus that we need this, the patch as given is merely two lines, so it seems awfully cheap to Just Do It. Hence, I've gone ahead and committed it. If we start getting complaints about use-cases this doesn't cover, we can re-discuss whether it's worth doing more. This looks fine to me. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Interesting study what is C in practice
We spend a lot of time debating what non standard behaviours are safe to rely on because other software relies on it. Someone actually went and surveyed some major projects about what behaviours are being counted on. Three results aren't as conclusive as I would have expected but even so some of then are still interesting. http://www.cl.cam.ac.uk/~pes20/cerberus/notes50-2015-05-24-survey-discussion.html
Re: [HACKERS] bugfix: incomplete implementation of errhidecontext
2015-07-03 1:07 GMT+02:00 Tom Lane t...@sss.pgh.pa.us: Andres Freund and...@anarazel.de writes: On 2015-06-08 14:44:53 +, Jeevan Chalke wrote: This is trivial bug fix in the area of hiding error context. I observed that there are two places from which we are calling this function to hide the context in log messages. Those were broken. Broken in which sense? They did prevent stuff to go from the server log? I'm not convinced that hiding stuff from the client is really necessarily the same as hiding it from the server log. We e.g. always send the verbose log to the client, even if we only send the terse version to the server log. I don't mind adjusting things for errhidecontext(), but it's not just a bug. Not only is it not just a bug, I disagree that it's a bug at all. The documentation of the errhidestmt function is crystal clear about what it does: * errhidecontext --- optionally suppress CONTEXT: field of log entry That says log entry, not anything else. Furthermore, this is clearly modeled on errhidestmt(), which also only affects what's written to the log. Generally our position on error reporting is that it's the client's responsibility to decide what parts of a report it will or won't show to the user, so even if we agreed the overall behavior was undesirable, I do not think this is the appropriate fix. I especially object to the part of the patch that suppresses calling the context callback stack functions; that's just introducing inconsistent behavior for no reason. It doesn't prevent collection of context (there are lots of errcontext() calls directly in ereports, which this wouldn't stop), and it will break callers that are using those callbacks for anything more than just calling errcontext(). An example here is that in clauses.c's sql_inline_error_callback, this would not only suppress the CONTEXT line but also reporting of the error cursor location. I didn't know it - My idea was, when CONTEXT is not showed, then is useless to collect data for it. What is the actual use-case that prompted this complaint? I would to use it for controlling (enabling, disabling) CONTEXT in RAISE statement in plpgsql. I am thinking so one option for this purpose is enough, and I would not to add other option to specify LOG, CLIENT. Regards Pavel regards, tom lane
Re: [HACKERS] PATCH: remove nclients/nthreads constraint from pgbench
This doesn't behave correctly if you set -j to greater than -c, and also use the rate limit option: Indeed. v3 attached fixed the case when nthreads nclients. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index a808546..2517a3a 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -326,8 +326,7 @@ pgbench optional replaceableoptions/ /optional replaceabledbname/ para Number of worker threads within applicationpgbench/application. Using more than one thread can be helpful on multi-CPU machines. -The number of clients must be a multiple of the number of threads, -since each thread is given the same number of client sessions to manage. +Clients are distributed as evenly as possible among available threads. Default is 1. /para /listitem diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 2c3e365..8dfc2fb 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -2792,6 +2792,7 @@ main(int argc, char **argv) int c; int nclients = 1; /* default number of simulated clients */ int nthreads = 1; /* default number of threads */ + int nclients_dealt; /* clients distributed between threads */ int is_init_mode = 0; /* initialize mode? */ int is_no_vacuum = 0; /* no vacuum at all before testing? */ int do_vacuum_accounts = 0; /* do vacuum accounts before testing? */ @@ -3114,6 +3115,10 @@ main(int argc, char **argv) } } + /* Adjust threads to clients */ + if (nthreads nclients) + nthreads = nclients; + /* compute a per thread delay */ throttle_delay *= nthreads; @@ -3153,12 +3158,6 @@ main(int argc, char **argv) if (nxacts = 0 duration = 0) nxacts = DEFAULT_NXACTS; - if (nclients % nthreads != 0) - { - fprintf(stderr, number of clients (%d) must be a multiple of number of threads (%d)\n, nclients, nthreads); - exit(1); - } - /* --sampling-rate may be used only with -l */ if (sample_rate 0.0 !use_log) { @@ -3359,19 +3358,24 @@ main(int argc, char **argv) /* set up thread data structures */ threads = (TState *) pg_malloc(sizeof(TState) * nthreads); + nclients_dealt = 0; + for (i = 0; i nthreads; i++) { TState *thread = threads[i]; thread-tid = i; - thread-state = state[nclients / nthreads * i]; - thread-nstate = nclients / nthreads; + thread-state = state[nclients_dealt]; + thread-nstate = + (nclients - nclients_dealt + nthreads - i - 1) / (nthreads - i); thread-random_state[0] = random(); thread-random_state[1] = random(); thread-random_state[2] = random(); thread-throttle_latency_skipped = 0; thread-latency_late = 0; + nclients_dealt += thread-nstate; + if (is_latencies) { /* Reserve memory for the thread to store per-command latencies */ @@ -3395,6 +3399,9 @@ main(int argc, char **argv) } } + /* all clients must be assigned to a thread */ + Assert(nclients_dealt == nclients); + /* get start up time */ INSTR_TIME_SET_CURRENT(start_time); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL logging problem in 9.4.3?
On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote: Hi, On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote: === Start with an empty database My guess is you have wal_level = minimal? Default config, was just initdb'd. So yes, the default wal_level = minimal. === Note the index file is 8KB. === At this point nuke the database server (in this case it was simply === destroying the container it was running in. How did you continue from there? The container has persistent storage? Or are you repapplying the WAL to somewhere else? The container has persistant storage on the host. What I think is actually unusual is that the script that started postgres was missing an 'exec so postgres never gets the signal to shutdown. martijn@martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 00010001 |grep -wE '16389|16387|16393' rmgr: XLOGlen (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: , desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown rmgr: Storage len (rec/tot): 16/48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: , desc: file create: base/16385/16387 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: , desc: file create: base/16385/16389 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: , desc: file create: base/16385/16393 rmgr: Sequencelen (rec/tot):158/ 190, tx:686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: , desc: log: rel 1663/16385/16387 rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: , desc: file truncate: base/16385/16389 to 0 blocks rmgr: Storage len (rec/tot): 16/48, tx:686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: , desc: file truncate: base/16385/16393 to 0 blocks pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740 Note that the truncate will lead to a new, different, relfilenode. Really? Comparing the relfilenodes gives the same values before and after the truncate. ctmp=# select * from test; ERROR: could not read block 0 in file base/16385/16393: read only 0 of 8192 bytes Hm. I can't reproduce this. Can you include a bit more details about how to reproduce? Hmm, for me it is 100% reproducable. Are you familiar with Docker? I can probably construct a Dockerfile that reproduces it pretty reliably. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] multivariate statistics / patch v7
Hello, I started to work on this patch. attached is v7 of the multivariate stats patch. The main improvement is major refactoring of the clausesel.c portion - splitting the awfully long spaghetti-style functions into smaller pieces, making it much more understandable etc. Thank you, it looks clearer. I have some comment for the brief look at this. This patchset is relatively large so I will comment on per-notice basis.. which means I'll send comment before examining the entire of this patchset. Sorry in advance for the desultory comments. === General comments: - You included unnecessary stuffs such like regression.diffs in these patches. - Now OID 3307 is used by pg_stat_file. I moved pg_mv_stats_dependencies_info/show to 3311/3312. - Single-variate stats have a mechanism to inject arbitrary values as statistics, that is, get_relation_stats_hook and the similar stuffs. I want the similar mechanism for multivariate statistics, too. 0001: - I also don't think it is right thing for expression_tree_walker to recognize RestrictInfo since it is not a part of expression. 0003: - In clauselist_selectivity, find_stats is uselessly called for single clause. This should be called after the clauselist found to consist more than one clause. - Searching vars to be compared with mv-stat columns which find_stats does should stop at disjunctions. But this patch doesn't behave so and it should be an unwanted behavior. The following steps shows that. =# CREATE TABLE t1 (a int, b int, c int); =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, ) a); =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12) =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c); =# ANALZYE t1; =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3; Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12) Rows changed unwantedly. It seems not so simple thing as your code assumes. I do assume some of those pieces are unnecessary because there already is a helper function with the same purpose (but I'm not aware of that). But IMHO this piece of code begins to look reasonable (especially when compared to the previous state). Year, such kind of work should be done later:p This patch is not-so-invasive so as to make it undoable. The other major improvement it review of the comments (including FIXMEs and TODOs), and removal of the obsolete / misplaced ones. And there was plenty of those ... These changes made this version ~20k smaller than v6. The patch also rebases to current master, which I assume shall be quite stable - so hopefully no more duplicate OIDs for a while. There are 6 files attached, but only 0002-0006 are actually part of the multivariate statistics patch itself. The first part makes it possible to use pull_varnos() with expression trees containing RestrictInfo nodes, but maybe this is not the right way to fix this (there's another thread where this was discussed). As mentioned above, checking if mv stats can be applied would be more complex matter than now you are assuming. I also will consider that. Also, the regression tests testing plan choice with multivariate stats (e.g. that a bitmap index scan is chosen instead of index scan) fail from time to time. I suppose this happens because the invalidation after ANALYZE is not processed before executing the query, so the optimizer does not see the stats, or something like that. I saw that occurs, but have no idea how it occurs so far.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
Josh Berkus wrote: Say you take this case: 2 : { local_replica, london_server, nyc_server } ... which should ensure that any data which is replicated is replicated to at least two places, so that even if you lose the entire local datacenter, you have the data on at least one remote data center. EXCEPT: say you lose both the local datacenter and communication with the london server at the same time (due to transatlantic cable issues, a huge DDOS, or whatever). You'd like to promote the NYC server to be the new master, but only if it was in sync at the time its communication with the original master was lost ... except that you have no way of knowing that. Please consider the following: If we have multiple replica on each DC, we can use the following: 3(local1, 1(london1, london2), 1(nyc1, nyc2)) In this case at least 1 from each DC is sync rep. When local and London center is lost, NYC promotion can be done by comparing the LSN. Also quorum would also ensure that even if one of the standby in a data center goes down, another can take over, preventing data loss. In the case 3(local1, london1, nyc1) If nyc1, is down, the transaction would wait continuously. This can be avoided. - -- Beena Emerson -- View this message in context: http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5856394.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PostgreSQL 9.5 Alpha 1 build fail with perl 5.22
On 7/2/2015 5:16 PM, Dave Page wrote: The PostgreSQL Global Development Group announces that the alpha release of PostgreSQL 9.5, the latest version of the world's leading open source database, is available today. This release contains previews of all of the features which will be available in the final release of version 9.5, although some details will change before then. Please download, test, and report what you find. Help Test for Bugs -- building on cygwin and $ perl --version This is perl 5, version 22, subversion 0 (v5.22.0) built for cygwin-thread-multi-64int build fail here, anyone seeing the same on other platforms ? make -C hstore_plperl all make[1]: Entering directory '/cygdrive/e/cyg_pub/devel/postgresql/prova/postgres ql-9.5alpha1-1.i686/build/contrib/hstore_plperl' gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -We ndif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -f wrapv -fexcess-precision=standard -ggdb -O2 -pipe -Wimplicit-function-declaratio n -I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5a lpha1/src/pl/plperl -I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr c/postgresql-9.5alpha1/contrib/hstore -I. -I/pub/devel/postgresql/prova/postgres ql-9.5alpha1-1.i686/src/postgresql-9.5alpha1/contrib/hstore_plperl -I../../src/i nclude -I/pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql- 9.5alpha1/src/include -I/usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE -c -o hstore_plperl.o /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/p ostgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c In file included from /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr c/postgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c:6:0: /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5alpha1 /contrib/hstore/hstore.h:79:0: warning: HS_KEY redefined #define HS_KEY(arr_,str_,i_) ((str_) + HSE_OFF((arr_)[2*(i_)])) ^ In file included from /usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE/perl.h: 3730:0, from /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr c/postgresql-9.5alpha1/src/pl/plperl/plperl.h:48, from /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/sr c/postgresql-9.5alpha1/contrib/hstore_plperl/hstore_plperl.c:4: /usr/lib/perl5/5.22/i686-cygwin-threads-64int/CORE/util.h:221:0: note: this is t he location of the previous definition # define HS_KEY(setxsubfn, popmark, apiver, xsver) \ ^ gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -We ndif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -f wrapv -fexcess-precision=standard -ggdb -O2 -pipe -Wimplicit-function-declaratio n -shared -o hstore_plperl.dll -Wl,--out-implib=libhstore_plperl.a hstore_plpe rl.o -L../../src/port -L../../src/common -Wl,-no-undefined -Wl,--allow-multiple -definition -Wl,--enable-auto-import -L/usr/local/lib -Wl,--as-needed -L../.. /src/backend -lpostgres -lpgcommon -lpgport -lintl -lssl -lcrypto -lz -lreadline -lcrypt -lldap hstore_plperl.o: In function `hstore_to_plperl': /pub/devel/postgresql/prova/postgresql-9.5alpha1-1.i686/src/postgresql-9.5alpha1 /contrib/hstore_plperl/hstore_plperl.c:16: undefined reference to `hstoreUpgrade ' -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rework the way multixact truncations work
On Thu, Jul 2, 2015 at 11:52 AM, Robert Haas robertmh...@gmail.com wrote: Will look at 0003 next. +appendStringInfo(buf, offsets [%u, %u), members [%u, %u), I don't think we typically use this style for notating intervals. case XLOG_MULTIXACT_CREATE_ID: id = CREATE_ID; break; +case XLOG_MULTIXACT_TRUNCATE_ID: +id = TRUNCATE; +break; If XLOG_MULTIXACT_CREATE_ID - CREATE_ID, then why not XLOG_MULTIXACT_TRUNCATE_ID - TRUNCATE_ID? + * too old to general truncation records. s/general/generate/ +MultiXactId oldestMXactDB; Data type should be OID. + * Recompute limits as we are now fully started, we now can correctly + * compute how far a members wraparound is away. s/,/:/ or something. This isn't particularly good English as written. + * Computing the actual limits is only possible once the data directory is + * in a consistent state. There's no need to compute the limits while + * still replaying WAL as no new multis can be created anyway. So we'll + * only do further checks after TrimMultiXact() has been called. Multis can be and are created during replay. What this should really say is that we have no choice about whether to create them or not: we just have to replay whatever's there. +(errmsg(performing legacy multixact truncation, upgrade master))); This message needs work. I'm not sure exactly what it should say, but I'm pretty sure that's not clear enough. I seriously, seriously doubt that it is a good idea to perform the legacy truncation from MultiXactAdvanceOldest() rather than TruncateMultiXact(). The checkpoint hasn't really happened at that point yet; you might truncate away stuff, then crash before the checkpoint is complete, and then we you restart recovery, you've got trouble. I think you should be very, very cautious about rejiggering the order of operations here. The current situation is not good, but casually rejiggering it can make things much worse. - * If no multixacts exist, then oldestMultiXactId will be the next - * multixact that will be created, rather than an existing multixact. + * Determine the offset of the oldest multixact. Normally, we can read + * the offset from the multixact itself, but there's an important special + * case: if there are no multixacts in existence at all, oldestMXact + * obviously can't point to one. It will instead point to the multixact + * ID that will be assigned the next time one is needed. There's no need to change this; it means the same thing either way. Generally, I think you've weakened the logic in SetOffsetVacuumLimit() appreciably here. The existing code is careful never to set oldestOffsetKnown false when it was previously true. Your rewrite removes that property. Generally, I see no need for this function to be overhauled to the extent that you have, and would suggest reverting the changes that aren't absolutely required. I don't particularly like the fact that find_multixact_start() calls SimpleLruFlush(). I think that's really a POLA violation: you don't expect that a function that looks like a simple inquiry is going to go do a bunch of unrelated I/O in the background. If somebody called find_multixact_start() with any frequency, you'd be sad. You're just doing it this way because of the context *in which you expect find_multixact_start* to be called, which does not seem very future-proof. I prefer Thomas's approach. If TruncateMultiXact() fails to acquire MultiXactTruncationLock right away, does it need to wait, or could it ConditionalAcquire and bail out if the lock isn't obtained? + * Make sure to only attempt truncation if there's values to truncate + * away. In normal processing values shouldn't go backwards, but there's + * some corner cases (due to bugs) where that's possible. I think this comment should be more detailed. Is that talking about the same thing as this comment: - * Due to bugs in early releases of PostgreSQL 9.3.X and 9.4.X, - * oldestMXact might point to a multixact that does not exist. - * Autovacuum will eventually advance it to a value that does exist, - * and we want to set a proper offsetStopLimit when that happens, - * so call DetermineSafeOldestOffset here even if we're not actually - * truncating. This comment seems to be saying there's a race condition: + * XXX: It's also possible that the page that oldestMXact is on has + * already been truncated away, and we crashed before updating + * oldestMXact. But why is that an XXX? I think this is just a case of recovery needing tolerate redo of an action already redone. I'm not convinced that it's a good idea to remove lastCheckpointedOldest and replace it with nothing. It seems like a very good idea to have two separate pieces of state in shared memory: - The oldest offset that we think anyone might need to access to make
Re: [HACKERS] Memory leak fixes for pg_dump, pg_dumpall, initdb and pg_upgrade
On 06/08/2015 09:48 AM, Michael Paquier wrote: Hi all, Please find attached a set of fixes for a couple of things in src/bin: - pg_dump/pg_dumpall: -- getFormattedTypeName, convertTSFunction and myFormatType return strdup'd results that are never free'd. -- convertTSFunction returns const char. I fail to see the point of that... In my opinion we are fine with just returning a char pointer, which is strdup'd so as it can be freed by the caller. - initdb's and pg_regress' use getaddrinfo, but do not free the returned result with freeaddrinfo(). - Coverity noticed on the way some leaked memory in pg_upgrade's equivalent_locale(). Those issues have been mostly spotted by Coverity, I may have spotted some of them while looking at similar code paths... In any case that's Coverity's win ;) Fixing most of these is not really an improvement, IMO. They're in pg_dump and pg_ugprade, which you only run once and then it exits, so as long as the leaks are not in some tight loops that execute millions of time, it doesn't matter. I committed some of these that seemed like improvements on readability grounds, but please just mark the rest as ignore in coverity. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing ClogControlLock contention
On Wed, Jul 1, 2015 at 6:21 AM, Andres Freund and...@anarazel.de wrote: On 2015-07-01 11:19:40 +0100, Simon Riggs wrote: What tricks are being used?? Please explain why taking 2 locks is bad here, yet works fine elsewhere. I didn't say anything about 'bad'. It's more complicated than one lock. Suddenly you have to care about lock ordering and such. The algorithms for ensuring correctness gets more complicated. Taking two locks might also be more expensive than just taking one. I suppose benchmarking will reveal whether there is an issue there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Configurable location for extension .control files
On 03/04/2015 09:41 AM, Oskari Saarenmaa wrote: 18.02.2015, 01:49, Jim Nasby kirjoitti: On 2/17/15 4:39 PM, Oskari Saarenmaa wrote: 10.06.2013, 17:51, Dimitri Fontaine kirjoitti: Andres Freund and...@2ndquadrant.com writes: In any case, no packager is going to ship an insecure-by-default configuration, which is what Dimitri seems to be fantasizing would happen. It would have to be local option to relax the permissions on the directory, no matter where it is. *I* don't want that at all. All I'd like to have is a postgresql.conf option specifying additional locations. Same from me. I think I would even take non-plural location. Here's a patch to allow overriding extension installation directory. The patch allows superusers to change it at runtime, but we could also restrict it to postgresql.conf if needed. I don't really see a point in restricting it (or not implementing the option at all) as the superuser can already load custom C code and sql from anywhere in the filesystem; not having configurable extension directory just makes it harder to test extensions and to create private clusters of PG data in a custom location without using custom binaries. I'm interested in this because it could potentially make it possible to install SQL extensions without OS access. (My understanding is there's some issue with writing the extension files even if LIBDIR is writable by the OS account). I'm not sure this patch makes extensions without OS access any easier to implement; you still need to write the files somewhere, and someone needs to set up the cluster with a nonstandard extension path. Hmm. I think you're on to something with this patch, but I couldn't exactly figure out what. What is the purpose of this patch? I can see a few things that you might want to do: 1. You might want to create a cluster using existing binaries, and set it up so that you can install extra extensions from a custom directory. Handy if you don't have write access to /usr, or you only want to make an extension available in one cluster but not others. 2. Like 1, but you want to replace the set of set of extensions altogether. 3. Writing an automated regression test for a custom extension. With all of those, you have the problem that you actually want to override both the lib-dir and the extensions-dir. So you'll need to set dynamic_library_path too. For 3, fiddling with the configuration file is a bit tedious. And PGXS doesn't currently have support for setting up a test cluster anyway. Oh, and why are we only worried about extensions? There's other stuff in 'share'-directory that you might also want to override in some clusters or as part of regression tests: timezone and tsearch data. Note that you can make Postgres to search for all of those things in a different directory by copying the postgres binary elsewhere. It's a bit hacky, but works. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring speculative insertion with unique indexes a little
On Thu, Jul 2, 2015 at 1:30 AM, Heikki Linnakangas hlinn...@iki.fi wrote: Sure, if a speculative inserter detects a conflict, it still has to wait. But not in the aminsert call, and not until it cleans up its physical insertion (by super-deleting). Clearly a CHECK_UNIQUE_SPECULATIVE would have to be much closer to CHECK_UNIQUE_PARTIAL than to CHECK_UNIQUE_YES. Why is it not OK for aminsert to do the waiting, with CHECK_UNIQUE_SPECULATIVE? Well, waiting means getting a ShareLock on the other session's XID. You can't do that without first releasing your locks, unless you're okay with unprincipled deadlocks. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 07/01/2015 11:12 PM, Fujii Masao wrote: I don't think this is good approach because there can be the case where you need to promote even the standby server not having sync flag. Please imagine the case where you have sync and async standby servers. When the master goes down, the async standby might be ahead of the sync one. This is possible in practice. In this case, it might be better to promote the async standby instead of sync one. Because the remaining sync standby which is behind can easily follow up with new master. If we're always going to be polling the replicas for furthest ahead, then why bother implementing quorum synch at all? That's the basic question I'm asking. What does it buy us that we don't already have? I'm serious, here. Without any additional information on synch state at failure time, I would never use quorum synch. If there's someone on this thread who *would*, let's speak to their use case and then we can actually get the feature right. Anyone? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
On 2015-07-02 11:46:25 -0400, Robert Haas wrote: In the case of static inline, the main problem with the status quo AFAICS is that you have to do a little dance any time you'd otherwise use a static inline function (for those following along at home, git grep INCLUDE_DEFINITIONS src/include to see how this is done). Now, it is obviously the case that the first time somebody has to do this dance, they will be somewhat inconvenienced, but once you know how to do it, it's not incredibly complicated. I obviously know the schema (having introduced it in pg), but I think the costs are actually rather significant. It makes development more complex, it makes things considerably harder to follow, and it's quite annoying to test (manually redefine PG_USE_INLINE, recompile whole tree). Several times people also argued against using inlines with that trick because it's slowing down older platforms. So, just to play the devil's advocate here, who really cares? I consider our time one of the most scarce resources around. I'd consider an argument for adopting one of these features to be much stronger if were accompanied by arguments like: - If we require feature X, we can reduce the size of the generated code and improve performance. Converting some of the bigger macros (I tested fastgetattr) to inliens, actually does both. See also http://archives.postgresql.org/message-id/4407.1435763473%40sss.pgh.pa.us Now, all that is possible with the INCLUDE_DEFINITIONS stuff, but it makes it considerably more expensive time/complexity wise. If everybody else here says working around the lack of feature X is too much of a pain in the butt and we want to adopt it purely too make development easier, I'm not going to sit here and try to fight the tide. But I personally don't find such arguments all that exciting. I find that an odd attitude. I think // comments are a not something we should adopt, because one style is better than two I don't particularly care about // comments either. They do have the advantage of allowing three more characters of actual content in one line comments ... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP
On 5/30/15 10:14 PM, Andreas Karlsson wrote: I have written a patch which makes it possible to change SSL certificates (and other SSL parameters, including the CRL) without restarting PostgreSQL. In fact this patch also makes it possible to turn on or off ssl entirely without restart. It does so by initializing a new SSL context when the postmaster receives a SIGHUP, and if the initialization succeeded the old context is replaced by the new. I think this would be a useful feature, and the implementation looks sound. But I don't like how the reload is organized. Reinitializing the context in the sighup handler, aside from questions about how much work one should do in a signal handler, would cause SSL reinitialization for unrelated reloads. We have the GUC assign hook mechanism for handling this sort of thing. The trick would be that when multiple SSL-related settings change, you only want to do one reinitialization. You could either have the different assign hooks communicate with each other somehow, or have them set a need SSL init flag that is checked somewhere else. There was some previous discussion[1] on the mailing list about what the proper context should be for the SSL parameters, but as far as I can tell the discussion never reached a conclusion. I have changed the SSL GUCs to PGC_SIGUP since I felt that was the closest to the truth, but it is not a perfect fit (the backends wont reload the SSL context). Should we add a new context for the SSL GUCs? I think PGC_SIGHUP is fine for this. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Raising our compiler requirements for 9.6
Robert Haas robertmh...@gmail.com writes: So far this thread is all about the costs of desupporting compilers that don't have these features, and you're making a good argument (that I think we all agree with) that the cost is small. But you haven't really said what the benefits are. I made the same remark with respect to varargs macros, and I continue to think that the cost-benefit ratio there is pretty marginal. However, I do think that there's a case to be made for adopting static inline. The INCLUDE_DEFINITIONS dance is very inconvenient, so it discourages people from using static inlines over macros, even in cases where the macro approach is pretty evil (usually, because of multiple- evaluation hazards). If we allowed static inlines then we could work towards having a safer coding standard along the lines of thou shalt not write macros that evaluate their arguments more than once. So the benefits here are easier development and less risk of bugs. On the other side of the ledger, the costs are pretty minimal; we will not actually break any platforms, at worst we'll make them unpleasant to develop on because of lots of warning messages. We have some platforms like that already, and it's not been a huge problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 2 July 2015 at 16:30, Sawada Masahiko sawada.m...@gmail.com wrote: Also, the flags of each heap page header might be set PD_ALL_FROZEN, as well as all-visible Is it possible to have VM bits set to frozen but not visible? The description makes those two states sound independent of each other. Are they? Or not? Do we test for an impossible state? -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Fix autoconf deprecation warnings
On 05/31/2015 05:22 AM, Andreas Karlsson wrote: I have attached new versions which apply on the current master. Thanks, applied. --- a/configure.in +++ b/configure.in @@ -1473,7 +1473,7 @@ if test x$pgac_cv_func_sigsetjmp = xyes; then AC_DEFINE(HAVE_SIGSETJMP, 1, [Define to 1 if you have sigsetjmp().]) fi -AC_DECL_SYS_SIGLIST +AC_CHECK_DECLS([sys_siglist]) AC_CHECK_FUNC(syslog, [AC_CHECK_HEADER(syslog.h, Hmm, according to the autoconf manual: Macro: AC_DECL_SYS_SIGLIST Same as: AC_CHECK_DECLS([sys_siglist], [], [], [#include signal.h /* NetBSD declares sys_siglist in unistd.h. */ #ifdef HAVE_UNISTD_H # include unistd.h #endif ]) See AC_CHECK_DECLS. So I replaced AC_DECL_SYS_SIGLIST with that full snippet instead. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested Looks functionally complete Need a test to show that ALTER TABLE works on views, as discussed on this thread. And confirmation that pg_dump is not broken by this. Message-ID: 20140321205828.gb3969...@tornado.leadboat.com Needs documentation The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Support for N synchronous standby servers - take 2
On 07/02/2015 11:31 AM, Andres Freund wrote: On 2015-07-02 11:10:27 -0700, Josh Berkus wrote: If we're always going to be polling the replicas for furthest ahead, then why bother implementing quorum synch at all? That's the basic question I'm asking. What does it buy us that we don't already have? What do those topic have to do with each other? A standby fundamentally can be further ahead than what the primary knows about. So you can't do very much with that knowledge on the master anyway? I'm serious, here. Without any additional information on synch state at failure time, I would never use quorum synch. If there's someone on this thread who *would*, let's speak to their use case and then we can actually get the feature right. Anyone? How would you otherwise ensure that your data is both on a second server in the same DC and in another DC? Which is a pretty darn common desire? So there's two parts to this: 1. I need to ensure that data is replicated to X places. 2. I need to *know* which places data was synchronously replicated to when the master goes down. My entire point is that (1) alone is useless unless you also have (2). And do note that I'm talking about information on the replica, not on the master, since in any failure situation we don't have the old master around to check. Say you take this case: 2 : { local_replica, london_server, nyc_server } ... which should ensure that any data which is replicated is replicated to at least two places, so that even if you lose the entire local datacenter, you have the data on at least one remote data center. EXCEPT: say you lose both the local datacenter and communication with the london server at the same time (due to transatlantic cable issues, a huge DDOS, or whatever). You'd like to promote the NYC server to be the new master, but only if it was in sync at the time its communication with the original master was lost ... except that you have no way of knowing that. Given that, we haven't really reduced our data loss potential or improved availabilty from the current 1-redundant synch rep. We still need to wait to get the London server back to figure out if we want to promote or not. Now, this configuration would reduce the data loss window: 3 : { local_replica, london_server, nyc_server } As would this one: 2 : { local_replica, nyc_server } ... because we would know definitively which servers were in sync. So maybe that's the use case we should be supporting? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers