Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Sun, Nov 9, 2014 at 10:32 PM, Fujii Masao masao.fu...@gmail.com wrote: On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed rahilasye...@gmail.com wrote: Hello, The patch was not applied to the master cleanly. Could you update the patch? Please find attached updated and rebased patch to compress FPW. Review comments given above have been implemented. Thanks for updating the patch! Will review it. BTW, I got the following compiler warnings. xlogreader.c:755: warning: assignment from incompatible pointer type autovacuum.c:1412: warning: implicit declaration of function 'CompressBackupBlocksPagesAlloc' xlogreader.c:755: warning: assignment from incompatible pointer type I have been looking at this patch, here are some comments: 1) This documentation change is incorrect: - termvarnamefull_page_writes/varname (typeboolean/type) + termvarnamefull_page_writes/varname (typeenum/type)/term indexterm primaryvarnamefull_page_writes/ configuration parameter/primary /indexterm - /term The termination of block term was correctly places before. 2) This patch defines FullPageWritesStr and full_page_writes_str, but both do more or less the same thing. 3) This patch is touching worker_spi.c and calling CompressBackupBlocksPagesAlloc directly. Why is that necessary? Doesn't a bgworker call InitXLOGAccess once it connects to a database? 4) Be careful as well of whitespaces (code lines should have as well a maximum of 80 characters): + * If compression is set on replace the rdata nodes of backup blocks added in the loop + * above by single rdata node that contains compressed backup blocks and their headers + * except the header of first block which is used to store the information about compression. + */ 5) GetFullPageWriteGUC or something similar is necessary, but I think that for consistency with doPageWrites its value should be fetched in XLogInsert and then passed as an extra argument in XLogRecordAssemble. Thinking more about this, I think that it would be cleaner to simply have a bool flag tracking if compression is active or not, something like doPageCompression, that could be fetched using GetFullPageWriteInfo. Thinking more about it, we could directly track forcePageWrites and fullPageWrites, but that would make back-patching more difficult with not that much gain. 6) Not really a complaint, but note that this patch is using two bits that were unused up to now to store the compression status of a backup block. This is actually safe as long as the maximum page is not higher than 32k, which is the limit authorized by --with-blocksize btw. I think that this deserves a comment at the top of the declaration of BkpBlock. ! unsignedhole_offset:15, /* number of bytes before hole */ ! flags:2,/* state of a backup block, see below */ ! hole_length:15; /* number of bytes in hole */ 7) Some code in RestoreBackupBlock: + char *uncompressedPages; + + uncompressedPages = (char *)palloc(XLR_TOTAL_BLCKSZ); [...] + /* Check if blocks in WAL record are compressed */ + if (bkpb.flag_compress == BKPBLOCKS_COMPRESSED) + { + /* Checks to see if decompression is successful is made inside the function */ + pglz_decompress((PGLZ_Header *) blk, uncompressedPages); + blk = uncompressedPages; + } uncompressedPages is pallocd'd all the time but you actually just need to do that when the block is compressed. 8) Arf, I don't like much the logic around CompressBackupBlocksPagesAlloc using a malloc to allocate once the space necessary for compressed and uncompressed pages. You are right to not do that inside a critical section, but PG tries to maximize the allocations to be palloc'd. Now it is true that if a palloc does not succeed, PG always ERROR's out (writer adding entry to TODO list)... Hence I think that using a static variable for those compressed and uncompressed pages makes more sense, and this simplifies greatly the patch as well. 9) Is avw_sighup_handler really necessary, what's wrong in allocating it all the time by default? This avoids some potential caveats in error handling as well as in value updates for full_page_writes. So, note that I am not only complaining about the patch, I actually rewrote it as attached while reviewing, with additional minor cleanups and enhancements. I did as well a couple of tests like the script attached, compression numbers being more or less the same as your previous patch, some noise creating differences. I have done also some regression test runs with a standby replaying behind. I'll go through the patch once again a bit later, but feel free to comment. Regards, -- Michael compress_test_2.sh Description: Bourne shell script diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 47b1192..6756172 100644 --- a/doc/src/sgml/config.sgml +++
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
On Fri, Oct 31, 2014 at 5:46 PM, furu...@pm.nttdata.co.jp wrote: We seem to be going in circles. You suggested having two options, --feedback, and --fsync, which is almost exactly what Furuya posted originally. I objected to that, because I think that user interface is too complicated. Instead, I suggested having just a single option called --synchronous, or even better, have no option at all and have the server tell the client if it's participating in synchronous replication, and have pg_receivexlog automatically fsync when it is, and not otherwise [1]. That way you don't need to expose any new options to the user. What did you think of that idea? I think it's pretty weird to make the fsync behavior of the client is controlled by the server. I also don't think that fsync() on the client side is useless in asynchronous replication. Yeah, it's true that there are no *guarantees* with asynchronous replication, but the bound on how long the data can take to get out to disk is a heck of a lot shorter if you fsync frequently than if you don't. And on the flip side, that has a performance impact. So I actually think the design you proposed is not as good as what was proposed by Furuya and Simon. But I don't feel incredibly strongly about it. Thanks for lots of comments!! I fixed the patch. As a default, it behave like a walreceiver. Same as walreceiver, it fsync and send a feedback after fsync. On second thought, flipping the default behavior seems not worthwhile here. Which might surprise the existing users and cause some troubles to them. I'd like to back to the Heikki's original suggestion like just adding --synchronous option. That is, only when --synchronous is specified, WAL is flushed and feedback is sent back as soon as WAL is received. I changed the patch heavily in that way. Please find the attached patch. By default, pg_receivexlog flushes WAL data only when WAL file is closed. If --synchronous is specified, like the standby's WAL receiver, sync commands are issued as soon as there is WAL data which has not been flushed yet. Also status packets are sent back to the server just after WAL data is flushed whatever --status-interval is set to. I added the description of this behavior to the doc. Thought? I think it's as you pointed out. Thank you for the patch. I did a review of the patch. There was no problem. I confirmed the following. 1. applied cleanly and compilation was without warnings and errors 2. all regress tests was passed ok 3. when --synchronous is not specified, do not fsync except wal has switched 4. it get normal backup when pg_basebackup has executed with '-X s' 5. when --synchronous is specified, it fsync every time when it receive WAL data 6. when --synchronous is specified, in spite of -s option, feedback are sent back when it fsync 7. when --slog is not specified, it wouldn't feedback fsync location 8. no problem in document Regards, -- Furuya Osamu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user/role CURRENT_USER
Hello, This is the new version. (WIP v2) The first attachment is the patch and the second is test sql script. - Behavior changing Almost all syntax taking role accepts CURRENT_USER and SESSION_USER and they are distinguished from current_user and session_user. The exceptions are follows. - CREATE USER/GROUP roleid - ALTER ROLE/GROUP/USER roleid RENAME TO newname These syntax still do not accept the keywords like CURRENT_USER and special names like public at all, but accepts current_user. The error message is changed as follows. | postgres=# create user current_user; | ERROR: role name should not be a keyword nor reserved name. | LINE 1: create user current_user; | ^ # Some other messages may changed... USER and CURRENT_ROLE haven't been extended to other syntax. The former still can be used only in CREATE/ALTER/DROP USER MAPPING and the latter cannot be used out of function expressions. - Storage for new information The new struct NameId stores an identifier which telling what it logically is using the new enum NameIdTypes. This is still be a bit suffered by the difference between CURRENT_USER and PUBLIC but now it makes distinction between current_user and current_user. User oid does not have the room for representing the difference among PUBLIC, NONE and 'not found' as the result of get_nameid_oid(), so members of NameId is exposed in foreigncmds.c and it gets a bit uglier. - Changes of related structs and grammar. Type of role member is changed to NameId in some of parser structs. AlterTableCmd.name has many other usage so I added new member NameId *newowner for exclusive usage. Type of OWNER clause of CREATE TABLESPACE is changed to RoleId. I suppose there's no particular reason that the non-terminal was name. Usage of public and none had been blocked for CREATE/RENAME USER in user.c but now it is blocked in gram.y - New function to resolve NameId New function get_nameid_oid() is added. It is an alternative of get_role_oid which can handle current_user and current_user properly. get_role_oid() still be used in many places having no direct relation to syntax. - Others No doc provided for now. regards, Adam Brightwell adam.brightw...@crunchydatasolutions.com writes: FWIW, I found the following in the archives: http://www.postgresql.org/message-id/15516.1038718...@sss.pgh.pa.us Now this is from 2002 and it appears it wasn't necessary to change at the time, but I haven't yet found anything else related (it's a big archive). Though, as I understand it, PUBLIC is now non-reserved as of SQL:2011 which might make a compelling argument to leave it as is? The current spec does list PUBLIC as a non-reserved keyword, but it also says (5.4 Names and identifiers syntax rules) 20) No authorization identifier shall specify PUBLIC. which, oddly enough, seems to license us to handle PUBLIC the way we are doing. OTOH, it lists CURRENT_USER as a reserved word, suggesting that they don't think the same type of hack should be used for that. I'd be inclined to leave the grammar as such alone (ie CURRENT_USER is a keyword, PUBLIC isn't). Changing that has more downside than upside, and we do have justification in the spec for treating the two cases differently. However, I agree that we should fix the subsequent processing so that current_user is not confused with CURRENT_USER. Sure, maybe. - PUBLIC should be left as it is, as an supecial identifier which cannot be used even if quoted under some syntax. - CURRENT_USER should be a kayword as it is, but we shouldn't inhibit it from being used as an identifier if quoted. SESSION_USER and USER should be treated in the same way. We don't want to use something other than string (prefixed by zero-byte) as a matter of course:). And resolving the name to roleid instantly in gram.y is not allowed for the reason shown upthread. So it is necessary to add a new member for the struct, say int special_role;:... Let me have more sane name for it :( - USER and CURRENT_ROLE are not needed for the syntax other than them which already uses them. I will work on this way. Let me know if something is not acceptable. -- Kyotaro Horiguchi NTT Open Source Software Center From f18d078d5e6e4005e803ecc954e59c899dbfd557 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi horiguchi.kyot...@lab.ntt.co.jp Date: Mon, 10 Nov 2014 19:08:42 +0900 Subject: [PATCH] ALTER USER CURRENT_USER v2 --- src/backend/catalog/aclchk.c | 13 ++- src/backend/commands/alter.c | 2 +- src/backend/commands/foreigncmds.c | 39 - src/backend/commands/schemacmds.c | 26 +- src/backend/commands/tablecmds.c | 2 +- src/backend/commands/tablespace.c | 2 +- src/backend/commands/user.c| 70 +++ src/backend/nodes/copyfuncs.c | 37 +--- src/backend/nodes/equalfuncs.c | 35 +---
Re: [HACKERS] [v9.5] Custom Plan API
On Sat, Nov 8, 2014 at 4:16 AM, Robert Haas robertmh...@gmail.com wrote: On Mon, Oct 27, 2014 at 2:35 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: FYI, patch v12 part 2 no longer applies cleanly. Thanks. I rebased the patch set according to the latest master branch. The attached v13 can be applied to the master. I've committed parts 1 and 2 of this, without the documentation, and with some additional cleanup. Few observations/questions related to this commit: 1. @@ -5546,6 +5568,29 @@ get_variable(Var *var, int levelsup, bool istoplevel, deparse_context *context) colinfo = deparse_columns_fetch(var-varno, dpns); attnum = var-varattno; } + else if (IS_SPECIAL_VARNO(var-varno) IsA(dpns-planstate, + CustomScanState) (expr = GetSpecialCustomVar((CustomScanState *) + dpns-planstate, var, child_ps)) != NULL) { deparse_namespace + save_dpns; + + if (child_ps) + push_child_plan(dpns, child_ps, save_dpns); + /* + * Force parentheses because our caller probably assumed a Var is a + * simple expression. + */ + if (!IsA(expr, Var)) + appendStringInfoChar(buf, '('); + get_rule_expr((Node *) expr, context, true); if (!IsA(expr, Var)) + appendStringInfoChar(buf, ')'); + + if (child_ps) + pop_child_plan(dpns, save_dpns); + return NULL; + } a. It seems Assert for netlelvelsup is missing in this loop. Indeed, this if-block does not have assertion unlike other special-varno. b. Below comment in function get_variable can be improved w.r.t handling for CustomScanState. The comment indicates that if varno is OUTER_VAR or INNER_VAR or INDEX_VAR, it handles all of them similarly which seems to be slightly changed for CustomScanState. /* * Try to find the relevant RTE in this rtable. In a plan tree, it's * likely that varno is OUTER_VAR or INNER_VAR, in which case we must dig * down into the subplans, or INDEX_VAR, which is resolved similarly. Also * find the aliases previously assigned for this RTE. */ I made a small comment that introduces only extension knows the mapping between these special varno and underlying expression, thus, it queries providers the expression being tied with this special varnode. Does it make sense? 2. +void +register_custom_path_provider(CustomPathMethods *cpp_methods) { .. } Shouldn't there be unregister function corresponding to above register function? Even though it is not difficult to implement, what situation will make sense to unregister rather than enable__scan GUC parameter added by extension itself? I initially thought prepared statement with custom-scan node is problematic if provider got unregistered / unloaded, however, internal_unload_library() actually does nothing. So, it is at least harmless even if we implemented. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com pgsql-v9.5-get_variable-smallfix.patch Description: pgsql-v9.5-get_variable-smallfix.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] postgres_fdw behaves oddly
Since two separate issues 1. using reltargetlist instead of attr_needed and 2. system columns usage in FDW are being tackled here, we should separate the patch into two one for each of the issues. While I agree that the system columns shouldn't be sent to the remote node, it doesn't look clear to me as to what would they or their values mean in the context of foreign data. Some columns like tableoid would have a value which is the OID of the foreign table, other system columns like xmin/xmax/ctid will have different meanings (or no meaning) depending upon the foreign data source. In case of later columns, each FDW would have its own way of defining that meaning (I guess). But in any case, I agree that we shouldn't send the system columns to the remote side. Is there a way we can enforce this rule across all the FDWs? OR we want to tackle it separately per FDW? On Tue, Oct 14, 2014 at 8:05 AM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/10/14 8:53), Robert Haas wrote: On Mon, Oct 13, 2014 at 3:30 PM, Bruce Momjian br...@momjian.us wrote: Uh, where are we on this patch? Probably it should be added to the upcoming CF. Done. https://commitfest.postgresql.org/action/patch_view?id=1599 Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
[HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?
Hi, During benchmarking/debugging I just had the problem that autovacuum was signalled at an insane rate - leading to more than one hundred autovac workers being started per second. Leading to a overall slowdown of more than 90% and the anti-wraparound vacuum not finishing. The problem is that I couldn't easily figure out where all the SIGUSR2's to the autovacuum launcher where coming from. This isn't the first time that I had that kind of problem. Posix provides information about the source of the signal when using SA_SIGINFO style handlers via si_code/si_pid. That information has been available for a *long* while (c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html). I've now hacked up my development instance to log something like autovacuum: invoked by pid 18175. I personally find that quite helpful. I can also imagine it being rather helpful to log information about the sender of SIGINT/TERM interrupts. The existing abstractions make are nearly sufficient to make it easy to optionally use SA_SIGINFO style handlers. Just by redifining SIGNAL_ARGS and pqsigfunc. There unfortunately is two things making it harder: SIG_IGN and SIG_DFL - those unfortunately can't be specified for SA_SIGINFO style handlers (as they have a different signature). So we'd need to use a different function for those two. Comments, ideas? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Fri, Nov 7, 2014 at 3:55 AM, Robert Haas robertmh...@gmail.com wrote: On Fri, Oct 31, 2014 at 9:07 AM, Andres Freund and...@2ndquadrant.com wrote: Maybe we can, as a first step, make those edges in the lock graph visible to the deadlock detector? It's pretty clear that undetected deadlocks aren't ok, but detectable deadlocks in a couple corner cases might be acceptable. An interesting point came to mind this morning on this topic, while discussing this issue with Amit. Suppose process A1 holds a lock on some object and process A2, a member of the same locking group, waits for a lock on that same object. It follows that there must exist a process B which either *holds* or *awaits* a lock which conflicts with the one requested by A2; otherwise, A2 would have been granted the lock at once. If B *holds* the conflicting lock, it may eventually release it; but if B *awaits* the conflicting lock, we have a soft deadlock, because A2 waits for B waits for A1, which we presume to wait for A1.[1] The logical consequence of this is that, if a process seeks to acquire a lock which is already held (in any mode) by a fellow group-member, it had better jump over the entire wait queue and acquire the lock immediately if no conflicting locks have already been granted. If it fails to do this, it immediately creates a soft deadlock. What if there *are* conflicting locks already granted? In that case, things are more complicated. We could, for example, have this situation: A1 holds AccessShareLock, B holds ExclusiveLock, C1 awaits ShareLock, and then (following it in the wait queue) A2 awaits RowExclusiveLock. This is not a deadlock, because B can finish, then C can get the lock, then C1 can finish, then A2 can get the lock. But if C2 now waits for some member of group A, then we have a soft deadlock between group A and group C, which can be resolved by moving A2's request ahead of C1. (This example is relatively realistic, too: a lock upgrade from AccessShareLock to RowExclusiveLock is probably the most common type of lock upgrade, for reasons that are hopefully obvious.) In practice, I suspect it's best to take a more aggressive approach and have A2 jump straight to the head of the line right out of the chute. Letting some parallel workers make progress while holding up others out of a notion of locking fairness is probably stupid, because they're probably dependent on each other in some way that makes it useless for one to make progress while the others don't. For the same reason, I'd argue that we ought to wait until all pending lock requests from a given group can be satisfied before granting any of them. I think this logic can sometimes block processes from acquiring a lock which no one is holding. Assume Group-1 (G-1) is waiting on two locks (Lock L1 on Table T1 and Lock L2 on Table T2) which are held by unrelated processes P-2 and P-3 respectively; now P-3 releases the lock on table T-2 and cannot grant it to G-1 who is waiting on this lock because still L-1 lock cannot be granted to G-1. At this point the situation is that no one holds lock on T-2 and G-1 waits for it, however any new process won't be able to acquire lock on T-2 as there is already a waiter for the same in wait queue. OTOH if we would have granted lock for T-2 to G-1, it might have finished it's work and released the lock (considering it is a short time lock on catalog relation). If it is possible to have such a situation, then even though it is beneficial for certain cases to view several pending requests from a same group as single request, at the same time it can be harmful for some unrelated processes as well. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Column/type dependency recording inconsistencies
On 09/11/14 23:36, Petr Jelinek wrote: On 09/11/14 23:04, Tom Lane wrote: I suspect this is actually a bug in the dependency search logic in DROP. Don't have the energy to chase it right now though. Yes that's where I started searching also and yes it is. The actual situation is that the columns of the table that is about to be dropped are normally not added to the object list that is used for dependency checks (if the table is going to be dropped who cares about what column depends on anyway). But this logic depends on the fact that the columns that belong to that table are processed after the table was already processed, however as the new type was added after the table was added, it's processed before the table and because of the dependency between type and columns, the columns are also processed before the table and so they are added to the object list for further dependency checking. See use and source of object_address_present_add_flags in dependency.c. So here is what I came up with to fix this. It's quite simple and works well enough, we don't really have any regression test that would hit this though. I wonder if the object_address_present_add_flags should be renamed since it does bit more than what name suggests at this point. And I think this should be backpatched to all the versions with extension support. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 256486c..e7ec7e2 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -2142,6 +2142,31 @@ object_address_present_add_flags(const ObjectAddress *object, */ return true; } + if (object-objectSubId == 0) + { +/* + * If we got here, it means that we are dropping table whose + * columns are already in the object list, since we assume + * later on that there are no subobjects of object present + * in the list, we should remove those. + * + * Note that this is very rare occasion that normally happens + * only when dropping extension which had some of its table + * columns ALTERed after CREATE with custom types or + * collations. + */ +if (i addrs-numrefs - 1) +{ + ObjectAddressExtra *thisextra = addrs-extras + i; + + memmove(thisobj, thisobj + 1, + (addrs-numrefs - 1 - i) * sizeof(ObjectAddress)); + memmove(thisextra, thisextra + 1, + (addrs-numrefs - 1 - i) * sizeof(ObjectAddressExtra)); +} +addrs-numrefs--; +continue; + } } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Mon, Nov 10, 2014 at 6:33 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think this logic can sometimes block processes from acquiring a lock which no one is holding. Assume Group-1 (G-1) is waiting on two locks (Lock L1 on Table T1 and Lock L2 on Table T2) which are held by unrelated processes P-2 and P-3 respectively; now P-3 releases the lock on table T-2 and cannot grant it to G-1 who is waiting on this lock because still L-1 lock cannot be granted to G-1. That's not what I meant. I meant we shouldn't grant a lock *on a particular object* to any process on the group until all lock requests *for that object* can be granted. Waiting until every awaited lock in the system can be granted in all requested modes would be really hard to implement and probably have all kinds of problems (as you point out here). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tab completion: \c [ dbname [ username ] ]
On Sun, Nov 9, 2014 at 6:13 PM, Ian Barwick i...@2ndquadrant.com wrote: Attached is a mighty trivial patch to extend psql tab completion for \c / \connect to generate a list of role names, as lack thereof was annoying me recently and I can't see any downside to doing this. Committed, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Robert Haas wrote: I think the key question here is the time for which the data needs to be retained. 2^32 of anything is a lot, but why keep around that number of records rather than more (after all, we have epochs to distinguish one use of a given txid from another) or fewer? The problem is not how much data we retain; is about how much data we can address. I thought I was responding to a concern about disk space utilization. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 10/11/14 08:01, Anssi Kääriäinen wrote: On Sun, 2014-11-09 at 11:57 -0500, Steve Singer wrote: The reason why Jim and myself are asking for the LSN and not just the timestamp is that I want to be able to order the transactions. Jim pointed out earlier in the thread that just ordering on timestamp allows for multiple transactions with the same timestamp. Maybe we don't need the entire LSN to solve that. If you already have the commit timestamp maybe you only need another byte or two of granularity to order transactions that are within the same microsecond. There is no guarantee that a commit with later LSN has a later timestamp. There are cases where the clock could move significantly backwards. A robust solution to storing transaction commit information (including commit order) in a way that can be referenced from other tables, can be loaded to another cluster, and survives crashes would be a great feature. But this feature doesn't have those properties. It has the property of surviving crashes. Not sure what you mean by referencing from other tables? And about loading to another cluster, the txid does not really have any meaning on another cluster, so the info about it does not have either? But anyway this patch is targeting extensions not DBAs - you could write extension that will provide that feature on top of this patch (although given what I wrote above I don't see how it's useful). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Mon, Nov 10, 2014 at 2:01 AM, Anssi Kääriäinen anssi.kaariai...@thl.fi wrote: There is no guarantee that a commit with later LSN has a later timestamp. There are cases where the clock could move significantly backwards. Good point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 09/11/14 17:57, Steve Singer wrote: On 11/07/2014 07:07 PM, Petr Jelinek wrote: The list of what is useful might be long, but we can't have everything there as there are space constraints, and LSN is another 8 bytes and I still want to have some bytes for storing the origin or whatever you want to call it there, as that's the one I personally have biggest use-case for. So this would be ~24bytes per txid already, hmm I wonder if we can pull some tricks to lower that a bit. The reason why Jim and myself are asking for the LSN and not just the timestamp is that I want to be able to order the transactions. Jim pointed out earlier in the thread that just ordering on timestamp allows for multiple transactions with the same timestamp. Maybe we don't need the entire LSN to solve that. If you already have the commit timestamp maybe you only need another byte or two of granularity to order transactions that are within the same microsecond. Hmm maybe just one part of LSN, but I don't really like that either, if we want to store LSN we should probably store it as is as somebody might want to map it to txid for other reasons. I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes per record, I am inclined to just say we can live with that. Since we agreed that the (B) case is not really feasible and we are doing the (C), I also wonder if extradata should be renamed to nodeid (even if it's not used at this point as nodeid). And then there is question about the size of it, since the nodeid itself can live with 2 bytes probably (64k of nodes ought to be enough for everybody ;) ). Or leave the extradata as is but use as reserved space for future use and not expose it at this time on SQL level at all? I guess Andres could answer what suits him better here. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
Robert Haas wrote: On Sun, Nov 9, 2014 at 8:41 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Robert Haas wrote: I think the key question here is the time for which the data needs to be retained. 2^32 of anything is a lot, but why keep around that number of records rather than more (after all, we have epochs to distinguish one use of a given txid from another) or fewer? The problem is not how much data we retain; is about how much data we can address. I thought I was responding to a concern about disk space utilization. Ah, right. So AFAIK we don't need to keep anything older than RecentXmin or something like that -- which is not too old. If I recall correctly Josh Berkus was saying in a thread about pg_multixact that it used about 128kB or so in = 9.2 for his customers; that one was also limited to RecentXmin AFAIR. I think a similar volume of commit_ts data would be pretty acceptable. Moreso considering that it's turned off by default. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql tab completion: \c [ dbname [ username ] ]
On 10/11/14 22:20, Robert Haas wrote: On Sun, Nov 9, 2014 at 6:13 PM, Ian Barwick i...@2ndquadrant.com wrote: Attached is a mighty trivial patch to extend psql tab completion for \c / \connect to generate a list of role names, as lack thereof was annoying me recently and I can't see any downside to doing this. Committed, thanks. Many thanks! Regards Ian Barwick -- Ian Barwick http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_multixact not getting truncated
On Sat, Nov 8, 2014 at 3:10 PM, Josh Berkus j...@agliodbs.com wrote: I also think our defaults for multixact freezing should be tied to the ones for xid freezing, and should not by default be completely independent numbers; I'm still not convinced that it makes sense to have a separate multixact threshold at all **since the same amount of vacuuming needs to be done regardless of whether we're truncating xids or mxids**. That doesn't make any sense at all. For one thing, it's not as if there is only ONE threshold here. There are three different ones, controlling three different aspects of the behavior: (a) the age at which we begin trying to freeze the pages we are planning to vacuum anyway, (b) the age at which we force a vacuum that we're planning to do anyway to scan the entire table, and (c) the age at which we trigger an autovacuum that we weren't otherwise planning to do. Generally, the reason why I think we need different thresholds for XIDs and MXIDs is that they may be getting consumed at vastly different rates. It may be useful to have a light that comes on in your car when you only have one gallon of fuel left, but you'd want a different threshold for an airplane because it burns fuel at a different rate. If you made that light come on when there's a gallon of fuel left, it would way too late to do any good. I think a big part of the tuning problem here is that we don't have any way of knowing what the real burn rates will be in a particular customer environment. If you're burning MXIDs very slowly, you probably want to threshold (a), the age at which we begin freezing pages we are planning to vacuum anyway, quite low, so that the next full-table vacuum triggered by XID consumption freezes all the MXIDs also, and advances relminmxid, thus preventing freezing passes specifically for MXIDs from ever happening. But if the MXID consumption rate is very high, that may result in unnecessary I/O freezing tuples that would have been updated before MXID age became an issue anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Mon, Nov 10, 2014 at 8:39 AM, Petr Jelinek p...@2ndquadrant.com wrote: I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes per record, I am inclined to just say we can live with that. If you do it as 20 bytes, you'll have to do some work to squeeze out the alignment padding. I'm inclined to think it's fine to have a few extra padding bytes here; someone might want to use those for something in the future, and they probably don't cost much. Since we agreed that the (B) case is not really feasible and we are doing the (C), I also wonder if extradata should be renamed to nodeid (even if it's not used at this point as nodeid). And then there is question about the size of it, since the nodeid itself can live with 2 bytes probably (64k of nodes ought to be enough for everybody ;) ). Or leave the extradata as is but use as reserved space for future use and not expose it at this time on SQL level at all? I vote for calling it node-ID, and for allowing at least 4 bytes for it. Penny wise, pound foolish. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Mon, Nov 10, 2014 at 8:40 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Ah, right. So AFAIK we don't need to keep anything older than RecentXmin or something like that -- which is not too old. If I recall correctly Josh Berkus was saying in a thread about pg_multixact that it used about 128kB or so in = 9.2 for his customers; that one was also limited to RecentXmin AFAIR. I think a similar volume of commit_ts data would be pretty acceptable. Moreso considering that it's turned off by default. I'm not sure whether keeping it just back to RecentXmin will be enough for everybody's needs. But we certainly don't need to keep the last 2^32 records as someone-or-other was suggesting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?
Andres Freund and...@2ndquadrant.com writes: Posix provides information about the source of the signal when using SA_SIGINFO style handlers via si_code/si_pid. That information has been available for a *long* while (c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html). That does not mean that it exists on all the platforms we support; Windows in particular is likely to be an issue. I concur with Heikki that you're probably solving this at the wrong end anyway, since the next question you'll be asking is *why* did process X send that signal. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Fwd: How parser works and to add keyword to keywords.h
Hello, Can anyone tell about how parser works and if we want to add a new keyword to the keywords.h, and how to use that keyword ? How the all process works ? -- Pankaj B.
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
I wrote: Attached are patches meant for HEAD and 9.2-9.4 respectively. BTW, has anyone got an opinion about whether to stick the full fix into 9.4? The argument for, of course, is that we'd get the full fix out to the public a year sooner. The argument against is that someone might have already done compatibility testing of their application against 9.4 betas, and then get blindsided if we change this before release. I'm inclined to think that since we expect json/jsonb usage to really take off with 9.4, it might be better if we get row_to_json behaving unsurprisingly now. But I'm not dead set on it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Attached are patches meant for HEAD and 9.2-9.4 respectively. BTW, has anyone got an opinion about whether to stick the full fix into 9.4? The argument for, of course, is that we'd get the full fix out to the public a year sooner. The argument against is that someone might have already done compatibility testing of their application against 9.4 betas, and then get blindsided if we change this before release. I'm inclined to think that since we expect json/jsonb usage to really take off with 9.4, it might be better if we get row_to_json behaving unsurprisingly now. But I'm not dead set on it. I think we should put the full fix in 9.4. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Tue, Nov 4, 2014 at 10:36 PM, Peter Eisentraut pete...@gmx.net wrote: While we're talking about removing old things, is there any use left for pg_standby? Was the original reason to keep it around anything other than backwards compatibility? If not, then it was backwards compatibility with a version that's not even supported anymore - so strong +1 for getting rid of it. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Fwd: How parser works and to add keyword to keywords.h
Pankaj Bagul bagul@gmail.com writes: Can anyone tell about how parser works and if we want to add a new keyword to the keywords.h, and how to use that keyword ? How the all process works ? The minimum requirements are as stated in gram.y: add the keyword to kwlist.h, to the %token keyword list near the head of gram.y, and to the appropriate one of the keyword category lists near the bottom of gram.y. Presumably you want to use this keyword in some production or other, so you'll also need to add or modify parsetree node struct types to represent the output from the production. Best bet really is to identify some existing SQL language feature that is somewhat related to what you want to do, and then grep the PG source code for all mentions of the related parse node type(s). There's usually a pretty fair amount of boilerplate support code that you'll need to add or modify, but copy and paste is your friend for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
Tom Lane-2 wrote I wrote: Attached are patches meant for HEAD and 9.2-9.4 respectively. BTW, has anyone got an opinion about whether to stick the full fix into 9.4? The argument for, of course, is that we'd get the full fix out to the public a year sooner. The argument against is that someone might have already done compatibility testing of their application against 9.4 betas, and then get blindsided if we change this before release. I'm inclined to think that since we expect json/jsonb usage to really take off with 9.4, it might be better if we get row_to_json behaving unsurprisingly now. But I'm not dead set on it. I am not one of those people who would be blindsided but I'd much rather inconvenience our beta testers in order to get a better product in place for the masses. Beta testers read release notes and should be able to identify and fix any problematic areas of their code while simultaneously being happy for the improvements. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/row-to-json-bug-with-index-only-scans-empty-keys-tp5826070p5826338.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
Yes, it's late beta, but especially if we're pushing json/jsonb usage as a major feature of this release, I'd say remove surprising cases now. It's late beta, but that's what beta is for, the last chance for bug fixes, before we live w/ it for the support life of the release. The affected class are people with an app that already uses json, so 9.2 or better, have ran acceptance tests against an early beta of 9.4, and have SQL that returns madeup fieldnames instead of the appropriate aliases? Which they were probably annoyed at when they wrote the code that consumes that output in the first place. I think an item in the list of compatability changes/gotchas should catch anyone who is that on the ball as to be testing the betas anyway. Anyone pushing that envelope is going to do the same acceptance testing against the actual release before rolling to production. Ross On Mon, Nov 10, 2014 at 10:11:25AM -0500, Tom Lane wrote: I wrote: Attached are patches meant for HEAD and 9.2-9.4 respectively. BTW, has anyone got an opinion about whether to stick the full fix into 9.4? The argument for, of course, is that we'd get the full fix out to the public a year sooner. The argument against is that someone might have already done compatibility testing of their application against 9.4 betas, and then get blindsided if we change this before release. I'm inclined to think that since we expect json/jsonb usage to really take off with 9.4, it might be better if we get row_to_json behaving unsurprisingly now. But I'm not dead set on it. regards, tom lane -- Ross Reedstrom, Ph.D. reeds...@rice.edu Systems Engineer Admin, Research Scientistphone: 713-348-6166 Connexions http://cnx.orgfax: 713-348-3665 Rice University MS-375, Houston, TX 77005 GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
On 11/10/2014 10:19 AM, Robert Haas wrote: On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote: I wrote: Attached are patches meant for HEAD and 9.2-9.4 respectively. BTW, has anyone got an opinion about whether to stick the full fix into 9.4? The argument for, of course, is that we'd get the full fix out to the public a year sooner. The argument against is that someone might have already done compatibility testing of their application against 9.4 betas, and then get blindsided if we change this before release. I'm inclined to think that since we expect json/jsonb usage to really take off with 9.4, it might be better if we get row_to_json behaving unsurprisingly now. But I'm not dead set on it. I think we should put the full fix in 9.4. +1 cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compiler warning in master branch
David Rowley dgrowle...@gmail.com wrote: On Mon, Nov 10, 2014 at 4:31 PM, Peter Geoghegan p...@heroku.com wrote: I see this when I build at -O2 from master's tip: brin_revmap.c: In function ‘brinRevmapExtend’: brin_revmap.c:113:14: error: variable ‘mapBlk’ set but not used [-Werror=unused-but-set-variable] BlockNumber mapBlk; ^ It would appear just to need the attached. Confirmed and pushed. Thanks to both of you! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] using custom scan nodes to prototype parallel sequential scan
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote: Something usable, with severe restrictions, is actually better than we have now. I understand the journey this work represents, so don't be embarrassed by submitting things with heuristics and good-enoughs in it. Our mentor, Mr.Lane, achieved much by spreading work over many releases, leaving others to join in the task. It occurs to me that, now that the custom-scan stuff is committed, it wouldn't be that hard to use that, plus the other infrastructure we already have, to write a prototype of parallel sequential scan. Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. I think this would be useful for a couple of reasons. First, it would be a demonstrable show of progress, illustrating how close we are to actually having something you can really deploy. Second, we could use it to demonstrate how the remaining infrastructure patches close up gaps in the initial prototype. Third, it would let us start doing real performance testing. It seems pretty clear that a parallel sequential scan of data that's in memory (whether the page cache or the OS cache) can be accelerated by having multiple processes scan it in parallel. But it's much less clear what will happen when the data is being read in from disk. Does parallelism help at all? What degree of parallelism helps? Do we break OS readahead so badly that performance actually regresses? These are things that are likely to need a fair amount of tuning before this is ready for prime time, so being able to start experimenting with them in advance of all of the infrastructure being completely ready seems like it might help. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Teaching pg_dump to use NOT VALID constraints
Magnus and I discussed the need for pg_dump to offer the use of NOT VALID constraints. Here's the patch. pg_dump --no-revalidaton will add NOT VALID onto the recreation SQL for any FKs, but only for ones that were already known to be valid. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services pg_dump_revalidation.v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL information view
On Mon, Jul 21, 2014 at 5:24 PM, Bernd Helmle maili...@oopsware.de wrote: --On 12. Juli 2014 15:08:01 +0200 Magnus Hagander mag...@hagander.net wrote: Before doing that, however, I'd like to ask for opinions :) The hack currently exposes a separate view that you can join to pg_stat_activity (or pg_stat_replication) on the pid -- this is sort of the same way that pg_stat_replication works in the first place. Do we want something similar to that for a builtin SSL view as well, or do we want to include the fields directly in pg_stat_activity and pg_stat_replication? I've heard more than once the wish to get this information without contrib..especially for the SSL version used (client and server likewise). So ++1 for this feature. I'd vote for a special view, that will keep the information into a single place and someone can easily join extra information together. Here's a patch that implements that. Docs are currently ont included because I'm waiting for the restructuring of tha section to be done (started by me in a separate thread) first, but the code is there for review. Right now it just truncates the dn at NAMEDATALEN - so treating it the same as we do with hostnames. My guess is this is not a big problem because in the case of long DNs, most of the time the important stuff is at the beginning anyway... (And it's not like it's actually used for authentication, in which case it would of course be a problem). -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ *** a/src/backend/catalog/system_views.sql --- b/src/backend/catalog/system_views.sql *** *** 648,653 CREATE VIEW pg_stat_replication AS --- 648,664 WHERE S.usesysid = U.oid AND S.pid = W.pid; + CREATE VIEW pg_stat_ssl AS + SELECT + I.pid, + I.ssl, + I.bits, + I.compression, + I.version, + I.cipher, + I.clientdn + FROM pg_stat_get_sslstatus() AS I; + CREATE VIEW pg_replication_slots AS SELECT L.slot_name, *** a/src/backend/libpq/be-secure-openssl.c --- b/src/backend/libpq/be-secure-openssl.c *** *** 88,93 static void info_cb(const SSL *ssl, int type, int args); --- 88,95 static void initialize_ecdh(void); static const char *SSLerrmessage(void); + static char *X509_NAME_to_cstring(X509_NAME *name); + /* are we in the middle of a renegotiation? */ static bool in_ssl_renegotiation = false; *** *** 1053,1055 SSLerrmessage(void) --- 1055,1159 snprintf(errbuf, sizeof(errbuf), _(SSL error code %lu), errcode); return errbuf; } + + /* + * Return information about the SSL connection + */ + int + be_tls_get_cipher_bits(Port *port) + { + int bits; + + if (port-ssl) + { + SSL_get_cipher_bits(port-ssl, bits); + return bits; + } + else + return 0; + } + + bool + be_tls_get_compression(Port *port) + { + if (port-ssl) + return (SSL_get_current_compression(port-ssl) != NULL); + else + return false; + } + + void + be_tls_get_version(Port *port, char *ptr, size_t len) + { + if (port-ssl) + strlcpy(ptr, SSL_get_version(port-ssl), len); + else + ptr[0] = '\0'; + } + + void + be_tls_get_cipher(Port *port, char *ptr, size_t len) + { + if (port-ssl) + strlcpy(ptr, SSL_get_cipher(port-ssl), NAMEDATALEN); + else + ptr[0] = '\0'; + } + + void + be_tls_get_peerdn_name(Port *port, char *ptr, size_t len) + { + if (port-peer) + strlcpy(ptr, X509_NAME_to_cstring(X509_get_subject_name(port-peer)), NAMEDATALEN); + else + ptr[0] = '\0'; + } + + /* + * Convert an X509 subject name to a cstring. + * + */ + static char * + X509_NAME_to_cstring(X509_NAME *name) + { + BIO *membuf = BIO_new(BIO_s_mem()); + int i, + nid, + count = X509_NAME_entry_count(name); + X509_NAME_ENTRY *e; + ASN1_STRING *v; + const char *field_name; + size_t size; + char nullterm; + char *sp; + char *dp; + char *result; + + (void) BIO_set_close(membuf, BIO_CLOSE); + for (i = 0; i count; i++) + { + e = X509_NAME_get_entry(name, i); + nid = OBJ_obj2nid(X509_NAME_ENTRY_get_object(e)); + v = X509_NAME_ENTRY_get_data(e); + field_name = OBJ_nid2sn(nid); + if (!field_name) + field_name = OBJ_nid2ln(nid); + BIO_printf(membuf, /%s=, field_name); + ASN1_STRING_print_ex(membuf, v, + ((ASN1_STRFLGS_RFC2253 ~ASN1_STRFLGS_ESC_MSB) + | ASN1_STRFLGS_UTF8_CONVERT)); + } + + /* ensure null termination of the BIO's content */ + nullterm = '\0'; + BIO_write(membuf, nullterm, 1); + size = BIO_get_mem_data(membuf, sp); + dp = pg_any_to_server(sp, size - 1, PG_UTF8); + + result = pstrdup(dp); + if (dp != sp) + pfree(dp); + BIO_free(membuf); + + return result; + } *** a/src/backend/postmaster/pgstat.c --- b/src/backend/postmaster/pgstat.c *** *** 2386,2391 pgstat_fetch_global(void) --- 2386,2394
Re: [HACKERS] tracking commit timestamps
On 11/10/2014 08:39 AM, Petr Jelinek wrote: On 09/11/14 17:57, Steve Singer wrote: On 11/07/2014 07:07 PM, Petr Jelinek wrote: The list of what is useful might be long, but we can't have everything there as there are space constraints, and LSN is another 8 bytes and I still want to have some bytes for storing the origin or whatever you want to call it there, as that's the one I personally have biggest use-case for. So this would be ~24bytes per txid already, hmm I wonder if we can pull some tricks to lower that a bit. The reason why Jim and myself are asking for the LSN and not just the timestamp is that I want to be able to order the transactions. Jim pointed out earlier in the thread that just ordering on timestamp allows for multiple transactions with the same timestamp. Maybe we don't need the entire LSN to solve that. If you already have the commit timestamp maybe you only need another byte or two of granularity to order transactions that are within the same microsecond. Hmm maybe just one part of LSN, but I don't really like that either, if we want to store LSN we should probably store it as is as somebody might want to map it to txid for other reasons. I did the calculation above wrong btw, it's actually 20 bytes not 24 bytes per record, I am inclined to just say we can live with that. Since we agreed that the (B) case is not really feasible and we are doing the (C), I also wonder if extradata should be renamed to nodeid (even if it's not used at this point as nodeid). And then there is question about the size of it, since the nodeid itself can live with 2 bytes probably (64k of nodes ought to be enough for everybody ;) ). Or leave the extradata as is but use as reserved space for future use and not expose it at this time on SQL level at all? I guess Andres could answer what suits him better here. I am happy with renaming extradata to nodeid and not exposing it at this time. If we feel that commit-order (ie LSN or something equivalent) is really a different patch/feature than commit-timestamp then I am okay with that also but we should make sure to warn users of the commit-timestamp in the documentation that two transactions might have the same timestamp and that the commit order might not be the same as ordering by the commit timestamp. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 11/7/14, 8:21 PM, Robert Haas wrote: On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( I could see logging it, but I agree with Andres and Alvaro that the odds are strongly against there being any actual problem here. I'm fine with that. Any other objections? Andres? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
So, summarizing the state of this patch set: - Patches 1, 2, 3, and 5 are committed. - Patch 4 has had some review and really, as far as I can tell, the only really major issue that has cropped up is the possibility that the GUC settings that are legal in backend A aren't legal in backend B for some reason; in that case, restoring the GUC settings there will probably fail. While this is a legitimate concern, I think what it really boils down to is that there's a possibility that there are libraries loaded in the user backend that are not loaded in the postmaster and therefore won't get loaded into the background worker. That's an issue to which we may need to engineer some solution, but not in this patch. Overall, the patch's architecture is modeled closely on the way we synchronize GUCs to new backends when using EXEC_BACKEND, and I don't think we're going do any better than stating with that and working to file down the rough edges as we go. So I'd like to go ahead and commit this. - Patch 6, pg_background itself, has a number of outstanding issues. Some people aren't sure it's useful enough to really be worth bothering with; other people seem quite excited about it, even to the point of wanting to push it all the way into core. Whatever we ultimately decide to do there, the patch as it stands today is clearly not ready for commit. The issues I know of are: -- I still haven't written documentation for it. -- There's some code duplication with exec_simple_query() that doesn't feel great, but it's not clear that there's a superior alternative. I think we certainly don't want to complicate exec_simple_query() to make pg_background happy. -- We should add REVOKE to the SQL script that installs it so that non-superusers can't use it unless the superuser decides to grant them rights. -- It doesn't handle types without send/receive functions. I thought that was tolerable since the functions without such types seem like mostly edge cases, but Andres doesn't like it. Apparently, BDR is makes provisions to either ship the tuple as one big blob - if all built-in types - or else use binary format where supported and text format otherwise. Since this issue is common to BDR, parallelism, and pg_background, we might want to introduce some common infrastructure for it, but it's not too clear to me right now what that should look like. -- Evil things can happen if the background process changes client_encoding. -- Even if I fix all that stuff, it's not entirely clear that there's a consensus in favor of the patch at all. I certainly think that as far as this CommitFest is concerned, patch 6 ought to be considered Returned with Feedback. Many thanks to Andres and all for reviewing. What I am less certain about is whether it makes sense to spend the energy fixing the above issues in the short-term. I wrote pg_background mostly a demonstration that the rest of these patches work and can be used to accomplish useful stuff, whether as part of parallelism or otherwise; and there's a lot of work left to be done on parallelism proper that won't necessarily benefit from polishing pg_background first. So, questions: 1. Any other opinions for or against pg_background as a concept? I thought the ability to kick of vacuums (or other stuff with PreventTransactionChain) asynchronously was pretty cool, as we as the ability to use it as an authentication-free loopback dblink. But opinions obviously vary. 2. Is anyone sufficiently interested in pg_background as a concept that they'd be willing to take over the patch and work on the TODO list mentioned above? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote: While we're talking about removing old things, is there any use left for pg_standby? -1 for removing it. There is still the case where I'd like to use log-shipping rather than replication. For example, it's the case where I need to compress WAL files before streaming them via very thin network. We can set up log-shipping using standby_mode and without using pg_standby, but it keeps emitting the failure of restore_command while while there is no WAL activity, and which is bothersome. So I still need pg_standby for log-shipping. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Nov 10, 2014 6:16 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote: While we're talking about removing old things, is there any use left for pg_standby? -1 for removing it. There is still the case where I'd like to use log-shipping rather than replication. For example, it's the case where I need to compress WAL files before streaming them via very thin network. We can set up log-shipping using standby_mode and without using pg_standby, but it keeps emitting the failure of restore_command while while there is no WAL activity, and which is bothersome. So I still need pg_standby for log-shipping. I didn't realize that part, but maybe we should fix that instead of keeping pg_standby around? (BTW, you can use streaming with compression as well using ssl of course, but it won't get quite the same levels due to smaller block sizes. And there are certainly still reasons for file based standbys so we should definitely not remove that) /Magnus
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
Jim Nasby wrote: On 11/7/14, 8:21 PM, Robert Haas wrote: On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( I could see logging it, but I agree with Andres and Alvaro that the odds are strongly against there being any actual problem here. I'm fine with that. Any other objections? Andres? If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
On Mon, Nov 10, 2014 at 4:15 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: (2014/11/06 23:38), Fujii Masao wrote: On Tue, Nov 4, 2014 at 12:04 PM, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote: IIUC, I think that min = 0 disables fast update, so ISTM that it'd be appropriate to set min to some positive value. And ISTM that the idea of using the min value of work_mem is not so bad. OK. I changed the min value to 64kB. *** 356,361 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ replaceable class=parametername/ --- 356,372 /listitem /varlistentry /variablelist +variablelist +varlistentry + termliteralPENDING_LIST_CLEANUP_SIZE//term The above is still in uppercse. Fixed. Attached is the updated version of the patch. Thanks for the review! Thanks for the updating the patch! The patch looks good to me except for the following point: Thanks for the review again! *** a/src/backend/access/gin/ginfast.c --- b/src/backend/access/gin/ginfast.c *** *** 25,30 --- 25,32 #include utils/memutils.h #include utils/rel.h + /* GUC parameter */ + int pending_list_cleanup_size = 0; I think we need to initialize the GUC to boot_val, 4096 in this case. No, IIUC basically the variable for GUC doesn't need to be initialized to its default value. OTOH, it's also harmless to initialize it to the default. I like the current code a bit because we don't need to change the initial value again when we decide to change the default value of GUC. I have no strong opinion about this, though. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_dump to use NOT VALID constraints
Simon Riggs wrote: Magnus and I discussed the need for pg_dump to offer the use of NOT VALID constraints. Here's the patch. pg_dump --no-revalidaton will add NOT VALID onto the recreation SQL for any FKs, but only for ones that were already known to be valid. Well. Constraints that haven't been validated already have a NOT VALID emitted by ruleutils.c, yes? So what this patch does is add such a clause for all *other* constraints. Right? In other words what it aims to do is speed up loading of data by skipping the validation step on restore. Is that right? ISTM we could have the default pg_dump behavior emit NOT VALID constraints, and add VALIDATE CONSTRAINT commands at the end; that way the database is usable sooner but the constraints end up marked as validated by the time the dump is finished. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Tue, Nov 11, 2014 at 2:19 AM, Magnus Hagander mag...@hagander.net wrote: On Nov 10, 2014 6:16 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Nov 5, 2014 at 6:36 AM, Peter Eisentraut pete...@gmx.net wrote: While we're talking about removing old things, is there any use left for pg_standby? -1 for removing it. There is still the case where I'd like to use log-shipping rather than replication. For example, it's the case where I need to compress WAL files before streaming them via very thin network. We can set up log-shipping using standby_mode and without using pg_standby, but it keeps emitting the failure of restore_command while while there is no WAL activity, and which is bothersome. So I still need pg_standby for log-shipping. I didn't realize that part, The log-shipping standby using standby_mode tries to execute the restore_command to restore new WAL file but it fails and the message meaning the failure is logged if there is no new WAL file. Then the standby tries to execure the same restore_command after five seconds. But it fails and the message is logged again. These happen continuously while no new WAL file is appearing in the standby's archival area. but maybe we should fix that instead of keeping pg_standby around? Yes. Even if we do that, we should announce pg_standby will be removed and wait for several releases before removing it? (BTW, you can use streaming with compression as well using ssl of course, but it won't get quite the same levels due to smaller block sizes. And there are certainly still reasons for file based standbys so we should definitely not remove that) Yes. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On 11/04/2014 01:36 PM, Peter Eisentraut wrote: While we're talking about removing old things, is there any use left for pg_standby? -1. A lot of people, a lot of customers use log shipping for various creative and business requirement setups. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, @cmdpromptinc If we send our children to Caesar for their education, we should not be surprised when they come back as Romans. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Teaching pg_dump to use NOT VALID constraints
On 10 November 2014 17:33, Alvaro Herrera alvhe...@2ndquadrant.com wrote: pg_dump --no-revalidaton will add NOT VALID onto the recreation SQL for any FKs, but only for ones that were already known to be valid. Well. Constraints that haven't been validated already have a NOT VALID emitted by ruleutils.c, yes? So what this patch does is add such a clause for all *other* constraints. Right? In other words what it aims to do is speed up loading of data by skipping the validation step on restore. Is that right? Correct. CHECK constraints are added onto main table so they validate at load. ISTM we could have the default pg_dump behavior emit NOT VALID constraints, and add VALIDATE CONSTRAINT commands at the end; that way the database is usable sooner but the constraints end up marked as validated by the time the dump is finished. Yes, may be an even better idea. We'd still want the --no-revalidation option, AFAICS. FKs are already at the end. Perhaps we should add another validation section? I like the idea, just not sure how long it would take. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 11/10/14, 11:28 AM, Alvaro Herrera wrote: Jim Nasby wrote: On 11/7/14, 8:21 PM, Robert Haas wrote: On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( I could see logging it, but I agree with Andres and Alvaro that the odds are strongly against there being any actual problem here. I'm fine with that. Any other objections? Andres? If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. IMHO that would be ideal, but I think Tom was leery of using more space for every table. If we go this route, I'm guessing we should only log pages we skip, and not log pages we had to wait for the lock on (in the case of a freeze). Also, should we still eroprt this even if we are putting it in stats? Is there a way to avoid duplicating the entire eroprt call? I see I could call errstart friends manually, but currently that's only done in elog.c. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow signal handlers to optionally use SA_SIGINFO information?
On 2014-11-10 09:50:17 -0500, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Posix provides information about the source of the signal when using SA_SIGINFO style handlers via si_code/si_pid. That information has been available for a *long* while (c.f. http://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html). That does not mean that it exists on all the platforms we support; Windows in particular is likely to be an issue. Given that we essentially use a named pipe to transport signals, it shouldn't be too hard if we decide we need it. I concur with Heikki that you're probably solving this at the wrong end anyway, since the next question you'll be asking is *why* did process X send that signal. I agree that that's the most useful thing for any specific situation fully under our control. But generally I'd have more than once found it useful to have the sender of a signal available - even if it's just during debugging. I also previously would have liked to know where INT/TERM are coming from - having the senders pid can be useful to diagnose. If we add it, it's surely nothing we ever are going to rely on - as you say, there's portability issues. But printing it optionallY in a few place at appropriate debug levels + having it available in signal handlers during debugging sounds useful enough anyway to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote: Jim Nasby wrote: On 11/7/14, 8:21 PM, Robert Haas wrote: On Thu, Nov 6, 2014 at 8:03 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem right now is there's no way to actually obtain evidence that this is (or isn't) something to worry about, because we just silently skip pages. If we had any kind of tracking on this we could stop guessing. :( I could see logging it, but I agree with Andres and Alvaro that the odds are strongly against there being any actual problem here. I'm fine with that. Any other objections? Andres? If you feel that strong about the need for logging, go ahead. If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but *much* *much* more work than this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On Wed, Nov 5, 2014 at 8:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier michael.paqu...@gmail.com wrote: Updated patch is attached. Please find attached an updated patch with the following things changed: - Addition of tab completion in psql for all new commands - Addition of a call to WaitForLockers in index_concurrent_swap to ensure that there are no running transactions on the parent table running before exclusive locks are taken on the index and its concurrent entry. Previous patch versions created deadlocks because of that, issue spotted by the isolation tests integrated in the patch. - Isolation tests for reindex concurrently are re-enabled by default. Regards, It looks like this needs another rebase, I get failures on index.c, toasting.c, indexcmds.c, and index.h Thanks, Jeff
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 2014-11-10 12:10:49 -0500, Robert Haas wrote: - Patch 4 has had some review and really, as far as I can tell, the only really major issue that has cropped up is the possibility that the GUC settings that are legal in backend A aren't legal in backend B for some reason; in that case, restoring the GUC settings there will probably fail. While this is a legitimate concern, I think what it really boils down to is that there's a possibility that there are libraries loaded in the user backend that are not loaded in the postmaster and therefore won't get loaded into the background worker. I'm not too concerned about that one. That's an issue to which we may need to engineer some solution, but not in this patch. Overall, the patch's architecture is modeled closely on the way we synchronize GUCs to new backends when using EXEC_BACKEND, and I don't think we're going do any better than stating with that and working to file down the rough edges as we go. So I'd like to go ahead and commit this. Did you check out whether PGC_BACKEND GUCs work? Other than that I agree that we can just solve further issues as we notice them. This isn't particularly intrustive. -- There's some code duplication with exec_simple_query() that doesn't feel great, but it's not clear that there's a superior alternative. I think we certainly don't want to complicate exec_simple_query() to make pg_background happy. Yea, I think we can live with it if really necessary. -- We should add REVOKE to the SQL script that installs it so that non-superusers can't use it unless the superuser decides to grant them rights. Right. That seems trivial enough. -- It doesn't handle types without send/receive functions. I thought that was tolerable since the functions without such types seem like mostly edge cases, but Andres doesn't like it. Apparently, BDR is makes provisions to either ship the tuple as one big blob - if all built-in types - or else use binary format where supported and text format otherwise. Since this issue is common to BDR, parallelism, and pg_background, we might want to introduce some common infrastructure for it, but it's not too clear to me right now what that should look like. I think we definitely need to solve this - but I'm also not at all that sure how. For something like pg_background it's pretty trivial to fall back to in/out when there's no send/recv. It's just a couple of lines, and the portal code has the necessary provisions. So solving it for pg_background itself is pretty trivial. But, as you say, pg_background itself isn't a primary goal (although a nice demonstration, with some real world applications). There's enough differences between the parallelism and the replication cases that I'm not entirely sure how much common ground there is. Namely replication has the additional concerns of version differences, trust (don't use blobs if you don't fully trust the other side), and differences in the allocated oids (especially relevant for arrays which, very annoyingly, embed the oid in the send/recv format). 1. Any other opinions for or against pg_background as a concept? I thought the ability to kick of vacuums (or other stuff with PreventTransactionChain) asynchronously was pretty cool, as we as the ability to use it as an authentication-free loopback dblink. But opinions obviously vary. I think it's a cool concept, but I'm not sure if it's worth the work to make it fully usable. Or rather, I think it's worthy enough, but I personally would priorize other stuff. 2. Is anyone sufficiently interested in pg_background as a concept that they'd be willing to take over the patch and work on the TODO list mentioned above? I personally won't. If we can come up with a sketch of how to deal with the data transport encoding issue above, I'd be willing to to work on that specific part. But not pg_background in itself. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
Andres Freund wrote: On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote: If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but *much* *much* more work than this. We already have last_autovacuum columns and such in pg_stat_tables et al, which only record the last value. My thinking regarding such numbers is that you would save histories and put them in a chart, see how they evolve with time. I doubt the individual numbers are worth much, but the trends might show something interesting. As far as I know, this is already true for most other pgstat values, with exception of things such as live_tuples which are absolute numbers rather than running counters. I agree having more vacuuming data in general is a worthwhile project, much larger than this one. Wasn't Greg Smith working on that? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
* Andres Freund (and...@2ndquadrant.com) wrote: 1. Any other opinions for or against pg_background as a concept? I thought the ability to kick of vacuums (or other stuff with PreventTransactionChain) asynchronously was pretty cool, as we as the ability to use it as an authentication-free loopback dblink. But opinions obviously vary. I think it's a cool concept, but I'm not sure if it's worth the work to make it fully usable. Or rather, I think it's worthy enough, but I personally would priorize other stuff. I've not read through the whole thread/discussionm but I'd put myself in more-or-less the same boat at this point. I've got a number of other things on my plate already that need to get done (more RLS cleanup / consistency, back-patching the ereport() column-privs issue, reviewing pgAudit, the less-than-superuser privileges work, actually helping out with the in-progress commitfest..) and so I really doubt I'd be able to seriously help with pg_background- at least for 9.5, which is coming up awful fast at this point, if we're going to stick with the 'normal' schedule and freeze in the spring. That said, I love the concept and had really been hoping to see it in 9.5, and maybe some at or cron-like ability happening later (yes, I absolutely feel we need this, though I know others have different opinions..). 2. Is anyone sufficiently interested in pg_background as a concept that they'd be willing to take over the patch and work on the TODO list mentioned above? I personally won't. If we can come up with a sketch of how to deal with the data transport encoding issue above, I'd be willing to to work on that specific part. But not pg_background in itself. If other things get done or additional resources show up, I'd be interested in helping, but I don't see either happening in time for 9.5. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 11/10/14, 12:15 PM, Andres Freund wrote: If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but*much* *much* more work than this. We already report statistics on vacuums (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding 1 or 2 counters to that. Should we add the other counters from vacuum? That would be significantly more data. Semi-related, we should probably be reporting stats from heap truncation. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
Stephen Frost wrote: at least for 9.5, which is coming up awful fast at this point, if we're going to stick with the 'normal' schedule and freeze in the spring. https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule Doesn't look all that normal to me, with that commitfest in Feb. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On 11/10/2014 07:50 PM, Joshua D. Drake wrote: On 11/04/2014 01:36 PM, Peter Eisentraut wrote: While we're talking about removing old things, is there any use left for pg_standby? -1. A lot of people, a lot of customers use log shipping for various creative and business requirement setups. Yes, but do they use pg_standby to implement it? If they do, why? pg_standby is more configurable than the built-in standby_mode=on. You can set the sleep time, for example, while standby_mode=on uses a hard-coded delay of 5 s. And pg_standby has a configurable maximum wait time. And as Fujii pointed out, the built-in system will print an annoying message to the log every time it attempts to restore a file. Nevertheless, 99% of users would probably be happy with the built-in thing. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-10 15:36:55 -0300, Alvaro Herrera wrote: Andres Freund wrote: On 2014-11-10 14:28:30 -0300, Alvaro Herrera wrote: If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but *much* *much* more work than this. We already have last_autovacuum columns and such in pg_stat_tables et al, which only record the last value. My thinking regarding such numbers is that you would save histories and put them in a chart, see how they evolve with time. I doubt the individual numbers are worth much, but the trends might show something interesting. I don't think they mean anything without also reporting the number of buffers actually scanned and other related stats. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] remove pg_standby?
On Mon, Nov 10, 2014 at 7:48 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 11/10/2014 07:50 PM, Joshua D. Drake wrote: On 11/04/2014 01:36 PM, Peter Eisentraut wrote: While we're talking about removing old things, is there any use left for pg_standby? -1. A lot of people, a lot of customers use log shipping for various creative and business requirement setups. Yes, but do they use pg_standby to implement it? If they do, why? pg_standby is more configurable than the built-in standby_mode=on. You can set the sleep time, for example, while standby_mode=on uses a hard-coded delay of 5 s. And pg_standby has a configurable maximum wait time. And as Fujii pointed out, the built-in system will print an annoying message to the log every time it attempts to restore a file. Nevertheless, 99% of users would probably be happy with the built-in thing. As long as pg_standby has features that are actually useful and that are not in the built-in system, we shouldn't remove it. We should, however, try to fix those in the main system so we can get rid of it after that :) -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 2014-11-10 12:37:29 -0600, Jim Nasby wrote: On 11/10/14, 12:15 PM, Andres Freund wrote: If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but*much* *much* more work than this. We already report statistics on vacuums (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding 1 or 2 counters to that. Should we add the other counters from vacuum? That would be significantly more data. At the very least it'd require: * The number of buffers skipped due to the vm * The number of buffers actually scanned * The number of full table in contrast to partial vacuums I think it'd require a fair amount of thinking about which values are required to make sense of the number of skipped buffers due to not being able to acquire the cleanup lock. If you want to do this - and I sure don't want to stop you from it - you should look at it from a general perspective, not from the perspective of how skipped cleanup locks are logged. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
Attached is a new version. It's rebased on current git master, including BRIN. I've also fixed the laundry list of small things you reported, as well as a bunch of bugs I uncovered during my own testing. Alvaro: you still have the BRIN WAL-logging code fresh in your memory, so could you take a look at that part of this patch to check that I didn't break it? And also, how do you like the new way of writing that, over what you had to in HEAD ? More below. On 11/09/2014 11:47 PM, Andres Freund wrote: On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote: Replying to some of your comments below. The rest were trivial issues that I'll just fix. On 10/30/2014 09:19 PM, Andres Freund wrote: * Is it really a good idea to separate XLogRegisterBufData() from XLogRegisterBuffer() the way you've done it ? If we ever actually get a record with a large numbers of blocks touched this issentially is O(touched_buffers*data_entries). Are you worried that the linear search in XLogRegisterBufData(), to find the right registered_buffer struct, might be expensive? If that ever becomes a problem, a simple fix would be to start the linear search from the end, and make sure that when you touch a large number of blocks, you do all the XLogRegisterBufData() calls right after the corresponding XLogRegisterBuffer() call. Yes, that was what I was (mildly) worried about. Since you specified a high limit of buffers I'm sure someone will come up with a use case for it ;) I've also though about having XLogRegisterBuffer() return the pointer to the struct, and passing it as argument to XLogRegisterBufData. So the pattern in WAL generating code would be like: registered_buffer *buf0; buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD); XLogRegisterBufData(buf0, data, length); registered_buffer would be opaque to the callers. That would have potential to turn XLogRegisterBufData into a macro or inline function, to eliminate the function call overhead. I played with that a little bit, but the difference in performance was so small that it didn't seem worth it. But passing the registered_buffer pointer, like above, might not be a bad thing anyway. Yes, that was roughly what I was thinking of as well. It's not all that pretty, but it generally does seem like a good idea to me anyay. I ended up doing something different: The block_id is now used as the index into the registered_buffers array. So looking up the right struct is now just registered_buffers[block_id]. That obviously gets rid of the linear search. It doesn't seem to make any difference in the quick performance testing I've been doing. * There's lots of functions like XLogRecHasBlockRef() that dig through the wal record. A common pattern is something like: if (XLogRecHasBlockRef(record, 1)) XLogRecGetBlockTag(record, 1, NULL, NULL, oldblk); else oldblk = newblk; I think doing that repeatedly is quite a bad idea. We should parse the record once and then use it in a sensible format. Not do it in pieces, over and over again. It's not like we ignore backup blocks - so doing this lazily doesn't make sense to me. Especially as ValidXLogRecord() *already* has parsed the whole damn thing once. Hmm. Adding some kind of a parsed XLogRecord representation would need a fair amount of new infrastructure. True. Vast majority of WAL records contain one, maybe two, block references, so it's not that expensive to find the right one, even if you do it several times. I'm not convinced. It's not an infrequent thing these days to hear people being bottlenecked by replay. And grovelling repeatedly through larger records isn't *that* cheap. I haven't done anything about this. We might want to do that, but I'd like to get this committed now, with all the changes to the AM's, and then spend some time trying to further optimize the WAL format. I might add the deparsed WAL record infrastructure as part of that optimization work. I ran a quick performance test on WAL replay performance yesterday. I ran pgbench for 100 transactions with WAL archiving enabled, and measured the time it took to replay the generated WAL. I did that with and without the patch, and I didn't see any big difference in replay times. I also ran perf on the startup process, and the profiles looked identical. I'll do more comprehensive testing later, with different index types, but I'm convinced that there is no performance issue in replay that we'd need to worry about. Interesting. What checkpoint_segments/timeout and what scale did you use? Since that heavily influences the average size of the record that's quite relevant... Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s. pg_xlogdump --stats says: Type N (%) Record size (%) FPI size (%)Combined size (%) - --- ---
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On Mon, Nov 10, 2014 at 1:29 PM, Andres Freund and...@2ndquadrant.com wrote: That's an issue to which we may need to engineer some solution, but not in this patch. Overall, the patch's architecture is modeled closely on the way we synchronize GUCs to new backends when using EXEC_BACKEND, and I don't think we're going do any better than stating with that and working to file down the rough edges as we go. So I'd like to go ahead and commit this. Did you check out whether PGC_BACKEND GUCs work? Other than that I agree that we can just solve further issues as we notice them. This isn't particularly intrustive. Is the scenario you're concerned about this one: 1. Start postmaster. 2. Start a user session. 3. Change a PGC_BACKEND GUC in postgresql.conf. 4. Reload. 5. Launch a background worker that uses this code. It should end up with the same values there as the active session, not the current one from postgresql.conf. But we want to verify that's the behavior we actually get. Right? -- It doesn't handle types without send/receive functions. I thought that was tolerable since the functions without such types seem like mostly edge cases, but Andres doesn't like it. Apparently, BDR is makes provisions to either ship the tuple as one big blob - if all built-in types - or else use binary format where supported and text format otherwise. Since this issue is common to BDR, parallelism, and pg_background, we might want to introduce some common infrastructure for it, but it's not too clear to me right now what that should look like. I think we definitely need to solve this - but I'm also not at all that sure how. For something like pg_background it's pretty trivial to fall back to in/out when there's no send/recv. It's just a couple of lines, and the portal code has the necessary provisions. So solving it for pg_background itself is pretty trivial. That's not really the interesting part of the problem, I think. Yeah, pg_background can be made to speak text format if we really care. But the real issue is that even with the binary format, converting a tuple in on-disk format into a DataRow message so that the other side can reconstruct the tuple and shove it into an executor slot (or wherever it wants to shove it) seems like it might be expensive. You might have data on that; if it's not expensive, stop here. If it is expensive, what we really want to do is figure out some way to make it safe to copy the tuple into the shared memory queue, or send it out over the socket, and have the process on the other end use it without having to revalidate the whole thing column-by-column. But, as you say, pg_background itself isn't a primary goal (although a nice demonstration, with some real world applications). There's enough differences between the parallelism and the replication cases that I'm not entirely sure how much common ground there is. Namely replication has the additional concerns of version differences, trust (don't use blobs if you don't fully trust the other side), and differences in the allocated oids (especially relevant for arrays which, very annoyingly, embed the oid in the send/recv format). Yeah. It would be nice to use the same code for deciding what to do in a particular case. It seems like it ought to be possible to have one rule for whether a tuple with a given set of data types is safe to ship in the on-disk format. For pg_background/parallelism, it's enough if that function returns true; for BDR, there might be additional criteria applied before deciding to do it that way. But they could use the same decision tree for part of it. What would be even better is to find some way to MAKE IT SAFE to send the undecoded tuple. I'm not sure what that would look like. Locking the types against concurrent changes? Marking dropped types as dropped without actually dropping them, and garbage collecting them later? Providing safer ways to deconstruct tuples? It might help to start by enumerating what exactly can go wrong here. As far as I can see, the main risks for pg_background and parallelism are (1) that the type might get concurrently dropped, leaving us unable to interpret the received tuple and (2) that the type might get concurrently modified in some way that leaves us confused about how to interpret the tuple, as by having the type size change. The chances of a problem in practice seem remote, since you can't do either of those things if a table column with that type exists, but I can't convince myself that there's no way for the type to be modified under us in such a way that two different backends have different ideas about whether it exists or how wide it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Access method extendability
Hi! Thanks everybody for discussion. Sorry for delay. I'll try to address most of questions arised in this discussion. In general, I'm agree with concept of marking index as invalid in certain cases. I see following possible consequences of buggy WAL-logged custom AM: 1) Server crash during insert/update. 2) Server crash during select. 3) Invalid query answers. 4) Server crash during vacuum. 5) Server crash in recovery. #5 is totally unacceptable. I've tried to design generic WAL record so that it should be always possible to purely mechanically apply the record. It's always possible to move piece of memory inside the page. It's always possible to copy piece of memory from WAL-record to the page. Buggy WAL for sure could cause an index corruption as well as any other bug in AM. WAL support doesn't bring nothing special to this expect the fact that WAL is quite hard to debug. However, in current implementation I missed some hidden asserts about page structure. Page could be so corrupted that we can't, for instance, safely call XLogReadBufferForRedo(). All this cases must be worked out. If we can't update page during recovery, index must be marked as invalid. It's some amount of work, but it seems to be very feasible. #4 seems problematic too. If autovacuum crashes on some index, then postgres can enter a crash loop. So, if autovacuum crashes on extension provided AM, that index should be marked as invalid. Consequences #1, #3 and #3 could be easily caused by buggy opclass as well. The reason why we're not knee-deep in extension provied bugs in GiST/GIN opclasses is not easyness of writing correct GiST/GIN opclasses. Actual reason is that we have only few GiST/GIN opclasses which weren't written by GiST/GIN authors. So, I don't expect PostgreSQL to be flooded with buggy AMs once we add AM extendability. Our motivation behind this proposal is that we want to be able to develop custom extensions with AMs. We want to be able to provide our AMs to our customers whithout having to push that into PostgreSQL core or fork PostgreSQL. Bugs in that AMs in our responsibility to out customers. Some commercial companies could implement patented AMs (for instance, fractal tree), and that is their responsibility to their customers. Also, I think it's OK to put adopting custom AMs to changes of AM interface to authors of those custom AMs. AM extensions anyway should be adopted to each new major release. AFAIR, interface of RelationOpen() function has been changed not too long ago. Custom AM would use many functions which we use to access relations. Their interface can be changed in the next release. PostGIS GiST opclass has bugs which are reproducable, known and not fixed. This is responsibility of PostGIS to their customers. We can feel sorry for PostGIS that they are so slow on fixing this. But we shouldn't feel sorry for GiST extendability, that woulde be redicilous. Some recearches could write their own extensions. We can hope that they are smart enough to not recommend it for production use. We can back our hope with warning during installing extension provided AM. That warning could say that all corruption caused by extension provided AM is up to AM developer. This warning could make users to beware of using extension provided AMs in production unless they are fully trust extension developer (have support subscription if it's commercial). PostgreSQL doesn't have any kind of safe extensions. Every extension must be trusted. Every extension can break (almost) everything.When one types CREATE EXTENSION he must completely trust extension author. This applies to every extension. I would be very careful with discouraging commercial AM extensions. We should always keen in the mind how many of PostgreSQL developers are working for companies which own their commercial PostgreSQL forks and how big their contribution is. Patented extensions looks scary for sure. But it's up to software patents not up to PostgreSQL extendability... Particular steps I'm going to do on these patches: 1) Make generic_redo never crash on broken pages. 2) Make autovacuum launcher mark index as invalid if vacuum process crashed on custom AM index. Since, we can't write something into postgres cluster when one process has crushed, ITSM autovacuum should have some separate place to put this information. Thus, after restart postgres can read it and mark index as invalid. Don't allowing CREATE ACCESS METHOD command seems problematic for me. How could it work with pg_upgrade? pg_dump wouldn't dump extra pg_am records. So, pg_upgrade would break at creating operator classes on new cluster. So, I agree with dropping create am command only if we let pg_dump to dump extra pg_am records... -- With best regards, Alexander Korotkov.
Re: [HACKERS] row_to_json bug with index only scans: empty keys!
Robert Haas robertmh...@gmail.com writes: On Mon, Nov 10, 2014 at 10:11 AM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, has anyone got an opinion about whether to stick the full fix into 9.4? I think we should put the full fix in 9.4. Since nobody spoke against that, I've put the full fix into HEAD and 9.4, and the restricted version into 9.3 and 9.2. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Fujii Masao wrote: I got the following PANIC error in the standby server when I set up the replication servers and ran make installcheck. Note that I was repeating the manual CHECKPOINT every second while installcheck was running. Without the checkpoints, I could not reproduce the problem. I'm not sure if CHECKPOINT really triggers this problem, though. Anyway BRIN seems to have a problem around its WAL replay. Hm, I think I see what's happening. The xl_brin_update record references two buffers, one which is target for the updated tuple and another which is the revmap buffer. When the update target buffer is being first used we set the INIT bit which removes the buffer reference from the xlog record; in that case, if the revmap buffer is first being modified after the prior checkpoint, that revmap buffer receives backup block number 0; but the code hardcodes it as 1 on the expectation that the buffer that's target for the update will receive 0. The attached patch should fix this. I cannot reproduce the issue after applying this patch, can you please confirm that it fixes the issue for you as well? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c index ebef984..2937068 100644 --- a/src/backend/access/brin/brin_xlog.c +++ b/src/backend/access/brin/brin_xlog.c @@ -60,9 +60,11 @@ brin_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, */ if (record-xl_info XLOG_BRIN_INIT_PAGE) { - XLogReadBufferForRedoExtended(lsn, record, 0, - xlrec-node, MAIN_FORKNUM, blkno, - RBM_ZERO, false, buffer); + /* + * No full-page image here. Don't try to read it, because there + * might be one for the revmap buffer, below. + */ + buffer = XLogReadBuffer(xlrec-node, blkno, true); page = BufferGetPage(buffer); brin_page_init(page, BRIN_PAGETYPE_REGULAR); action = BLK_NEEDS_REDO; @@ -97,7 +99,9 @@ brin_xlog_insert_update(XLogRecPtr lsn, XLogRecord *record, UnlockReleaseBuffer(buffer); /* update the revmap */ - action = XLogReadBufferForRedo(lsn, record, 1, xlrec-node, + action = XLogReadBufferForRedo(lsn, record, + record-xl_info XLOG_BRIN_INIT_PAGE ? 0 : 1, + xlrec-node, xlrec-revmapBlk, buffer); if (action == BLK_NEEDS_REDO) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Stephen Frost wrote: at least for 9.5, which is coming up awful fast at this point, if we're going to stick with the 'normal' schedule and freeze in the spring. https://wiki.postgresql.org/wiki/PgCon_2014_Developer_Meeting#9.5_Schedule Doesn't look all that normal to me, with that commitfest in Feb. I'd be more inclined to expect the late release of 9.4 to impact things than that schedule to really change when we freeze.. Just my 2c though. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Alvaro Herrera wrote: Hm, I think I see what's happening. The xl_brin_update record references two buffers, one which is target for the updated tuple and another which is the revmap buffer. When the update target buffer is being first used we set the INIT bit which removes the buffer reference from the xlog record; in that case, if the revmap buffer is first being modified after the prior checkpoint, that revmap buffer receives backup block number 0; but the code hardcodes it as 1 on the expectation that the buffer that's target for the update will receive 0. The attached patch should fix this. Pushed, thanks for the report. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Greg Stark wrote: 1) The manual describes the exensibility API including the BrinOpcInfo struct -- but it doesn't define the BrinDesc struct that every API method takes. It's not clear what exactly that argument is for or how to make use of it. Hm, I guess this could use some expansion. 2) The mention about additional opclass operators and to number them from 11 up is fine -- but there's no explanation of how to decide what operators need to be explicitly added like that. Specifically I gather from reading minmax that = is handled internally by Brin and you only need to add any other operators aside from = ? Is that right? I think I already replied to this in the other email. 3) It's not entirely clear in the docs when each method is will be invoked. Specifically it's not clear whether opcInfo is invoked once when the index is defined or every time the definition is loaded to be used. I gather it's the latter? Perhaps there needs to be a method that's invoked specifically when the index is defined? I'm wondering where I'm going to hook in the logic to determine the size and number of hash functions to use for the bloom filter which needs to be decided once when the index is created and then static for the index in the future. Every time the index is accessed, yeah. I'm not sure about figuring the initial creation details. Do you think we need another support procedure to help with that? We can add it if needed; minmax would just define it to InvalidOid. 4) It doesn't look like BRIN handles cross-type operators at all. The idea here is that there is a separate opclass to handle cross-type operators, which would be together in the same opfamily as the opclass used to create the index. I haven't actually tried this yet, mind you. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Greg Stark wrote: On Sun, Nov 9, 2014 at 5:57 PM, Greg Stark st...@mit.edu wrote: 2) The mention about additional opclass operators and to number them from 11 up is fine -- but there's no explanation of how to decide what operators need to be explicitly added like that. Specifically I gather from reading minmax that = is handled internally by Brin and you only need to add any other operators aside from = ? Is that right? I see I totally misunderstood the use of the opclass procedure functions. I think I understand now but just to be sure -- If I can only handle BTEqualStrategyNumber keys then is it adequate to just define the opclass containing only the equality operator? Yes. I agree that this deserves some more documentation. In a nutshell, the opclass must provide three separate groups of items: 1. the mandatory support functions, opcInfo, addValue, Union, Consistent. opcInfo is invoked each time the index is accessed (including during index creation). 2. the additional support functions; normally these are called from within addValue, Consistent, Union. For minmax, what we provide is the functions that implement the inequality operators for the type, that is = = and . Since minmax tries to be generic and support a whole lot of types, this is the way that the mandatory support functions know what functions to call to compare two given values. If the opclass is specific to one data type, you might not need anything here; or perhaps you have other ways to figure out a hash function to call, etc. 3. the operators. We only use these so that the optimizer picks up the index for queries. Somehow I got confused between the amprocs that minmax uses to implement the consistency function and the amops that the brin index supports. I think it is somewhat confusing, yeah. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background (and more parallelism infrastructure patches)
On 2014-11-10 14:54:15 -0500, Robert Haas wrote: On Mon, Nov 10, 2014 at 1:29 PM, Andres Freund and...@2ndquadrant.com wrote: That's an issue to which we may need to engineer some solution, but not in this patch. Overall, the patch's architecture is modeled closely on the way we synchronize GUCs to new backends when using EXEC_BACKEND, and I don't think we're going do any better than stating with that and working to file down the rough edges as we go. So I'd like to go ahead and commit this. Did you check out whether PGC_BACKEND GUCs work? Other than that I agree that we can just solve further issues as we notice them. This isn't particularly intrustive. Is the scenario you're concerned about this one: 1. Start postmaster. 2. Start a user session. 3. Change a PGC_BACKEND GUC in postgresql.conf. 4. Reload. 5. Launch a background worker that uses this code. No, that's not what I was thinking of. Consider: export pg_libdir=$(pg_config --libdir) mkdir $pg_libdir/plugins ln -s $pg_libdir/auto_explain.so $pg_libdir/plugins/ PG_OPTIONS='-c local_preload_libraries=auto_explain' psql and use pg_background() (or something else using this infrastructure) in that context. Without having reread your relevant code I'd expect a: ERROR: 55P02: parameter local_preload_libraries cannot be set after connection start because the code would try to import the user session's local_preload_libraries values into the background process - after that actually already has done its initialization. It should end up with the same values there as the active session, not the current one from postgresql.conf. But we want to verify that's the behavior we actually get. Right? But that's also something worthwile to check. -- It doesn't handle types without send/receive functions. I thought that was tolerable since the functions without such types seem like mostly edge cases, but Andres doesn't like it. Apparently, BDR is makes provisions to either ship the tuple as one big blob - if all built-in types - or else use binary format where supported and text format otherwise. Since this issue is common to BDR, parallelism, and pg_background, we might want to introduce some common infrastructure for it, but it's not too clear to me right now what that should look like. I think we definitely need to solve this - but I'm also not at all that sure how. For something like pg_background it's pretty trivial to fall back to in/out when there's no send/recv. It's just a couple of lines, and the portal code has the necessary provisions. So solving it for pg_background itself is pretty trivial. That's not really the interesting part of the problem, I think. Yeah, pg_background can be made to speak text format if we really care. But the real issue is that even with the binary format, converting a tuple in on-disk format into a DataRow message so that the other side can reconstruct the tuple and shove it into an executor slot (or wherever it wants to shove it) seems like it might be expensive. You might have data on that; if it's not expensive, stop here. If it is expensive, what we really want to do is figure out some way to make it safe to copy the tuple into the shared memory queue, or send it out over the socket, and have the process on the other end use it without having to revalidate the whole thing column-by-column. Yes, sending the tuples as-is is noticeable win. I've not fully analyzed where the difference is - I'm suspect it's to a large degree because it requires less copying/memory allocations. But also some of the send/recv functions aren't exactly cheap in themselves. For BDR we've forgone problems around changing type definitions and such by saying only builtin types get to do the 'as-is' stuff. That can be expanded later, but that's it for now. For those there's no chance that the type definition changes or anything. A general problem is that you really can't allow this to happen outside a controlled setting - manipulating data on the 'Datum' level allows you to do bad things. But, as you say, pg_background itself isn't a primary goal (although a nice demonstration, with some real world applications). There's enough differences between the parallelism and the replication cases that I'm not entirely sure how much common ground there is. Namely replication has the additional concerns of version differences, trust (don't use blobs if you don't fully trust the other side), and differences in the allocated oids (especially relevant for arrays which, very annoyingly, embed the oid in the send/recv format). Yeah. It would be nice to use the same code for deciding what to do in a particular case. It seems like it ought to be possible to have one rule for whether a tuple with a given set of data types is safe to ship in the on-disk format. For pg_background/parallelism, it's enough if that function returns true; for BDR,
Re: [HACKERS] Add CREATE support to event triggers
On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com wrote: We could just integrate those parts, and be done with it. But would that actually be a good thing for the community? Then slony needs to do it and potentially others as well? Then auditing can't use it? Then potential schema tracking solutions can't use it? Do you think Slony is really going to use this? I guess we can let the Slony guys speak for themselves, but I've been skeptical since day one that this is the best way to do DDL replication, and I still am. There are lots of ways that a replicated DDL statement can fail on the replicas, and what are you going to do then? It's too late to show the user the error message, so you can throw it in a log someplace and hope that somebody notices, but that's it. It makes a lot more sense to me to use some kind of a tool that applies the DDL in a coordinated fashion on all nodes - or even just do it manually, since it might very well be desirable to take the lock on different nodes at widely different times, separated by a switchover. I certainly think there's a use-case for what you're trying to do here, but I don't think it'll be right for everyone. Certainly, if the Slony guys - or some other team building an out-of-core replication solutions says, hey, we really want this in core, that would considerably strengthen the argument for putting it there. But I haven't heard anyone say that yet - unlike logical decoding, were we did have other people expressing clear interest in using it. There've been people for a long while asking about triggers on catalogs for that purpose. IIRC Jan was one of them. My impression, based on something Christopher Brown said a few years ago, is that Slony's DDL trigger needs are largely satisfied by the existing event trigger stuff. It would be helpful to get confirmation as to whether that's the case. I'm not sure that a replication system that intends to do partial replication (e.g. - being selective of what objects are to be replicated) will necessarily want to use the CREATE event triggers to capture creates. Several cases pop up with different answers: a) I certainly don't want to replicate temporary tables b) I almost certainly don't want to replicate unlogged tables c) For more ordinary tables, I'm not sure I want to extend Slony to detect them and add them automatically, because there are annoying sub-cases c.1) If I'm working on data conversion, I may create not totally temporary tables that are nonetheless not worthy to replicate. (I'm working on such right now) Long and short: it seems likely that I'd frequently NOT want all new tables added to replication, at least not all of them, all the time. What would seem valuable, to me, would be to have a CREATE event trigger that lets me know the OID and/or fully qualified name of the new object so that perhaps the replication system: a) Has some kind of rule system to detect if it wants to replicate it, b) Logs the change so a human might know later that there's new stuff that probably ought to be replicated c) Perhaps a human might put replication into a new suggestive mode, a bit akin to Slony's EXECUTE SCRIPT, but where the human essentially says, Here, I'm running DDL against this connection for a while, and I'd be grateful if Postgres told Slony to capture all the new tables and sequences and replicated them. There are kind of two approaches: a) Just capture the OIDs, and have replication go back later and grab the table definition once the dust clears on the master b) We need to capture ALL the DDL, whether CREATE or ALTER, and forward it, altered to have fully qualified names on everything so that we don't need to duplicate all the set search_path requests and such. I suppose there's also a third... c) Have a capability to put an event trigger function in place that makes DDL requests fail. That's more useful than you'd think; if, by default, we make them fail, and with an error messages such as DDL request failed as it was not submitted using slonik DDL TOOL then we have protection against uncontrolled application of DDL. DDL TOOL would switch off the fail trigger, possibly trying to capture the DDL, or perhaps just capturing the statements passed to it so they get passed everywhere. (That heads back to a) and b); what should get captured...) I'm not sure that all of that is totally internally coherent, but I hope there are some ideas worth thinking about. -- When confronted by a difficult problem, solve it by reducing it to the question, How would the Lone Ranger handle this?
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
On Mon, Nov 10, 2014 at 9:31 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Every time the index is accessed, yeah. I'm not sure about figuring the initial creation details. Do you think we need another support procedure to help with that? We can add it if needed; minmax would just define it to InvalidOid. I have a working bloom filter with hard coded filter size and hard coded number of hash functions. I need to think about how I'm going to make it more general now. I think the answer is that I should have an index option that specifies the false positive rate and calculates the optimal filter size and number of hash functions. It might possibly need to peek at the table statistics to determine the population size though. Or perhaps I should bite the bullet and size the bloom filters based on the actual number of rows in a chunk since the BRIN infrastructure does allow each summary to be a different size. There's another API question I have. To implement Consistent I need to call the hash function which in the case of functions like hashtext could be fairly expensive and I even need to generate multiple hash values(though currently I'm slicing them all from the integer hash value so that's not too bad) and then test each of those bits. It would be natural to call hashtext once at the start of the scan and possibly build a bitmap and compare all of them in a single operation. But afaict there's no way to hook the beginning of the scan and opaque is not associated with the specific scan so I don't think I can cache the hash value of the scan key there safely. Is there a good way to do it with the current API? On a side note I'm curious about something, I was stepping through the my code in gdb and discovered that a single row insert appeared to construct a new summary then union it into the existing summary instead of just calling AddValue on the existing summary. Is that intentional? What led to that? -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
On Thu, Nov 6, 2014 at 3:42 AM, Michael Paquier michael.paqu...@gmail.com wrote: On Sat, Sep 13, 2014 at 11:02 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: Patch rebased and added to commitfest [1]. It looks like a good thing to remove ATChangeIndexesPersistence, this puts the persistence switch directly into reindex process. A couple of minor comments about this patch: 1) Reading it, I am wondering if it would not be finally time to switch to a macro to get a relation's persistence, something like RelationGetPersistence in rel.h... Not related directly to this patch. Good idea... I'll provide a patch soon. 2) reindex_index has as new argument a relpersislence value for the new index. reindex_relation has differently a new set of flags to enforce the relpersistence of all the underling indexes. Wouldn't it be better for API consistency to pass directly a relpersistence value through reindex_relation? In any case, the comment block of reindex_relation does not contain a description of the new flags. I did it because the ReindexDatabase build a list of oids to run reindex_relation for each item of the list. I can change the list to store the relpersistence also, but this can lead us to a concurrency trouble because if one session execute REINDEX DATABASE and other session run ALTER TABLE name SET {LOGGED|UNLOGGED} during the building of the list the reindex can lead us to a inconsitence state. Added comments to comment block of reindex_relation. 3) Here you may as well just set the value and be done: +/* + * Check if need to set the new relpersistence + */ +if (iRel-rd_rel-relpersistence != relpersistence) +iRel-rd_rel-relpersistence = relpersistence; Hmmm... I really don't know why I did it... fixed. Thanks! -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 912038a..50cf0ef 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1980,7 +1980,7 @@ index_build(Relation heapRelation, * created it, or truncated twice in a subsequent transaction, the * relfilenode won't change, and nothing needs to be done here. */ - if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED + if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED !smgrexists(indexRelation-rd_smgr, INIT_FORKNUM)) { RegProcedure ambuildempty = indexRelation-rd_am-ambuildempty; @@ -3130,7 +3130,7 @@ IndexGetRelation(Oid indexId, bool missing_ok) * reindex_index - This routine is used to recreate a single index */ void -reindex_index(Oid indexId, bool skip_constraint_checks) +reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence) { Relation iRel, heapRelation; @@ -3191,6 +3191,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks) indexInfo-ii_ExclusionStrats = NULL; } + /* Set the relpersistence of the new index */ + iRel-rd_rel-relpersistence = relpersistence; + /* We'll build a new physical relation for the index */ RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidMultiXactId); @@ -3310,6 +3313,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks) * performance, other callers should include the flag only after transforming * the data in a manner that risks a change in constraint validity. * + * REINDEX_REL_FORCE_UNLOGGED: if true, force the new rebuilded indexes to be + * unlogged. + * + * REINDEX_REL_FORCE_LOGGED: if true, force the new rebuilded indexes to be + * permanent. + * * Returns true if any indexes were rebuilt (including toast table's index * when relevant). Note that a CommandCounterIncrement will occur after each * index rebuild. @@ -3389,11 +3398,19 @@ reindex_relation(Oid relid, int flags) foreach(indexId, indexIds) { Oid indexOid = lfirst_oid(indexId); + char relpersistence = rel-rd_rel-relpersistence; if (is_pg_class) RelationSetIndexList(rel, doneIndexes, InvalidOid); - reindex_index(indexOid, !(flags REINDEX_REL_CHECK_CONSTRAINTS)); + /* Check for flags to force UNLOGGED or PERMANENT persistence */ + if (flags REINDEX_REL_FORCE_UNLOGGED) +relpersistence = RELPERSISTENCE_UNLOGGED; + else if (flags REINDEX_REL_FORCE_PERMANENT) +relpersistence = RELPERSISTENCE_PERMANENT; + + reindex_index(indexOid, !(flags REINDEX_REL_CHECK_CONSTRAINTS), + relpersistence); CommandCounterIncrement(); diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 6a578ec..95b9a4f 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -589,7 +589,8 @@ rebuild_relation(Relation OldHeap, Oid
Re: [HACKERS] Add CREATE support to event triggers
On 10/11/14 23:37, Christopher Browne wrote: On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com My impression, based on something Christopher Brown said a few years ago, is that Slony's DDL trigger needs are largely satisfied by the existing event trigger stuff. It would be helpful to get confirmation as to whether that's the case. I'm not sure that a replication system that intends to do partial replication (e.g. - being selective of what objects are to be replicated) will necessarily want to use the CREATE event triggers to capture creates. Several cases pop up with different answers: a) I certainly don't want to replicate temporary tables b) I almost certainly don't want to replicate unlogged tables c) For more ordinary tables, I'm not sure I want to extend Slony to detect them and add them automatically, because there are annoying sub-cases c.1) If I'm working on data conversion, I may create not totally temporary tables that are nonetheless not worthy to replicate. (I'm working on such right now) Long and short: it seems likely that I'd frequently NOT want all new tables added to replication, at least not all of them, all the time. I don't see how this is problem with using CREATE event triggers, just put logic in your trigger that handles this, you get the object identity of the object that is being created/altered so you can get any info about it you wish and you can easily filter however you want. There are kind of two approaches: a) Just capture the OIDs, and have replication go back later and grab the table definition once the dust clears on the master b) We need to capture ALL the DDL, whether CREATE or ALTER, and forward it, altered to have fully qualified names on everything so that we don't need to duplicate all the set search_path requests and such. This is basically what this patch gives you (actually both the canonized command and the identity)? I suppose there's also a third... c) Have a capability to put an event trigger function in place that makes DDL requests fail. That's more useful than you'd think; if, by default, we make them fail, and with an error messages such as DDL request failed as it was not submitted using slonik DDL TOOL You can do that already, it's even the example in the event trigger documentation. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 2014-11-10 17:37:40 -0500, Christopher Browne wrote: On 8 November 2014 17:49, Robert Haas robertmh...@gmail.com wrote: We could just integrate those parts, and be done with it. But would that actually be a good thing for the community? Then slony needs to do it and potentially others as well? Then auditing can't use it? Then potential schema tracking solutions can't use it? Do you think Slony is really going to use this? I guess we can let the Slony guys speak for themselves, but I've been skeptical since day one that this is the best way to do DDL replication, and I still am. There are lots of ways that a replicated DDL statement can fail on the replicas, and what are you going to do then? It's too late to show the user the error message, so you can throw it in a log someplace and hope that somebody notices, but that's it. It makes a lot more sense to me to use some kind of a tool that applies the DDL in a coordinated fashion on all nodes - or even just do it manually, since it might very well be desirable to take the lock on different nodes at widely different times, separated by a switchover. I certainly think there's a use-case for what you're trying to do here, but I don't think it'll be right for everyone. Certainly, if the Slony guys - or some other team building an out-of-core replication solutions says, hey, we really want this in core, that would considerably strengthen the argument for putting it there. But I haven't heard anyone say that yet - unlike logical decoding, were we did have other people expressing clear interest in using it. There've been people for a long while asking about triggers on catalogs for that purpose. IIRC Jan was one of them. My impression, based on something Christopher Brown said a few years ago, is that Slony's DDL trigger needs are largely satisfied by the existing event trigger stuff. It would be helpful to get confirmation as to whether that's the case. I'm not sure that a replication system that intends to do partial replication (e.g. - being selective of what objects are to be replicated) will necessarily want to use the CREATE event triggers to capture creates. Several cases pop up with different answers: a) I certainly don't want to replicate temporary tables b) I almost certainly don't want to replicate unlogged tables Those are quite easy to recognize and skip. c) For more ordinary tables, I'm not sure I want to extend Slony to detect them and add them automatically, because there are annoying sub-cases c.1) If I'm working on data conversion, I may create not totally temporary tables that are nonetheless not worthy to replicate. (I'm working on such right now) Sure. you might not want to do it automatically all the time - but I think it's a very useful default mode. Once you can replicate CREATEs per se, it's easy to add logic (in a couple lines of plpgsql or whatever) to only do so in a certain schema or similar. But the main reason all this is interesting isn't so much CREATE itself. But that it can be (and Alvaro has mostly done it!) for ALTER as well. And there it imo becomes really interesting. Because you can quite easily check whether the affected relation is being replicated you can just emit the DDL when that's the case. And that makes DDL in a logically replicated setup *much* easier. Long and short: it seems likely that I'd frequently NOT want all new tables added to replication, at least not all of them, all the time. Agreed. That's quite possible with the design here - you get the creation commands and can decide whether you want to do anything with them. You're not forced to insert them into your replication queue or whatever you're using for that. What would seem valuable, to me, would be to have a CREATE event trigger that lets me know the OID and/or fully qualified name of the new object so that perhaps the replication system: a) Has some kind of rule system to detect if it wants to replicate it, Sure. b) Logs the change so a human might know later that there's new stuff that probably ought to be replicated Sure. c) Perhaps a human might put replication into a new suggestive mode, a bit akin to Slony's EXECUTE SCRIPT, but where the human essentially says, Here, I'm running DDL against this connection for a while, and I'd be grateful if Postgres told Slony to capture all the new tables and sequences and replicated them. Sure. Some of that already is possible with the current event triggers - and all of it would be possible with the suggested functionality here. An old version of bdr, employing the functionality presented here, had the following (simplified) event trigger: CREATE OR REPLACE FUNCTION bdr.queue_commands() RETURNS event_trigger LANGUAGE plpgsql AS $function$ DECLARE r RECORD; BEGIN -- don't recursively log ddl commands IF
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
Hi Robert, All, On 2014-11-10 10:57:16 -0500, Robert Haas wrote: On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs si...@2ndquadrant.com wrote: Something usable, with severe restrictions, is actually better than we have now. I understand the journey this work represents, so don't be embarrassed by submitting things with heuristics and good-enoughs in it. Our mentor, Mr.Lane, achieved much by spreading work over many releases, leaving others to join in the task. It occurs to me that, now that the custom-scan stuff is committed, it wouldn't be that hard to use that, plus the other infrastructure we already have, to write a prototype of parallel sequential scan. Given where we are with the infrastructure, there would be a number of unhandled problems, such as deadlock detection (needs group locking or similar), assessment of quals as to parallel-safety (needs proisparallel or similar), general waterproofing to make sure that pushing down a qual we shouldn't does do anything really dastardly like crash the server (another written but yet-to-be-published patch adds a bunch of relevant guards), and snapshot sharing (likewise). But if you don't do anything weird, it should basically work. I think this would be useful for a couple of reasons. First, it would be a demonstrable show of progress, illustrating how close we are to actually having something you can really deploy. Second, we could use it to demonstrate how the remaining infrastructure patches close up gaps in the initial prototype. Third, it would let us start doing real performance testing. I think it might be a useful experiment - as long as it's clear that it's that. Which is, I think, what you're thinking about? It seems pretty clear that a parallel sequential scan of data that's in memory (whether the page cache or the OS cache) can be accelerated by having multiple processes scan it in parallel. Right. But it's much less clear what will happen when the data is being read in from disk. I think that *very* heavily depends on the IO subsystem. Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... What degree of parallelism helps? That's quite a hard question. Generally the question about how much parallelism for what is beneficial will be one of the most complicated areas once the plumbing is in. Do we break OS readahead so badly that performance actually regresses? I don't think it's likely that we break OS readahead - that works on a per task basis at least on linux afaik. But it's nonetheless very easy to have too many streams causing to many random reads. These are things that are likely to need a fair amount of tuning before this is ready for prime time, so being able to start experimenting with them in advance of all of the infrastructure being completely ready seems like it might help. I'm not actually entirely sure how much that's going to help. I think you could very well have a WIP patch ready reasonably quick that doesn't solve the issues you mention above by patching it in. For the kind of testing we're talking about that seems likely sufficient - a git branch somewhere probably is actually easier to compile for people than some contrib module that needs to be loaded... And I *do* think that you'll very quickly hit the limits of the custom scan API. And I'd much rather see you work on improving parallelism than the custom scan stuff, just so you can prototype further ahead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I feel this may play a major role when the selectivity is more. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSL information view
On Tue, Nov 11, 2014 at 1:43 AM, Magnus Hagander mag...@hagander.net wrote: Right now it just truncates the dn at NAMEDATALEN - so treating it the same as we do with hostnames. My guess is this is not a big problem because in the case of long DNs, most of the time the important stuff is at the beginning anyway... (And it's not like it's actually used for authentication, in which case it would of course be a problem). You should add it to the next CF for proper tracking, there are already many patches in the queue waiting for reviews :) -- Michael
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
On 11/10/14, 12:56 PM, Andres Freund wrote: On 2014-11-10 12:37:29 -0600, Jim Nasby wrote: On 11/10/14, 12:15 PM, Andres Freund wrote: If what we want is to quantify the extent of the issue, would it be more convenient to save counters to pgstat? Vacuum already sends pgstat messages, so there's no additional traffic there. I'm pretty strongly against that one in isolation. They'd need to be stored somewhere and they'd need to be queryable somewhere with enough context to make sense. To actually make sense of the numbers we'd also need to report all the other datapoints of vacuum in some form. That's quite a worthwile project imo - but*much* *much* more work than this. We already report statistics on vacuums (lazy_vacuum_rel()/pgstat_report_vacuum), so this would just be adding 1 or 2 counters to that. Should we add the other counters from vacuum? That would be significantly more data. At the very least it'd require: * The number of buffers skipped due to the vm * The number of buffers actually scanned * The number of full table in contrast to partial vacuums If we're going to track full scan vacuums separately, I think we'd need two separate scan counters. I think (for pgstats) it'd make more sense to just count initial failure to acquire the lock in a full scan in the 'skipped page' counter. In terms of answering the question how common is it not to get the lock, it's really the same event. I think it'd require a fair amount of thinking about which values are required to make sense of the number of skipped buffers due to not being able to acquire the cleanup lock. If you want to do this - and I sure don't want to stop you from it - you should look at it from a general perspective, not from the perspective of how skipped cleanup locks are logged. Honestly, my desire at this point is just to see if there's actually a problem. Many people are asserting that this should be a very rare occurrence, but there's no way to know. Towards that simple end, I'm a bit torn. My preference would be to simply log, and throw a warning if it's over some threshold. I believe that would give the best odds of getting feedback from users if this isn't as uncommon as we think. I agree that ideally this would be tracked as another stat, but from that standpoint I think there's other, much more important metrics to track, and AFAIK the only reason we don't have them is that busy systems already push pgstats to it's limits. We should try and fix that, but that's a much bigger issue. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: Log inability to lock pages during vacuum
Jim Nasby jim.na...@bluetreble.com writes: On 11/10/14, 12:56 PM, Andres Freund wrote: If you want to do this - and I sure don't want to stop you from it - you should look at it from a general perspective, not from the perspective of how skipped cleanup locks are logged. Honestly, my desire at this point is just to see if there's actually a problem. Many people are asserting that this should be a very rare occurrence, but there's no way to know. Towards that simple end, I'm a bit torn. My preference would be to simply log, and throw a warning if it's over some threshold. I believe that would give the best odds of getting feedback from users if this isn't as uncommon as we think. I agree that ideally this would be tracked as another stat, but from that standpoint I think there's other, much more important metrics to track, and AFAIK the only reason we don't have them is that busy systems already push pgstats to it's limits. We should try and fix that, but that's a much bigger issue. Yeah. We know that per-table pgstat counters are a pretty expensive thing in databases with many tables. We should absolutely not be adding them on mere speculation that the number might be interesting. Now, that objection would not apply to a per *database* counter, but I'm not sure if tracking the numbers at that granularity would help anyone. On the whole, I'm +1 for just logging the events and seeing what we learn that way. That seems like an appropriate amount of effort for finding out whether there is really an issue. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On partitioning
Hi, From: Robert Haas [mailto:robertmh...@gmail.com] Sent: Saturday, November 08, 2014 5:41 AM I'd be in favor of that. Thanks! I am not sure whether the code is close enough to what we need to be really useful, but that's for you to decide. Hmm, I'm not entirely convinced about the patch as it stands either but, I will try to restate below what the patch in its current state does anyway (just to refresh): The patch provides syntax to: * Specify partitioning key, optional partition definitions within CREATE TABLE, * A few ALTER TABLE commands that let you define a partitioning key (partitioning a table after the fact), attach/detach an existing table as a partition of a partitioned table, * CREATE PARTITION to create a new partition on a partitioned table. Above commands are merely transformed into ALTER TABLE subcommands that arrange partitioned table and partitions into inheritance hierarchy, but with extra information, that is, allowed values for the partition in a new anyarray column called 'pg_inherits.values'. A special case of ATExecAddInherit() namely ATExecAttachPartitionI(), as part of its processing, also adds partition constraints in the form of appropriate CHECK constraints. So, a few of the manual steps are automated and additional (IMHO non-opaque) metadata (namely partition boundaries/list values) is added. Additionally, defining a partitioning key (PARTITION BY) creates a pg_partition entry that specifies for a partitioned table the following - partition kind (range/list), an opclass for the key value comparison and a key 'expression' (say, colname % 10). A few key things I can think of as needing improvement would be (perhaps just reiterating a review of the patch): * partition pruning would still depend on constraint exclusion using the CHECK constraints (same old) * there is no tuple-routing at all (same can be said of partition pruning above) * partition pruning or tuple-routing would require a scan over pg_inherits (perhaps inefficient) * partitioning key is an expression which might not be a good idea in early stages of the implementation (might be better off with just the attnum of the column to partition on?) * there is no DROP PARTITION (in fact, it is suggested not to go CREATE/DROP PARTITION route at all) - ALTER TABLE ... ADD/DROP PARTITION? Some other important ones: * dependency handling related oversights * constraint propagation related oversights And then some of the oddities of behaviour that I am seeing while trying out things that the patch does. Please feel free to suggest those that I am not seeing. I am sure these improvements need more than just tablecmds.c hacking which is what the current patch mostly does. The first two points could use separate follow-on patches as I feel they need extensive changes unless I am missing something. I will try to post possible solutions to these issues provided metadata in current form is OK to proceed. In my view, the main problem we should be trying to solve here is avoid relying on constraint exclusion. In other words, the syntax for adding a partition should put some metadata into the system catalogs that lets us do partitioning pruning very very quickly, without theorem-proving. For example, for list or range partitioning, a list of partition bounds would be just right: you could binary-search it. The same metadata should also be suitable for routing inserts to the proper partition, and handling partition motion when a tuple is updated. Now there's other stuff we might very well want to do, but I think making partition pruning and tuple routing fast would be a pretty big win by itself. Those are definitely the goals worth striving for. Thanks for your time. Regards, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add CREATE support to event triggers
On 11/10/14, 4:58 PM, Andres Freund wrote: But the main reason all this is interesting isn't so much CREATE itself. But that it can be (and Alvaro has mostly done it!) for ALTER as well. And there it imo becomes really interesting. Because you can quite easily check whether the affected relation is being replicated you can just emit the DDL when that's the case. And that makes DDL in a logically replicated setup*much* easier. +1. Adding columns is a PITA, you have to manually ensure you do it on all slaves first. Drop is somewhat worse, because you have to do it on the master first, opposite of the (more usual) case of adding a column. RENAME is a complete disaster. Handing scripts to your replication system to execute isn't a very good alternative either; it assumes that you actually have a script (bad assumption with ORMs), and that you have a reasonable way to get that script to wherever you run your replication system. I will also weigh in that there are a LOT of cases that binary replication doesn't cover. I find it interesting that prior to creating built in replication, the community stance was We won't build that because there's too many different use cases, but now some folks are saying that everyone should just use streaming rep and be done with it. :P -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Greg Stark wrote: There's another API question I have. To implement Consistent I need to call the hash function which in the case of functions like hashtext could be fairly expensive and I even need to generate multiple hash values(though currently I'm slicing them all from the integer hash value so that's not too bad) and then test each of those bits. It would be natural to call hashtext once at the start of the scan and possibly build a bitmap and compare all of them in a single operation. But afaict there's no way to hook the beginning of the scan and opaque is not associated with the specific scan so I don't think I can cache the hash value of the scan key there safely. Is there a good way to do it with the current API? I'm not sure why you say opaque is not associated with the specific scan. Are you thinking we could reuse opaque for a future scan? I think we could consider that opaque *is* the place to cache things such as the hashed value of the qual constants or whatever. On a side note I'm curious about something, I was stepping through the my code in gdb and discovered that a single row insert appeared to construct a new summary then union it into the existing summary instead of just calling AddValue on the existing summary. Is that intentional? What led to that? That's to test the Union procedure; if you look at the code, it's just used in assert-enabled builds. Now that I think about it, perhaps this can turn out to be problematic for your bloom filter opclass. I considered the idea of allowing the opclass to disable this testing procedure, but it isn't done (yet.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Convert query plan to sql query
If what you're wishing for is that you could also capture the effects of planner steps that are in the nature of source-to-source transformations, I think that's going to be harder because no great effort has been made to keep those at arm's length from steps that don't have results describable as pure SQL. However, you could probably get pretty far if you applied ruleutils.c to the modified parse tree after the constant-folding and join tree simplification phases. I'm not sure if I understand what you mean by source-to-source transformations. But yes, what I'm aiming is applying simplification phases and constant-folding before transforming the query tree back to sql text query. Thank you for the suggestions. Best, Mariem -- View this message in context: http://postgresql.nabble.com/Convert-query-plan-to-sql-query-tp5825727p5826448.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. Transfering tuple id doesn't solve the scenarios if the node needs any projection. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 1:48 PM, Peter Geoghegan p...@heroku.com wrote: Reminder: I maintain a slight preference for only offering one suggestion per relation RTE, which is what this revision does (so no change there). If a committer who picks this up wants me to alter that, I don't mind doing so; since only Michael spoke up on this, I've kept things my way. Hm. The last version of this patch has not really changed since since my first review, and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that this patch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are welcome. -- Michael
Re: [HACKERS] Race in tablespace test on Windows
On Sat, Nov 8, 2014 at 10:34 AM, Noah Misch n...@leadboat.com wrote: In my Windows development environment, the tablespace regression test fails approximately half the time. Buildfarm member frogmouth failed in the same manner at least once: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogmouthdt=2014-05-21%2014%3A30%3A01 Here is a briefer command sequence exhibiting the same problem: CREATE TABLESPACE testspace LOCATION '...somewhere...'; CREATE TABLE atable (c int) tablespace testspace; SELECT COUNT(*) FROM atable;-- open heap \c - ALTER TABLE atable SET TABLESPACE pg_default; DROP TABLESPACE testspace; -- bug: fails sometimes DROP TABLESPACE testspace; -- second one ~always works DROP TABLE atable; For me, it doesn't get success even second time, I am getting the same error until I execute some command on first session which means till first session has processed the invalidation messages. postgres=# Drop tablespace tbs; ERROR: tablespace tbs is not empty postgres=# Drop tablespace tbs; ERROR: tablespace tbs is not empty I have tested this on Windows 7. When we unlink an open file, Windows retains it in the directory structure until all processes close it. ALTER TABLE SET TABLESPACE sends invalidation messages prompting backends to do so. The backend running the ALTER TABLE always processes invalidations before processing another command. The other backend, the one serving commands before \c -, may have neither exited nor processed the invalidation. When it yet holds a file descriptor for atable, the DROP TABLESPACE fails. I suspect it's possible, though more difficult, to see like trouble in dbcommands.c users of RequestCheckpoint(CHECKPOINT_WAIT). To make this work as well on Windows as it does elsewhere, DROP TABLESPACE would need to wait for other backends to close relevant unlinked files. Perhaps implement wait_unlinked_files(const char *dirname) to poll unlinked, open files until they disappear. (An attempt to open an unlinked file reports ERROR_ACCESS_DENIED. It might be tricky to reliably distinguish this cause from other causes of that error, but it should be possible.) I think the proposed mechanism can work but the wait can be very long (untill the backend holding descriptor executes another command). Can we think of some other solution like in Drop Tablespace instead of checking if directory is empty, check if there is no object that belongs to database/cluster, then allow to forcibly delete that directory someway. I propose to add this as a TODO, then bandage the test case with s/^\\c -$/RESET ROLE;/. Yeah, this make sense. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] REINDEX CONCURRENTLY 2.0
On Tue, Nov 11, 2014 at 3:24 AM, Jeff Janes jeff.ja...@gmail.com wrote: On Wed, Nov 5, 2014 at 8:49 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Thu, Oct 30, 2014 at 5:19 PM, Michael Paquier michael.paqu...@gmail.com wrote: Updated patch is attached. Please find attached an updated patch with the following things changed: - Addition of tab completion in psql for all new commands - Addition of a call to WaitForLockers in index_concurrent_swap to ensure that there are no running transactions on the parent table running before exclusive locks are taken on the index and its concurrent entry. Previous patch versions created deadlocks because of that, issue spotted by the isolation tests integrated in the patch. - Isolation tests for reindex concurrently are re-enabled by default. Regards, It looks like this needs another rebase, I get failures on index.c, toasting.c, indexcmds.c, and index.h Indeed. There are some conflicts created by the recent modification of index_create. Here is a rebased patch. -- Michael diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml index cd55be8..653b120 100644 --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -864,7 +864,8 @@ ERROR: could not serialize access due to read/write dependencies among transact para Acquired by commandVACUUM/command (without optionFULL/option), - commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, and + commandANALYZE/, commandCREATE INDEX CONCURRENTLY/, + commandREINDEX CONCURRENTLY/, commandALTER TABLE VALIDATE/command and other commandALTER TABLE/command variants (for full details see xref linkend=SQL-ALTERTABLE). @@ -1143,7 +1144,7 @@ ERROR: could not serialize access due to read/write dependencies among transact sect2 id=locking-pages titlePage-level Locks/title - + para In addition to table and row locks, page-level share/exclusive locks are used to control read/write access to table pages in the shared buffer diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index cabae19..285f3ff 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation refsynopsisdiv synopsis -REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ] +REINDEX { INDEX | TABLE | DATABASE | SYSTEM } [ CONCURRENTLY ] replaceable class=PARAMETERname/replaceable [ FORCE ] /synopsis /refsynopsisdiv @@ -68,9 +68,12 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam An index build with the literalCONCURRENTLY/ option failed, leaving an quoteinvalid/ index. Such indexes are useless but it can be convenient to use commandREINDEX/ to rebuild them. Note that - commandREINDEX/ will not perform a concurrent build. To build the - index without interfering with production you should drop the index and - reissue the commandCREATE INDEX CONCURRENTLY/ command. + commandREINDEX/ will perform a concurrent build if literal + CONCURRENTLY/ is specified. To build the index without interfering + with production you should drop the index and reissue either the + commandCREATE INDEX CONCURRENTLY/ or commandREINDEX CONCURRENTLY/ + command. Indexes of toast relations can be rebuilt with commandREINDEX + CONCURRENTLY/. /para /listitem @@ -139,6 +142,21 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam /varlistentry varlistentry +termliteralCONCURRENTLY/literal/term +listitem + para + When this option is used, productnamePostgreSQL/ will rebuild the + index without taking any locks that prevent concurrent inserts, + updates, or deletes on the table; whereas a standard reindex build + locks out writes (but not reads) on the table until it's done. + There are several caveats to be aware of when using this option + mdash; see xref linkend=SQL-REINDEX-CONCURRENTLY + endterm=SQL-REINDEX-CONCURRENTLY-title. + /para +/listitem + /varlistentry + + varlistentry termliteralFORCE/literal/term listitem para @@ -218,6 +236,194 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam reindex anything. /para + refsect2 id=SQL-REINDEX-CONCURRENTLY + title id=SQL-REINDEX-CONCURRENTLY-titleRebuilding Indexes Concurrently/title + + indexterm zone=SQL-REINDEX-CONCURRENTLY + primaryindex/primary + secondaryrebuilding concurrently/secondary + /indexterm + + para +Rebuilding an index can interfere with regular operation of a database. +Normally productnamePostgreSQL/ locks the table whose index is rebuilt +against writes and performs the entire index build with a single scan of the +table. Other transactions can still read the table, but if they try to +
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
On Tue, Nov 11, 2014 at 9:42 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. However another idea here could be that instead of passing tuple data, we just pass tuple id, but in that case we have to retain the pin on the buffer that contains tuple untill master backend reads from it that might have it's own kind of problems. Transfering tuple id doesn't solve the scenarios if the node needs any projection. Hmm, that's why I told that we need to retain buffer pin, so that we can get the tuple data. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [v9.5] Custom Plan API
On Mon, Nov 10, 2014 at 6:33 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Nov 10, 2014 at 6:55 AM, Amit Kapila amit.kapil...@gmail.com wrote: I thought that in general if user has the API to register the custom path methods, it should have some way to unregister them and also user might need to register some different custom path methods after unregistering the previous one's. I think we should see what Robert or others have to say about this point before trying to provide such an API. I wouldn't bother. As KaiGai says, if you want to shut the functionality off, the provider itself can provide a GUC. Also, we really have made no effort to ensure that loadable modules can be safely unloaded, or hooked functions safely-unhooked. ExecutorRun_hook is a good example. Typical of hook installation is this: prev_ExecutorRun = ExecutorRun_hook; ExecutorRun_hook = pgss_ExecutorRun; In this case, Extension takes care of register and unregister for hook. In _PG_init(), it register the hook and _PG_fini() it unregisters the same. So if for custom scan core pg is providing API to register the methods, shouldn't it provide an API to unregister the same? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] [GSoC2014] Patch ALTER TABLE ... SET LOGGED
Thanks for the updated patch, Fabrizio. On Tue, Nov 11, 2014 at 7:44 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: A couple of minor comments about this patch: 1) Reading it, I am wondering if it would not be finally time to switch to a macro to get a relation's persistence, something like RelationGetPersistence in rel.h... Not related directly to this patch. Good idea... I'll provide a patch soon. I created a thread dedicated to that: http://www.postgresql.org/message-id/cab7npqseb+hwitxfwkqypa7+9bjolnxio47qsro3hcbsoq0...@mail.gmail.com Now few people cared enough to comment :) 2) reindex_index has as new argument a relpersislence value for the new index. reindex_relation has differently a new set of flags to enforce the relpersistence of all the underling indexes. Wouldn't it be better for API consistency to pass directly a relpersistence value through reindex_relation? In any case, the comment block of reindex_relation does not contain a description of the new flags. I did it because the ReindexDatabase build a list of oids to run reindex_relation for each item of the list. I can change the list to store the relpersistence also, but this can lead us to a concurrency trouble because if one session execute REINDEX DATABASE and other session run ALTER TABLE name SET {LOGGED|UNLOGGED} during the building of the list the reindex can lead us to a inconsistent state. Fair point. I forgot this code path. Except a typo with s/rebuilded/rebuilt/ in the patch, corrected in the patch attached with some extra comments bundled, it seems to be that this patch can be passed to a committer, so marking it so. Btw, perhaps this diff should be pushed as a different patch as this is a rather different thing: - if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED + if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED !smgrexists(indexRelation-rd_smgr, INIT_FORKNUM) As this is a correctness fix, it does not seem necessary to back-patch it: the parent relation always has the same persistence as its indexes. Regards, -- Michael diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 912038a..a0a81e8 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -1980,7 +1980,7 @@ index_build(Relation heapRelation, * created it, or truncated twice in a subsequent transaction, the * relfilenode won't change, and nothing needs to be done here. */ - if (heapRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED + if (indexRelation-rd_rel-relpersistence == RELPERSISTENCE_UNLOGGED !smgrexists(indexRelation-rd_smgr, INIT_FORKNUM)) { RegProcedure ambuildempty = indexRelation-rd_am-ambuildempty; @@ -3130,7 +3130,7 @@ IndexGetRelation(Oid indexId, bool missing_ok) * reindex_index - This routine is used to recreate a single index */ void -reindex_index(Oid indexId, bool skip_constraint_checks) +reindex_index(Oid indexId, bool skip_constraint_checks, char relpersistence) { Relation iRel, heapRelation; @@ -3191,6 +3191,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks) indexInfo-ii_ExclusionStrats = NULL; } + /* Set the relpersistence of the new index */ + iRel-rd_rel-relpersistence = relpersistence; + /* We'll build a new physical relation for the index */ RelationSetNewRelfilenode(iRel, InvalidTransactionId, InvalidMultiXactId); @@ -3310,6 +3313,12 @@ reindex_index(Oid indexId, bool skip_constraint_checks) * performance, other callers should include the flag only after transforming * the data in a manner that risks a change in constraint validity. * + * REINDEX_REL_FORCE_UNLOGGED: if true, set the persistence of the rebuilt + * indexes to unlogged. + * + * REINDEX_REL_FORCE_LOGGED: if true, set the persistence of the rebuilt + * indexes to permanent. + * * Returns true if any indexes were rebuilt (including toast table's index * when relevant). Note that a CommandCounterIncrement will occur after each * index rebuild. @@ -3389,11 +3398,19 @@ reindex_relation(Oid relid, int flags) foreach(indexId, indexIds) { Oid indexOid = lfirst_oid(indexId); + char relpersistence = rel-rd_rel-relpersistence; if (is_pg_class) RelationSetIndexList(rel, doneIndexes, InvalidOid); - reindex_index(indexOid, !(flags REINDEX_REL_CHECK_CONSTRAINTS)); + /* Check for flags to enforce UNLOGGED or PERMANENT persistence */ + if (flags REINDEX_REL_FORCE_UNLOGGED) +relpersistence = RELPERSISTENCE_UNLOGGED; + else if (flags REINDEX_REL_FORCE_PERMANENT) +relpersistence = RELPERSISTENCE_PERMANENT; + + reindex_index(indexOid, !(flags REINDEX_REL_CHECK_CONSTRAINTS), + relpersistence); CommandCounterIncrement(); diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index 6a578ec..e067abc 100644 --- a/src/backend/commands/cluster.c +++
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 8:13 PM, Michael Paquier michael.paqu...@gmail.com wrote: Hm. The last version of this patch has not really changed since since my first review, and I have no more feedback to provide about it except what I already mentioned. I honestly don't think that this patch is ready for committer as-is... If someone wants to review it further, well extra opinions I am sure are welcome. Why not? You've already said that you're happy to defer to whatever committer picks this up with regard to whether or not more than a single suggestion can come from an RTE. I agreed with this (i.e. I said I'd defer to their opinion too), and once again drew particular attention to this state of affairs alongside my most recent revision. What does that leave? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Missing line for postgresql.auto.conf?
Hackers, I thought 9.4's postgresql.conf.sample was supposed to have a commented-out line for postgresql.auto.conf? In the includes section? It's not there. If we don't give folks a commented-out line to uncomment, it'll be pretty hard for them to figure out auto.conf ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()
On Mon, Nov 10, 2014 at 10:29 PM, Peter Geoghegan p...@heroku.com wrote: Why not? You've already said that you're happy to defer to whatever committer picks this up with regard to whether or not more than a single suggestion can come from an RTE. I agreed with this (i.e. I said I'd defer to their opinion too), and once again drew particular attention to this state of affairs alongside my most recent revision. What does that leave? I see you've marked this Needs Review, even though your previously marked it Ready for Committer a few months back (Robert marked it Waiting on Author very recently because of the compiler warning, and then I marked it back to Ready for Committer once that was addressed, before you finally marked it back to Needs Review and removed yourself as the reviewer just now). I'm pretty puzzled by this. Other than our agree to disagree and defer to committer position on the question of whether or not more than one suggestion can come from a single RTE, which you were fine with before [1], I have only restored the core/contrib separation to a state recently suggested by Robert as the best and simplest all around [2]. Did I miss something else? [1] http://www.postgresql.org/message-id/CAB7nPqQObEeQ298F0Rb5+vrgex5_r=j-bvqzgp0qa1y_xdc...@mail.gmail.com [2] http://www.postgresql.org/message-id/ca+tgmoykiiq8mc0uj5i5xfktybg1qqfn4yrckz60ydunumk...@mail.gmail.com -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing line for postgresql.auto.conf?
On 11/10/2014 10:48 PM, Josh Berkus wrote: Hackers, I thought 9.4's postgresql.conf.sample was supposed to have a commented-out line for postgresql.auto.conf? In the includes section? It's not there. If we don't give folks a commented-out line to uncomment, it'll be pretty hard for them to figure out auto.conf ... Never mind. In all of the changes I lost track of the fact that we dropped this and aren't requiring an include line. Was comparing it with an earlier snapshot ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL format and API changes (9.5)
On Tue, Nov 11, 2014 at 4:29 AM, Heikki Linnakangas hlinnakan...@vmware.com wrote: Attached is a new version. It's rebased on current git master, including BRIN. I've also fixed the laundry list of small things you reported, as well as a bunch of bugs I uncovered during my own testing. This patch needs a small rebase, it has been broken by a590f266 that fixed WAL replay for brin indexes: patching file src/backend/access/brin/brin_xlog.c Hunk #2 FAILED at 42. Hunk #3 FAILED at 91. This will facilitate testing as well. Regards -- Michael
Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Haribabu Kommi Sent: Tuesday, November 11, 2014 1:13 PM To: Amit Kapila Cc: Andres Freund; Robert Haas; Simon Riggs; Tom Lane; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan On Tue, Nov 11, 2014 at 2:35 PM, Amit Kapila amit.kapil...@gmail.com wrote: On Tue, Nov 11, 2014 at 5:30 AM, Haribabu Kommi kommi.harib...@gmail.com wrote: On Tue, Nov 11, 2014 at 10:21 AM, Andres Freund and...@2ndquadrant.com wrote: On 2014-11-10 10:57:16 -0500, Robert Haas wrote: Does parallelism help at all? I'm pretty damn sure. We can't even make a mildly powerfull storage fully busy right now. Heck, I can't make my workstation's storage with a raid 10 out of four spinning disks fully busy. I think some of that benefit also could be reaped by being better at hinting the OS... Yes, it definitely helps but not only limited to IO bound operations. It gives a good gain for the queries having CPU intensive where conditions. One more point we may need to consider, is there any overhead in passing the data row from workers to backend? I am not sure if that overhead will be too much visible if we improve the use of I/O subsystem by making parallel tasks working on it. I feel there may be an overhead because of workers needs to put the result data in the shared memory and the backend has to read from there to process it further. If the cost of transfering data from worker to backend is more than fetching a tuple from the scan, then the overhead is visible when the selectivity is more. In my experience, data copy and transformation to fit TupleTableSlot is the biggest overhead, rather than scan or join itself... Probably, a straight-forward way is to construct an array of values/isnull on a shared memory segment, then the backend process just switch pointers of tts_values/tts_isnull, with no data copy. It gave us a performance gain. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers