Re: [HACKERS] Pipelining executions to postgresql server
On 11/04/2014 07:56 AM, Mikko Tiihonen wrote: > I also think the async I/O is the way to go. Luckily that has already been > done > in the pgjdbc-ng (https://github.com/impossibl/pgjdbc-ng), built on top > of netty java NIO library. It has quite good feature parity with the original > pgjdbc driver. Huh, it does seem to now. The big omission, unless I'm blind, is support for COPY. (It also lacks support for JDBC3 and JDBC4, requiring JDK 7 and JDBC 4.1, but that's reasonable enough.) It now has LISTEN/NOTIFY by the looks, and of course it's built around binary protocol support. I freely admit I haven't looked at it too closely. I'm a tad frustrated by the parallel development of a different driver when the "official" driver has so much in the way of low hanging fruit to improve. I have to say I like some aspects of how pgjdbc-ng is being done - in particular use of Maven and being built around non-blocking I/O. OTOH, the use of Google Guava I find pretty inexplicable in a JDBC driver, and since it hard-depends on JDK 7 I don't understand why Netty was used instead of the async / non-blocking features of the core JVM. That may simply be my inexperience with writing non-blocking socket I/O code on Java though. I'm also concerned about how it'll handle new JDBC versions, as it seems to lack the JDBC interface abstraction of the core driver. > I do not think the JDBC batch interface even allow doing updates to multiple > tables when using prepared statements? Ah, I missed this before. That's correct. You get prepared statements _or_ multiple different statements. That's a more understandable reason to concoct a new API, and explains why you're not just solving the issues with batches. Though I still think you're going to have to fix the buffer management code before you do anything like this. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [JDBC] [HACKERS] Pipelining executions to postgresql server
On 11/03/2014 07:05 AM, Scott Harrington wrote: > This avoids the need for a Future, and avoids the client having to > loop/sleep until done. A Future is the logical way to represent an asynchronous operation in Java. Why implement something else that doesn't fit into existing libraries and tools when there's already a standard interface? If you're breaking outside the stock JDBC model and thinking asynchronously, then using the existing Java APIs for asynchrony makes sense to me. In terms of looping until done ... what we really need is a reliably non-blocking PGConnection.consumeInput() so apps can call that in their main loops, or via a helper thread. Then we'd be able to add async notification callbacks too. To do that we need to make PGstream use a java.nio.Channel instead of a java.net.socket . -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
On 11/04/2014 07:56 AM, Mikko Tiihonen wrote: > I do not quite grasp why not sending Sync is so important. Well, at the moment the PgJDBC driver relies on the following flow to manage its buffers and avoid a logjam where both server and client are waiting for the other to consume input: * Send some stuff * Sync * Consume input until Sync response received ... since at this point it knows the server's send buffer should be empty, or near enough. There's no fundamental reason why you *can't* send a Sync after each query, AFAIK. You just have to change how the driver handles syncs so it knows that there might be more work pipelined. Or fix the driver's buffer management, as described in the github issues I linked. It doesn't make much sense to unless you're running in autocommit mode though, and I'm not convinced piplelining work in autocommit mode makes a lot of sense. The biggest problem is exception handling - you can't throw an exception at some statement execution in the past. So you need a new interface for piplelining... or to just use the existing batch interface. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [JDBC] [HACKERS] Pipelining executions to postgresql server
On 11/03/2014 07:05 AM, Scott Harrington wrote: > I looked over your patch. Your list of ResultHandlerHolders seems to be > the right direction, but as Tom Lane mentioned there may need to be some > way to ensure the statements are all in the same transaction. Why go down this track, when we already have batching? Just use the standard JDBC API batching facilities and add the missing support for batching queries that return result sets in PgJDBC. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Josh, Do you have a list of what needs to be done to keep the MONEY type? What is wrong with it? Thanks, -cktan On Mon, Nov 3, 2014 at 10:30 PM, Feng Tian wrote: > Hi, > > This is Feng from Vitesse. Performance different between Money and Numeric > is *HUGE*. For TPCH Q1, the performance difference is 5x for stock > postgres, and ~20x for vitesse. > > Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use Numeric > (15, 2) is ~53s. > > Kevin, > test=# do $$ begin perform sum('1.01'::numeric) from > generate_series(1,1000); end; $$; > > This may not reflect the difference of the two data type. One aggregate is > not where most of the time is spent. TPCH Q1 has many more computing. > > > > > > > > > > > > > > > > > > On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck > wrote: >> >> Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: >> > BTW, after reflecting a bit more I'm less than convinced that this >> > datatype is completely useless. Even if you prefer to store currency >> > values in numeric columns, casting to or from money provides a way to >> > accept or emit values in whatever monetary format the LC_MONETARY locale >> > setting specifies. That seems like a useful feature, and it's one you >> > could not easily duplicate using to_char/to_number (not to mention that >> > those functions aren't without major shortcomings of their own). >> >> As an additional datapoint, Vitesse Data changed the DB schema from >> NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The >> modification to data types is easy to understand -- money and double >> types are faster than Numeric (and no one on this planet has a bank >> account that overflows the money type, not any time soon)."[1] And >> "Replaced NUMERIC fields representing currency with MONEY"[2]. >> >> Not sure whether they modified/optimized PostgreSQL with respect to the >> MONEY data type and/or how much performance that gained, so CCing CK Tan >> as well. >> >> >> Michael >> >> [1] >> http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html >> [2] http://vitessedata.com/benchmark/ >> >> -- >> Michael Banck >> Projektleiter / Berater >> Tel.: +49 (2161) 4643-171 >> Fax: +49 (2161) 4643-100 >> Email: michael.ba...@credativ.de >> >> credativ GmbH, HRB Mönchengladbach 12080 >> USt-ID-Nummer: DE204566209 >> Hohenzollernstr. 133, 41061 Mönchengladbach >> Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer >> >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers > > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On 11/03/2014 03:41 AM, Tom Lane wrote: > Nothing that I recall at the moment, but there is certainly plenty of > stuff of dubious quality in there. I'd argue that chkpass, intagg, > intarray, isn, spi, and xml2 are all in worse shape than the money type. What's wrong with intarray? I'm aware of issues with the rest (and I think citext has a few of its own too) but I didn't think intarray had any real issues. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Hi, This is Feng from Vitesse. Performance different between Money and Numeric is *HUGE*. For TPCH Q1, the performance difference is 5x for stock postgres, and ~20x for vitesse. Stock postgres, for my laptop, TPCH 1G, Q1, use money type ~ 9s, use Numeric (15, 2) is ~53s. Kevin, test=# do $$ begin perform sum('1.01'::numeric) from generate_series(1,1000); end; $$; This may not reflect the difference of the two data type. One aggregate is not where most of the time is spent. TPCH Q1 has many more computing. On Mon, Nov 3, 2014 at 4:54 AM, Michael Banck wrote: > Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: > > BTW, after reflecting a bit more I'm less than convinced that this > > datatype is completely useless. Even if you prefer to store currency > > values in numeric columns, casting to or from money provides a way to > > accept or emit values in whatever monetary format the LC_MONETARY locale > > setting specifies. That seems like a useful feature, and it's one you > > could not easily duplicate using to_char/to_number (not to mention that > > those functions aren't without major shortcomings of their own). > > As an additional datapoint, Vitesse Data changed the DB schema from > NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The > modification to data types is easy to understand -- money and double > types are faster than Numeric (and no one on this planet has a bank > account that overflows the money type, not any time soon)."[1] And > "Replaced NUMERIC fields representing currency with MONEY"[2]. > > Not sure whether they modified/optimized PostgreSQL with respect to the > MONEY data type and/or how much performance that gained, so CCing CK Tan > as well. > > > Michael > > [1] > http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html > [2] http://vitessedata.com/benchmark/ > > -- > Michael Banck > Projektleiter / Berater > Tel.: +49 (2161) 4643-171 > Fax: +49 (2161) 4643-100 > Email: michael.ba...@credativ.de > > credativ GmbH, HRB Mönchengladbach 12080 > USt-ID-Nummer: DE204566209 > Hohenzollernstr. 133, 41061 Mönchengladbach > Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
Hello , Please find updated patch with the review comments given above implemented. The compressed data now includes all backup blocks and their headers except the header of first backup block in WAL record. The first backup block header in WAL record is used to store the compression information. This is done in order to avoid adding compression information in WAL record header. Memory allocation on SIGHUP in autovacuum is remaining. Working on it. Thank you, Rahila Syed On Tue, Oct 28, 2014 at 8:51 PM, Rahila Syed wrote: > > >>>Do we release the buffers for compressed data when fpw is changed from > "compress" to "on"? > >> The current code does not do this. > >Don't we need to do that? > Yes this needs to be done in order to avoid memory leak when compression is > turned off at runtime while the backend session is running. > > >You don't need to make the processes except the startup process allocate > >the memory for uncompressedPages when fpw=on. Only the startup process > >uses it for the WAL decompression > I see. fpw != on check can be put at the time of memory allocation of > uncompressedPages in the backend code . And at the time of recovery > uncompressedPages can be allocated separately if not already allocated. > > >BTW, what happens if the memory allocation for uncompressedPages for > >the recovery fails? > The current code does not handle this. This will be rectified. > > >Which would prevent the recovery at all, so PANIC should > >happen in that case? > IIUC, instead of reporting PANIC , palloc can be used to allocate memory > for uncompressedPages at the time of recovery which will throw ERROR and > abort startup process in case of failure. > > > Thank you, > Rahila Syed > > > > -- > View this message in context: > http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html > Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > compress_fpw_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On Mon, Nov 03, 2014 at 04:40:39PM -0500, Tom Lane wrote: > Peter Eisentraut writes: > > I thank you for this research, but I suggest that we ship 9.4 as is, > > that is with requiring IPC::Run and --enable-* option. All the possible > > alternatives will clearly need more rounds of portability testing. We > > can then evaluate these changes for 9.5 in peace. > > I concur --- it's time to get 9.4 out, and the current state of affairs > is Good Enough imo. Agreed; any release-blocking aspects of this test suite are behind us. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Michael Paquier writes: > There is as well another way: finally support WAL-logging for hash indexes. Sure, but that's a bit easier said than done. I think Robert looked into that a year or two back and found some stumbling blocks that it wasn't obvious how to surmount. I do hope it will happen eventually, but we should assume that it's a long-term answer not a short-term one. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Josh Berkus writes: > On 11/02/2014 11:41 AM, Tom Lane wrote: >> Nothing that I recall at the moment, but there is certainly plenty of >> stuff of dubious quality in there. I'd argue that chkpass, intagg, >> intarray, isn, spi, and xml2 are all in worse shape than the money type. > Why are we holding on to xml2 again? IIRC, there's some xpath-related functionality in there that's not yet available in core (and needs some redesign before it'd ever get accepted into core, so there's not a real quick fix to be had). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #11867: Strange behaviour with composite types after resetting database tablespace
m...@bloodnok.com writes: > I have a script (below) that sets and resets the tablespace for a database > and drops and recreates a composite type. The script fails when trying to > re-create the previously dropped composite type because the type has > apparently been magically resuscitated by changing the tablespace for the > database. I assume this is a bug. Fascinating. I believe what is happening is: * You create the composite type. This results in entries in the database's pg_class and pg_type tables. These entries are in blocks of those tables that are in shared buffers, with hashtag RelFileNodes that mention the database's original tablespace (tbs3). * You move the database to another tablespace. We faithfully flush the dirty catalog blocks to disk and then copy them somewhere else on disk. * You drop the composite type. This entails fetching in relevant blocks of the database's pg_class and pg_type tables from their new tablespace and marking the catalog rows deleted. This happens in shared buffers that have hashtags mentioning the database's new tablespace (pg_default, in this example). * You move the database back to its original tablespace. We faithfully flush the dirty catalog blocks to disk and then copy them back to the original on-disk location. However: at no point in this sequence did we flush the original catalog blocks out of shared buffers. Therefore, a read of the database's pg_class or pg_type at this point is likely to find the previous states of those pages (pre-composite-type-drop) as valid, matching entries, so that we skip reading the up-to-date page contents back from disk. A quick fix for this would be to issue DropDatabaseBuffers during movedb(), though I'm not sure offhand where in that sequence is the safest place for it. We could be more restrictive and only drop buffers that belong to the source tablespace, but I'm not sure it's worth any extra code to do so. > This is not mission-critical for me but I'd be grateful for suggestions for > work-arounds. I don't see any workaround that's much easier than fixing the bug. But what's your use case that involves flipping databases from one tablespace to another and then back again within the life expectancy of unused shared-buffers pages? It doesn't seem terribly surprising that nobody noticed this before ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: PENDING_LIST_CLEANUP_SIZE - maximum size of GIN pending list Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index
(2014/10/30 21:30), Fujii Masao wrote: On Thu, Oct 30, 2014 at 7:30 PM, Etsuro Fujita wrote: Here are my review comments. * The patch applies cleanly and make and make check run successfully. I think that the patch is mostly good. Thanks! Attached is the updated version of the patch. Thank you for updating the patch! * In src/backend/utils/misc/guc.c: + { + {"pending_list_cleanup_size", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Sets the maximum size of the pending list for GIN index."), +NULL, + GUC_UNIT_KB + }, + &pending_list_cleanup_size, + 4096, 0, MAX_KILOBYTES, + NULL, NULL, NULL + }, ISTM it'd be better to use RESOURCES_MEM, not CLIENT_CONN_STATEMENT. No? Yes if the pending list always exists in the memory. But not, IIUC. Thought? Exactly. But I think we can expect that in many cases, since I think that the users would often set the GUC to a small value to the extent that most of the pending list pages would be cached by shared buffer, to maintain *search* performance. I'd like to hear the opinions of others about the category for the GUC. Also why not set min to 64, not to 0, in accoradance with that of work_mem? I'm OK to use 64. But I just chose 0 because I could not think of any reasonable reason why 64k is suitable as the minimum size of the pending list. IOW, I have no idea about whether it's reasonable to use the min value of work_mem as the min size of the pending list. IIUC, I think that min = 0 disables fast update, so ISTM that it'd be appropriate to set min to some positive value. And ISTM that the idea of using the min value of work_mem is not so bad. * In doc/src/sgml/ref/create_index.sgml: + PENDING_LIST_CLEANUP_SIZE IMHO, it seems to me better for this variable to be in lowercase in accordance with the GUC version. Using lowercase only for pending_list_cleanup_size and uppercase for other options looks strange to me. What about using lowercase for all the storage options? +1 I changed the document in that way. *** 356,361 CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ class="parameter">name --- 356,372 + + + PENDING_LIST_CLEANUP_SIZE The above is still in uppercse. Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On Sun, Nov 2, 2014 at 2:30 AM, Tom Lane wrote: > In the case of hash indexes, because we still have to have the hash > opclasses in core, there's no way that it could be pushed out as an > extension module even if we otherwise had full support for AMs as > extensions. So what I hear you proposing is "let's break this so > thoroughly that it *can't* be fixed". I'm not on board with that. > I think the WARNING will do just fine to discourage novices who are > not familiar with the state of the hash AM. In the meantime, we > could push forward with the idea of making hash indexes automatically > unlogged, so that recovering from a crash wouldn't be quite so messy/ > dangerous. > There is as well another way: finally support WAL-logging for hash indexes. -- Michael
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On 11/02/2014 11:41 AM, Tom Lane wrote: > Nothing that I recall at the moment, but there is certainly plenty of > stuff of dubious quality in there. I'd argue that chkpass, intagg, > intarray, isn, spi, and xml2 are all in worse shape than the money type. Why are we holding on to xml2 again? FWIW, I'd be fine with moving ISN, intarray and intagg to PGXN. In fact, I think it would be better for them to be there. And they're not core types, so there shouldn't be an issue with that. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PG-XC] how to pass a value(String) from coordinator node to all data nodes
On Tue, Nov 4, 2014 at 10:40 AM, Demai Ni wrote: > hi, hackers, > > I am looking for a simple way to pass a value(ideally a hashtable, and use > a string for easy implementation for now), that is generated by coordinator > node, and pass it to all data nodes. Plans may be one approach, I am just > wondering whether there is a simpler hook? > > Many thanks for your input. > This is a message for the Postgres-XC mailing lists. Feel free to subscribe and send messages here: https://lists.sourceforge.net/lists/listinfo/postgres-xc-developers -- Michael
Re: [HACKERS] Pipelining executions to postgresql server
On 11/04/2014 09:10 AM, Tom Lane wrote: > Mikko Tiihonen writes: >> I do not quite grasp why not sending Sync is so important. My proof of >> concept setup was for queries with autocommit enabled. > > [snip] It'll be very much like > sending a fixed (predetermined) SQL script to the server using a scripting > language that lacks any conditionals. ... which is part of why I think the batch interface for JDBC is the appropriate UI, and the existing support in PgJDBC just needs to be extended to support returning result sets. The JDBC driver supports multiple result sets, and a batch execution looks just like a procedure that returned multiple result sets (or rather, a mix of result sets, affected rowcounts, and no-info success messages). See http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#addBatch(java.lang.String) http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#executeBatch() The docs are admittedly mostly oriented toward DML, but nothing stops us returning SUCCESS_NO_INFO and a resultset. Or if we choose, returning a rowcount and resultset. It'd be worth testing what other drivers do before doing that, but nothing in the spec: https://jcp.org/aboutJava/communityprocess/mrel/jsr221/index.html seems to stop us doing it. The batch interface doesn't offer any way to set scrollable/editable resultset options, but we don't want to allow that for batches anyway. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On Tue, Nov 04, 2014 at 07:51:06AM +0900, Tatsuo Ishii wrote: > > The performance of our numeric vs Oracle's was a common complaint when > > I was at EnterpriseDB (in 2007). > > > > Perhaps numeric's performance could be greatly improved in cases where > > the precision is low enough to map to an int/bigint. That would get us > > closer to eliminating money as well as give other uses a big win. Of > > course, how to do that and not break pg_upgrade is a huge question... > > Just out of curiosity, why is Oracle's NUMBER (I assume you are > talking about this) so fast? I suspect that what happens is that NUMBER is stored as a native type (int2, int4, int8, int16) that depends on its size and then cast to the next upward thing as needed, taking any performance hits at that point. The documentation hints (38 decimal places) at a 128-bit internal representation as the maximum. I don't know what happens when you get past what 128 bits can represent. Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PG-XC] how to pass a value(String) from coordinator node to all data nodes
hi, hackers, I am looking for a simple way to pass a value(ideally a hashtable, and use a string for easy implementation for now), that is generated by coordinator node, and pass it to all data nodes. Plans may be one approach, I am just wondering whether there is a simpler hook? Many thanks for your input. Demai
Re: [HACKERS] pg_multixact not getting truncated
On 11/03/2014 05:24 PM, Josh Berkus wrote: > BTW, the reason I started poking into this was a report from a user that > they have a pg_multixact directory which is 21GB in size, and is 2X the > size of the database. > > Here's XID data: > > Latest checkpoint's NextXID: 0/1126461940 > Latest checkpoint's NextOID: 371135838 > Latest checkpoint's NextMultiXactId: 162092874 > Latest checkpoint's NextMultiOffset: 778360290 > Latest checkpoint's oldestXID:945761490 > Latest checkpoint's oldestXID's DB: 370038709 > Latest checkpoint's oldestActiveXID: 1126461940 > Latest checkpoint's oldestMultiXid: 123452201 > Latest checkpoint's oldestMulti's DB: 370038709 > > Oldest mxid file is 29B2, newest is 3A13 > > No tables had a relminmxid of 1 (some of the system tables were 0, > though), and the data from pg_class and pg_database is consistent. More tidbits: I just did a quick check on customer systems (5 of them). This only seems to be happening on customer systems where I specifically know there is a high number of FK lock waits (the system above gets around 1000 per minute that we know of). Other systems with higher transaction rates don't exhibit this issue; I checked a 9.3.5 database which normally needs to do XID wraparound once every 10 days, and it's pg_multixact is only 48K (it has one file, ). Note that pg_clog on the bad machine is only 64K in size. How many IDs are there per mxid file? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_multixact not getting truncated
On 11/03/2014 05:06 PM, Alvaro Herrera wrote: > Josh Berkus wrote: >> Hackers, >> >> I'm looking at a couple of high-transaction-rate and high-FK-conflict >> rate servers where pg_multixact has grown to be more than 1GB in size. >> One such server doesn't appear to be having any notable issues with >> vacuuming, and the oldest mxid on the system is about 47m old. VACUUM >> FREEZEing the oldest databases did not cause the pg_multixact dir to get >> smaller --- it may have even caused it to get larger. >> >> Why would pg_multixact not be truncating? Does it never truncate files >> with aborted multixacts in them? Might we have another multixact bug? > > Have a look at the MultiXactId values in pg_controldata, datminmxid in > pg_database, and relminmxid in pg_class. They should advance as you > VACUUM FREEZE. If it's stuck at 1, you might be in pg_upgrade trouble > if you used 9.3.4 or earlier to upgrade. They're advancing. I also know for a fact that this system was not pg_upgraded, because I set it up. It's always been on 9.3.5. So after some vacuum freezing and a checkpoint, the oldestMultiXid advanced by 9 million, about 20% of the gap with current multiXid. However, that did not delete 20% of the files; before the Freeze there were 4414 files, and now there are 4195. So advancing 9m mxids resulted in only deleting ~~130 files. So I think we definitely have an issue here. BTW, the reason I started poking into this was a report from a user that they have a pg_multixact directory which is 21GB in size, and is 2X the size of the database. Here's XID data: Latest checkpoint's NextXID: 0/1126461940 Latest checkpoint's NextOID: 371135838 Latest checkpoint's NextMultiXactId: 162092874 Latest checkpoint's NextMultiOffset: 778360290 Latest checkpoint's oldestXID:945761490 Latest checkpoint's oldestXID's DB: 370038709 Latest checkpoint's oldestActiveXID: 1126461940 Latest checkpoint's oldestMultiXid: 123452201 Latest checkpoint's oldestMulti's DB: 370038709 Oldest mxid file is 29B2, newest is 3A13 No tables had a relminmxid of 1 (some of the system tables were 0, though), and the data from pg_class and pg_database is consistent. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
Mikko Tiihonen writes: > I do not quite grasp why not sending Sync is so important. My proof of > concept setup was for queries with autocommit enabled. The point is that that will be very, very much harder to use than doing it the other way. It's fairly easy to reason about the results of single-transaction pipelined queries: they're all or nothing. If you fire off queries that are going to autocommit independently, then you have to worry about all combinations of success/failure, and you won't have the opportunity to adjust on the fly. It'll be very much like sending a fixed (predetermined) SQL script to the server using a scripting language that lacks any conditionals. People certainly do use fixed scripts sometimes, but they usually find out they want them wrapped into single transactions, because otherwise they're left with a horrible mess if the script goes off the rails at all. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_multixact not getting truncated
Josh Berkus wrote: > Hackers, > > I'm looking at a couple of high-transaction-rate and high-FK-conflict > rate servers where pg_multixact has grown to be more than 1GB in size. > One such server doesn't appear to be having any notable issues with > vacuuming, and the oldest mxid on the system is about 47m old. VACUUM > FREEZEing the oldest databases did not cause the pg_multixact dir to get > smaller --- it may have even caused it to get larger. > > Why would pg_multixact not be truncating? Does it never truncate files > with aborted multixacts in them? Might we have another multixact bug? Have a look at the MultiXactId values in pg_controldata, datminmxid in pg_database, and relminmxid in pg_class. They should advance as you VACUUM FREEZE. If it's stuck at 1, you might be in pg_upgrade trouble if you used 9.3.4 or earlier to upgrade. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
Jeff Janes wrote: > On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera > wrote: > I get a couple compiler warnings with this: > > brin.c: In function 'brininsert': > brin.c:97: warning: 'tupcxt' may be used uninitialized in this function > brin.c:98: warning: 'oldcxt' may be used uninitialized in this function Ah, that's easily fixed. My compiler (gcc 4.9 from Debian Jessie nowadays) doesn't complain, but I can see that it's not entirely trivial. > Also, I think it is missing a cat version bump. It let me start the > patched server against an unpatched initdb run, but once started it didn't > find the index method. Sure, that's expected (by me at least). I'm too lazy to maintain catversion bumps in the patch before pushing, since that generates constant conflicts as I rebase. > What would it take to make CLUSTER work on a brin index? Now I just added > a btree index on the same column, clustered on that, then dropped that > index. Interesting question. What's the most efficient way to pack a table to minimize the intervals covered by each index entry? One thing that makes this project a bit easier, I think, is that CLUSTER has already been generalized so that it supports either an indexscan or a seqscan+sort. If anyone wants to work on this, be my guest; I'm certainly not going to add it to the initial commit. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On Mon, Nov 3, 2014 at 7:25 PM, Alexey Vasiliev wrote: > > > > > Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello < fabriziome...@gmail.com>: > > > > > > > > Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it? > > > > > > > You should just send another version of your patch by email and add a new "comment" to commit-fest using the "Patch" comment type. > > > Added new patch. > And you should add the patch to an open commit-fest not to an in-progress. The next opened is 2014-12 [1]. Regards, [1] https://commitfest.postgresql.org/action/commitfest_view?id=25 -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog: http://fabriziomello.github.io >> Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello >> Github: http://github.com/fabriziomello
[HACKERS] pg_multixact not getting truncated
Hackers, I'm looking at a couple of high-transaction-rate and high-FK-conflict rate servers where pg_multixact has grown to be more than 1GB in size. One such server doesn't appear to be having any notable issues with vacuuming, and the oldest mxid on the system is about 47m old. VACUUM FREEZEing the oldest databases did not cause the pg_multixact dir to get smaller --- it may have even caused it to get larger. Why would pg_multixact not be truncating? Does it never truncate files with aborted multixacts in them? Might we have another multixact bug? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BRIN indexes - TRAP: BadArgument
On Mon, Nov 3, 2014 at 2:18 PM, Alvaro Herrera wrote: > Robert Haas wrote: > > [lots] > > I have fixed all these items in the attached, thanks -- most > user-visible change was the pageinspect 1.3 thingy. pg_upgrade from 1.2 > works fine now. I also fixed some things Heikki noted, mainly avoid > retail pfree where possible, and renumber the support procs to leave > room for future expansion of the framework. XLog replay code is updated > too. > > Also, I made the summarization step callable directly from SQL without > having to invoke VACUUM. > > So here's v21. I also attach a partial diff from v20, just in case > anyone wants to give it a look. > I get a couple compiler warnings with this: brin.c: In function 'brininsert': brin.c:97: warning: 'tupcxt' may be used uninitialized in this function brin.c:98: warning: 'oldcxt' may be used uninitialized in this function Also, I think it is missing a cat version bump. It let me start the patched server against an unpatched initdb run, but once started it didn't find the index method. What would it take to make CLUSTER work on a brin index? Now I just added a btree index on the same column, clustered on that, then dropped that index. Thanks, Jeff
Re: [HACKERS] Pipelining executions to postgresql server
> Craig Ringer wrote: > On 11/02/2014 09:27 PM, Mikko Tiihonen wrote: > > Is the following summary correct: > > - the network protocol supports pipelinings > Yes. > > All you have to do is *not* send a Sync message and be aware that the > server will discard all input until the next Sync, so pipelining + > autocommit doesn't make a ton of sense for error handling reasons. I do not quite grasp why not sending Sync is so important. My proof of concept setup was for queries with autocommit enabled. When looking with wireshark I could observe that the client sent 3-10 P/B//D/E/S messages to server, before the server started sending the corresponding 1/2/T/D/C/Z replies for each request. Every 10 requests the test application waited for the all the replies to come to not overflow the network buffers (which is known to cause deadlocks with current pg jdbc driver). If I want separate error handling for each execution then shouldn't I be sending separate sync for each pipelined operation? > > - the server handles operations in order, starting the processing of next > > operation only after fully processing the previous one > >- thus pipelining is invisible to the server > > As far as I know, yes. The server just doesn't care. > > > - libpq driver does not support pipelining, but that is due to internal > > limitations > > Yep. > > > - if proper error handling is done by the client then there is no reason > > why pipelining could be supported by any pg client > > Indeed, and most should support it. Sending batches of related queries > would make things a LOT faster. > > PgJDBC's batch support is currently write-oriented. There is no > fundamental reason it can't be expanded for reads. I've already written > a patch to do just that for the case of returning generated keys. > > https://github.com/ringerc/pgjdbc/tree/batch-returning-support > > and just need to rebase it so I can send a pull for upstream PgJDBC. > It's already linked in the issues documenting the limitatations in batch >support. Your code looked like good. Returning inserts are an important thing to optimize. > If you want to have more general support for batches that return rowsets > there's no fundamental technical reason why it can't be added. It just > requires some tedious refactoring of the driver to either: > > - Sync and wait before it fills its *send* buffer, rather than trying > to manage its receive buffer (the server send buffer), so it can > reliably avoid deadlocks; or > > - Do async I/O in a select()-like loop over a protocol state machine, > so it can simultaneously read and write on the wire. I also think the async I/O is the way to go. Luckily that has already been done in the pgjdbc-ng (https://github.com/impossibl/pgjdbc-ng), built on top of netty java NIO library. It has quite good feature parity with the original pgjdbc driver. I'll try next if I can enable the pipelining with it now that I have tried the proof of concept with the originial pgjdbc driver. > I might need to do some of that myself soon, but it's a big (and > therefore error-prone) job I've so far avoided by making smaller, more > targeted changes. > > For JDBC the JDBC batch interface is the right place to do this, and you > should not IMO attempt to add pipelining outside that interface. > (Multiple open resultsets from portals, yes, but not pipelining of queries). I do not think the JDBC batch interface even allow doing updates to multiple tables when using prepared statements? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
On 03/11/14 22:19, Tom Lane wrote: =?ISO-8859-1?Q?=C1lvaro_Hern=E1ndez_Tortosa?= writes: IMHO, from a user perspective the transaction begins when the BEGIN command is issued. If I really want to see a "frozen" view of the database before any real SELECT, I have to issue another query like "SELECT 1". This seems odd to me. I understand tx snapshot may be deferred until real execution for performance reasons, but it is confusing from a user perspective. Is this really expected, or is it a bug? Am I having a bad day and missing some point here? ^_^ It's expected. Without this behavior, you could not take out any locks before freezing the transaction snapshot, which would be a bad thing. Thank you for your comment, Tom. However I think this behavior, as seen from a user perspective, it's not the expected one. There may be some internal reasons for it, but for the user executing the transaction, it's normal to expect the freezing to happen right after the BEGIN, rather than *after* the first query. If it is still the intended behavior, I think it should be clearly documented as such, and a recommendation similar to "issue a 'SELECT 1' right after BEGIN to freeze the data before any own query" or similar comment should be added. Again, as I said in my email, the documentation clearly says that "only sees data committed before the transaction began". And this is clearly not the real behavior. I think there are some examples in the "concurrency control" chapter of the manual. Sure, there are, that was the link I pointed out, but I found no explicit mention to the fact that I'm raising here. Regards, Álvaro -- Álvaro Hernández Tortosa --- 8Kdata -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On Mon, Nov 3, 2014 at 3:26 PM, Peter Eisentraut wrote: > On 11/1/14 8:04 AM, Petr Jelinek wrote: >> On second thought, maybe those should be pg_get_transaction_committs, >> pg_get_transaction_committs_data, etc. > > Please don't name anything "committs". That looks like a misspelling of > something. > > There is nothing wrong with > > pg_get_transaction_commit_timestamp() > > If you want to reduce the length, lose the "get". +1: all non void returning functions 'get' something. >> For me the commit time thing feels problematic in the way I perceive it >> - I see commit time as a point in time, where I see commit timestamp (or >> committs for short) as something that can recorded. So I would prefer to >> stick with commit timestamp/committs. > > In PostgreSQL, it is pretty clearly established that time is hours, > minutes, seconds, and timestamp is years, months, days, hours, minutes, > seconds. So unless this feature only records the hour, minute, and > second of a commit, it should be "timestamp". Elsewhere, for example, we have: "pg_last_xact_replay_timestamp()". So, in keeping with that, maybe, pg_xact_commit_timestamp(xid) merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Alvaro Herrera wrote: > There is a real advantage of money over numeric in the performance > front. I haven't measured it, but suffice to say that money uses > integer operations which map almost directly to CPU instructions, > whereas numeric needs to decode from our varlena base-1 digit > format, operate on digits at a time, then encode back. No matter how > much you optimize numeric, it's never going to outperform stuff that > runs practically on bare electrons. As far as I recall, some TPCH > queries run aggregations on currency columns. > > Now, whether this makes a measurable difference or not in TPCH terms, I > have no idea. Best of 3 tries on each statement... A count(*) as a baseline: test=# do $$ begin perform count(*) from generate_series(1,1000); end; $$; DO Time: 3260.920 ms A sum of money: test=# do $$ begin perform sum('1.01'::money) from generate_series(1,1000); end; $$; DO Time: 3572.709 ms A sum of numeric: test=# do $$ begin perform sum('1.01'::numeric) from generate_series(1,1000); end; $$; DO Time: 4805.559 ms -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
> The performance of our numeric vs Oracle's was a common complaint when > I was at EnterpriseDB (in 2007). > > Perhaps numeric's performance could be greatly improved in cases where > the precision is low enough to map to an int/bigint. That would get us > closer to eliminating money as well as give other uses a big win. Of > course, how to do that and not break pg_upgrade is a huge question... Just out of curiosity, why is Oracle's NUMBER (I assume you are talking about this) so fast? Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 03/11/14 22:26, Peter Eisentraut wrote: On 11/1/14 8:04 AM, Petr Jelinek wrote: On second thought, maybe those should be pg_get_transaction_committs, pg_get_transaction_committs_data, etc. Please don't name anything "committs". That looks like a misspelling of something. There is nothing wrong with pg_get_transaction_commit_timestamp() If you want to reduce the length, lose the "get". I am fine with that, I only wonder if your definition of "anything" only concerns the SQL interfaces or also the internals. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 11/03/2014 04:44 PM, Peter Eisentraut wrote: On 10/29/14 8:42 AM, Robert Haas wrote: I'm sympathetic to that line of reasoning, but I really think that if you want to keep this infrastructure, it needs to be made portable. Let me clarify that this was my intention. I have looked at many test frameworks, many of which are much nicer than what we have, but the portability and dependency implications for this project would have been between shocking and outrageous. I settled for what I felt was the absolute minimum: Perl + IPC::Run. It was only later on that I learned that 1) subtests don't work in Perl 5.10, and 2) subtests are broken in Perl 5.12. So we removed the use of subtests and now we are back at the baseline I started with. The irony in this whole story is that if we had thrown this onto the build farm right away, we might have found and fixed these problems within a week instead of five months later. An email note to me would probably have done the trick. I've missed one or two things lately, since my mind has been to some extent elsewhere, and don't mind a little memory jog. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 10/29/14 8:42 AM, Robert Haas wrote: > I'm sympathetic to that line of reasoning, but I really think that if > you want to keep this infrastructure, it needs to be made portable. Let me clarify that this was my intention. I have looked at many test frameworks, many of which are much nicer than what we have, but the portability and dependency implications for this project would have been between shocking and outrageous. I settled for what I felt was the absolute minimum: Perl + IPC::Run. It was only later on that I learned that 1) subtests don't work in Perl 5.10, and 2) subtests are broken in Perl 5.12. So we removed the use of subtests and now we are back at the baseline I started with. The irony in this whole story is that if we had thrown this onto the build farm right away, we might have found and fixed these problems within a week instead of five months later. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
Peter Eisentraut writes: > I thank you for this research, but I suggest that we ship 9.4 as is, > that is with requiring IPC::Run and --enable-* option. All the possible > alternatives will clearly need more rounds of portability testing. We > can then evaluate these changes for 9.5 in peace. I concur --- it's time to get 9.4 out, and the current state of affairs is Good Enough imo. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 11/2/14 2:00 PM, Noah Misch wrote: >> Ick; I concur with your judgment on those aspects of the IPC::Cmd design. >> Thanks for investigating. So, surviving options include: >> >> 1. Require IPC::Run. >> 2. Write our own run() that reports the raw exit code. >> 3. Distill the raw exit code from the IPC::Cmd::run() error string. >> 4. Pass IPC::Run::run_forked() a subroutine that execs an argument list. > > FWIW, (3) looks most promising to me. That is to say, implement a reverse of > IPC::Cmd::_pp_child_error(). Ugly to be sure, but the wart can be small and > self-contained. I thank you for this research, but I suggest that we ship 9.4 as is, that is with requiring IPC::Run and --enable-* option. All the possible alternatives will clearly need more rounds of portability testing. We can then evaluate these changes for 9.5 in peace. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] tracking commit timestamps
On 11/1/14 8:04 AM, Petr Jelinek wrote: > On second thought, maybe those should be pg_get_transaction_committs, > pg_get_transaction_committs_data, etc. Please don't name anything "committs". That looks like a misspelling of something. There is nothing wrong with pg_get_transaction_commit_timestamp() If you want to reduce the length, lose the "get". > For me the commit time thing feels problematic in the way I perceive it > - I see commit time as a point in time, where I see commit timestamp (or > committs for short) as something that can recorded. So I would prefer to > stick with commit timestamp/committs. In PostgreSQL, it is pretty clearly established that time is hours, minutes, seconds, and timestamp is years, months, days, hours, minutes, seconds. So unless this feature only records the hour, minute, and second of a commit, it should be "timestamp". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re[3]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Mon, 3 Nov 2014 19:18:51 -0200 от Fabrízio de Royes Mello : > > > > > Should I change my patch and send it by email? And also as I understand I > > should change message ID for > > https://commitfest.postgresql.org/action/patch_view?id=1636 , isn't it? > > > > You should just send another version of your patch by email and add a new > "comment" to commit-fest using the "Patch" comment type. Added new patch. -- Alexey Vasiliev From a5d941717df71e4c16941b8a02f0dca40a1107a0 Mon Sep 17 00:00:00 2001 From: Alexey Vasiliev Date: Mon, 3 Nov 2014 00:21:14 +0200 Subject: [PATCH] add recovery_timeout to controll timeout between restore_command nonzero --- doc/src/sgml/recovery-config.sgml | 24 src/backend/access/transam/recovery.conf.sample | 5 + src/backend/access/transam/xlog.c | 20 ++-- 3 files changed, 47 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index 0f1ff34..98eb451 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -145,6 +145,30 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows + + restore_command_retry_interval (integer) + +restore_command_retry_interval recovery parameter + + + + +By default, if restore_command return nonzero status, +server will retry command again after 5 seconds. This parameter +allow to change this time. This parameter is optional. This can +be useful to increase/decrease number of a restore_command calls. + + +This is useful, if I using for restore of wal logs some +external storage (like AWS S3) and no matter what the slave database +will lag behind the master. The problem, what for each request to +AWS S3 need to pay, what is why for N nodes, which try to get next +wal log each 5 seconds will be bigger price, than for example each +30 seconds. + + + + diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample index 7657df3..4eee8f4 100644 --- a/src/backend/access/transam/recovery.conf.sample +++ b/src/backend/access/transam/recovery.conf.sample @@ -58,6 +58,11 @@ # #recovery_end_command = '' # +# specifies an optional timeout after nonzero code of restore_command. +# This can be useful to increase/decrease number of a restore_command calls. +# +#restore_command_retry_interval = 5s +# #--- # RECOVERY TARGET PARAMETERS #--- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3c9aeae..c26101e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -233,6 +233,7 @@ static TimestampTz recoveryTargetTime; static char *recoveryTargetName; static int recovery_min_apply_delay = 0; static TimestampTz recoveryDelayUntilTime; +static int restore_command_retry_interval = 0; /* options taken from recovery.conf for XLOG streaming */ static bool StandbyModeRequested = false; @@ -5245,6 +5246,20 @@ readRecoveryCommandFile(void) (errmsg_internal("trigger_file = '%s'", TriggerFile))); } + else if (strcmp(item->name, "restore_command_retry_interval") == 0) + { + const char *hintmsg; + + if (!parse_int(item->value, &restore_command_retry_interval, GUC_UNIT_MS, + &hintmsg)) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a temporal value", +"restore_command_retry_interval"), + hintmsg ? errhint("%s", _(hintmsg)) : 0)); + ereport(DEBUG2, + (errmsg_internal("restore_command_retry_interval = '%s'", item->value))); + } else if (strcmp(item->name, "recovery_min_apply_delay") == 0) { const char *hintmsg; @@ -11104,13 +9,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, } /* - * Wait for more WAL to arrive. Time out after 5 seconds, + * Wait for more WAL to arrive. Time out after + * restore_command_retry_interval (5 seconds by default), * like when polling the archive, to react to a trigger * file promptly. */ WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, - 5000L); + restore_command_retry_interval > 0 ? restore_command_retry_interval : 5000L); ResetLatch(&XLogCtl->recoveryWakeupLatch); break; } -- 1.9.3 (Apple Git-50) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Repeatable read and serializable transactions see data committed after tx start
=?ISO-8859-1?Q?=C1lvaro_Hern=E1ndez_Tortosa?= writes: > IMHO, from a user perspective the transaction begins when the BEGIN > command is issued. If I really want to see a "frozen" view of the > database before any real SELECT, I have to issue another query like > "SELECT 1". This seems odd to me. I understand tx snapshot may be > deferred until real execution for performance reasons, but it is > confusing from a user perspective. Is this really expected, or is it a > bug? Am I having a bad day and missing some point here? ^_^ It's expected. Without this behavior, you could not take out any locks before freezing the transaction snapshot, which would be a bad thing. I think there are some examples in the "concurrency control" chapter of the manual. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
> > Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636, isn't it? > You should just send another version of your patch by email and add a new "comment" to commit-fest using the "Patch" comment type. Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog: http://fabriziomello.github.io >> Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello >> Github: http://github.com/fabriziomello
Re: [HACKERS] how to handle missing "prove"
Peter Eisentraut writes: > On 11/3/14 3:11 PM, Tom Lane wrote: >> If there's a commit that goes with this statement, you neglected to push >> it... > Are you not seeing this in configure.in: > AC_CHECK_PROGS(PROVE, prove) > if test -z "$PROVE"; then > AC_MSG_ERROR([prove not found]) Oh, duh. I read your message to mean that you'd just pushed a change for that, not that it already did it. Nevermind ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to handle missing "prove"
On 11/3/14 3:11 PM, Tom Lane wrote: > Peter Eisentraut writes: >> On 11/2/14 11:36 AM, Tom Lane wrote: >>> Committed patch looks good, but should we also add the stanza we discussed >>> in Makefile.global.in concerning defining $(prove) in terms of "missing" >>> if we didn't find it? I think the behavior of HEAD when you ask for >>> --enable-tap-tests but don't have "prove" might be less than ideal. > >> configure will now fail when "prove" is not found. > > If there's a commit that goes with this statement, you neglected to push it... Are you not seeing this in configure.in: # # Check for test tools # if test "$enable_tap_tests" = yes; then AC_CHECK_PROGS(PROVE, prove) if test -z "$PROVE"; then AC_MSG_ERROR([prove not found]) fi if test -z "$PERL"; then AC_MSG_ERROR([Perl not found]) fi fi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Abhijit Menon-Sen writes: > Earlier, I was using a combination of check and assign hooks to convert > names to OIDs, but (as Andres pointed out) that would have problems with > cache invalidations. I was even playing with caching membership lookups, > but I ripped out all that code. > In the attached patch, role_is_audited does all the hard work to split > up the list of roles, look up the corresponding OIDs, and check if the > user is a member of any of those roles. It works fine, but it doesn't > seem desirable to repeat all that work for every statement. > So does anyone have suggestions about how to make this faster? Have you read the code in acl.c that caches lookup results for role-is-member-of checks? Sounds pretty closely related. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
Hi. I could actually use some comments on the approach. I've attached a prototype I've been working on (which is a cut down version of my earlier code; but it's not terribly interesting and you don't need to read it to comment on my questions below). The attached patch does the following: 1. Adds a pgaudit.roles = 'role1, role2' GUC setting. 2. Adds a role_is_audited() function that returns true if the given role OID is mentioned in (or inherits from a role mentioned in) pgaudit.roles. 3. Adds a call to role_is_audited from log_audit_event with the current user id (GetSessionUserId in the patch, though it may be better to use GetUserId; but that's a minor detail). Earlier, I was using a combination of check and assign hooks to convert names to OIDs, but (as Andres pointed out) that would have problems with cache invalidations. I was even playing with caching membership lookups, but I ripped out all that code. In the attached patch, role_is_audited does all the hard work to split up the list of roles, look up the corresponding OIDs, and check if the user is a member of any of those roles. It works fine, but it doesn't seem desirable to repeat all that work for every statement. So does anyone have suggestions about how to make this faster? -- Abhijit diff --git a/pgaudit.c b/pgaudit.c index 30f729a..5ba9886 100644 --- a/pgaudit.c +++ b/pgaudit.c @@ -58,6 +58,13 @@ PG_FUNCTION_INFO_V1(pgaudit_func_ddl_command_end); PG_FUNCTION_INFO_V1(pgaudit_func_sql_drop); /* + * pgaudit_roles_str is the string value of the pgaudit.roles + * configuration variable, which is a list of role names. + */ + +char *pgaudit_roles_str = NULL; + +/* * pgaudit_log_str is the string value of the pgaudit.log configuration * variable, e.g. "read, write, user". Each token corresponds to a flag * in enum LogClass below. We convert the list of tokens into a bitmap @@ -117,6 +124,40 @@ typedef struct { } AuditEvent; /* + * Takes a role OID and returns true or false depending on whether + * auditing it enabled for that roles according to the pgaudit.roles + * setting. + */ + +static bool +role_is_audited(Oid roleid) +{ + List *roles; + ListCell *lt; + + if (!SplitIdentifierString(pgaudit_roles_str, ',', &roles)) + return false; + + foreach(lt, roles) + { + char *name = (char *)lfirst(lt); + HeapTuple roleTup; + + roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum(name)); + if (HeapTupleIsValid(roleTup)) + { + Oid parentrole = HeapTupleGetOid(roleTup); + + ReleaseSysCache(roleTup); + if (is_member_of_role(roleid, parentrole)) +return true; + } + } + + return false; +} + +/* * Takes an AuditEvent and returns true or false depending on whether * the event should be logged according to the pgaudit.log setting. If * it returns true, it also fills in the name of the LogClass which it @@ -289,18 +330,24 @@ should_be_logged(AuditEvent *e, const char **classname) static void log_audit_event(AuditEvent *e) { + Oid userid; const char *timestamp; const char *database; const char *username; const char *eusername; const char *classname; + userid = GetSessionUserId(); + + if (!role_is_audited(userid)) + return; + if (!should_be_logged(e, &classname)) return; timestamp = timestamptz_to_str(GetCurrentTimestamp()); database = get_database_name(MyDatabaseId); - username = GetUserNameFromId(GetSessionUserId()); + username = GetUserNameFromId(userid); eusername = GetUserNameFromId(GetUserId()); /* @@ -328,7 +375,7 @@ log_executor_check_perms(List *rangeTabls, bool abort_on_violation) { ListCell *lr; - foreach (lr, rangeTabls) + foreach(lr, rangeTabls) { Relation rel; AuditEvent e; @@ -1127,6 +1174,32 @@ pgaudit_object_access_hook(ObjectAccessType access, } /* + * Take a pgaudit.roles value such as "role1, role2" and verify that + * the string consists of comma-separated tokens. + */ + +static bool +check_pgaudit_roles(char **newval, void **extra, GucSource source) +{ + List *roles; + char *rawval; + + /* Make sure we have a comma-separated list of tokens. */ + + rawval = pstrdup(*newval); + if (!SplitIdentifierString(rawval, ',', &roles)) + { + GUC_check_errdetail("List syntax is invalid"); + list_free(roles); + pfree(rawval); + return false; + } + pfree(rawval); + + return true; +} + +/* * Take a pgaudit.log value such as "read, write, user", verify that * each of the comma-separated tokens corresponds to a LogClass value, * and convert them into a bitmap that log_audit_event can check. @@ -1232,6 +1305,23 @@ _PG_init(void) errmsg("pgaudit must be loaded via shared_preload_libraries"))); /* + * pgaudit.roles = "role1, role2" + * + * This variable defines a list of roles for which auditing is + * enabled. + */ + + DefineCustomStringVariable("pgaudit.roles", + "Enable auditing for certain roles", + NULL, + &pgaudit_roles_str, + "", + PGC_SUSET, + GUC_LIST_INPUT | GUC_NOT_IN_SAMPL
Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory
On Wed, Oct 15, 2014 at 1:11 PM, Jeff Janes wrote: > On Fri, Aug 8, 2014 at 12:08 AM, Guillaume Lelarge > wrote: > >> Hi, >> >> As part of our monitoring work for our customers, we stumbled upon an >> issue with our customers' servers who have a wal_keep_segments setting >> higher than 0. >> >> We have a monitoring script that checks the number of WAL files in the >> pg_xlog directory, according to the setting of three parameters >> (checkpoint_completion_target, checkpoint_segments, and wal_keep_segments). >> We usually add a percentage to the usual formula: >> >> greatest( >> (2 + checkpoint_completion_target) * checkpoint_segments + 1, >> checkpoint_segments + wal_keep_segments + 1 >> ) >> > > I think the first bug is even having this formula in the documentation to > start with, and in trying to use it. > > "and will normally not be more than..." > > This may be "normal" for a toy system. I think that the normal state for > any system worth monitoring is that it has had load spikes at some point in > the past. > > So it is the next part of the doc, which describes how many segments it > climbs back down to upon recovering from a spike, which is the important > one. And that doesn't mention wal_keep_segments at all, which surely > cannot be correct. > > I will try to independently derive the correct formula from the code, as > you did, without looking too much at your derivation first, and see if we > get the same answer. > It looked to me that the formula, when descending from a previously stressed state, would be: greatest(1 + checkpoint_completion_target) * checkpoint_segments, wal_keep_segments) + 1 + 2 * checkpoint_segments + 1 This assumes logs are filled evenly over a checkpoint cycle, which is probably not true because there is a spike in full page writes right after a checkpoint starts. But I didn't have a great deal of confidence in my analysis. The first line reflects the number of WAL that will be retained as-is, the second is the number that will be recycled for future use before starting to delete them. My reading of the code is that wal_keep_segments is computed from the current end of WAL (i.e the checkpoint record), not from the checkpoint redo point. If I distribute the part outside the 'greatest' into both branches of the 'greatest', I don't get the same answer as you do for either branch. Then I started wondering if the number we keep for recycling is a good choice, anyway. 2 * checkpoint_segments + 1 seems pretty large. But then again, given that we've reached the high-water-mark once, how unlikely are we to reach it again? Cheers, Jeff
Re: [HACKERS] tracking commit timestamps
Jim Nasby wrote: > On 11/1/14, 8:41 AM, Petr Jelinek wrote: > >Well this is not BDR specific thing, the idea is that with logical > >replication, commit timestamp is not enough for conflict handling, you also > >need to have additional info in order to identify some types of conflicts > >conflicts (local update vs remote update for example). So the extradata > >field was meant as something that could be used to add the additional info > >to the xid. > > Related to this... is there any way to deal with 2 transactions that commit > in the same microsecond? It seems silly to try and handle that for every > commit since it should be quite rare, but perhaps we could store the LSN as > extradata if we detect a conflict? Well, two things. One, LSN is 8 bytes and extradata (at least in this patch when I last saw it) is only 4 bytes. But secondly and more important is that detecting a conflict is done in node B *after* node A has recorded the transaction's commit time; there is no way to know at commit time that there is going to be a conflict caused by that transaction in the future. (If there was a way to tell, you could just as well not commit the transaction in the first place.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re[2]: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Mon, 03 Nov 2014 14:36:33 -0500 от Peter Eisentraut : > On 11/3/14 6:04 AM, Alexey Vasiliev wrote: > > 3. What the patch does in a short paragraph: This patch should add > > option recovery_timeout, which help to control timeout of > > restore_command nonzero status code. Right now default value is 5 > > seconds. This is useful, if I using for restore of wal logs some > > external storage (like AWS S3) and no matter what the slave database > > will lag behind the master. The problem, what for each request to > > AWS S3 need to pay, what is why for N nodes, which try to get next > > wal log each 5 seconds will be bigger price, than for example each > > 30 seconds. > > That seems useful. I would include something about this use case in the > documentation. Ok, I will add this in patch. > > > This is my first patch. I am not sure about name of option. Maybe it > > should called "recovery_nonzero_timeout". > > The option name had me confused. At first I though this is the time > after which a running restore_command invocation gets killed. I think a > more precise description might be restore_command_retry_interval. "restore_command_retry_interval" - I like this name! Should I change my patch and send it by email? And also as I understand I should change message ID for https://commitfest.postgresql.org/action/patch_view?id=1636, isn't it? Thanks -- Alexey Vasiliev -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On 11/3/14, 1:34 PM, Alvaro Herrera wrote: David Fetter wrote: On Mon, Nov 03, 2014 at 01:54:09PM +0100, Michael Banck wrote: As an additional datapoint, Vitesse Data changed the DB schema from NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The modification to data types is easy to understand -- money and double types are faster than Numeric (and no one on this planet has a bank account that overflows the money type, not any time soon)."[1] And "Replaced NUMERIC fields representing currency with MONEY"[2]. Not sure whether they modified/optimized PostgreSQL with respect to the MONEY data type and/or how much performance that gained, so CCing CK Tan as well. How does our NUMERIC type's performance compare to other systems' precise types? I realize that some of those systems might have restrictions on publishing numbers they don't authorize, but they might have pushed some authorized numbers if those numbers make them look good. There is a real advantage of money over numeric in the performance front. I haven't measured it, but suffice to say that money uses integer operations which map almost directly to CPU instructions, whereas numeric needs to decode from our varlena base-1 digit format, operate on digits at a time, then encode back. No matter how much you optimize numeric, it's never going to outperform stuff that runs practically on bare electrons. As far as I recall, some TPCH queries run aggregations on currency columns. Now, whether this makes a measurable difference or not in TPCH terms, I have no idea. The performance of our numeric vs Oracle's was a common complaint when I was at EnterpriseDB (in 2007). Perhaps numeric's performance could be greatly improved in cases where the precision is low enough to map to an int/bigint. That would get us closer to eliminating money as well as give other uses a big win. Of course, how to do that and not break pg_upgrade is a huge question... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to handle missing "prove"
Peter Eisentraut writes: > On 11/2/14 11:36 AM, Tom Lane wrote: >> Committed patch looks good, but should we also add the stanza we discussed >> in Makefile.global.in concerning defining $(prove) in terms of "missing" >> if we didn't find it? I think the behavior of HEAD when you ask for >> --enable-tap-tests but don't have "prove" might be less than ideal. > configure will now fail when "prove" is not found. If there's a commit that goes with this statement, you neglected to push it... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code
On 10/31/2014 03:46 PM, Andres Freund wrote: On 2014-10-27 09:30:33 -0400, Tom Lane wrote: Andres Freund writes: On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote: replication/slot.c and replication/logical/snapbuild.c use a CRC on the physical slot and snapshot files. It uses the same algorithm as used e.g. for the WAL. However, they are not doing the finalization step, FIN_CRC32() on the calculated checksums. Not that it matters much, but it's a bit weird and inconsistent, and was probably an oversight. Hm. Good catch - that's stupid. I wonder what to do about it. I'm tempted to just add a comment about it to 9.4 and fix it on master as changing it is essentially a catversion bump. Any objections to that? Yeah, I think you should get it right the first time. It hardly seems likely that any 9.4 beta testers are depending on those files to be stable yet. Since both state files have the version embedded it'd be trivial to just do the FIN_CRC32() when loading a version 2 file. Does anybody object to the relevant two lines of code + docs? No objection, if you feel the backwards-compatibility with beta3 is worth it. Looking at slot.c again, a comment in ReplicationSlotOnDisk contradicts the code: /* * Replication slot on-disk data structure. */ typedef struct ReplicationSlotOnDisk { /* first part of this struct needs to be version independent */ /* data not covered by checksum */ uint32 magic; pg_crc32checksum; /* data covered by checksum */ uint32 version; uint32 length; ReplicationSlotPersistentData slotdata; } ReplicationSlotOnDisk; /* size of the part of the slot that is version independent */ #define ReplicationSlotOnDiskConstantSize \ offsetof(ReplicationSlotOnDisk, slotdata) /* size of the slots that is not version indepenent */ #define ReplicationSlotOnDiskDynamicSize \ sizeof(ReplicationSlotOnDisk) - ReplicationSlotOnDiskConstantSize The code that calculates the checksum skips over the constant size, i.e. upto ReplicationSlotOnDiskConstantSize, or slotdata. That means that the version and length are in fact not covered by the checksum, contrary to the comment. Looking at the similar code in snapbuild.c, I think the code was supposed to do what the comment says, and include 'version' and 'length' in the checksum. It would be nice to fix that too in slot.c, to be consistent with snapbuild.c. PS. I find the name "ReplicationSlotOnDiskDynamicSize" confusing, as it is in fact a fixed size struct. I gather it's expected that the size of that part might change across versions, but by that definition nothing is constant. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On Sat, Nov 1, 2014 at 10:26 AM, Andres Freund wrote: > On 2014-11-01 10:18:03 -0700, Josh Berkus wrote: > > On 10/31/2014 03:07 PM, Tom Lane wrote: > > > I don't care one way or the other about the money type, but I will > defend > > > hash indexes, especially seeing that we've already added a pretty > > > in-your-face warning as of 9.5: > > > > > > regression=# create table foo(f1 int); > > > CREATE TABLE > > > regression=# create index on foo using hash (f1); > > > WARNING: hash indexes are not WAL-logged and their use is discouraged > > > CREATE INDEX > > > > Yes, and I'm arguing that is the wrong decision. If hash indexes are > > "discouraged", then they shouldn't be in core in the first place. > > Last time we discussed it there were people (IIRC Andrew was one of > them) commenting that they use hash indexes *precisely* because they're > not WAL logged and that they can live with the dangers that creates. I > don't think that's sufficient justification for introducing the feature > at all. But it's nothing new that removing a feature has to fit quite > different criteria than adding one. > > So, by that argument we could remove hash indexes once we have unlogged > indexes on logged tables. But then there's no need to remove them > anymore... > I would object to removing hash indexes as long as the alternative way to index oversized value is to write all my SQL to look like: select count(*) from foo where substr(x,1,2700)=substr($1,1,2700) and x=$1 Now, if the planner were smart enough to realize that x=$1 implies substr(x,1,2700)=substr($1,1,2700), that might be a different matter. But it is not. Or, if there were a way to create a view on foo which would do this implication automatically, but again as far as I know there is not a way to do that either. Cheers, Jeff
Re: [HACKERS] tracking commit timestamps
On 11/1/14, 8:41 AM, Petr Jelinek wrote: Well this is not BDR specific thing, the idea is that with logical replication, commit timestamp is not enough for conflict handling, you also need to have additional info in order to identify some types of conflicts conflicts (local update vs remote update for example). So the extradata field was meant as something that could be used to add the additional info to the xid. Related to this... is there any way to deal with 2 transactions that commit in the same microsecond? It seems silly to try and handle that for every commit since it should be quite rare, but perhaps we could store the LSN as extradata if we detect a conflict? -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] how to handle missing "prove"
On 11/2/14 11:36 AM, Tom Lane wrote: > Committed patch looks good, but should we also add the stanza we discussed > in Makefile.global.in concerning defining $(prove) in terms of "missing" > if we didn't find it? I think the behavior of HEAD when you ask for > --enable-tap-tests but don't have "prove" might be less than ideal. configure will now fail when "prove" is not found. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
On 11/3/14 6:04 AM, Alexey Vasiliev wrote: > 3. What the patch does in a short paragraph: This patch should add > option recovery_timeout, which help to control timeout of > restore_command nonzero status code. Right now default value is 5 > seconds. This is useful, if I using for restore of wal logs some > external storage (like AWS S3) and no matter what the slave database > will lag behind the master. The problem, what for each request to > AWS S3 need to pay, what is why for N nodes, which try to get next > wal log each 5 seconds will be bigger price, than for example each > 30 seconds. That seems useful. I would include something about this use case in the documentation. > This is my first patch. I am not sure about name of option. Maybe it > should called "recovery_nonzero_timeout". The option name had me confused. At first I though this is the time after which a running restore_command invocation gets killed. I think a more precise description might be restore_command_retry_interval. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
David Fetter wrote: > On Mon, Nov 03, 2014 at 01:54:09PM +0100, Michael Banck wrote: > > As an additional datapoint, Vitesse Data changed the DB schema from > > NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The > > modification to data types is easy to understand -- money and double > > types are faster than Numeric (and no one on this planet has a bank > > account that overflows the money type, not any time soon)."[1] And > > "Replaced NUMERIC fields representing currency with MONEY"[2]. > > > > Not sure whether they modified/optimized PostgreSQL with respect to the > > MONEY data type and/or how much performance that gained, so CCing CK Tan > > as well. > > How does our NUMERIC type's performance compare to other systems' > precise types? I realize that some of those systems might have > restrictions on publishing numbers they don't authorize, but they > might have pushed some authorized numbers if those numbers make them > look good. There is a real advantage of money over numeric in the performance front. I haven't measured it, but suffice to say that money uses integer operations which map almost directly to CPU instructions, whereas numeric needs to decode from our varlena base-1 digit format, operate on digits at a time, then encode back. No matter how much you optimize numeric, it's never going to outperform stuff that runs practically on bare electrons. As far as I recall, some TPCH queries run aggregations on currency columns. Now, whether this makes a measurable difference or not in TPCH terms, I have no idea. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On Mon, Nov 03, 2014 at 01:54:09PM +0100, Michael Banck wrote: > Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: > > BTW, after reflecting a bit more I'm less than convinced that this > > datatype is completely useless. Even if you prefer to store currency > > values in numeric columns, casting to or from money provides a way to > > accept or emit values in whatever monetary format the LC_MONETARY locale > > setting specifies. That seems like a useful feature, and it's one you > > could not easily duplicate using to_char/to_number (not to mention that > > those functions aren't without major shortcomings of their own). > > As an additional datapoint, Vitesse Data changed the DB schema from > NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The > modification to data types is easy to understand -- money and double > types are faster than Numeric (and no one on this planet has a bank > account that overflows the money type, not any time soon)."[1] And > "Replaced NUMERIC fields representing currency with MONEY"[2]. > > Not sure whether they modified/optimized PostgreSQL with respect to the > MONEY data type and/or how much performance that gained, so CCing CK Tan > as well. How does our NUMERIC type's performance compare to other systems' precise types? I realize that some of those systems might have restrictions on publishing numbers they don't authorize, but they might have pushed some authorized numbers if those numbers make them look good. Also, just a general three-state comparison, "we beat them handily," "we're somewhere near them" or "we're not even close" would be enough to work from. Cheers, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fet...@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Repeatable read and serializable transactions see data committed after tx start
Hi! Given a transaction started with "BEGIN (REPEATABLE READ | SERIALIZABLE)", if a concurrent session commits some data before *any* query within the first transaction, that committed data is seen by the transaction. This is not what I'd expect. Specifically, the documentation states that: "The Repeatable Read isolation level only sees data committed before the transaction began;" [1] IMHO, from a user perspective the transaction begins when the BEGIN command is issued. If I really want to see a "frozen" view of the database before any real SELECT, I have to issue another query like "SELECT 1". This seems odd to me. I understand tx snapshot may be deferred until real execution for performance reasons, but it is confusing from a user perspective. Is this really expected, or is it a bug? Am I having a bad day and missing some point here? ^_^ Regards, Álvaro [1] http://www.postgresql.org/docs/devel/static/transaction-iso.html P.S. In case it wasn't clear what I meant, here's an example: Session 1 Session 2 CREATE TABLE i (i integer); BEGIN ISOLATION LEVEL REPEATABLE READ; INSERT INTO i VALUES (1); SELECT i FROM i; -- returns 1 row, value 1 -- should return empty set INSERT INTO i VALUES (2); SELECT i FROM i; -- returns 1 row, value 1 -- returns, as it should, the same as the previous query In the first select, I'd have expected to have no rows. If a "SELECT 1" is issued after BEGIN, there are no rows found. -- Álvaro Hernández Tortosa --- 8Kdata
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
On 11/1/14, 1:45 PM, Andrew Dunstan wrote: On 11/01/2014 02:34 PM, Andres Freund wrote: Yeah, if we were trying to duplicate the behavior of indisvalid, there'd need to be a way to detect the invalid index at plan time and not use it. But I'm not sure that that's actually an improvement from the user's standpoint: what they'd see is queries suddenly, and silently, performing a lot worse than they expect. An explicit complaint about the necessary REINDEX seems more user-friendly from where I sit. A REINDEX is imo unlikely to be acceptable. It takes long (why would you bother on a small table?) and locks the relation/indexes. It's a bit of a pity we don't have REINDEX CONCURRENTLY. Reviews welcome: https://commitfest.postgresql.org/action/patch_view?id=1563 -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Mon, Nov 3, 2014 at 1:02 PM, Simon Riggs wrote: > OK, I think I'm happy with this as a general point. Cool! > How will we automatically test this code? Good question. I can see two possible approaches: 1. We write a test_group_locking harness along the lines of test_shm_mq and test_decoding and put that in contrib. 2. We wait until higher-level facilities built on top of this are available and get regression test coverage of this code via those higher-level modules. Personally, I can't imagine debugging this code without writing some kind of test harness that only does locking; I don't want my locking bugs to be mixed with my planner and executor bugs. But I could go either way on actually putting that code in contrib. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On 3 November 2014 17:00, Robert Haas wrote: > On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs wrote: >> The main question is still why any of this is necessary? If we are >> only acquiring fast path locks in workers, what need is there for any >> of this? The leader and workers never conflict with each other, so why >> would they ever wait for each other (except at a point where they are >> actively transferring data), which is the necessary pre-condition for >> a deadlock? >> >> Yes, running a command with a conflicting lock would disrupt the start >> of a parallel operation, but as I said above, the deadlock detector >> will eventually see that and re-arrange us so there is no deadlock. >> >> The only need I can see is if one of the workers/leader requests a >> Strong lock on an object, someone else requests a conflicting lock on >> that object and then we perform a parallel operation ourselves. We >> could easily solve that by saying that you can't go parallel if you >> hold a strong lock AND that workers can't take anything higher than >> RowShareLock, like HS backends. That's good enough for now. > > That would be enough to keep the parallel group from deadlocking with > itself, but it wouldn't be enough to prevent deadlock with other > processes, which is the scenario that actually worries me a lot more. > In particular, I'm worried about deadlocks where some other process > gets sandwiched between two processes from the same parallel group. > That is, let A1 be a parallel group leader and A2 be a process from > the same parallel group, and let B be an unrelated process. I'm > worried about the situation where A2 waits for B which waits for A1 > which (outside the view of the deadlock detector) waits for A2. All > that can happen even if the parallel backends hold no strong locks, > because B can hold strong locks. OK, I think I'm happy with this as a general point. How will we automatically test this code? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
At 2014-11-03 17:31:37 +, si...@2ndquadrant.com wrote: > > Stephen's suggestion allows auditing to be conducted without the > users/apps being aware it is taking place. OK, that makes sense. I'm working on a revised patch that does things this way, and will post it in a few days. -- Abhijit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL
On 14 October 2014 20:33, Abhijit Menon-Sen wrote: > At 2014-10-14 20:09:50 +0100, si...@2ndquadrant.com wrote: >> >> I think that's a good idea. >> >> We could have pg_audit.roles = 'audit1, audit2' > > Yes, it's a neat idea, and we could certainly do that. But why is it any > better than "ALTER ROLE audit_rw SET pgaudit.log = …" and granting that > role to the users whose actions you want to audit? That would make auditing visible to the user, who may then be able to do something about it. Stephen's suggestion allows auditing to be conducted without the users/apps being aware it is taking place. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] COPY TO returning empty result with parallel ALTER TABLE
Hi all, we experienced what seems to be a bug in the COPY TO implementation. When a table is being rewritten by an ALTER TABLE statement, a parallel COPY TO results in an empty result. Consider the following table data: CREATE TABLE test (id INTEGER NOT NULL, PRIMARY KEY (id)); INSERT INTO test (id) SELECT generate_series(1, 1000); One session does: ALTER TABLE test ADD COLUMN dummy BOOLEAN NOT NULL DEFAULT FALSE; This acquires an exclusive lock to the table. Another session now performs parallel: COPY test TO STDOUT; This blocks on the exclusive lock held by the ALTER statement. When the ALTER staement is finished, the COPY statement returns with an empty result. Same goes for COPY (SELECT ...) TO, whereas the same SELECT executed without COPY blocks and returns the correct result as expected. This is my analysis of this issue: The backend opens the rewritten data files, but it ignores the complete content, which indicates the data is being ignored because of XIDs. For direct SELECT statements the top-level query parsing acquires locks on involved tables and creates a new snapshot, but for COPY statements the parsing and locking is done later in COPY code. After locking the tables in COPY code, the data is read with an old snapshot, effectively ignoring all data from the rewritten table. I've check git master and 9.x and all show the same behaviour. I came up with the patch below, which is against curent git master. The patch modifies the COPY TO code to create a new snapshot, after acquiring the necessary locks on the source tables, so that it sees any modification commited by other backends. Despite thinking this is the correct solution, another solution or optimization would be to have ALTER TABLE, which holds the highest lock level, set the XID of rewritten tuples to the FrozenXID, as no other backend should access the table before the ALTER TABLE is committed. Sven diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 6b83576..fe2d157 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -1344,6 +1344,13 @@ BeginCopy(bool is_from, (errcode(ERRCODE_UNDEFINED_COLUMN), errmsg("table \"%s\" does not have OIDs", RelationGetRelationName(cstate->rel; + + /* +* Use a new snapshot to ensure this query sees +* results of any previously executed queries. +*/ + if (!is_from) + PushActiveSnapshot(GetTransactionSnapshot()); } else { @@ -1394,11 +1401,10 @@ BeginCopy(bool is_from, plan = planner(query, 0, NULL); /* -* Use a snapshot with an updated command ID to ensure this query sees +* Use a new snapshot to ensure this query sees * results of any previously executed queries. */ - PushCopiedSnapshot(GetActiveSnapshot()); - UpdateActiveSnapshotCommandId(); + PushActiveSnapshot(GetTransactionSnapshot()); /* Create dest receiver for COPY OUT */ dest = CreateDestReceiver(DestCopyOut); @@ -1741,9 +1747,11 @@ EndCopyTo(CopyState cstate) ExecutorFinish(cstate->queryDesc); ExecutorEnd(cstate->queryDesc); FreeQueryDesc(cstate->queryDesc); - PopActiveSnapshot(); } + /* Discard snapshot */ + PopActiveSnapshot(); + /* Clean up storage */ EndCopy(cstate); } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional role attributes && superuser review
* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote: > > That said, I don't feel very strongly about that position, so if you and > > Robert (and others on the thread) agree that's the right approach then > > I'll see about getting it done. Thanks for trying to make progress on this. > We haven't reached consensus on this one yet and I didn't want it to fall > too far off the radar. > > Here is what I summarize as the current state of the discussion: For my 2c- I don't think we'd be able to drop the old syntax and therefore I'm not sure that I really see the point in adding new syntax. I don't hold that position strongly, but I don't like the idea that we're just going to be sitting on our thumbs because we can't agree on the color of the bike shed. If there is agreement to move forward with using the syntax below, great, that can be implemented in relatively short order. If there isn't, then let's move forward with the existing syntax and change it later, if there is agreement to do so at some point in the future. I don't like the idea of tying it into GRANT. GRANT is SQL-spec defined, quite overloaded already, and role attributes don't act like GRANTs anyway (there's no ADMIN option and they aren't inheirited). > 2. Catalog Representation: > > Condense all attributes in pg_authid to single int64 column and create > bitmasks accordingly. I'd suggest we do this regardless and the only question is if we feel there is need to provide a view to split them back out again or not. I'm inclined to say 'no' to another view and instead suggest we provide SQL-level "has_whatever_privilege" for these which match the existing C functions, update our clients to use those, and encourage others to do the same. Probably also helpful would be a single function that returns a simple list of the non-default role attributes which a user has and we'd simplify psql's describeRoles by having it use that instead. > Obviously there is some concern for upgrading and whether to do both at > once or to do them incrementally. IMHO, I think if the changes are going > to be made, then we should go ahead and do them at the same time. Though, > would it be beneficial to separate in to two distinct patches? I agree that doing these changes at the same time makes sense, if we can get agreement on what exactly to do. I've shared my thoughts, but we really need input from others, ideally of the form of "I prefer X over Y" rather than "well, we could do X or Y". Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Sun, Nov 2, 2014 at 7:31 AM, Simon Riggs wrote: > The main question is still why any of this is necessary? If we are > only acquiring fast path locks in workers, what need is there for any > of this? The leader and workers never conflict with each other, so why > would they ever wait for each other (except at a point where they are > actively transferring data), which is the necessary pre-condition for > a deadlock? > > Yes, running a command with a conflicting lock would disrupt the start > of a parallel operation, but as I said above, the deadlock detector > will eventually see that and re-arrange us so there is no deadlock. > > The only need I can see is if one of the workers/leader requests a > Strong lock on an object, someone else requests a conflicting lock on > that object and then we perform a parallel operation ourselves. We > could easily solve that by saying that you can't go parallel if you > hold a strong lock AND that workers can't take anything higher than > RowShareLock, like HS backends. That's good enough for now. That would be enough to keep the parallel group from deadlocking with itself, but it wouldn't be enough to prevent deadlock with other processes, which is the scenario that actually worries me a lot more. In particular, I'm worried about deadlocks where some other process gets sandwiched between two processes from the same parallel group. That is, let A1 be a parallel group leader and A2 be a process from the same parallel group, and let B be an unrelated process. I'm worried about the situation where A2 waits for B which waits for A1 which (outside the view of the deadlock detector) waits for A2. All that can happen even if the parallel backends hold no strong locks, because B can hold strong locks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Additional role attributes && superuser review
All, > That said, I don't feel very strongly about that position, so if you and > Robert (and others on the thread) agree that's the right approach then > I'll see about getting it done. We haven't reached consensus on this one yet and I didn't want it to fall too far off the radar. Here is what I summarize as the current state of the discussion: 1. Syntax: ALTER ROLE { ADD | DROP } CAPABILITY * I think this is the most straight forward approach as it still close to the current syntax. * Perhaps keeping the current syntax around as deprecated to be removed in a scheduled future version. (provide a "deprecated" message to the user?) or GRANT EXECUTE PRIVILEGES ON TO * Though, this will be tricky since EXECUTE is not reserved and the currently state of GRANT in the grammar would require either refactoring or making it RESERVED... neither I think are acceptable at this point in time for obvious reasons. 2. Catalog Representation: Condense all attributes in pg_authid to single int64 column and create bitmasks accordingly. Obviously there is some concern for upgrading and whether to do both at once or to do them incrementally. IMHO, I think if the changes are going to be made, then we should go ahead and do them at the same time. Though, would it be beneficial to separate in to two distinct patches? -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Mon, Nov 3, 2014 at 10:18 AM, Greg Stark wrote: > On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas wrote: >> 1. Any non-trivial piece of PostgreSQL code is likely to contain >> syscache lookups. >> 2. Syscache lookups had better work in parallel workers, or they'll be >> all but useless. > > I've been using parallel sorts and index builds in my mental model of > how this will be used. I note that sorts go out of their way to look > up all the syscache entries in advance precisely so that tuplesort > doesn't start doing catalog lookups in the middle of the sort. This isn't really true. Routines like varstr_cmp() do syscache lookups and de-TOASTing. We've recently tried to reduce that with the sortsupport stuff, but it hasn't eliminated it completely, and there are other cases. This is really the key point that I think everybody needs to understand: a lot of code that you think doesn't do anything complicated actually reads from the database - generally either in the form of syscache lookups, or in the form of de-TOASTing. Most of that code doesn't perform those operations *frequently*, but that's because there are backend-private caches that minimize the number of times we actually hit the database. But restructuring all of that code to do no database access whatsoever would be a very difficult proposition and would involve painful sacrifices, which is, I imagine, why it is the way it is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
Greg Stark writes: > On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas wrote: >> 2. Syscache lookups had better work in parallel workers, or they'll be >> all but useless. > I've been using parallel sorts and index builds in my mental model of > how this will be used. I note that sorts go out of their way to look > up all the syscache entries in advance precisely so that tuplesort > doesn't start doing catalog lookups in the middle of the sort. Well, they try to avoid doing the lookups more than once, but that does not by any means imply that no lookups happen after the sort starts. In particular, if you're sorting a container type (array, record) there will be lookups during the first call of the sort type's comparison function. Enums could cause catalog lookups much later than the first call, too. I'm with Robert: if you cannot do catalog lookups in a worker process, the feature will be so crippled as to be essentially useless. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] group locking: incomplete patch, just for discussion
On Sat, Nov 1, 2014 at 9:09 PM, Robert Haas wrote: > 1. Any non-trivial piece of PostgreSQL code is likely to contain > syscache lookups. > 2. Syscache lookups had better work in parallel workers, or they'll be > all but useless. I've been using parallel sorts and index builds in my mental model of how this will be used. I note that sorts go out of their way to look up all the syscache entries in advance precisely so that tuplesort doesn't start doing catalog lookups in the middle of the sort. In general I think what people are imagining is that the parallel workers will be running low-level code like tuplesort that has all the databasey stuff like catalog lookups done in advance and just operates on C data structures like function pointers. And I think that's a valuable coding discipline to enforce, it avoids having low level infrastructure calling up to higher level abstractions which quickly becomes hard to reason about. However in practice I think you're actually right -- but not for the reasons you've been saying. I think the parallel workers *should* be written as low level infrastructure and not be directly doing syscache lookups or tuple locking etc. However there are a million ways in which Postgres is extensible which causes loops in the call graph that aren't apparent in the direct code structure. For instance, what happens if the index you're building is an expression index or partial index? Worse, what happens if those expressions have a plpython function that does queries using SPI But those are the kinds of user code exploiting extensibility are the situations where we need a deadlock detector and where you might need this infrastructure. We wouldn't and shouldn't need a deadlock detector for our own core server code. In an ideal world some sort of compromise that enforces careful locking rules where all locks are acquired in advance and parallel workers are prohibited from obtaining locks in the core code while still allowing users to a free-for-all and detecting deadlocks at runtime for them would be ideal. But I'm not sure there's any real middle ground here. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
On 11/02/2014 09:27 PM, Mikko Tiihonen wrote: > Is the following summary correct: > - the network protocol supports pipelinings Yes. All you have to do is *not* send a Sync message and be aware that the server will discard all input until the next Sync, so pipelining + autocommit doesn't make a ton of sense for error handling reasons. > - the server handles operations in order, starting the processing of next > operation only after fully processing the previous one - thus pipelining is > invisible to the server As far as I know, yes. The server just doesn't care. > - libpq driver does not support pipelining, but that is due to internal > limitations Yep. > - if proper error handling is done by the client then there is no reason why > pipelining could be supported by any pg client Indeed, and most should support it. Sending batches of related queries would make things a LOT faster. PgJDBC's batch support is currently write-oriented. There is no fundamental reason it can't be expanded for reads. I've already written a patch to do just that for the case of returning generated keys. https://github.com/ringerc/pgjdbc/tree/batch-returning-support and just need to rebase it so I can send a pull for upstream PgJDBC. It's already linked in the issues documenting the limitatations in batch support. If you want to have more general support for batches that return rowsets there's no fundamental technical reason why it can't be added. It just requires some tedious refactoring of the driver to either: - Sync and wait before it fills its *send* buffer, rather than trying to manage its receive buffer (the server send buffer), so it can reliably avoid deadlocks; or - Do async I/O in a select()-like loop over a protocol state machine, so it can simultaneously read and write on the wire. I might need to do some of that myself soon, but it's a big (and therefore error-prone) job I've so far avoided by making smaller, more targeted changes. Doing async I/O using Java nio channels is by far the better approach, but also the more invasive one. The driver currently sends data on the wire where it generates it and blocks to receive expected data. Switching to send-side buffer management doesn't have the full performance gains that doing bidirectional I/O via channels does, though, and may be a significant performance _loss_ if you're sending big queries but getting small replies. For JDBC the JDBC batch interface is the right place to do this, and you should not IMO attempt to add pipelining outside that interface. (Multiple open resultsets from portals, yes, but not pipelining of queries). -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pipelining executions to postgresql server
On 11/01/2014 10:04 PM, Mikko Tiihonen wrote: > Hi, > > I created a proof of concecpt patch for postgresql JDBC driver that allows > the caller to do pipelining of requests within a transaction. The pipelining > here means same as for HTTP: the client can send the next execution already > before waiting for the response of the previous request to be fully processed. ... but ... it already does that. Use batches, and that's exactly what it'll do. Statement.addBatch(String) or PreparedStatement.addBatch() > What kind of problems could pipelining cause (assuming we limit it to rather > simple use cases only)? Client/server pipleline deadlocks due to how PgJDBC manages it. See the details: https://github.com/pgjdbc/pgjdbc/issues/195 and https://github.com/pgjdbc/pgjdbc/issues/194 -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Let's drop two obsolete features which are bear-traps for novices
Am Sonntag, den 02.11.2014, 12:41 -0500 schrieb Tom Lane: > BTW, after reflecting a bit more I'm less than convinced that this > datatype is completely useless. Even if you prefer to store currency > values in numeric columns, casting to or from money provides a way to > accept or emit values in whatever monetary format the LC_MONETARY locale > setting specifies. That seems like a useful feature, and it's one you > could not easily duplicate using to_char/to_number (not to mention that > those functions aren't without major shortcomings of their own). As an additional datapoint, Vitesse Data changed the DB schema from NUMERIC to MONEY for their TPCH benchmark for performance reasons: "The modification to data types is easy to understand -- money and double types are faster than Numeric (and no one on this planet has a bank account that overflows the money type, not any time soon)."[1] And "Replaced NUMERIC fields representing currency with MONEY"[2]. Not sure whether they modified/optimized PostgreSQL with respect to the MONEY data type and/or how much performance that gained, so CCing CK Tan as well. Michael [1] http://vitesse-timing-on.blogspot.de/2014/10/running-tpch-on-postgresql-part-1.html [2] http://vitessedata.com/benchmark/ -- Michael Banck Projektleiter / Berater Tel.: +49 (2161) 4643-171 Fax: +49 (2161) 4643-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Hohenzollernstr. 133, 41061 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...
On Mon, Nov 3, 2014 at 3:12 AM, Rushabh Lathia wrote: > > Patch looks good, assigning to committer. > Thanks for your review! -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL >> Timbira: http://www.timbira.com.br >> Blog: http://fabriziomello.github.io >> Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello >> Github: http://github.com/fabriziomello
[HACKERS] Patch: add recovery_timeout option to control timeout of restore_command nonzero status code
Hello everyone. * Project name: Add recovery_timeout option to control timeout of restore_command nonzero status code * Uniquely identifiable file name, so we can tell difference between your v1 and v24: 0001-add-recovery_timeout-to-controll-timeout-between-res.patch * What the patch does in a short paragraph: This patch should add option recovery_timeout, which help to control timeout of restore_command nonzero status code. Right now default value is 5 seconds. This is useful, if I using for restore of wal logs some external storage (like AWS S3) and no matter what the slave database will lag behind the master. The problem, what for each request to AWS S3 need to pay, what is why for N nodes, which try to get next wal log each 5 seconds will be bigger price, than for example each 30 seconds. Before I do this in this way: " if ! (/usr/local/bin/envdir /etc/wal-e.d/env /usr/local/bin/wal-e wal-fetch "%f" "%p"); then sleep 60; fi ". But in this case restart/stop database slower. * Whether the patch is for discussion or for application: No such thing. * Which branch the patch is against: master branch * Whether it compiles and tests successfully, so we know nothing obvious is broken: compiled and pass tests on local mashine. * Whether it contains any platform-specific items and if so, has it been tested on other platforms: hope, no. * Confirm that the patch includes regression tests to check the new feature actually works as described: No it doesn't have test. I don't found ho to testing new config variables. * Include documentation: added. * Describe the effect your patch has on performance, if any: shouldn't effect on database performance. This is my first patch. I am not sure about name of option. Maybe it should called "recovery_nonzero_timeout". -- Alexey VasilievFrom 35abe56b2497f238a6888fe98c54aa9cb5300866 Mon Sep 17 00:00:00 2001 From: Alexey Vasiliev Date: Mon, 3 Nov 2014 00:21:14 +0200 Subject: [PATCH] add recovery_timeout to controll timeout between restore_command nonzero --- doc/src/sgml/recovery-config.sgml | 16 src/backend/access/transam/recovery.conf.sample | 5 + src/backend/access/transam/xlog.c | 17 - 3 files changed, 37 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml index 0f1ff34..bc410b3 100644 --- a/doc/src/sgml/recovery-config.sgml +++ b/doc/src/sgml/recovery-config.sgml @@ -145,6 +145,22 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"' # Windows + + restore_timeout (integer) + +restore_timeout recovery parameter + + + + +By default, if restore_command return nonzero status, +server will retry command again after 5 seconds. This parameter +allow to change this time. This parameter is optional. This can +be useful to increase/decrease number of a restore_command calls. + + + + diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample index 7657df3..282e898 100644 --- a/src/backend/access/transam/recovery.conf.sample +++ b/src/backend/access/transam/recovery.conf.sample @@ -58,6 +58,11 @@ # #recovery_end_command = '' # +# specifies an optional timeout after nonzero code of restore_command. +# This can be useful to increase/decrease number of a restore_command calls. +# +#restore_timeout = '' +# #--- # RECOVERY TARGET PARAMETERS #--- diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3c9aeae..98f0fca 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -233,6 +233,7 @@ static TimestampTz recoveryTargetTime; static char *recoveryTargetName; static int recovery_min_apply_delay = 0; static TimestampTz recoveryDelayUntilTime; +static int restore_timeout = 0; /* options taken from recovery.conf for XLOG streaming */ static bool StandbyModeRequested = false; @@ -5245,6 +5246,20 @@ readRecoveryCommandFile(void) (errmsg_internal("trigger_file = '%s'", TriggerFile))); } + else if (strcmp(item->name, "restore_timeout") == 0) + { + const char *hintmsg; + + if (!parse_int(item->value, &restore_timeout, GUC_UNIT_MS, + &hintmsg)) +ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a temporal value", +"restore_timeout"), + hintmsg ? errhint("%s", _(hintmsg)) : 0)); + ereport(DEBUG2, + (errmsg_internal("restore_timeout = '%s'", item->value))); + } else if (strcmp(item->name, "recovery_min_apply_delay") == 0) { const char *hintmsg; @@ -0,7 +11125,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr Re