Re: [HACKERS] tuplesort_gettuple_common() and *should_free argument
On 10/22/2016 02:45 AM, Peter Geoghegan wrote: I noticed that the routine tuplesort_gettuple_common() fails to set *should_free in all paths in master branch (no bug in 9.6). Consider the TSS_SORTEDONTAPE case, where we can return false without also setting "*should_free = false" to instruct caller not to pfree() tuple memory that tuplesort still owns. I suppose that this is okay because caller should never pfree() a tuple when NULL pointer was returned at higher level (as "pointer-to-tuple"). Even still, it seems like a bad idea to rely on caller testing things such that caller doesn't go on to pfree() a NULL pointer when seemingly instructed to do so by tuplesort "get tuple" interface routine (that is, some higher level routine that calls tuplesort_gettuple_common()). More importantly, there are no remaining cases where tuplesort_gettuple_common() sets "*should_free = true", because there is no remaining need for caller to *ever* pfree() tuple. Moreover, I don't anticipate any future need for this -- the scheme recently established around per-tape "slab slots" makes it seem unlikely to me that the need to "*should_free = true" will ever arise again. So, for Postgres 10, why not rip out the "*should_free" arguments that appear in a bunch of places? This lets us slightly simplify most tuplesort.c callers. Yeah, I agree we should just remove the *should_free arguments. Should we be worried about breaking the API of tuplesort_get* functions? They might be used by extensions. I think that's OK, but wanted to bring it up. This would be only for master, of course, and any breakage would be straightforward to fix. Now, it is still true that at least some callers to tuplesort_gettupleslot() (but not any other "get tuple" interface routine) are going to *require* ownership of memory for returned tuples, which means a call to heap_copy_minimal_tuple() remains necessary there (see recent commit d8589946ddd5c43e1ebd01c5e92d0e177cbfc198 for details). But, that's beside the point. In the master branch only, the tuplesort_gettupleslot() test "if (!should_free)" seems tautological, just as similar tests are for every other tuplesort_gettuple_common() caller. So, tuplesort_gettupleslot() can safely assume it should *always* heap_copy_minimal_tuple() in the master branch. (BTW, I'm going to teach tuplesort_gettuple_common() to avoid this heap_copy_minimal_tuple() call for callers that don't actually need it, but that's a discussion for another thread). Yep. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas wrote: > You could do that, but first I would code up the simplest, cleanest > algorithm you can think of and see if it even shows up in a 'perf' > profile. Microbenchmarking is probably overkill here unless a problem > is visible on macrobenchmarks. This is what I would go for! The current code is doing a simple thing: select the Nth element using qsort() after scanning each WAL sender's values. And I think that Sawada-san got it right. Even running on my laptop a pgbench run with 10 sync standbys using a data set that fits into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead using perf top on a non-assert, non-debug build. Hash tables and allocations get a far larger share. Using the patch, SyncRepGetSyncRecPtr is at the same level with a quorum set of 10 nodes. Let's kick the ball for now. An extra patch could make things better later on if that's worth it. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
On 8 December 2016 at 12:01, Kohei KaiGai wrote: >> At a higher level, I don't understand exactly where such giant >> ExpandedObjects would come from. (As you point out, there's certainly >> no easy way for a client to ship over the data for one.) So this feels >> like a very small part of a useful solution, if indeed it's part of a >> useful solution at all, which is not obvious. >> > I expect an aggregate function that consumes millions of rows as source > of a large matrix larger than 1GB. Once it is formed to a variable, it is > easy to deliver as an argument of PL functions. You might be interested in how Java has historically dealt with similar issues. For a long time the JVM had quite low limits on the maximum amount of RAM it could manage, in the single gigabytes for a long time. Even for the 64-bit JVM. Once those limitations were lifted, the garbage collector algorithm placed a low practical limit on how much RAM it could cope with effectively. If you were doing scientific computing with Java, lots of big image/video work, using GPGPUs, doing large scale caching, etc, this rapidly became a major pain point. So people introduced external memory mappings to Java, where objects could reference and manage memory outside the main JVM heap. The most well known is probably BigMemory (https://www.terracotta.org/products/bigmemory), but there are many others. They exposed this via small opaque handle objects that you used to interact with the external memory store via library functions. It might make a lot of sense to apply the same principle to PostgreSQL, since it's much less intrusive than true 64-bit VARLENA. Rather than extending all of PostgreSQL to handle special-case split-up VARLENA extended objects, have your interim representation be a simple opaque value that points to externally mapped memory. Your operators for the type, etc, know how to work with it. You probably don't need a full suite of normal operators, you'll be interacting with the data in a limited set of ways. The main issue would presumably be one of resource management, since we currently assume we can just copy a Datum around without telling anybody about it or doing any special management. You'd need to know when to clobber your external segment, when to copy(!) it if necessary, etc. This probably makes sense for working with GPGPUs anyway, since they like dealing with big contiguous chunks of memory (or used to, may have improved?). It sounds like only code specifically intended to work with the oversized type should be doing much with it except passing it around as an opaque handle, right? Do you need to serialize this type to/from disk at all? Or just exchange it in chunks with a client? If you do need to, can you possibly do TOAST-like or pg_largeobject-like storage where you split it up for on disk storage then reassemble for use? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
On 8 December 2016 at 07:36, Tom Lane wrote: > Likewise, the need for clients to be able to transfer data in chunks > gets pressing well before you get to 1GB. So there's a lot here that > really should be worked on before we try to surmount that barrier. Yeah. I tend to agree with Tom here. Allowing >1GB varlena-like objects, when we can barely cope with our existing ones in dump/restore, in clients, etc, doesn't strike me as quite the right direction to go in. I understand it solves a specific, niche case you're dealing with when exchanging big blobs of data with a GPGPU. But since the client doesn't actually see that large blob, it's split up into objects that will work on the current protocol and interfaces, why is is necessary to have instances of a single data type with >1GB values, rather than take a TOAST-like / pg_largeobject-like approach and split it up for storage? I'm concerned that this adds a special case format that will create maintenance burden and pain down the track, and it won't help with pain points users face like errors dumping/restoring rows with big varlena objects, problems efficiently exchanging them on the wire protocol, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Mmm. I did the same thing in select_common_type and resulted in a messier (a bit). At Wed, 07 Dec 2016 23:44:19 -0500, Tom Lane wrote in <15128.1481172...@sss.pgh.pa.us> > Attached is a patch that fixes this by storing typmod info in the RTE. > This turned out to be straightforward, and I think it's definitely > what we should do in HEAD. I have mixed emotions about whether it's > worth doing anything about it in the back branches. With it, VALUES works as the same as UNION, as documentation says. https://www.postgresql.org/docs/8.2/static/queries-values.html Past versions have the same documentation so back-patching the *behavior* seems to me worth doing. Instead of modifying RTE, re-coercing the first row's value would enough (I'm not sure how to do that now) for back-patching. > I chose to redefine the existing coltypes/coltypmods/colcollations > lists for CTE RTEs as also applying to VALUES RTEs. That saves a > little space in the RangeTblEntry nodes and allows sharing code > in a couple of places. It's tempting to consider making that apply > to all RTE types, which would permit collapsing expandRTE() and > get_rte_attribute_type() into a single case. But AFAICS there would > be no benefit elsewhere, so I'm not sure the extra code churn is > justified. Agreed. > BTW, I noticed that the CTE case of expandRTE() fails to assign the > specified location to the generated Vars, which is clearly a bug > though a very minor one; it would result in failing to display a > parse error location in some cases where we would do so for Vars from > other RTE types. That part might be worth back-patching, not sure. If we do back-patching VALUES patch, involving it would worth, I think. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
2016-12-07 22:17 GMT+01:00 Alvaro Herrera : > Tom Lane wrote: > > > In HEAD, we could change the RTE data structure so that > > transformValuesClause could save the typmod information in the RTE, > > keeping the lookups cheap. > > Hmm, I think this would be useful for the XMLTABLE patch too. I talked > a bit about it at > https://www.postgresql.org/message-id/20161122204730. > dgipy6gxi25j4e6a@alvherre.pgsql > > The patch has evolved quite a bit since then, but the tupdesc continues > to be a problem. Latest patch is at > https://www.postgresql.org/message-id/CAFj8pRBsrhwR636-_ > 3TPbqu%3DFo3_DDer6_yp_afzR7qzhW1T6Q%40mail.gmail.com What do you dislike on tupdesc usage there? regards Pavel > > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Attached is a patch that fixes this by storing typmod info in the RTE. This turned out to be straightforward, and I think it's definitely what we should do in HEAD. I have mixed emotions about whether it's worth doing anything about it in the back branches. I chose to redefine the existing coltypes/coltypmods/colcollations lists for CTE RTEs as also applying to VALUES RTEs. That saves a little space in the RangeTblEntry nodes and allows sharing code in a couple of places. It's tempting to consider making that apply to all RTE types, which would permit collapsing expandRTE() and get_rte_attribute_type() into a single case. But AFAICS there would be no benefit elsewhere, so I'm not sure the extra code churn is justified. BTW, I noticed that the CTE case of expandRTE() fails to assign the specified location to the generated Vars, which is clearly a bug though a very minor one; it would result in failing to display a parse error location in some cases where we would do so for Vars from other RTE types. That part might be worth back-patching, not sure. regards, tom lane diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index e30c57e..d973225 100644 *** a/src/backend/nodes/copyfuncs.c --- b/src/backend/nodes/copyfuncs.c *** _copyRangeTblEntry(const RangeTblEntry * *** 2149,2161 COPY_NODE_FIELD(functions); COPY_SCALAR_FIELD(funcordinality); COPY_NODE_FIELD(values_lists); - COPY_NODE_FIELD(values_collations); COPY_STRING_FIELD(ctename); COPY_SCALAR_FIELD(ctelevelsup); COPY_SCALAR_FIELD(self_reference); ! COPY_NODE_FIELD(ctecoltypes); ! COPY_NODE_FIELD(ctecoltypmods); ! COPY_NODE_FIELD(ctecolcollations); COPY_NODE_FIELD(alias); COPY_NODE_FIELD(eref); COPY_SCALAR_FIELD(lateral); --- 2149,2160 COPY_NODE_FIELD(functions); COPY_SCALAR_FIELD(funcordinality); COPY_NODE_FIELD(values_lists); COPY_STRING_FIELD(ctename); COPY_SCALAR_FIELD(ctelevelsup); COPY_SCALAR_FIELD(self_reference); ! COPY_NODE_FIELD(coltypes); ! COPY_NODE_FIELD(coltypmods); ! COPY_NODE_FIELD(colcollations); COPY_NODE_FIELD(alias); COPY_NODE_FIELD(eref); COPY_SCALAR_FIELD(lateral); diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index b7a109c..edc1797 100644 *** a/src/backend/nodes/equalfuncs.c --- b/src/backend/nodes/equalfuncs.c *** _equalRangeTblEntry(const RangeTblEntry *** 2460,2472 COMPARE_NODE_FIELD(functions); COMPARE_SCALAR_FIELD(funcordinality); COMPARE_NODE_FIELD(values_lists); - COMPARE_NODE_FIELD(values_collations); COMPARE_STRING_FIELD(ctename); COMPARE_SCALAR_FIELD(ctelevelsup); COMPARE_SCALAR_FIELD(self_reference); ! COMPARE_NODE_FIELD(ctecoltypes); ! COMPARE_NODE_FIELD(ctecoltypmods); ! COMPARE_NODE_FIELD(ctecolcollations); COMPARE_NODE_FIELD(alias); COMPARE_NODE_FIELD(eref); COMPARE_SCALAR_FIELD(lateral); --- 2460,2471 COMPARE_NODE_FIELD(functions); COMPARE_SCALAR_FIELD(funcordinality); COMPARE_NODE_FIELD(values_lists); COMPARE_STRING_FIELD(ctename); COMPARE_SCALAR_FIELD(ctelevelsup); COMPARE_SCALAR_FIELD(self_reference); ! COMPARE_NODE_FIELD(coltypes); ! COMPARE_NODE_FIELD(coltypmods); ! COMPARE_NODE_FIELD(colcollations); COMPARE_NODE_FIELD(alias); COMPARE_NODE_FIELD(eref); COMPARE_SCALAR_FIELD(lateral); diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 0d858f5..7258c03 100644 *** a/src/backend/nodes/outfuncs.c --- b/src/backend/nodes/outfuncs.c *** _outRangeTblEntry(StringInfo str, const *** 2841,2855 break; case RTE_VALUES: WRITE_NODE_FIELD(values_lists); ! WRITE_NODE_FIELD(values_collations); break; case RTE_CTE: WRITE_STRING_FIELD(ctename); WRITE_UINT_FIELD(ctelevelsup); WRITE_BOOL_FIELD(self_reference); ! WRITE_NODE_FIELD(ctecoltypes); ! WRITE_NODE_FIELD(ctecoltypmods); ! WRITE_NODE_FIELD(ctecolcollations); break; default: elog(ERROR, "unrecognized RTE kind: %d", (int) node->rtekind); --- 2841,2857 break; case RTE_VALUES: WRITE_NODE_FIELD(values_lists); ! WRITE_NODE_FIELD(coltypes); ! WRITE_NODE_FIELD(coltypmods); ! WRITE_NODE_FIELD(colcollations); break; case RTE_CTE: WRITE_STRING_FIELD(ctename); WRITE_UINT_FIELD(ctelevelsup); WRITE_BOOL_FIELD(self_reference); ! WRITE_NODE_FIELD(coltypes); ! WRITE_NODE_FIELD(coltypmods); ! WRITE_NODE_FIELD(colcollations); break; default: elog(ERROR, "unrecognized RTE kind: %d", (int) node->rtekind); diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c index c587d4e..d608530 100644 *** a/src/backend/nodes/readfuncs.c --- b/src/backend/nodes/readfuncs.c *** _readRangeTblEntry(void) *** 1314,1328 break; case RTE_VALUES: READ_NODE_FIELD(values_lists); ! READ_NODE_FIELD(values_collations); break; case RTE_
Re: [HACKERS] Declarative partitioning - another take
On Thu, Dec 8, 2016 at 1:39 PM, Andres Freund wrote: > On 2016-12-07 13:20:04 -0500, Robert Haas wrote: >> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers wrote: >> >> My bad. The fix I sent last night for one of the cache flush issues >> >> wasn't quite right. The attached seems to fix it. >> > Yes, fixed here too. Thanks. >> >> Thanks for the report - that was a good catch. > > Congrats to everyone working on this! This is a large step forward. Congratulations to all! It was a long way to this result. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016-12-07 13:20:04 -0500, Robert Haas wrote: > On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers wrote: > >> My bad. The fix I sent last night for one of the cache flush issues > >> wasn't quite right. The attached seems to fix it. > > Yes, fixed here too. Thanks. > > Thanks for the report - that was a good catch. Congrats to everyone working on this! This is a large step forward. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unlogged tables cleanup
On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas wrote: > Michael, your greetings were passed on to me with a request that I > look at this thread. Thanks for showing up! > On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier > wrote: More seriously, if there could be more details regarding that, I would think that we could say something like "logging the init fork is mandatory in any case to ensure its on-disk presence when at recovery replay, even on non-default tablespaces whose base location are deleted and re-created from scratch if the WAL record in charge of creating this tablespace gets replayed". The problem shows up because of tablespaces being deleted at replay at the end... So perhaps this makes sense. What do you think? >>> >>> Yes, that's about what I think we need to explain. >> >> OK, what do you think about the attached then? >> >> I came up with something like that for those code paths: >> - /* Write the page. If archiving/streaming, XLOG it. */ >> + /* >> +* Write the page and log it unconditionally. This is important >> +* particularly for indexes created on tablespaces and databases >> +* whose creation happened after the last redo pointer as recovery >> +* removes any of their existing content when the corresponding >> +* create records are replayed. >> +*/ >> I have not been able to use the word "create" less than that. The >> patch is really repetitive, but I think that we had better mention the >> need of logging the content in each code path and not touch >> log_newpage() to keep it a maximum flexible. > > In blinsert.c, nbtree.c, etc. how about: > > Write the page and log it. It might seem that an immediate sync would > be sufficient to guarantee that the file exists on disk, but recovery > itself might remove it while replaying, for example, an > XLOG_DBASE_CREATE record. Therefore, we need this even when > wal_level=minimal. OK, I rewrote a bit the patch as attached. What do you think? >>> Actually, I'm >>> wondering if we ought to just switch this over to using the delayChkpt >>> mechanism instead of going through this complicated song-and-dance. >>> Incurring an immediate sync to avoid having to WAL-log this is a bit >>> tenuous, but having to WAL-log it AND do the immediate sync just seems >>> silly, and I'm actually a bit worried that even with your fix there >>> might still be a latent bug somewhere. We couldn't switch mechanisms >>> cleanly in the 9.2 branch (cf. >>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >>> back-patch it in the form you have it in already, but it's probably >>> worth switching over at least in master. >> >> Thinking through it... Could we not just rip off the sync phase >> because there is delayChkpt? It seems to me that what counts is that >> the commit of the transaction that creates the relation does not get >> past the redo point. It would matter for read uncommitted transactions >> but that concept does not exist in PG, and never will. On >> back-branches it is definitely safer to keep the sync, I am just >> thinking about a HEAD-only optimization here as you do. > > Right (I think). If we set and clear delayChkpt around this work, we > don't need the immediate sync. My point is a bit different than what you mean I think: the transaction creating an unlogged relfilenode would not need to even set delayChkpt in the empty() routines because other transactions would not refer to it until this transaction has committed. So I am arguing about just removing the sync phase. -- Michael diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c index 0946aa29ec..a31149f044 100644 --- a/contrib/bloom/blinsert.c +++ b/contrib/bloom/blinsert.c @@ -164,13 +164,18 @@ blbuildempty(Relation index) metapage = (Page) palloc(BLCKSZ); BloomFillMetapage(index, metapage); - /* Write the page. If archiving/streaming, XLOG it. */ + /* +* Write the page and log it. It might seem that an immediate sync +* would be sufficient to guarantee that the file exists on disk, but +* recovery itself might remove it while replaying, for example, an +* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we +* need this even when wal_level=minimal. +*/ PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - if (XLogIsNeeded()) - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, - BLOOM_METAPAGE_BLKNO, metapage, false); + log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, + BLOOM_METAPAGE_BLKNO, metapage, false); /* * An immediate sync is required even if we xlog'd the page, because the diff --git a/src/backend/access/nbtree/nbtree.c b/src
Re: [HACKERS] Patch to implement pg_current_logfile() function
Hello Robert, On Wed, 7 Dec 2016 18:52:24 -0500 Robert Haas wrote: > On Tue, Dec 6, 2016 at 6:11 PM, Gilles Darold > wrote: > > It seems that all fixes have been included in this patch. I read this and knew that I hadn't finished review, but didn't immediately respond because I thought the patch had to be marked "ready for committer" on commitfest.postgresql.org before the committers would spend a lot of time on it. I don't have the time to fully respond, or I'd be able to attach new code, but want to comment before anybody else spends a lot of time on this patch. > Yikes. This patch has certainly had a lot of review, but it has lots > of basic inconsistencies with PostgreSQL coding style, which one would > think somebody would have noticed by now, Yes and no. I don't really know what the coding style is and rather than have to make multiple corrections to code that might get weeded out and discarded I've been focusing on getting the logic right. Really, I've not put a thought to it except for trying to write what I write like what's there, and sticking to < 80 chars per line. There has been lots of review. The only truly effective way I've found to communicate regards the pg_current_logfiles patch has been to write code and provide detailed test cases. I could be wrong, but unless I submit code I don't feel like I've been understood. I just haven't been interested in re-formatting somebody else's code before I think the code really works. > I don't have any objection > to the basic idea of this patch, but it's got to look and feel like > the surrounding PostgreSQL code. And not be unnecessarily > complicated. I've been introducing some complication, but I hope it's necessary. To keep the review process simple my plan has been to submit multiple patches. One with the basics and more on top of that that introduce complication that can be considered and accepted or rejected. So I send emails with multiple patches, some that I think need to go into the core patch and others to be kept separate. But, although I try, this does not seem to have been communicated because I keep getting emails back that contain only a single patch. Maybe something's wrong with my review process but I don't know how to fix it. If you think a single patch is the way to go I can open up separate tickets at commitfest.postgresql.org. But this seems like overkill because all the patches touch the same code. The extreme case is the attached "cleanup_rotate" patch. (Which applies to v14 of this patch.) It's nothing but a little bit of tiding up of the master branch, but does touch code that's already being modified so it seems like the committers would want to look at it at the same time as when reviewing the pg_current_logfile patch. Once you've looked at the pg_current_logfile patch you've already looked at and modified code in the function the "cleanup_rotate" patch touches. But the "cleanup_rotate" patch got included in the v15 version of the patch and is also in v16. I'm expecting to submit it as a separate patch along with the pg_current_logfile patch and tried to be very very clear about this. It really has nothing to do with the pg_current_logfile functionality. (And is the only "separate" patch which really has nothing to do with the pg_current_logfile functionality.) More examples of separate patches are below. Any guidance would be appreciated. > > Detailed comments below: > rm_log_metainfo() could be merged into logfile_writename(), since > they're always called in conjunction and in the same pattern. This would be true, if it weren't for the attached "retry_current_logfiles" patches that were applied in v15 of the patch and removed from v16 due to some mis-communication I don't understand. (But the attached patches apply on top of the the v14 patch. I don't have time to refactor them now.) FYI. The point of the "retry_current_logfiles" patch series is to handle the case where logfile_writename gets a ENFILE or EMFILE. When this happens the current_logfiles file can "skip" and not contain the name(s) of the current log file for an entire log rotation. To miss, say, a month of logs because the logs only rotate monthly and your log processing is based on reading the current_logfiles file sounds like a problem. What I was attempting to communicate in my email response to the v15 patch is that the (attached, but applies to the v14 patch again) "backoff" patch should be submitted as a separate patch. It seems over-complicated, but exists in case a committer is worried about re-trying writes on a system that's busy enough to generate ENFILE or EMFILE errors. > +if (errno == ENFILE || errno == EMFILE) > +ereport(LOG, > +(errmsg("system is too busy to write logfile meta > info, %s will be updated on next rotation (or use SIGHUP to retry)", > LOG_METAINFO_DATAFILE))); > > This seems like a totally unprincipled reaction to a purely arbitrary > subset of
Re: [HACKERS] varlena beyond 1GB and matrix
2016-12-08 8:36 GMT+09:00 Tom Lane : > Robert Haas writes: >> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai wrote: >>> I like to propose a new optional type handler 'typserialize' to >>> serialize an in-memory varlena structure (that can have indirect >>> references) to on-disk format. > >> I think it's probably a mistake to conflate objects with substructure >> with objects > 1GB. Those are two somewhat orthogonal needs. > > Maybe. I think where KaiGai-san is trying to go with this is being > able to turn an ExpandedObject (which could contain very large amounts > of data) directly into a toast pointer or vice versa. There's nothing > really preventing a TOAST OID from having more than 1GB of data > attached, and if you had a side channel like this you could transfer > the data without ever having to form a larger-than-1GB tuple. > > The hole in that approach, to my mind, is that there are too many places > that assume that they can serialize an ExpandedObject into part of an > in-memory tuple, which might well never be written to disk, or at least > not written to disk in a table. (It might be intended to go into a sort > or hash join, for instance.) This design can't really work for that case, > and unfortunately I think it will be somewhere between hard and impossible > to remove all the places where that assumption is made. > Regardless of the ExpandedObject, does the flatten format need to contain fully flatten data chunks? If a data type internally contains multiple toast pointers as like an array, its flatten image is likely very small we can store using an existing varlena mechanism. One problem is VARSIZE() will never tell us exact total length of the data even if it references multiple GB scale chunks. > At a higher level, I don't understand exactly where such giant > ExpandedObjects would come from. (As you point out, there's certainly > no easy way for a client to ship over the data for one.) So this feels > like a very small part of a useful solution, if indeed it's part of a > useful solution at all, which is not obvious. > I expect an aggregate function that consumes millions of rows as source of a large matrix larger than 1GB. Once it is formed to a variable, it is easy to deliver as an argument of PL functions. > FWIW, ExpandedObjects themselves are far from a fully fleshed out > concept, one of the main problems being that they don't have very long > lifespans except in the case that they're the value of a plpgsql > variable. I think we would need to move things along quite a bit in > that area before it would get to be useful to think in terms of > ExpandedObjects containing multiple GB of data. Otherwise, the > inevitable flattenings and re-expansions are just going to kill you.q > > Likewise, the need for clients to be able to transfer data in chunks > gets pressing well before you get to 1GB. So there's a lot here that > really should be worked on before we try to surmount that barrier. > Do you point out the problem around client<->server protocol, isn't it? Likely, we eventually need this enhancement. I agree. Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
2016-12-08 8:04 GMT+09:00 Robert Haas : > On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai wrote: >> I like to propose a new optional type handler 'typserialize' to >> serialize an in-memory varlena structure (that can have indirect >> references) to on-disk format. >> If any, it shall be involced on the head of toast_insert_or_update() >> than indirect references are transformed to something other which >> is safe to save. (My expectation is, the 'typserialize' handler >> preliminary saves the indirect chunks to the toast relation, then >> put toast pointers instead.) > > This might not work. The reason is that we have important bits of > code that expect that they can figure out how to do some operation on > a datum (find its length, copy it, serialize it) based only on typlen > and typbyval. See src/backend/utils/adt/datum.c for assorted > examples. Note also the lengthy comment near the top of the file, > which explains that typlen > 0 indicates a fixed-length type, typlen > == -1 indicates a varlena, and typlen == -2 indicates a cstring. I > think there's probably room for other typlen values; for example, we > could have typlen == -3 indicate some new kind of thing -- a > super-varlena that has a higher length limit, or some other kind of > thing altogether. > > Now, you can imagine trying to treat what you're talking about as a > new type of TOAST pointer, but I think that's not going to help, > because at some point the TOAST pointer gets de-toasted into a varlena > ... which is still limited to 1GB. So that's not really going to > work. And it brings out another point, which is that if you define a > new typlen code, like -3, for super-big things, they won't be > varlenas, which means they won't work with the existing TOAST > interfaces. Still, you might be able to fix that. You would probably > have to do some significant surgery on the wire protocol, per the > commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd. > Hmm... The reason why I didn't introduce the idea of 64bit varlena format is this approach seems too invasive for existing PostgreSQL core and extensions, because I assumed this "long" variable length datum utilize/enhance existing varlena infrastructure. However, once we have completely independent infrastructure from the exiting varlena, we may not need to have a risk of code mixture. It seems to me an advantage. My concern about ExpandedObject is its flat format still needs 32bit varlena header that restrict the total length up to 1GB, so, it was not possible to represent a large data chunk. So, I didn't think the current ExpandedObject is a solution for us. Regarding to the protocol stuff, I want to consider the support of a large record next to the internal data format, because my expected primary usage is an internal use for in-database analytics, then user will get its results from a complicated logic described in PL function. > I think it's probably a mistake to conflate objects with substructure > with objects > 1GB. Those are two somewhat orthogonal needs. As Jim > also points out, expanded objects serve the first need. Of course, > those assume that we're dealing with a varlena, so if we made a > super-varlena, we'd still need to create an equivalent. But perhaps > it would be a fairly simple adaptation of what we've already got. > Handling objects >1GB at all seems like the harder part of the > problem. > I could get your point almost. Does the last line above mention about amount of the data object >1GB? even if the "super-varlena" format allows 64bit length? Thanks, -- KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Wed, Dec 7, 2016 at 3:42 PM, Tom Lane wrote: > > > Hmm ... after further experimentation, I still can't get this version of > systemd (231) to do anything evil. It turns out that Fedora ships it with > KillUserProcesses turned off by default, and maybe having that on is a > prerequisite for the other behavior? But that doesn't make a lot of sense > because we'd never be seeing the reports of databases moaning about lost > semaphores if the processes got killed first. Anyway, I see nothing bad > happening if KillUserProcesses is off, while if it's on then the database > gets shut down reasonably politely via SIGTERM. > > Color me confused ... maybe systemd's behavior has changed? > Hrm, the following incantation seems to break for me on a fresh Fedora 25 system: 1) As root su to $USER and start postgres. 2) ssh in as $USER and then logout 3) # psql localhost FATAL: semctl(4980742, 3, SETVAL, 0) failed: Invalid argument LOG: server process (PID 14569) exited with exit code 1 ...
Re: [HACKERS] [sqlsmith] Short reads in hash indexes
On Thu, Dec 8, 2016 at 2:38 AM, Andreas Seltenreich wrote: > Andreas Seltenreich writes: > >> Amit Kapila writes: >> >>> On Sat, Dec 3, 2016 at 3:44 PM, Andreas Seltenreich >>> wrote: Amit Kapila writes: > [2. text/x-diff; fix_hash_bucketsplit_sqlsmith_v1.patch] Ok, I'll do testing with the patch applied. >> >> Good news: the assertion hasn't fired since the patch is in. > > Meh, it fired again today after being silent for 100e6 queries :-/ > I guess I need to add some confidence qualification on such statements. > Maybe sigmas as they do at CERN… > >> smith=# select * from state_report where sqlstate = 'XX001'; >> -[ RECORD 1 >> ]-- >> count| 10 >> sqlstate | XX001 >> sample | ERROR: could not read block 1173 in file "base/16384/17256": >> read only 0 of 8192 bytes >> hosts| {airbisquit,frell,gorgo,marbit,pillcrow,quakken} >> >>> Hmm, I am not sure if this is related to previous problem, but it >>> could be. Is it possible to get the operation and or callstack for >>> above failure? >> >> Ok, will turn the elog into an assertion to get at the backtraces. > > Doing so on top of 4212cb7, I caught the backtrace below. Query was: > Thanks for the report, I will look into it. I think this one is quite similar to what Jeff has reported [1]. [1] - https://www.postgresql.org/message-id/CAMkU%3D1ydfriLCOriJ%3DAxtF%3DhhBOUUcWtf172vquDrj%3D3T7yXmg%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Robert, On 2016/12/08 3:20, Robert Haas wrote: > On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers wrote: >>> My bad. The fix I sent last night for one of the cache flush issues >>> wasn't quite right. The attached seems to fix it. >> Yes, fixed here too. Thanks. > > Thanks for the report - that was a good catch. > > I've committed 0001 - 0006 with that correction and a few other > adjustments. There's plenty of room for improvement here, and almost > certainly some straight-up bugs too, but I think we're at a point > where it will be easier and less error-prone to commit follow on > changes incrementally rather than by continuously re-reviewing a very > large patch set for increasingly smaller changes. +1 and thanks a lot for your and everyone else's very patient support in reviewing the patches. > Some notes: > > * We should try to teach the executor never to scan the parent. > That's never necessary with this system, and it might add significant > overhead. We should also try to get rid of the idea of the parent > having storage (i.e. a relfilenode). Agreed, I will start investigating. > * The fact that, in some cases, locking requirements for partitioning > are stronger than those for inheritance is not good. We made those > decisions for good reasons -- namely, data integrity and not crashing > the server -- but it would certainly be good to revisit those things > and see whether there's any room for improvement. +1 > * I didn't commit 0007, which updates the documentation for this new > feature. That patch removes more lines than it adds, and I suspect > what is needed here > is an expansion of the documentation rather than a diminishing of it. Hmm, I had mixed feeling about what to do about that as well. So now, we have the description of various new features buried into VI. Reference section of the documentation, which is simply meant as a command reference. I agree that the new partitioning warrants more expansion in the DDL partitioning chapter. Will see how that could be done. > * The fact that there's no implementation of row movement should be > documented as a limitation. We should also look at removing that > limitation. Yes, something to improve. By the way, since we currently mention INSERT tuple-routing directly in the description of the partitioned tables in the CREATE TABLE command reference, is that also the place to list this particular limitation? Or is UPDATE command reference rather the correct place? Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel execution and prepared statements
On Wed, Dec 7, 2016 at 12:30 AM, Robert Haas wrote: > On Tue, Dec 6, 2016 at 1:07 PM, Tom Lane wrote: >> Robert Haas writes: >>> Done. >> >> The comment seems quite confused now: >> >> * If a tuple count was supplied or data is being written to relation, we >> * must force the plan to run without parallelism, because we might exit >> * early. >> >> Exit early is exactly what we *won't* do if writing to an INTO rel, so >> I think this will confuse future readers. I think it should be more like >> >> * If a tuple count was supplied, we must force the plan to run without >> * parallelism, because we might exit early. Also disable parallelism >> * when writing into a relation, because [ uh, why exactly? ] >> >> Considering that the writing would happen at top level of the executor, >> and hence in the parent process, it's not actually clear to me why the >> second restriction is there at all: can't we write tuples to a rel even >> though they came from a parallel worker? In any case, the current wording >> of the comment is a complete fail at explaining this. > > Oops. You're right. [ uh, why exactly? ] -> no database changes > whatsoever are allowed while in parallel mode. (This restriction > might be lifted someday, but right now we're stuck with it.) > Attached patch changes the comment based on above suggestions. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_comment_parallel_mode_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Alvaro Herrera writes: > Tom Lane wrote: >> In HEAD, we could change the RTE data structure so that >> transformValuesClause could save the typmod information in the RTE, >> keeping the lookups cheap. > Hmm, I think this would be useful for the XMLTABLE patch too. I talked > a bit about it at > https://www.postgresql.org/message-id/20161122204730.dgipy6gxi25j4e6a@alvherre.pgsql I dunno. If your example there is correct that XMLTABLE can be called as a plain function in a SELECT list, then I doubt that we want to tie anything about it to the RTE data structure. If anything, the case where it appears in FROM seems to need to be treated as a generic RTE_FUNCTION case. I've been trying to avoid getting involved in the XMLTABLE patch, mainly because I know zip about XML, but maybe I need to take a look. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
On 2016/12/08 3:33, Robert Haas wrote: > On Wed, Dec 7, 2016 at 1:30 PM, Robert Haas wrote: >> -- partitioned table cannot partiticipate in regular inheritance >> CREATE TABLE partitioned2 ( >> a int >> --- 392,411 >> c text, >> d text >> ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d >> collate "en_US"); >> + ERROR: collation "en_US" for encoding "SQL_ASCII" does not exist > ... >> No idea why yet, but I'll try to figure it out. > > And of course that'd be because relying on en_US isn't portable. Sigh. Should've thought about the non-portability of locales. Thanks for catching and fixing anyway! Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016/12/08 1:53, Erik Rijkers wrote: > On 2016-12-07 17:38, Robert Haas wrote: >> On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote >> wrote: begin; create schema if not exists s; create table s.t (c text, d text, id serial) partition by list ((ascii(substring(coalesce(c, d, ''), 1, 1; create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); it logs as follows: 2016-12-07 17:03:45.787 CET 6125 LOG: server process (PID 11503) was terminated by signal 11: Segmentation fault 2016-12-07 17:03:45.787 CET 6125 DETAIL: Failed process was running: create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); >>> >>> Hm, will look into this in a few hours. >> >> My bad. The fix I sent last night for one of the cache flush issues >> wasn't quite right. The attached seems to fix it. > > Yes, fixed here too. Thanks. Thanks for reporting and the fix, Erik and Robert! Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Separate connection handling from backends
On 7 December 2016 at 22:27, Kevin Grittner wrote: > I don't know how that execution model would compare to what we use > now in terms of performance, but its popularity makes it hard to > ignore as something to consider. Those engines also tend to be threaded. They can stash state in memory and hand it around between executors in ways we cannot really do. I'd love to see a full separation of executor from session in postgres, but I can't see how it could be at all practical. The use of globals for state and the assumption that session == backend is baked in way too deep. At least, I think it'd be a slow and difficult thing to change, and would need many steps. Something like what was proposed upthread would possibly make sense as a first step. But again, I don't see anyone who's likely to actually do it. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
On Thu, Dec 8, 2016 at 12:06 AM, Tom Lane wrote: > Stephen Frost writes: >> It would be really nice if we would detect that some other postmaster is >> already using a given tablespace directory and to throw an error and >> complain rather than starting up thinking everything is fine. > > In principle, we could have the postmaster run through $PGDATA/pg_tblspc > and drop a lockfile into each referenced directory. But the devil is in > the details --- in particular, not sure how to get the right thing to > happen during a CREATE TABLESPACE. Also, I kinda doubt that this is going > to fix anything for the replica-on-same-machine problem. That's where having a node-based ID would become helpful, which is different from the global system ID. Ages ago when working on Postgres-XC, we took care of this problem by appending to the tablespace folder name, the one prefixed with PGXX, a suffix using a node name. When applying this concept to PG, we could have standbys to set up this node ID each time recovery is done using a backup_label. This won't solve the problem of tablespaces already created, that should be handled by users when taking the base backup by remapping them. But it would adress the problems for newly-created ones. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Robert Haas writes: > On Wed, Dec 7, 2016 at 6:49 PM, Tom Lane wrote: >> This still doesn't address the real question, which is whether RemoveIPC >> does anything if KillUserProcesses is off, and whether that behavior >> has changed. I don't see anything about RemoveIPC in that thread. > http://www.dsm.fordham.edu/cgi-bin/man-cgi.pl?topic=logind.conf§=5 > suggests that KillUserProcesses and RemoveIPC are separate cleanup > behaviors triggered by the same underlying cause (termination of last > session). Yeah, I read that man page too, but ... The test case I was using was to ssh into the box, launch a postmaster using the old-school "nohup postmaster &" technique, and log out. What I saw was that the "/usr/lib/systemd/systemd --user" process Alex referred to would be launched when the ssh connection started, and would stick around as long as the postmaster was there, if KillUserProcesses was off. (If it was on, something SIGTERM'd the postmaster as soon as I disconnected.) So if they really are independent behaviors, I'd have expected the same something to have killed the semaphores as soon as I disconnected; but that did NOT happen. [ Yes, RemoveIPC is definitely on: I turned it on explicitly in logind.conf, just in case the comment claiming it's on by default is a lie. ] BTW, I also tried this from the console, but the results were confused by the fact that GNOME seems to launch approximately a metric buttload of "helper" processes, which don't disappear when I log out. If that's the behavior Lennart is trying to get rid of, I can see his point; but I tend to agree with the other comments in that thread that this should be fixed in GNOME not by breaking longstanding working assumptions. When I get a chance, I think I'll try F24 and see if it behaved differently. F23 might be interesting too if it's still downloadable. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
Hello, context switch was complete that time, sorry. There's multiple "target LET"s. So we need kth-largest LTEs. At Wed, 7 Dec 2016 19:04:23 +0900, Michael Paquier wrote in > On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada wrote: > > Sorry, I could not understand this algorithm. Could you elaborate > > this? It takes only O(n) times? > > Nah, please forget that, that was a random useless thought. There is > no way to be able to select the k-th element without knowing the > hierarchy induced by the others, which is what the partial sort would > help with here. So, let's consider for some cases, - needing 3-quorum among 5 standbys. There's no problem whatever make kth-largest we choose. Of course qsorts are fine. - needing 10 quorums among 100 standbys. I'm not sure if there's any difference with 3/5. - needing 10 quorums among 1000 standbys. Obviously qsorts are doing too much. Any kind of kth-largest algorithm should be involved. For instance quickselect with O(n long n) - O(n) seems better in comparison to O(n log n) - O(n^2) of qsort. - needing 700 quorums among 1000 standbys. I don't think this case is worth consider but kth-largest is better even for this case. If we don't 700/1000 is out of at least the current scope, I also recommend to use kth-largest selection. If not, the quorum calculation is triggered by every standby reply message and the frequency of the calculation seems near to the frequency of WAL-insertion for the worst case. (Right?) Even the kth-largest takes too long time to have 1000 standys. Maintining kth-largest in shared memory needs less CPU but leads to more bad contention on the shared memory. Inversely, we already have waiting LSNs of backends in procarray. If we have another array in the order of waiting LSNs and having a condition varialble on the number of comforming walsenders. Every walsender can individually looking up it and count it up. It might performs better but I'm not sure. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
I wrote: > Maybe. I think where KaiGai-san is trying to go with this is being > able to turn an ExpandedObject (which could contain very large amounts > of data) directly into a toast pointer or vice versa. There's nothing > really preventing a TOAST OID from having more than 1GB of data > attached, and if you had a side channel like this you could transfer > the data without ever having to form a larger-than-1GB tuple. BTW, you could certainly imagine attaching such infrastructure for direct-to-TOAST-table I/O to ExpandedObjects today, independently of any ambitions about larger-than-1GB values. I'm not entirely sure how often it would get exercised, which is the key subtext of what I wrote before, but it's clearly a possible optimization of what we do now. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Tue, Dec 6, 2016 at 11:26 PM, Michael Paquier wrote: > On Wed, Dec 7, 2016 at 12:32 PM, Fujii Masao wrote: >> So, isn't it better to compare the performance of some algorithms and >> confirm which is the best for quorum commit? Since this code is hot, i.e., >> can be very frequently executed, I'd like to avoid waste of cycle as much >> as possible. > > It seems to me that it would be simple enough to write a script to do > that to avoid any other noise: allocate an array with N random > elements, and fetch the M-th element from it after applying a sort > method. I highly doubt that you'd see much difference with a low > number of elements, now if you scale at a thousand standbys in a > quorum set you may surely see something :*) > Anybody willing to try out? You could do that, but first I would code up the simplest, cleanest algorithm you can think of and see if it even shows up in a 'perf' profile. Microbenchmarking is probably overkill here unless a problem is visible on macrobenchmarks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Wed, Dec 7, 2016 at 6:49 PM, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> Color me confused ... maybe systemd's behavior has changed? > >> https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/ > > I see Lennart hasn't gotten any less convinced that he's always right > since I left Red Hat :-( This thread does seem to give that impression. It's nice to know there are engineers in the world even more arrogant than we are. :-) > But anyway, it's a demonstrable fact that Fedora 25 has KillUserProcesses > off by default, even though it contains systemd-231. I assume FESCO > brought down the hammer at some point. https://pagure.io/fesco/issue/1600 seems to suggest that it's merely in abeyance. (See the first two updates and the last one for the executive summary.) > This still doesn't address the real question, which is whether RemoveIPC > does anything if KillUserProcesses is off, and whether that behavior > has changed. I don't see anything about RemoveIPC in that thread. http://www.dsm.fordham.edu/cgi-bin/man-cgi.pl?topic=logind.conf§=5 suggests that KillUserProcesses and RemoveIPC are separate cleanup behaviors triggered by the same underlying cause (termination of last session). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to implement pg_current_logfile() function
Robert Haas writes: > Yikes. This patch has certainly had a lot of review, but it has lots > of basic inconsistencies with PostgreSQL coding style, which one would > think somebody would have noticed by now, and there are multiple > places where the logic seems to do in a complicated way something that > could be done significantly more simply. I don't have any objection > to the basic idea of this patch, but it's got to look and feel like > the surrounding PostgreSQL code. And not be unnecessarily > complicated. A lot of the code-formatting issues could probably be fixed by running it through pgindent. Also, when you are starting from code that was written with little regard for Postgres layout conventions, it's a really good idea to see what pgindent will do to it, because it might not look very nice afterwards. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch to implement pg_current_logfile() function
On Tue, Dec 6, 2016 at 6:11 PM, Gilles Darold wrote: > It seems that all fixes have been included in this patch. Yikes. This patch has certainly had a lot of review, but it has lots of basic inconsistencies with PostgreSQL coding style, which one would think somebody would have noticed by now, and there are multiple places where the logic seems to do in a complicated way something that could be done significantly more simply. I don't have any objection to the basic idea of this patch, but it's got to look and feel like the surrounding PostgreSQL code. And not be unnecessarily complicated. Detailed comments below: The SGML doesn't match the surrounding style - for example, typically appears on a line by itself. +if (!Logging_collector) { Formatting. rm_log_metainfo() could be merged into logfile_writename(), since they're always called in conjunction and in the same pattern. The function is poorly named; it should be something like update_metainfo_datafile(). +if (errno == ENFILE || errno == EMFILE) +ereport(LOG, +(errmsg("system is too busy to write logfile meta info, %s will be updated on next rotation (or use SIGHUP to retry)", LOG_METAINFO_DATAFILE))); This seems like a totally unprincipled reaction to a purely arbitrary subset of errors. EMFILE or ENFILE doesn't represent a general "too busy" condition, and logfile_open() has already logged the real error. +snprintf(tempfn, sizeof(tempfn), "%s.tmp", LOG_METAINFO_DATAFILE); You don't really need to use snprintf() here. You could #define LOG_METAINFO_DATAFILE_TMP LOG_METAINFO_DATAFILE ".tmp" and compute this at compile time instead of runtime. +if (PG_NARGS() == 1) { +fmt = PG_ARGISNULL(0) ? NULL : PG_GETARG_TEXT_PP(0); +if (fmt != NULL) { +logfmt = text_to_cstring(fmt); +if ( (strcmp(logfmt, "stderr") != 0) && +(strcmp(logfmt, "csvlog") != 0) ) { +ereport(ERROR, +(errmsg("log format %s not supported, possible values are stderr or csvlog", logfmt))); +PG_RETURN_NULL(); +} +} +} else { +logfmt = NULL; +} Formatting. This is unlike PostgreSQL style in multiple ways, including cuddled braces, extra parentheses and spaces, and use of braces around a single-line block. Also, this could be written in a much less contorted way, like: if (PG_NARGS() == 0 || PG_ARGISNULL(0)) logfmt = NULL; else { logfmt = text_to_cstring(PG_GETARG_TEXT_PP(0)); if (strcmp(logfmt, "stderr") != 0 && strcmp(logfmt, "csvlog") != 0) ereport(...); } Also, the ereport() needs an errcode(), not just an errmsg(). Otherwise it defaults to ERRCODE_INTERNAL_ERROR, which is not correct. +/* Check if current log file is present */ +if (stat(LOG_METAINFO_DATAFILE, &stat_buf) != 0) +PG_RETURN_NULL(); Useless test. The code that follows catches this case anyway and handles it the same way. Which is good, because this has an inherent race condition. The previous if (!Logging_collector) test sems fairly useless too; unless there's a bug in your syslogger.c code, the file won't exist anyway, so we'll return NULL for that reason with no extra code needed here. +/* +* Read the file to gather current log filename(s) registered +* by the syslogger. +*/ Whitespace isn't right. +while (fgets(lbuffer, sizeof(lbuffer), fd) != NULL) { +charlog_format[10]; +int i = 0, space_pos = 0; + +/* Check for a read error. */ +if (ferror(fd)) { More coding style issues. +ereport(ERROR, +(errcode_for_file_access(), +errmsg("could not read file \"%s\": %m", LOG_METAINFO_DATAFILE))); +FreeFile(fd); +break; ereport(ERROR) doesn't return, so the code that follows is pointless. +if (strchr(lbuffer, '\n') != NULL) +*strchr(lbuffer, '\n') = '\0'; Probably worth adding a local variable instead of doing this twice. Local variables are cheap, and the code would be more readable. +if ((space_pos == 0) && (isspace(lbuffer[i]) != 0)) Too many parentheses. Also, I think the whole loop in which this is contained could be eliminated entirely. Just search for the first ' ' character with strchr(); you don't need to are about isspace() because you know the code that writes this file uses ' ' specifically. Overwrite it with '\0'. And then use a pointer to lbuffer for the log format and a pointer to lbuffer + strchr_result + 1 for the pathname. +if ((space_pos != (int)strlen("stderr")) && +(space_pos != (int)strlen("csvlog"))) +{ +ereport(ERROR, +(errmsg("unexpected line format in file %s", LOG_METAINFO_DATAFILE))); +break; +} I think this is pointless. Validating the length of the log format but not the content seems kind of silly, and why do either? T
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Alvaro Herrera writes: > Tom Lane wrote: >> Color me confused ... maybe systemd's behavior has changed? > https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/ I see Lennart hasn't gotten any less convinced that he's always right since I left Red Hat :-( But anyway, it's a demonstrable fact that Fedora 25 has KillUserProcesses off by default, even though it contains systemd-231. I assume FESCO brought down the hammer at some point. This still doesn't address the real question, which is whether RemoveIPC does anything if KillUserProcesses is off, and whether that behavior has changed. I don't see anything about RemoveIPC in that thread. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
Robert Haas writes: > On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai wrote: >> I like to propose a new optional type handler 'typserialize' to >> serialize an in-memory varlena structure (that can have indirect >> references) to on-disk format. > I think it's probably a mistake to conflate objects with substructure > with objects > 1GB. Those are two somewhat orthogonal needs. Maybe. I think where KaiGai-san is trying to go with this is being able to turn an ExpandedObject (which could contain very large amounts of data) directly into a toast pointer or vice versa. There's nothing really preventing a TOAST OID from having more than 1GB of data attached, and if you had a side channel like this you could transfer the data without ever having to form a larger-than-1GB tuple. The hole in that approach, to my mind, is that there are too many places that assume that they can serialize an ExpandedObject into part of an in-memory tuple, which might well never be written to disk, or at least not written to disk in a table. (It might be intended to go into a sort or hash join, for instance.) This design can't really work for that case, and unfortunately I think it will be somewhere between hard and impossible to remove all the places where that assumption is made. At a higher level, I don't understand exactly where such giant ExpandedObjects would come from. (As you point out, there's certainly no easy way for a client to ship over the data for one.) So this feels like a very small part of a useful solution, if indeed it's part of a useful solution at all, which is not obvious. FWIW, ExpandedObjects themselves are far from a fully fleshed out concept, one of the main problems being that they don't have very long lifespans except in the case that they're the value of a plpgsql variable. I think we would need to move things along quite a bit in that area before it would get to be useful to think in terms of ExpandedObjects containing multiple GB of data. Otherwise, the inevitable flattenings and re-expansions are just going to kill you. Likewise, the need for clients to be able to transfer data in chunks gets pressing well before you get to 1GB. So there's a lot here that really should be worked on before we try to surmount that barrier. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Tom Lane wrote: > Hmm ... after further experimentation, I still can't get this version of > systemd (231) to do anything evil. It turns out that Fedora ships it with > KillUserProcesses turned off by default, and maybe having that on is a > prerequisite for the other behavior? But that doesn't make a lot of sense > because we'd never be seeing the reports of databases moaning about lost > semaphores if the processes got killed first. Anyway, I see nothing bad > happening if KillUserProcesses is off, while if it's on then the database > gets shut down reasonably politely via SIGTERM. > > Color me confused ... maybe systemd's behavior has changed? https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/ -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partitionning: support for Truncate Table WHERE
here is an exemple : CREATE OR REPLACE FUNCTION truncate_table_where(v_table VARCHAR, v_where_condition VARCHAR) RETURNS void AS $$ DECLARE v_stmt varchar; v_tableoid oid; v_part varchar; v_found_other integer; BEGIN LOOP v_stmt := 'SELECT tableoid FROM '|| v_table||' WHERE '||v_where_condition||' limit 1 '; EXECUTE v_stmt INTO v_tableoid; IF (v_tableoid is null) THEN EXIT; END IF; Select pg_namespace.nspname||'.'||pg_class.relname into v_part from pg_catalog.pg_class INNER JOIN pg_namespace ON pg_class.relnamespace = pg_namespace.oid where pg_class.oid = v_tableoid; RAISE NOTICE 'Partition found: %', v_part; -- check if other data in part v_stmt := 'SELECT 1 FROM '|| v_part||' WHERE NOT ('||v_where_condition||') limit 1 '; EXECUTE v_stmt INTO v_found_other; IF (v_found_other =1) THEN v_stmt := 'DELETE FROM '|| v_part||' WHERE '||v_where_condition; RAISE NOTICE 'Executing: %', v_stmt; EXECUTE v_stmt; ELSE v_stmt := 'TRUNCATE '|| v_part; RAISE NOTICE 'Executing: %', v_stmt; EXECUTE v_stmt; END IF; END LOOP; END $$ LANGUAGE plpgsql; ; De : Amit Langote Envoyé : mercredi 7 décembre 2016 06:58:03 À : Craig Ringer; legrand legrand Cc : pgsql-hackers@postgresql.org Objet : Re: [HACKERS] Partitionning: support for Truncate Table WHERE On 2016/12/07 15:26, Craig Ringer wrote: > On 7 December 2016 at 07:29, legrand legrand > wrote: > >> Working in a DSS environment, we often need to truncate table partitions >> regarding a WHERE condition and have to >> [...] >> Would be pleased to ear your feedback regarding this. > > It sounds like something that'd be useful to do on top of declarative > partitioning, once that is merged. Perhaps you could start by reading > and testing the declarative partitioning patch. That'll give you a > better idea of the practicalities of doing what you propose on top of > it, and give you an opportunity to suggest changes to the declarative > partitioning scheme that might make conditional truncate easier later. Agreed. If I understand the request correctly, TRUNCATE on the parent table (a partitioned table), which currently recurses to *all* child tables (partitions), should have a restricting WHERE clause, right? It would become possible to implement something like that with the new declarative partitioned tables. As Crag mentioned, you can take a look at the discussion about declarative partitioning in the emails linked to at the following page: https://commitfest.postgresql.org/12/611/ Thanks, Amit
Re: [HACKERS] varlena beyond 1GB and matrix
On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai wrote: > I like to propose a new optional type handler 'typserialize' to > serialize an in-memory varlena structure (that can have indirect > references) to on-disk format. > If any, it shall be involced on the head of toast_insert_or_update() > than indirect references are transformed to something other which > is safe to save. (My expectation is, the 'typserialize' handler > preliminary saves the indirect chunks to the toast relation, then > put toast pointers instead.) This might not work. The reason is that we have important bits of code that expect that they can figure out how to do some operation on a datum (find its length, copy it, serialize it) based only on typlen and typbyval. See src/backend/utils/adt/datum.c for assorted examples. Note also the lengthy comment near the top of the file, which explains that typlen > 0 indicates a fixed-length type, typlen == -1 indicates a varlena, and typlen == -2 indicates a cstring. I think there's probably room for other typlen values; for example, we could have typlen == -3 indicate some new kind of thing -- a super-varlena that has a higher length limit, or some other kind of thing altogether. Now, you can imagine trying to treat what you're talking about as a new type of TOAST pointer, but I think that's not going to help, because at some point the TOAST pointer gets de-toasted into a varlena ... which is still limited to 1GB. So that's not really going to work. And it brings out another point, which is that if you define a new typlen code, like -3, for super-big things, they won't be varlenas, which means they won't work with the existing TOAST interfaces. Still, you might be able to fix that. You would probably have to do some significant surgery on the wire protocol, per the commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd. I think it's probably a mistake to conflate objects with substructure with objects > 1GB. Those are two somewhat orthogonal needs. As Jim also points out, expanded objects serve the first need. Of course, those assume that we're dealing with a varlena, so if we made a super-varlena, we'd still need to create an equivalent. But perhaps it would be a fairly simple adaptation of what we've already got. Handling objects >1GB at all seems like the harder part of the problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump vs. TRANSFORMs
Stephen Frost writes: > As pointed out by Peter E, this also impacts CASTs. Attached is a patch > which addresses both by simply also pulling any functions which are > referenced from pg_cast or pg_transform when they have OIDs at or after > FirstNormalObjectId. I also modified dumpCast() and dumpTransform() to > complain loudly if they're unable to dump out the cast or transform due > to not finding the function definition(s) necessary. Please do not hack the already-overcomplicated query in getFuncs without bothering to adjust the long comment that describes what it's doing. I have a vague feeling that the code for dumping casts and/or transforms may have some assumptions that the underlying function is also being dumped. Although maybe the assumption was really only what's fixed here, ie that there be a DumpableObject for the function. Anyway, take a close look for that. Looks reasonable otherwise. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On 2016-12-06 23:54:43 -0500, Tom Lane wrote: > You're attacking a straw man. I didn't propose changing our behavior > anywhere but Linux. AFAIK, on that platform unnamed POSIX semaphores > are futexes, which have been a stable feature since 2003 according to > https://en.wikipedia.org/wiki/Futex#History. Anybody who did need > to compile PG for use with a pre-2.6 kernel could override the default, > anyway. Back then futexes weren't "robust" though (crash handling and such was unusable). They only started to be reliable in the ~2007-2008 frame IIRC. That still should be ok though. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Hi, On 2016-12-06 21:53:06 -0500, Tom Lane wrote: > Just saw another report of what's probably systemd killing off Postgres' > SysV semaphores, as we've discussed previously at, eg, > https://www.postgresql.org/message-id/flat/57828C31.5060409%40gmail.com > Since the systemd people are generally impervious to suggestions that > they might be mistaken, I do not expect this problem to go away. Would doing so actually solve the systemd issue? Doesn't systemd also remove SYSV shared memory, which we still use a tiny bit of? > I think we should give serious consideration to back-patching commit > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX > on Linux. We've seen no problems in the buildfarm in the two months > that that's been in HEAD. If we don't do this, we can expect to > continue seeing complaints of this sort until pre-v10 PG releases > fall out of use ... and I don't want to wait that long. I'm a bit uncomfortable backpatching this change, before it has seen production usage. Both the posix and sysv semaphore implementation has evolved over the years, with changing performance characteristics. I don't think it's fair to users to swap a proven solution out for something that hasn't seen a lot of load. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Alex Hunsaker writes: > On Wed, Dec 7, 2016 at 1:12 PM, Tom Lane wrote: >> But this is all kind of moot if Peter is right that systemd will zap >> POSIX shmem along with SysV semaphores. I've been trying to reproduce >> the issue on a Fedora 25 installation, and so far I can't get it to >> zap anything, so I'm a bit at a loss how to prove things one way or >> the other. > After logon, you should see "/usr/lib/systemd/systemd --user" running for > that user. After logout out, said proc should exit. Hmm ... after further experimentation, I still can't get this version of systemd (231) to do anything evil. It turns out that Fedora ships it with KillUserProcesses turned off by default, and maybe having that on is a prerequisite for the other behavior? But that doesn't make a lot of sense because we'd never be seeing the reports of databases moaning about lost semaphores if the processes got killed first. Anyway, I see nothing bad happening if KillUserProcesses is off, while if it's on then the database gets shut down reasonably politely via SIGTERM. Color me confused ... maybe systemd's behavior has changed? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Thu, Oct 20, 2016 at 11:30 AM, Pavan Deolasee wrote: > We have a design to convert WARM chains back to HOT and that will increase > the percentage of WARM updates much beyond 50%. I was waiting for feedback > on the basic patch before putting in more efforts, but it went unnoticed > last CF. While you did sign up to review one patch in the last CF, the amount of review you did for that patch is surely an order of magnitude less than what WARM will require. Maybe more than that. I don't mean to point the finger at you specifically -- there are lots of people slinging patches into the CommitFest who aren't doing as much review as their own patches will require. I'm putting a lot of time into reviewing patches this year, and basically none into writing my own, but I still can't review every major patch that somebody submits. I can't even do committer review of all of those patches, let alone first-round review. Perhaps I ought to rank the things I review by descending order of importance, in which case this arguably ought to be pretty high on the list. But I'd feel somewhat bad working on this instead of, say, multivariate statistics or unique joins, which have been pending for a lot longer. Anyway, the point, not just to you but to everybody, is that the review can't always be left to other people. Some people will review and not contribute any code, and that's great. Some people will contribute code but not review, and to the extent that we can support that, it's also great. But the giant backlog of unreviewed patches which has accumulated shows that we have too many people needing more review than they produce, and that is not great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] varlena beyond 1GB and matrix
On 12/7/16 5:50 AM, Kohei KaiGai wrote: If and when this structure is fetched from the tuple, its @ptr_block is initialized to NULL. Once it is supplied to a function which references a part of blocks, type specific code can load sub-matrix from the toast relation, then update the @ptr_block not to load the sub-matrix from the toast multiple times. I'm not certain whether it is acceptable behavior/manner. I'm glad you're looking into this. The 1G limit is becoming a larger problem every day. Have you considered using ExpandedObjects to accomplish this? I don't think the API would work as-is, but I suspect there's other places where we'd like to be able to have this capability (arrays and JSONB come to mind). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Wed, Dec 7, 2016 at 1:12 PM, Tom Lane wrote: > But this is all kind of moot if Peter is right that systemd will zap > POSIX shmem along with SysV semaphores. I've been trying to reproduce > the issue on a Fedora 25 installation, and so far I can't get it to > zap anything, so I'm a bit at a loss how to prove things one way or > the other. > Don't know precisely about Fedora 25, but I've had success in the past with: ssh in as the user start postgres under tmux/screen logout do another ssh login/logout cycle After logon, you should see "/usr/lib/systemd/systemd --user" running for that user. After logout out, said proc should exit. If either of those is not true, either systemd is not setup to track sessions (probably via pam) or it thinks you still have an active logon. Another way to check if systemd thinks the user is logged in is if /run/user/$USER exists.
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Tom Lane wrote: > In HEAD, we could change the RTE data structure so that > transformValuesClause could save the typmod information in the RTE, > keeping the lookups cheap. Hmm, I think this would be useful for the XMLTABLE patch too. I talked a bit about it at https://www.postgresql.org/message-id/20161122204730.dgipy6gxi25j4e6a@alvherre.pgsql The patch has evolved quite a bit since then, but the tupdesc continues to be a problem. Latest patch is at https://www.postgresql.org/message-id/CAFj8pRBsrhwR636-_3TPbqu%3DFo3_DDer6_yp_afzR7qzhW1T6Q%40mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Indirect indexes
On Fri, Oct 21, 2016 at 7:04 PM, Alvaro Herrera wrote: > Robert Haas wrote: >> So, I think that this is a really promising direction, but also that >> you should try very hard to try to get out from under this 6-byte PK >> limitation. That seems really ugly, and in practice it probably means >> your PK is probably going to be limited to int4, which is kind of sad >> since it leaves people using int8 or text PKs out in the cold. > > I think we could just add a new type, unsigned 6 byte int, specifically > for this purpose. Little in the way of operators, as it's pointless to > try to do arithmetic with object identifiers. (It'd be similar to UUID > in spirit, but I wouldn't try to do anything too smart to generate them.) Sure, we could do that, but that's just band-aiding over the fact that the index page format isn't really what we want for a feature of this type. > Yes, recheck is always needed. > > As for vacuum, I was thinking this morning that perhaps the answer to > that is just to not vacuum the index at all and instead rely on the > killtuple interface (which removes tuples during scan). So we don't > need to spend precious maint_work_mem space on a large list of PK > values. I don't think that's going to fly. Even if it's the case that indirect indexes typically need less cleanup than regular indexes, the idea that there's no command to force a full cleanup short of REINDEX doesn't sit well with me. It's not difficult to construct realistic scenarios in which kill_prior_tuple is almost useless (e.g. values are all marching forward). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Short reads in hash indexes
Andreas Seltenreich writes: > Amit Kapila writes: > >> On Sat, Dec 3, 2016 at 3:44 PM, Andreas Seltenreich >> wrote: >>> Amit Kapila writes: >>> [2. text/x-diff; fix_hash_bucketsplit_sqlsmith_v1.patch] >>> Ok, I'll do testing with the patch applied. > > Good news: the assertion hasn't fired since the patch is in. Meh, it fired again today after being silent for 100e6 queries :-/ I guess I need to add some confidence qualification on such statements. Maybe sigmas as they do at CERN… > smith=# select * from state_report where sqlstate = 'XX001'; > -[ RECORD 1 > ]-- > count| 10 > sqlstate | XX001 > sample | ERROR: could not read block 1173 in file "base/16384/17256": read > only 0 of 8192 bytes > hosts| {airbisquit,frell,gorgo,marbit,pillcrow,quakken} > >> Hmm, I am not sure if this is related to previous problem, but it >> could be. Is it possible to get the operation and or callstack for >> above failure? > > Ok, will turn the elog into an assertion to get at the backtraces. Doing so on top of 4212cb7, I caught the backtrace below. Query was: --8<---cut here---start->8--- set max_parallel_workers_per_gather = 0; select count(1) from public.hash_name_heap as ref_2 join public.rtest_emplog as sample_1 on (ref_2.random = sample_1.who); --8<---cut here---end--->8--- I've put the data directory where it can be reproduced here: http://ansel.ydns.eu/~andreas/hash_index_short_read.tar.xz (12MB) regards, Andreas TRAP: FailedAssertion("!(!"short read of block")", File: "md.c", Line: 782) #2 0x007f7f11 in ExceptionalCondition (conditionName=conditionName@entry=0x9a1ae9 "!(!\"short read of block\")", errorType=errorType@entry=0x83db3d "FailedAssertion", fileName=fileName@entry=0x946a9a "md.c", lineNumber=lineNumber@entry=782) at assert.c:54 #3 0x006fb305 in mdread (reln=, forknum=, blocknum=4702, buffer=0x7fe97e7e1280 "\"") at md.c:782 #4 0x006d0ffa in ReadBuffer_common (smgr=0x2af7408, relpersistence=, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=blockNum@entry=4702, mode=RBM_NORMAL, strategy=, hit=0x7ffde9df11cf "") at bufmgr.c:890 #5 0x006d1a20 in ReadBufferExtended (reln=0x2fd10d8, forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=4702, mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at bufmgr.c:664 #6 0x006d1b74 in ReadBuffer (blockNum=, reln=) at bufmgr.c:596 #7 ReleaseAndReadBuffer (buffer=buffer@entry=87109984, relation=, blockNum=) at bufmgr.c:1540 #8 0x004c047b in index_fetch_heap (scan=scan@entry=0x5313160) at indexam.c:469 #9 0x004c05ee in index_getnext (scan=scan@entry=0x5313160, direction=direction@entry=ForwardScanDirection) at indexam.c:565 #10 0x005f9b71 in IndexNext (node=node@entry=0x5311c48) at nodeIndexscan.c:105 #11 0x005ec492 in ExecScanFetch (recheckMtd=0x5f9af0 , accessMtd=0x5f9b30 , node=0x5311c48) at execScan.c:95 #12 ExecScan (node=0x5311c48, accessMtd=0x5f9b30 , recheckMtd=0x5f9af0 ) at execScan.c:145 #13 0x005e4da8 in ExecProcNode (node=node@entry=0x5311c48) at execProcnode.c:427 #14 0x006014f9 in ExecNestLoop (node=node@entry=0x53110a8) at nodeNestloop.c:174 #15 0x005e4cf8 in ExecProcNode (node=node@entry=0x53110a8) at execProcnode.c:476 #16 0x00601436 in ExecNestLoop (node=node@entry=0x5310e00) at nodeNestloop.c:123 #17 0x005e4cf8 in ExecProcNode (node=node@entry=0x5310e00) at execProcnode.c:476 #18 0x00601436 in ExecNestLoop (node=node@entry=0x530f698) at nodeNestloop.c:123 #19 0x005e4cf8 in ExecProcNode (node=node@entry=0x530f698) at execProcnode.c:476 #20 0x005e0e9e in ExecutePlan (dest=0x603a4a8, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x530f698, estate=0x46bc008) at execMain.c:1568 #21 standard_ExecutorRun (queryDesc=0x3475168, direction=, count=0) at execMain.c:338 #22 0x007029f8 in PortalRunSelect (portal=portal@entry=0x2561e18, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x603a4a8) at pquery.c:946 #23 0x00703f3e in PortalRun (portal=portal@entry=0x2561e18, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x603a4a8, altdest=altdest@entry=0x603a4a8, completionTag=completionTag@entry=0x7ffde9df18b0 "") at pquery.c:787 #24 0x00700d5b in exec_simple_query (query_string=0x4685258 ) at postgres.c:1094 #25 PostgresMain (argc=, argv=argv@entry=0x256f5a8, dbname=0x256f580 "regression", username=) at postgres.c:4069 #26 0x0046daf2 in BackendRun (port=0x25645a0) at postmaster.c:4274 #27 BackendStartup (port=0x25645a0) at postmaster.c:3946 #28 ServerLoop () at postmaster.c:1704 #29 0x00699d28 in PostmasterMain (argc=argc@entry=4,
Re: [HACKERS] Unlogged tables cleanup
Michael, your greetings were passed on to me with a request that I look at this thread. On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier wrote: >>> More seriously, if there could be more details regarding that, I would >>> think that we could say something like "logging the init fork is >>> mandatory in any case to ensure its on-disk presence when at recovery >>> replay, even on non-default tablespaces whose base location are >>> deleted and re-created from scratch if the WAL record in charge of >>> creating this tablespace gets replayed". The problem shows up because >>> of tablespaces being deleted at replay at the end... So perhaps this >>> makes sense. What do you think? >> >> Yes, that's about what I think we need to explain. > > OK, what do you think about the attached then? > > I came up with something like that for those code paths: > - /* Write the page. If archiving/streaming, XLOG it. */ > + /* > +* Write the page and log it unconditionally. This is important > +* particularly for indexes created on tablespaces and databases > +* whose creation happened after the last redo pointer as recovery > +* removes any of their existing content when the corresponding > +* create records are replayed. > +*/ > I have not been able to use the word "create" less than that. The > patch is really repetitive, but I think that we had better mention the > need of logging the content in each code path and not touch > log_newpage() to keep it a maximum flexible. In blinsert.c, nbtree.c, etc. how about: Write the page and log it. It might seem that an immediate sync would be sufficient to guarantee that the file exists on disk, but recovery itself might remove it while replaying, for example, an XLOG_DBASE_CREATE record. Therefore, we need this even when wal_level=minimal. >> Actually, I'm >> wondering if we ought to just switch this over to using the delayChkpt >> mechanism instead of going through this complicated song-and-dance. >> Incurring an immediate sync to avoid having to WAL-log this is a bit >> tenuous, but having to WAL-log it AND do the immediate sync just seems >> silly, and I'm actually a bit worried that even with your fix there >> might still be a latent bug somewhere. We couldn't switch mechanisms >> cleanly in the 9.2 branch (cf. >> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should >> back-patch it in the form you have it in already, but it's probably >> worth switching over at least in master. > > Thinking through it... Could we not just rip off the sync phase > because there is delayChkpt? It seems to me that what counts is that > the commit of the transaction that creates the relation does not get > past the redo point. It would matter for read uncommitted transactions > but that concept does not exist in PG, and never will. On > back-branches it is definitely safer to keep the sync, I am just > thinking about a HEAD-only optimization here as you do. Right (I think). If we set and clear delayChkpt around this work, we don't need the immediate sync. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol
On 12/07/2016 08:39 AM, Michael Paquier wrote: On Tue, Nov 29, 2016 at 1:36 PM, Michael Paquier wrote: Nothing more will likely happen in this CF, so I have moved it to 2017-01 with the same status of "Needs Review". Attached is a new set of patches using the new routines pg_backend_random() and pg_strong_random() to handle the randomness in SCRAM: - 0001 refactors the SHA2 routines. pgcrypto uses raw files from src/common when compiling with this patch. That works on any platform, and this is the simplified version of upthread. - 0002 adds base64 routines to src/common. - 0003 does some refactoring regarding the password encryption in ALTER/CREATE USER queries. - 0004 adds the clause PASSWORD (val USING method) in CREATE/ALTER USER. - 0005 is the code patch for SCRAM. Note that this switches pgcrypto to link to libpgcommon as SHA2 routines are used by the backend. - 0006 adds some regression tests for passwords. - 0007 adds some TAP tests for authentication. This is added to the upcoming CF. I spent a little time reading through this once again. Steady progress, did some small fixes: * Rewrote the nonce generation. In the server-side, it first generated a string of ascii-printable characters, then base64-encoded them, which is superfluous. Also, avoid calling pg_strong_random() one byte at a time, for performance reasons. * Added a more sophisticated fallback implementation in libpq, for the --disable-strong-random cases, similar to pg_backend_random(). * No need to disallow SCRAM with db_user_namespace. It doesn't include the username in the salt like MD5 does. Attached those here, as add-on patches to your latest patch set. I'll continue reviewing, but a couple of things caught my eye that you may want to jump on, in the meanwhile: On error messages, the spec says: o e: This attribute specifies an error that occurred during authentication exchange. It is sent by the server in its final message and can help diagnose the reason for the authentication exchange failure. On failed authentication, the entire server- final-message is OPTIONAL; specifically, a server implementation MAY conclude the SASL exchange with a failure without sending the server-final-message. This results in an application-level error response without an extra round-trip. If the server-final-message is sent on authentication failure, then the "e" attribute MUST be included. Note that it says that the server can send the error message with the e= attribute, in the *final message*. It's not a valid response in the earlier state, before sending server-first-message. I think we need to change the INIT state handling in pg_be_scram_exchange() to not send e= messages to the client. On an error at that state, it needs to just bail out without a message. The spec allows that. We can always log the detailed reason in the server log, anyway. As Peter E pointed out earlier, the documentation is lacking, on how to configure MD5 and/or SCRAM. If you put "scram" as the authentication method in pg_hba.conf, what does it mean? If you have a line for both "scram" and "md5" in pg_hba.conf, with the same database/user/hostname combo, what does that mean? Answer: The first one takes effect, the second one has no effect. Yet the example in the docs now has that, which is nonsense :-). Hopefully we'll have some kind of a "both" option, before the release, but in the meanwhile, we need describe how this works now in the docs. - Heikki >From 4d3a59ae1cb5742499c71b0c1e048d30dcef6836 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 7 Dec 2016 15:24:55 +0200 Subject: [PATCH 08/11] Rewrite nonce generation. In the server, the nonce was generated using only ASCII-printable characters, and the result was base64-encoded. The base64 encoding is pointless, if we use only ASCII-printable chars to begin with. Calling pg_strong_random() can be somewhat expensive, as with the /dev/urandom implementation, it has to open the device, read the bytes, and close, on every call. So avoid calling it in a loop, generating only one byte in each call. I went back to using base64-encoding method of turning the raw bytes into the final nonce. That was more convenient than writing something that encodes to the whole ASCII-printable range. That means that we're not using the whole range of chars allowed in the nonce, but I believe that doesn't make any difference. (Both the frontend and backend will still accept the full range from the other side of the connection). --- src/backend/libpq/auth-scram.c | 52 --- src/include/common/scram-common.h| 6 +++- src/include/libpq/libpq-be.h | 2 -- src/interfaces/libpq/fe-auth-scram.c | 60 ++-- 4 files changed, 34 insertions(+), 86 deletions(-) diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c index 55c5efa..cda663b
Re: [HACKERS] patch: function xmltable
2016-12-07 20:50 GMT+01:00 Alvaro Herrera : > Pavel Stehule wrote: > > 2016-12-07 18:34 GMT+01:00 Alvaro Herrera : > > > > Hmm. Now that I see how this works, by having the GetValue "guess" > what > > > is going on and have a special case for it, I actually don't like it > > > very much. It seems way too magical. I think we should do away with > > > the "if column is NULL" case in GetValue, and instead inject a column > > > during transformTableExpr if columns is NIL. This has implications on > > > ExecInitExpr too, which currently checks for an empty column list -- it > > > would no longer have to do so. > > > > I prefer this way against second described. The implementation should be > in > > table builder routines, not in executor. > > Well, given the way you have implemented it, I would prefer the original > too. But your v23 is not what I meant. Essentially what you do in v23 > is to communicate the lack of COLUMNS clause in a different way -- > previously it was "ncolumns = 0", now it's "is_auto_col=true". It's > still "magic". It's not an improvement. > is_auto_col is used primary for asserting. The table builder has information for decision in parameter path, when path is NULL. Hard to say, if this info should be assigned to column or to table. In both locations has sense. But somewhere should be some flag. > > What I want to happen is that there is no magic at all; it's up to > transformExpr to make sure that when COLUMNS is empty, one column > appears and it must not be a magic column that makes the xml.c code act > differently, but rather to xml.c it should appear that this is just a > normal column that happens to return the entire row. If I say "COLUMNS > foo PATH '/'" I should be able to obtain a similar behavior (except that > in the current code, if I ask for "COLUMNS foo XML PATH '/'" I don't get > XML at all but rather weird text where all tags have been stripped out, > which is very strange. I would expect the tags to be preserved if the > output type is XML. Maybe the tag-stripping behavior should occur if > the output type is some type of text.) > I am doing this. Just I using NULL for PATH. > > > I still have to figure out how to fix the tupledesc thing. What we have > now is not good. > cannot be moved to nodefunc? Pavel > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Wed, Dec 7, 2016 at 3:12 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane wrote: >>> Robert Haas writes: On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane wrote: > I think we should give serious consideration to back-patching commit > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX > on Linux. > Urk. That sounds like a scary thing to back-patch. > >>> I don't deny that it's scary, but the alternative seems to be to be >>> rather badly broken on systemd-using distros for years to come. >>> That's pretty scary too. > >> Why can't this be configurable? > > It already is. Note that I said "default". > > As things stand, it's only a configure-time choice, but I've been > thinking that we might be well advised to make it run-time configurable. Sure. A configure-time choice only benefits people who are compiling from source, which as far as production is concerned is almost nobody. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
On Wed, Dec 7, 2016 at 3:13 PM, Tom Lane wrote: > Robert Haas writes: >> And of course that'd be because relying on en_US isn't portable. Sigh. > > You can't rely on *any* collations other than C and POSIX. I get it; I just missed that during review, and then sent that message before I even looked at it carefully, so that you would know I was working on it. I think that it's fixed now; at any rate, the buildfarm seems happy enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
On Wed, Dec 7, 2016 at 1:03 PM, Tom Lane wrote: > Still, things have been like this since 8.2 when we implemented multi-row > VALUES, and nobody's noticed up to now. Maybe the right answer is to > change the data structure in HEAD and decree that we won't fix it in back > branches. I don't really like that answer though ... > The merit of "won't back-patch" is that at least you don't punish those who are being lazy (in a good sense) but generating values in subsequent lines that conform to the type specification of the first record. We already implicitly accept such behavior elsewhere - though probably with better validation - so keeping it here is defense-able. We dislike changing query plans in back branches and this really isn't that different. The concern, especially since this can propagate to a CREATE TABLE AS, is whether there is some kind of fundamental storage risk being introduced that we do not want to have happen no matter how rare. /me feels deja-vu... David J.
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
Robert Haas writes: > And of course that'd be because relying on en_US isn't portable. Sigh. You can't rely on *any* collations other than C and POSIX. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Robert Haas writes: > On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane wrote: >> Robert Haas writes: >>> On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane wrote: I think we should give serious consideration to back-patching commit ecb0d20a9, which changed the default semaphore type to unnamed-POSIX on Linux. >>> Urk. That sounds like a scary thing to back-patch. >> I don't deny that it's scary, but the alternative seems to be to be >> rather badly broken on systemd-using distros for years to come. >> That's pretty scary too. > Why can't this be configurable? It already is. Note that I said "default". As things stand, it's only a configure-time choice, but I've been thinking that we might be well advised to make it run-time configurable. I do not believe that anyone's still using a Linux version wherein POSIX semas wouldn't work, but I am not convinced that the same is true for FreeBSD. And a configure-run-time test is not a pleasant option because it doesn't work for cross-compiles. So really, on platforms where we think POSIX semas might work, it'd be best to try a sem_init() during postmaster start and then fall back to SysV if it doesn't work. But this is all kind of moot if Peter is right that systemd will zap POSIX shmem along with SysV semaphores. I've been trying to reproduce the issue on a Fedora 25 installation, and so far I can't get it to zap anything, so I'm a bit at a loss how to prove things one way or the other. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump vs. TRANSFORMs
All, * Stephen Frost (sfr...@snowman.net) wrote: > While testing pg_dump, I discovered that there seems to be an issue when > it comes to TRANSFORMs. [...] As pointed out by Peter E, this also impacts CASTs. Attached is a patch which addresses both by simply also pulling any functions which are referenced from pg_cast or pg_transform when they have OIDs at or after FirstNormalObjectId. I also modified dumpCast() and dumpTransform() to complain loudly if they're unable to dump out the cast or transform due to not finding the function definition(s) necessary. This'll need to be back-patched to 9.5 for the pg_transform look up and all the way for pg_cast, though I don't expect that to be too difficult. We don't do anything else with FirstNormalObjectId in SQL code in pg_dump, though we obviously use it all over the place in the actual code based on the OIDs returned from the database. Still, does anyone see an issue with using it in a query? Without it, we end grabbing the info for 100+ or so functions in a default install that we don't need, which isn't horrible, but there were concerns raised before about pg_dump performance for very small databases. This also adds in regression tests to pg_dump for casts and transforms and the pg_upgrade testing with the regression database will now actually test the dump/restore of transforms (which it didn't previously because the transforms weren't dumped). Thoughts? Thanks! Stephen From 3eaa8d32170b69e58b5780c0aa92b05d10869389 Mon Sep 17 00:00:00 2001 From: Stephen Frost Date: Wed, 7 Dec 2016 14:59:44 -0500 Subject: [PATCH] Fix dumping of casts and transforms using built-in functions In pg_dump.c dumpCast() and dumpTransform(), we would happily ignore the cast or transform if it happened to use a built-in function because we weren't including the information about built-in functions when querying pg_proc from getFuncs(). Modify the query in getFuncs() to also gather information about functions which are used by user-defined casts and transforms (where "user-defined" means "has an OID >= FirstNormalObjectId"). This also adds to the TAP regression tests for 9.6 and master to cover these types of objects. Back-patch all the way for casts, back to 9.5 for transforms. Discussion: https://www.postgresql.org/message-id/flat/20160504183952.GE10850%40tamriel.snowman.net --- src/bin/pg_dump/pg_dump.c| 33 +--- src/bin/pg_dump/t/002_pg_dump.pl | 85 +--- 2 files changed, 80 insertions(+), 38 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 42873bb..530500c 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -4721,12 +4721,21 @@ getFuncs(Archive *fout, int *numFuncs) "\n AND (" "\n pronamespace != " "(SELECT oid FROM pg_namespace " - "WHERE nspname = 'pg_catalog')", + "WHERE nspname = 'pg_catalog')" + "\n OR EXISTS (SELECT 1 FROM pg_cast" + "\n WHERE pg_cast.oid >= %u " + "\n AND p.oid = pg_cast.castfunc)" + "\n OR EXISTS (SELECT 1 FROM pg_transform" + "\n WHERE pg_transform.oid >= %u AND " + "\n (p.oid = pg_transform.trffromsql" + "\n OR p.oid = pg_transform.trftosql))", acl_subquery->data, racl_subquery->data, initacl_subquery->data, initracl_subquery->data, - username_subquery); + username_subquery, + FirstNormalObjectId, + FirstNormalObjectId); if (dopt->binary_upgrade) appendPQExpBufferStr(query, "\n OR EXISTS(SELECT 1 FROM pg_depend WHERE " @@ -4764,7 +4773,16 @@ getFuncs(Archive *fout, int *numFuncs) "\n AND (" "\n pronamespace != " "(SELECT oid FROM pg_namespace " - "WHERE nspname = 'pg_catalog')"); + "WHERE nspname = 'pg_catalog')" + "\n OR EXISTS (SELECT 1 FROM pg_cast" + "\n WHERE p.oid = pg_cast.castfunc)"); + + if (fout->remoteVersion >= 90500) + appendPQExpBufferStr(query, + "\n OR EXISTS (SELECT 1 FROM pg_transform" + "\n WHERE p.oid = pg_transform.trffromsql" + "\n OR p.oid = pg_transform.trftosql)"); + if (dopt->binary_upgrade && fout->remoteVersion >= 90100) appendPQExpBufferStr(query, "\n OR EXISTS(SELECT 1 FROM pg_depend WHERE " @@ -10828,7 +10846,8 @@ dumpCast(Archive *fout, CastInfo *cast) { funcInfo = findFuncByOid(cast->castfunc); if (funcInfo == NULL) - return; + exit_horribly(NULL, "unable to find function definition for OID %u", + cast->castfunc); } /* @@ -10937,13 +10956,15 @@ dumpTransform(Archive *fout, TransformInfo *transform) { fromsqlFuncInfo = findFuncByOid(transform->trffromsql); if (fromsqlFuncInfo == NULL) - return; + exit_horribly(NULL, "unable to find function definition for OID %u", + transform->trffromsql); } if (OidIsValid(transform->trftosql)) { tosqlFuncInfo = findFu
Re: [HACKERS] [GENERAL] Select works only when connected from login postgres
Joseph Brenner writes: > I thought I'd reproduced the behavior in an xterm, but I was just > trying again and I don't see it. It does seem that the dumbness of my > dumb terminal is a factor. Evidently. > If I understand the way this works, it could be an even more baffling > behavior if I were using an xterm: with a blank PAGER your output > would disappear only if the select exceeded a certain number of > lines... Yeah, that was exactly the behavior I was seeing before fixing it (the fix is pushed btw). regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Kyotaro HORIGUCHI writes: > At Mon, 5 Dec 2016 21:54:08 -0700, "David G. Johnston" > wrote in > >> If every row passes you can retain the typemod. That is arguably the >> perfect solution. The concern is that "scan every row" could be very >> expensive - though in writing this I'm thinking that you'd quickly find a > I found that it already scans the whole list. select_common_type > does that. And it returns the type that is found to be applicable > on all values. Handling typmod there has a small > impact. (Though, I don't assert that we should do this.) The problem is that there is not where the impact is. It would be really easy for transformValuesClause to check for a common typmod while it's transforming the construct, but it has noplace to put the information. The place where the info is needed is in expandRTE and get_rte_attribute_type (which is called by make_var). So basically, parsing of each Var reference to a VALUES subquery would potentially have to scan all rows of the VALUES list to identify the right typmod to give to the Var. The repetitiveness of that is what's bothering me. In HEAD, we could change the RTE data structure so that transformValuesClause could save the typmod information in the RTE, keeping the lookups cheap. But that option is not available to us in released branches. On the other hand, we might be worrying over nothing. I think that in typical use, expandRTE() will be done just once per query, and it could amortize the cost by considering all the rows in one scan. get_rte_attribute_type would be a problem except that I believe it's rarely used for VALUES RTEs --- our code coverage report shows that that switch branch isn't reached at all in the standard regression tests. (The reason for the above statements is that we convert a bare VALUES to SELECT * FROM VALUES, and the * is expanded by expandRTE, not by retail applications of make_var.) > Ok, I think all of the #1 to #5 are the change of behavior in > this criteria. What we shold resolve here is the inconsistent > table generated by create table as select from (values... Well, that's one symptom, but not the only one; consider # select x::varchar(4) from (values ('z'::varchar(4)), ('0123456789')) v(x); x z 0123456789 (2 rows) The Var representing v.x is (mis)labeled as already having the right typmod for varchar(4), so transformTypeCast concludes that no run-time work is needed, and you get clearly-wrong answers out. Still, things have been like this since 8.2 when we implemented multi-row VALUES, and nobody's noticed up to now. Maybe the right answer is to change the data structure in HEAD and decree that we won't fix it in back branches. I don't really like that answer though ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: function xmltable
Pavel Stehule wrote: > 2016-12-07 18:34 GMT+01:00 Alvaro Herrera : > > Hmm. Now that I see how this works, by having the GetValue "guess" what > > is going on and have a special case for it, I actually don't like it > > very much. It seems way too magical. I think we should do away with > > the "if column is NULL" case in GetValue, and instead inject a column > > during transformTableExpr if columns is NIL. This has implications on > > ExecInitExpr too, which currently checks for an empty column list -- it > > would no longer have to do so. > > I prefer this way against second described. The implementation should be in > table builder routines, not in executor. Well, given the way you have implemented it, I would prefer the original too. But your v23 is not what I meant. Essentially what you do in v23 is to communicate the lack of COLUMNS clause in a different way -- previously it was "ncolumns = 0", now it's "is_auto_col=true". It's still "magic". It's not an improvement. What I want to happen is that there is no magic at all; it's up to transformExpr to make sure that when COLUMNS is empty, one column appears and it must not be a magic column that makes the xml.c code act differently, but rather to xml.c it should appear that this is just a normal column that happens to return the entire row. If I say "COLUMNS foo PATH '/'" I should be able to obtain a similar behavior (except that in the current code, if I ask for "COLUMNS foo XML PATH '/'" I don't get XML at all but rather weird text where all tags have been stripped out, which is very strange. I would expect the tags to be preserved if the output type is XML. Maybe the tag-stripping behavior should occur if the output type is some type of text.) I still have to figure out how to fix the tupledesc thing. What we have now is not good. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Select works only when connected from login postgres
Yes, I have a tendency to use emacs sub-shells (and occasionally M-x sql-postgres)-- I thought I'd reproduced the behavior in an xterm, but I was just trying again and I don't see it. It does seem that the dumbness of my dumb terminal is a factor. If I understand the way this works, it could be an even more baffling behavior if I were using an xterm: with a blank PAGER your output would disappear only if the select exceeded a certain number of lines... On Wed, Dec 7, 2016 at 2:31 AM, Daniel Verite wrote: > Tom Lane wrote: > >> BTW, I realized while testing this that there's still one gap in our >> understanding of what went wrong for you: cases like "SELECT 'hello'" >> should not have tried to use the pager, because that would've produced >> less than a screenful of data > > At some point emacs was mentioned as the terminal: > >>> And I guess I did that intentionally, my .bashrc has >>> >>> # I use emacs shells, I got a "pager" already: >>> export PAGER='' > > The M-x shell mode of emacs has a so-called "dumb" terminal > emulation (that's the value of $TERM) where the notion of a "page" > doesn't quite apply. > > For instance, when using emacs 24.3 with my default pager on an > Ubuntu desktop, this is what I get: > > test=> select 1; > WARNING: terminal is not fully functional > - (press RETURN) > ?column? > -- > 1 > (1 row) > > I suspect that psql is unable to determine the screen size > of the "dumb" terminal, and that it's the fault of the terminal > rather than psql. > The warning is displayed by "less" AFAICS. > > There are other psql features like tab-completion that don't work > in this mode because emacs interpret keystrokes first for > itself, in effect mixing emacs functionalities with these of the > application run in the terminal. It's awesome sometimes > and irritating at other times depending on what you expect :) > > OTOH it has also a M-x term command/mode that provides a > more sophisticated screen emulation into which paging seems > to work exactly like in a normal terminal and the emacs key bindings > are turned off. > > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane wrote: >>> I think we should give serious consideration to back-patching commit >>> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX >>> on Linux. > >> Urk. That sounds like a scary thing to back-patch. > > I don't deny that it's scary, but the alternative seems to be to be > rather badly broken on systemd-using distros for years to come. > That's pretty scary too. Why can't this be configurable? >> ... Granted, that might not >> happen, because maybe unnamed POSIX semas are one of those really >> awesome operating system primitives that never has problems on any >> system anywhere ever. But I think it's pretty hard to be certain of >> that. > > You're attacking a straw man. I didn't propose changing our behavior > anywhere but Linux. AFAIK, on that platform unnamed POSIX semaphores > are futexes, which have been a stable feature since 2003 according to > https://en.wikipedia.org/wiki/Futex#History. Anybody who did need > to compile PG for use with a pre-2.6 kernel could override the default, > anyway. Changing the behavior even just on Linux leaves plenty of room for failure, even if the feature itself has been stable. For example, there are Linux machines where POSIX shared memory doesn't work, even though POSIX shared memory is in general a supported feature on Linux and has been for a long time. So, if we were to change from System V shared memory to POSIX shared memory in a minor release, anyone in that situation would break. It's hard to be sure the same thing wouldn't happen in this case. The fact that the feature's stable doesn't prove that it works on every system in every configuration. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
On Wed, Dec 7, 2016 at 1:30 PM, Robert Haas wrote: > -- partitioned table cannot partiticipate in regular inheritance > CREATE TABLE partitioned2 ( > a int > --- 392,411 > c text, > d text > ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d > collate "en_US"); > + ERROR: collation "en_US" for encoding "SQL_ASCII" does not exist ... > No idea why yet, but I'll try to figure it out. And of course that'd be because relying on en_US isn't portable. Sigh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.
On Wed, Dec 7, 2016 at 1:20 PM, Robert Haas wrote: > Implement table partitioning. Well, that didn't take long to cause problems. The very first buildfarm machine to report after this commit is longfin, which is unhappy: *** *** 392,419 c text, d text ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "en_US"); -- check relkind SELECT relkind FROM pg_class WHERE relname = 'partitioned'; relkind - ! P ! (1 row) -- check that range partition key columns are marked NOT NULL SELECT attname, attnotnull FROM pg_attribute WHERE attrelid = 'partitioned'::regclass AND attnum > 0; ! attname | attnotnull ! -+ ! a | t ! b | f ! c | t ! d | t ! (4 rows) ! -- prevent a function referenced in partition key from being dropped DROP FUNCTION plusone(int); - ERROR: cannot drop function plusone(integer) because other objects depend on it - DETAIL: table partitioned depends on function plusone(integer) - HINT: Use DROP ... CASCADE to drop the dependent objects too. -- partitioned table cannot partiticipate in regular inheritance CREATE TABLE partitioned2 ( a int --- 392,411 c text, d text ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d collate "en_US"); + ERROR: collation "en_US" for encoding "SQL_ASCII" does not exist -- check relkind SELECT relkind FROM pg_class WHERE relname = 'partitioned'; relkind - ! (0 rows) -- check that range partition key columns are marked NOT NULL SELECT attname, attnotnull FROM pg_attribute WHERE attrelid = 'partitioned'::regclass AND attnum > 0; ! ERROR: relation "partitioned" does not exist ! LINE 1: ...me, attnotnull FROM pg_attribute WHERE attrelid = 'partition... ! ^ -- prevent a function referenced in partition key from being dropped DROP FUNCTION plusone(int); -- partitioned table cannot partiticipate in regular inheritance CREATE TABLE partitioned2 ( a int No idea why yet, but I'll try to figure it out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers wrote: >> My bad. The fix I sent last night for one of the cache flush issues >> wasn't quite right. The attached seems to fix it. > Yes, fixed here too. Thanks. Thanks for the report - that was a good catch. I've committed 0001 - 0006 with that correction and a few other adjustments. There's plenty of room for improvement here, and almost certainly some straight-up bugs too, but I think we're at a point where it will be easier and less error-prone to commit follow on changes incrementally rather than by continuously re-reviewing a very large patch set for increasingly smaller changes. Some notes: * We should try to teach the executor never to scan the parent. That's never necessary with this system, and it might add significant overhead. We should also try to get rid of the idea of the parent having storage (i.e. a relfilenode). * The fact that, in some cases, locking requirements for partitioning are stronger than those for inheritance is not good. We made those decisions for good reasons -- namely, data integrity and not crashing the server -- but it would certainly be good to revisit those things and see whether there's any room for improvement. * I didn't commit 0007, which updates the documentation for this new feature. That patch removes more lines than it adds, and I suspect what is needed here is an expansion of the documentation rather than a diminishing of it. * The fact that there's no implementation of row movement should be documented as a limitation. We should also look at removing that limitation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] patch: function xmltable
Pavel Stehule wrote: > I fixed two issues. > > 2. there was reverse setting in NOT NULL flag Ah-hah, that was silly, thanks. > 1. there are not columns data when there are not any explicit column - fixed Hmm. Now that I see how this works, by having the GetValue "guess" what is going on and have a special case for it, I actually don't like it very much. It seems way too magical. I think we should do away with the "if column is NULL" case in GetValue, and instead inject a column during transformTableExpr if columns is NIL. This has implications on ExecInitExpr too, which currently checks for an empty column list -- it would no longer have to do so. Maybe this means we need an additional method, which would request "the expr that returns the whole row", so that transformExpr can work for XmlTable (which I think would be something like "./") and the future JsonTable stuff (I don't know how that one would work, but I assume it's not necessarily the same thing). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016-12-07 17:38, Robert Haas wrote: On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote wrote: begin; create schema if not exists s; create table s.t (c text, d text, id serial) partition by list ((ascii(substring(coalesce(c, d, ''), 1, 1; create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); it logs as follows: 2016-12-07 17:03:45.787 CET 6125 LOG: server process (PID 11503) was terminated by signal 11: Segmentation fault 2016-12-07 17:03:45.787 CET 6125 DETAIL: Failed process was running: create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); Hm, will look into this in a few hours. My bad. The fix I sent last night for one of the cache flush issues wasn't quite right. The attached seems to fix it. Yes, fixed here too. Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote wrote: >> begin; >> create schema if not exists s; >> create table s.t (c text, d text, id serial) partition by list >> ((ascii(substring(coalesce(c, d, ''), 1, 1; >> create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); >> >> it logs as follows: >> >> 2016-12-07 17:03:45.787 CET 6125 LOG: server process (PID 11503) was >> terminated by signal 11: Segmentation fault >> 2016-12-07 17:03:45.787 CET 6125 DETAIL: Failed process was running: create >> table s.t_part_ascii_065 partition of s.t for values in ( 65 ); > > Hm, will look into this in a few hours. My bad. The fix I sent last night for one of the cache flush issues wasn't quite right. The attached seems to fix it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index ae77d15..a230b20 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -2516,6 +2516,7 @@ RelationClearRelation(Relation relation, bool rebuild) bool keep_tupdesc; bool keep_rules; bool keep_policies; + bool keep_partkey; bool keep_partdesc; /* Build temporary entry, but don't link it into hashtable */ @@ -2547,6 +2548,7 @@ RelationClearRelation(Relation relation, bool rebuild) keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att); keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules); keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc); + keep_partkey = (relation->rd_partkey != NULL); keep_partdesc = equalPartitionDescs(relation->rd_partkey, relation->rd_partdesc, newrel->rd_partdesc); @@ -2604,9 +2606,12 @@ RelationClearRelation(Relation relation, bool rebuild) SWAPFIELD(Oid, rd_toastoid); /* pgstat_info must be preserved */ SWAPFIELD(struct PgStat_TableStatus *, pgstat_info); - /* partition key must be preserved */ - SWAPFIELD(PartitionKey, rd_partkey); - SWAPFIELD(MemoryContext, rd_partkeycxt); + /* partition key must be preserved, if we have one */ + if (keep_partkey) + { + SWAPFIELD(PartitionKey, rd_partkey); + SWAPFIELD(MemoryContext, rd_partkeycxt); + } /* preserve old partdesc if no logical change */ if (keep_partdesc) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
Hi Erik, On Thu, Dec 8, 2016 at 1:19 AM, Erik Rijkers wrote: > On 2016-12-07 12:42, Amit Langote wrote: > >> 0001-Catalog-and-DDL-for-partitioned-tables-20.patch >> 0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch >> 0003-Catalog-and-DDL-for-partitions-20.patch >> 0004-psql-and-pg_dump-support-for-partitions-20.patch >> 0005-Teach-a-few-places-to-use-partition-check-quals-20.patch >> 0006-Tuple-routing-for-partitioned-tables-20.patch >> 0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch > > > Patches apply, compile, check OK. Thanks! > But this yields a segfault: > > begin; > create schema if not exists s; > create table s.t (c text, d text, id serial) partition by list > ((ascii(substring(coalesce(c, d, ''), 1, 1; > create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); > > it logs as follows: > > 2016-12-07 17:03:45.787 CET 6125 LOG: server process (PID 11503) was > terminated by signal 11: Segmentation fault > 2016-12-07 17:03:45.787 CET 6125 DETAIL: Failed process was running: create > table s.t_part_ascii_065 partition of s.t for values in ( 65 ); Hm, will look into this in a few hours. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_background contrib module proposal
This is simply wonderful! Finaly, I can implement my favorite sleepsort in postgres: create table input as select round(random()*20) x from generate_series(1,5,1); create table output(place int,value int); create sequence s start 1; create table handles as select pg_background_launch('select pg_sleep('||x||'); insert into output values (nextval(''s''),'||x||');') h from input; select (select * from pg_background_result(h) as (x text) limit 1) from handles; select * from output; Works like a charm. It has perfrect running time O(1) where n is number of items to sort, but it cannot sort more than max_worker_processes-1. Next step of user cuncurrency mechanics is future indexes (indexes under cunstruction are taken into plans, query is executed when index is actualy constructed) and promised tables (query waits untial there is actually data in the table). I like the idea and implementation and I'm planning to post review by the end of december. Right now I have just one useful idea: I do not see comfortable way to check up state of task (finished\failed) without possibility of haning for long or catching fire. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
On 2016-12-07 12:42, Amit Langote wrote: 0001-Catalog-and-DDL-for-partitioned-tables-20.patch 0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch 0003-Catalog-and-DDL-for-partitions-20.patch 0004-psql-and-pg_dump-support-for-partitions-20.patch 0005-Teach-a-few-places-to-use-partition-check-quals-20.patch 0006-Tuple-routing-for-partitioned-tables-20.patch 0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch Patches apply, compile, check OK. But this yields a segfault: begin; create schema if not exists s; create table s.t (c text, d text, id serial) partition by list ((ascii(substring(coalesce(c, d, ''), 1, 1; create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); it logs as follows: 2016-12-07 17:03:45.787 CET 6125 LOG: server process (PID 11503) was terminated by signal 11: Segmentation fault 2016-12-07 17:03:45.787 CET 6125 DETAIL: Failed process was running: create table s.t_part_ascii_065 partition of s.t for values in ( 65 ); 2016-12-07 17:03:45.787 CET 6125 LOG: terminating any other active server processes 2016-12-07 17:03:45.791 CET 6125 LOG: all server processes terminated; reinitializing 2016-12-07 17:03:45.999 CET 11655 LOG: database system was interrupted; last known up at 2016-12-07 17:00:38 CET 2016-12-07 17:03:48.040 CET 11655 LOG: database system was not properly shut down; automatic recovery in progress 2016-12-07 17:03:48.156 CET 11655 LOG: redo starts at 0/2897988 2016-12-07 17:03:48.172 CET 11655 LOG: invalid magic number in log segment 00010002, offset 9207808 2016-12-07 17:03:48.172 CET 11655 LOG: redo done at 0/28C72C0 2016-12-07 17:03:48.172 CET 11655 LOG: last completed transaction was at log time 2016-12-07 17:01:29.580562+01 2016-12-07 17:03:49.534 CET 11655 LOG: MultiXact member wraparound protections are now enabled 2016-12-07 17:03:49.622 CET 6125 LOG: database system is ready to accept connections Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
Stephen Frost writes: > It would be really nice if we would detect that some other postmaster is > already using a given tablespace directory and to throw an error and > complain rather than starting up thinking everything is fine. In principle, we could have the postmaster run through $PGDATA/pg_tblspc and drop a lockfile into each referenced directory. But the devil is in the details --- in particular, not sure how to get the right thing to happen during a CREATE TABLESPACE. Also, I kinda doubt that this is going to fix anything for the replica-on-same-machine problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Temporal query processing with range types
Am 05.12.2016 um 06:11 schrieb Haribabu Kommi: On Tue, Oct 25, 2016 at 8:44 PM, Peter Moser mailto:pitiz...@gmail.com>> wrote: We decided to follow your recommendation and add the patch to the commitfest. Path is not applying properly to HEAD. Moved to next CF with "waiting on author" status. We updated our patch. We tested it with the latest commit dfe530a09226a9de80f2b4c3d5f667bf51481c49. Regards, Hari Babu Fujitsu Australia Best regards, Anton, Johann, Michael, Peter diff --git src/backend/commands/explain.c src/backend/commands/explain.c index 0a669d9..09406d4 100644 --- src/backend/commands/explain.c +++ src/backend/commands/explain.c @@ -875,6 +875,12 @@ ExplainNode(PlanState *planstate, List *ancestors, case T_SeqScan: pname = sname = "Seq Scan"; break; + case T_TemporalAdjustment: + if(((TemporalAdjustment *) plan)->temporalCl->temporalType == TEMPORAL_TYPE_ALIGNER) +pname = sname = "Adjustment(for ALIGN)"; + else +pname = sname = "Adjustment(for NORMALIZE)"; + break; case T_SampleScan: pname = sname = "Sample Scan"; break; diff --git src/backend/executor/Makefile src/backend/executor/Makefile index 51edd4c..42801d3 100644 --- src/backend/executor/Makefile +++ src/backend/executor/Makefile @@ -25,6 +25,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \ nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \ nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \ nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \ - nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o + nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \ + nodeTemporalAdjustment.o + include $(top_srcdir)/src/backend/common.mk diff --git src/backend/executor/execProcnode.c src/backend/executor/execProcnode.c index 554244f..610d753 100644 --- src/backend/executor/execProcnode.c +++ src/backend/executor/execProcnode.c @@ -114,6 +114,7 @@ #include "executor/nodeValuesscan.h" #include "executor/nodeWindowAgg.h" #include "executor/nodeWorktablescan.h" +#include "executor/nodeTemporalAdjustment.h" #include "nodes/nodeFuncs.h" #include "miscadmin.h" @@ -334,6 +335,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags) estate, eflags); break; + case T_TemporalAdjustment: + result = (PlanState *) ExecInitTemporalAdjustment((TemporalAdjustment *) node, + estate, eflags); + break; + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; /* keep compiler quiet */ @@ -531,6 +537,10 @@ ExecProcNode(PlanState *node) result = ExecLimit((LimitState *) node); break; + case T_TemporalAdjustmentState: + result = ExecTemporalAdjustment((TemporalAdjustmentState *) node); + break; + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; @@ -779,6 +789,10 @@ ExecEndNode(PlanState *node) ExecEndLimit((LimitState *) node); break; + case T_TemporalAdjustmentState: + ExecEndTemporalAdjustment((TemporalAdjustmentState *) node); + break; + default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); break; @@ -812,3 +826,4 @@ ExecShutdownNode(PlanState *node) return planstate_tree_walker(node, ExecShutdownNode, NULL); } + diff --git src/backend/executor/nodeTemporalAdjustment.c src/backend/executor/nodeTemporalAdjustment.c new file mode 100644 index 000..e45ec03 --- /dev/null +++ src/backend/executor/nodeTemporalAdjustment.c @@ -0,0 +1,528 @@ +#include "postgres.h" +#include "executor/executor.h" +#include "executor/nodeTemporalAdjustment.h" +#include "utils/memutils.h" +#include "access/htup_details.h"/* for heap_getattr */ +#include "utils/lsyscache.h" +#include "nodes/print.h" /* for print_slot */ +#include "utils/datum.h" /* for datumCopy */ + +/* + * #define TEMPORAL_DEBUG + * XXX PEMOSER Maybe we should use execdebug.h stuff here? + */ +#ifdef TEMPORAL_DEBUG +static char* +datumToString(Oid typeinfo, Datum attr) +{ + Oid typoutput; + bool typisvarlena; + getTypeOutputInfo(typeinfo, &typoutput, &typisvarlena); + return OidOutputFunctionCall(typoutput, attr); +} + +#define TPGdebug(...) { printf(__VA_ARGS__); printf("\n"); fflush(stdout); } +#define TPGdebugDatum(attr, typeinfo) TPGdebug("%s = %s %ld\n", #attr, datumToString(typeinfo, attr), attr) +#define TPGdebugSlot(slot) { printf("Printing Slot '%s'\n", #slot); print_slot(slot); fflush(stdout); } + +#else +#define datumToString(typeinfo, attr) +#define TPGdebug(...) +#define TPGdebugDatum(attr, typeinfo) +#define TPGdebugSlot(slot) +#endif + +/* + * isLessThan + * We must check if the sweepline is before a timepoint, or if a timepoint + * is smaller than another. We initialize the function call info during + * ExecInit phase. + */ +static bool +isLessThan(Datum a, Datum b, TemporalAdjustmentState* node
Re: [HACKERS] Declarative partitioning - another take
On Wed, Dec 7, 2016 at 6:42 AM, Amit Langote wrote: > On 2016/12/07 13:38, Robert Haas wrote: >> On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote >> wrote: >>> The latest patch I posted earlier today has this implementation. >> >> I decided to try out these patches today with #define >> CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of >> problems: >> >> 1. RelationClearRelation() wasn't preserving the rd_partkey, even >> though there's plenty of code that relies on it not changing while we >> hold a lock on the relation - in particular, transformPartitionBound. > > Oh, I thought an AccessExclusiveLock on the relation would prevent having > to worry about that, but guess I'm wrong. Perhaps, having a lock on a > table does not preclude RelationClearRelation() resetting the table's > relcache. No, it sure doesn't. The lock prevents the table from actually being changed, so a reload will find data equivalent to what it had before, but it doesn't prevent the backend's cache from being flushed. >> 2. partition_bounds_equal() was using the comparator and collation for >> partitioning column 0 to compare the datums for all partitioning >> columns. It's amazing this passed the regression tests. > > Oops, it seems that the regression tests where the above code might be > exercised consisted only of range partition key with columns all of the > same type: create table test(a int, b int) partition by range (a, (a+b)); It doesn't seem like it; you had this: create table part1 partition of range_parted for values from ('a', 1) to ('a', 10); >> I recommend that once you fix this, you run 'make check' with #define >> CLOBBER_CACHE_ALWAYS 1 and look for other hazards. Such mistakes are >> easy to make with this kind of patch. > > With the attached latest version of the patches, I couldn't see any > failures with a CLOBBER_CACHE_ALWAYS build. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
Michael, all, * Michael Paquier (michael.paqu...@gmail.com) wrote: > On Wed, Dec 07, 2016 at 03:42:53PM +0300, Aleksander Alekseev wrote: > > > In the same host, primary and standby will try to use the tablespace > > > in the same path. That's the origin of this breakage. > > > > Sorry, I don't follow. Don't master and replica use different > > directories to store _all_ data? Particularly in my case: > > > > ``` > > $ find path/to/postgresql-install/ -type d -name pg_tblspc > > /home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc > > /home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc > > ``` > > > > Where exactly a collision happens? > > At the location of the tablespaces, pg_tblspc just stores symlinks to > the place data is stored, and both point to the same path, the same path > being stream to the standby when replaying the create tablespace record. It would be really nice if we would detect that some other postmaster is already using a given tablespace directory and to throw an error and complain rather than starting up thinking everything is fine. We do that already for $PGDATA, of course, but not tablespaces. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
Peter Eisentraut writes: > On 12/6/16 9:53 PM, Tom Lane wrote: >> I think we should give serious consideration to back-patching commit >> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX >> on Linux. > Even with that change, dynamic shared memory is still vulnerable to be > removed. Really? I thought we concluded that it is safe because it is detectably attached to running processes. The trouble with SysV semaphores is that they lack any such attachment, so systemd is left to guess whether they are still in use. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
All, * Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote: > On 12/6/16 9:53 PM, Tom Lane wrote: > > I think we should give serious consideration to back-patching commit > > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX > > on Linux. > > Even with that change, dynamic shared memory is still vulnerable to be > removed. So backpatching the semaphore change wouldn't achieve any new > level of safety for users so that we could tell them, "you're good now". I tend to agree with Peter, Magnus, and Craig on this. If you aren't running PG as a system user on a system where systemd feels happy to kill processes and remove shared memory segments and semaphores when you log out, no amount of back-patching of anything is going to make you 'safe'. As noted in the thread referenced, users who are trying to (mistakenly) do this are already having to modify their logind.conf file to not have PG outright killed when they log out, it's on them to make sure systemd doesn't do other stupid things when they log out too if they really want PG to be run as their user account. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Separate connection handling from backends
On Wed, Dec 7, 2016 at 12:36 AM, Jim Nasby wrote: > The way I'm picturing it backends would no longer be directly > tied to connections. The code that directly handles connections > would grab an available backend when a statement actually came in > (and certainly it'd need to worry about transactions and session > GUCs). If we're going to consider that, I think we should consider going all the way to the technique used by many (most?) database products, which is to have a configurable number of "engines" that pull work requests from queues. We might have one queue for disk writes, one for disk reads, one for network writes, etc. Traditionally, each engine spins over attempts to read from the queues until it finds a request to process; blocking only if several passes over all queues come up empty. It is often possible to bind each engine to a particular core. Current process-local state would be passed around, attached to queued requests, in a structure associated with the connection. I don't know how that execution model would compare to what we use now in terms of performance, but its popularity makes it hard to ignore as something to consider. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On 12/6/16 9:53 PM, Tom Lane wrote: > I think we should give serious consideration to back-patching commit > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX > on Linux. Even with that change, dynamic shared memory is still vulnerable to be removed. So backpatching the semaphore change wouldn't achieve any new level of safety for users so that we could tell them, "you're good now". -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] varlena beyond 1GB and matrix
This is a design proposal for matrix data type which can be larger than 1GB. Not only a new data type support, it also needs a platform enhancement because existing varlena has a hard limit (1GB). We had a discussion about this topic on the developer unconference at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal below stands on the overall consensus on the discussion. BACKGROUND The varlena format has either short (1-byte) or long (4-bytes) header. We use the long header for in-memory structure which is referenced by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than 126b but less than TOAST threshold. Elsewhere, the short format is used if varlena is less than 126b or externally stored in the toast relation. Any kind of varlena representation does not support data- size larger than 1GB. On the other hands, some use cases which handle (relative) big-data in database are interested in variable length datum larger than 1GB. One example is what I've presented at PGconf.SV. A PL/CUDA function takes two arguments (2D-array of int4 instead of matrix), then returns top-N combination of the chemical compounds according to the similarity, like as: SELECT knn_gpu_similarity(5, Q.matrix, D.matrix) FROM (SELECT array_matrix(bitmap) matrix FROM finger_print_query WHERE tag LIKE '%abc%') Q, (SELECT array_matrix(bitmap) matrix FROM finger_print_10m) D; array_matrix() is an aggregate function to generate 2D-array which contains all the input relation stream. It works fine as long as data size is less than 1GB. Once it exceeds the boundary, user has to split the 2D-array by manual, although it is not uncommon the recent GPU model has more than 10GB RAM. It is really problematic if we cannot split the mathematical problem into several portions appropriately. And, it is not comfortable for users who cannot use full capability of the GPU device. = PROBLEM = Our problem is that varlena format does not permit to have a variable length datum larger than 1GB, even if our "matrix" type wants to move a bunch of data larger than 1GB. So, we need to solve the problem of the varlena format restriction prior to the matrix type implementation. In the developer unconference, people had discussed three ideas towards the problem. Then, overall consensus was the idea of special data-type which can contain multiple indirect references to other data chunks. Both of the main part and referenced data chunks are less than 1GB, but total amount of data size we can represent is more than 1GB. For example, even if we have a large matrix around 8GB, its sub-matrix separated into 9 portions (3x3) are individually less than 1GB. It is problematic when we try to save the matrix which contains indirect reference to the sub-matrixes, because toast_save_datum() writes out the flat portion just after the varlena head onto a tuple or a toast relation as is. If main portion of the matrix contains pointers (indirect reference), it is obviously problematic. We need to have an infrastructure to serialize the indirect reference prior to saving. BTW, other ideas but less acknowledgement were 64bit varlena header and utilization of large object. The earlier idea breaks effects to the current data format, thus, will make unexpected side-effect on the existing code of PostgreSQL core and extensions. The later one requires users to construct a large object preliminary. It makes impossible to use interim result of sub-query, and leads unnecessary i/o for preparation. == SOLUTION == I like to propose a new optional type handler 'typserialize' to serialize an in-memory varlena structure (that can have indirect references) to on-disk format. If any, it shall be involced on the head of toast_insert_or_update() than indirect references are transformed to something other which is safe to save. (My expectation is, the 'typserialize' handler preliminary saves the indirect chunks to the toast relation, then put toast pointers instead.) On the other hands, it is uncertain whether we need 'typdeserialize' handler symmetrically. Because all the functions/operators which support the special data types should know its internal format, it is possible to load the data chunks indirectly referenced on demand. It will be beneficial from the performance perspective if functions /operators touches only a part of the large structure, because the rest of portions are not necessary to load into the memory. One thing I'm not certain is, whether we can update the datum supplied as an argument in functions/operators. Let me assume the data structure below: struct { int32 varlena_head; Oid element_id; int matrix_type; int blocksz_x; /* horizontal size of each block */ int blocksz_y; /* vertical size of each block */ int gridsz_x; /* horizon
[HACKERS] Re: [PATCH] PostgresNode.pm enhancements, pg_lsn helper, and some more recovery tests
Hello, Craig. I noticed that these patches have a "Needs review" status and decided to take a look. Surprisingly I didn't manage to find many defects. * I like this idea in general; * Code is test covered and documented well; * All tests pass; * No warnings during compilation and/or tests run; * I didn't find any typos; * Code style seems to be OK; Though are you sure you want to name a class like_this instead of LikeThis? It's quite uncommon in Perl code and usually used only for packages like "strict" and "warnings". However I prefer no to insist on changes like this since it's a matter of taste. The bottom line --- this code looks good to me. If there is nothing you would like to change in the last moment I suggest to change the status to "Ready for committer". -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
On Wed, Dec 07, 2016 at 03:42:53PM +0300, Aleksander Alekseev wrote: > > In the same host, primary and standby will try to use the tablespace > > in the same path. That's the origin of this breakage. > > Sorry, I don't follow. Don't master and replica use different > directories to store _all_ data? Particularly in my case: > > ``` > $ find path/to/postgresql-install/ -type d -name pg_tblspc > /home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc > /home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc > ``` > > Where exactly a collision happens? At the location of the tablespaces, pg_tblspc just stores symlinks to the place data is stored, and both point to the same path, the same path being stream to the standby when replaying the create tablespace record. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
> In the same host, primary and standby will try to use the tablespace > in the same path. That's the origin of this breakage. Sorry, I don't follow. Don't master and replica use different directories to store _all_ data? Particularly in my case: ``` $ find path/to/postgresql-install/ -type d -name pg_tblspc /home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc /home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc ``` Where exactly a collision happens? On Wed, Dec 07, 2016 at 09:39:20PM +0900, Michael Paquier wrote: > On Wed, Dec 07, 2016 at 03:18:59PM +0300, Aleksander Alekseev wrote: > > I noticed, that `make installcheck` fails on my laptop with following > > errors: > > > > http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt > > http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt > > The interesting bit for the archives: > > *** > /home/eax/work/postgrespro/postgresql-src/src/test/regress/expected/tablespace.out > 2016-12-07 13:53:44.000728436 +0300 > --- > /home/eax/work/postgrespro/postgresql-src/src/test/regress/results/tablespace.out > 2016-12-07 13:53:46.150728558 +0300 > *** > *** 66,71 > --- 66,72 > INSERT INTO testschema.test_default_tab VALUES (1); > CREATE INDEX test_index1 on testschema.test_default_tab (id); > CREATE INDEX test_index2 on testschema.test_default_tab (id) TABLESPACE > regress_tblspace; > + ERROR: could not open file "pg_tblspc/16395/PG_10_201612061/16393/16407": > No such file or directory > \d testschema.test_index1 > > > Any ideas what can cause this issue? > > In the same host, primary and standby will try to use the tablespace > in the same path. That's the origin of this breakage. > -- > Michael -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
On Wed, Dec 07, 2016 at 03:18:59PM +0300, Aleksander Alekseev wrote: > I noticed, that `make installcheck` fails on my laptop with following > errors: > > http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt > http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt The interesting bit for the archives: *** /home/eax/work/postgrespro/postgresql-src/src/test/regress/expected/tablespace.out 2016-12-07 13:53:44.000728436 +0300 --- /home/eax/work/postgrespro/postgresql-src/src/test/regress/results/tablespace.out 2016-12-07 13:53:46.150728558 +0300 *** *** 66,71 --- 66,72 INSERT INTO testschema.test_default_tab VALUES (1); CREATE INDEX test_index1 on testschema.test_default_tab (id); CREATE INDEX test_index2 on testschema.test_default_tab (id) TABLESPACE regress_tblspace; + ERROR: could not open file "pg_tblspc/16395/PG_10_201612061/16393/16407": No such file or directory \d testschema.test_index1 > Any ideas what can cause this issue? In the same host, primary and standby will try to use the tablespace in the same path. That's the origin of this breakage. -- Michael signature.asc Description: PGP signature
[HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup
Hello. I noticed, that `make installcheck` fails on my laptop with following errors: http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt My first idea was to use `git bisect`. It turned out that this issue reproduces on commits back from 2015 as well (older versions don't compile on my laptop). However it reproduces rarely and with different errors: http://afiskon.ru/s/8e/1ad2c8ed2b_regression.diffs-8c48375.txt Here are scripts I use to compile and test PostgreSQL: https://github.com/afiskon/pgscripts Exact steps to reproduce are: ``` ./quick-build.sh && ./install.sh && make installcheck ``` Completely removing all `configure` flags doesn't make any difference. Issue reproduces only on master-replica setup i.e. if instead of install.sh you run ./single-install.sh all tests pass. I'm using Arch Linux and GCC 6.2.1. Any ideas what can cause this issue? -- Best regards, Aleksander Alekseev signature.asc Description: PGP signature
Re: [HACKERS] Push down more full joins in postgres_fdw
On Wed, Dec 7, 2016 at 4:51 PM, Etsuro Fujita wrote: > On 2016/12/05 20:20, Ashutosh Bapat wrote: >> >> On Mon, Nov 28, 2016 at 3:52 PM, Etsuro Fujita >> wrote: >>> >>> On 2016/11/24 21:45, Etsuro Fujita wrote: * I removed make_outerrel_subquery/make_innerrel_subquery from fpinfo and added a new member "is_subquery_rel" to fpinfo, as a flag on whether to deparse the relation as a subquery. > > >> Replacing make_outerrel_subquery/make_innerrel_subquery with >> is_subquery_rel >> seems to be a bad idea. Whether a relation needs to be deparsed as a >> subquery >> or not depends upon the relation which joins it with another relation. >> Take for >> example a case when a join ABC can pull up the conditions on AB, but ABD >> can >> not. In this case we will mark that AB in ABD needs to be deparsed as a >> subquery but will not mark so in ABC. So, if we choose to join ABCD as >> (ABC)D >> we don't need AB to be deparsed as a subquery. If we choose to join ABCD >> as >> (ABD)C, we will need AB to deparsed as a subquery. > > > This is not true; since (1) currently, a relation needs to be deparsed as a > subquery only when the relation is full-joined with any other relation, and > (2) the join ordering for full joins is preserved, we wouldn't have such an > example. (You might think we would have that when extending the patch to > handle PHVs, but the answer would be no, because the patch for PHVs would > determine whether a relation emitting PHVs needs to be deparsed as a > subquery, depending on the relation itself. See the patch for PHVs.) I like > is_subquery_rel more than make_outerrel_subquery/make_innerrel_subquery, > because it reduces the number of members in fpinfo. But the choice would be > arbitrary, so I wouldn't object to going back to > make_outerrel_subquery/make_innerrel_subquery. I think you are right. And even if we were to deparse a relation as subquery, when it shouldn't be, we will only loose a syntactic optimization. So, it shouldn't result in a bug. >> 3. Why is foreignrel variable changed to rel? >> -extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, >> -RelOptInfo *foreignrel, List *tlist, >> -List *remote_conds, List *pathkeys, >> -List **retrieved_attrs, List **params_list); >> +extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo >> *root, RelOptInfo *rel, >> +List *tlist, List *remote_conds, List *pathkeys, >> +bool is_subquery, List **retrieved_attrs, >> +List **params_list); > > > I changed the name that way to match with the function definition in > deparse.c. > That's something good to do but it's unrelated to this patch. I have repeated this line several times in this thread :) -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Push down more full joins in postgres_fdw
On 2016/12/07 15:39, Ashutosh Bapat wrote: On 2016/11/22 18:28, Ashutosh Bapat wrote: I guess, the reason why you are doing it this way, is SELECT clause for the outermost query gets deparsed before FROM clause. For later we call deparseRangeTblRef(), which builds the tlist. So, while deparsing SELECT clause, we do not have tlist to build from. In that case, I guess, we have to build the tlist in get_subselect_alias_id() if it's not available and stick it in fpinfo. Subsequent calls to get_subselect_alias_id() should find tlist there. Then in deparseRangeTblRef() assert that there's a tlist in fpinfo and use it to build the SELECT clause of subquery. That way, we don't build tlist unless it's needed and also use the same tlist for all searches. Please use tlist_member() to search into the tlist. I wrote: This would probably work, but seems to me a bit complicated. Instead, I'd like to propose that we build the tlist for each relation being deparsed as a subquery in a given join tree, right before deparsing the SELECT clause in deparseSelectStmtForRel, if is_subquery is false and lower_subquery_rels isn't NULL, and store the tlist into the relation's fpinfo. That would allow us to build the tlist only when we need it, and to use tlist_member for the exact comparison. I think it would be much easier to implement that. IIRC, this is inline with my original proposal of creating tlists before deparsing anything. Along-with that I also proposed to bubble this array of tlist up the join hierarchy to avoid recursion [1] point #15, further elaborated in [2] [1] https://www.postgresql.org/message-id/ad449b25-66ee-4c06-568c-0eb2e1bef9f9%40lab.ntt.co.jp [2] https://www.postgresql.org/message-id/CAFjFpRcn7%3DDNOXm-PJ_jVDqAmghKVf6tApQC%2B_RgMZ8tOPExcA%40mail.gmail.com We seem to be moving towards that solution, but as discussed we have left it to committer's judgement. My proposal here would be a bit different from what you proposed; right before deparseSelectSql in deparseSelectStmtForRel, build the tlists for relations present in the given jointree that will be deparsed as subqueries, by (1) traversing the given jointree and (2) applying build_tlist_to_deparse to each relation to be deparsed as a subquery and saving the result in that relation's fpinfo. I think that would be implemented easily, and the overhead would be small. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Push down more full joins in postgres_fdw
On 2016/12/05 20:20, Ashutosh Bapat wrote: On Mon, Nov 28, 2016 at 3:52 PM, Etsuro Fujita wrote: On 2016/11/24 21:45, Etsuro Fujita wrote: * I removed make_outerrel_subquery/make_innerrel_subquery from fpinfo and added a new member "is_subquery_rel" to fpinfo, as a flag on whether to deparse the relation as a subquery. Replacing make_outerrel_subquery/make_innerrel_subquery with is_subquery_rel seems to be a bad idea. Whether a relation needs to be deparsed as a subquery or not depends upon the relation which joins it with another relation. Take for example a case when a join ABC can pull up the conditions on AB, but ABD can not. In this case we will mark that AB in ABD needs to be deparsed as a subquery but will not mark so in ABC. So, if we choose to join ABCD as (ABC)D we don't need AB to be deparsed as a subquery. If we choose to join ABCD as (ABD)C, we will need AB to deparsed as a subquery. This is not true; since (1) currently, a relation needs to be deparsed as a subquery only when the relation is full-joined with any other relation, and (2) the join ordering for full joins is preserved, we wouldn't have such an example. (You might think we would have that when extending the patch to handle PHVs, but the answer would be no, because the patch for PHVs would determine whether a relation emitting PHVs needs to be deparsed as a subquery, depending on the relation itself. See the patch for PHVs.) I like is_subquery_rel more than make_outerrel_subquery/make_innerrel_subquery, because it reduces the number of members in fpinfo. But the choice would be arbitrary, so I wouldn't object to going back to make_outerrel_subquery/make_innerrel_subquery. Some more comments 1. The output of the following query +SELECT 1 FROM (SELECT c1 FROM ft4 WHERE c1 between 50 and 60) t1 FULL JOIN (SELECT c1 FROM ft5 WHERE c1 between 50 and 60) t2 ON (TRUE); produces 21 rows all containing a "1". That's not quite good use of expected output space, given that all that the test tests is the empty targetlists are deparsed correctly. You might want to either just test EXPLAIN output or make "between 50 and 60" condition more stringent to reduce the number of rows output. Will do. 2. Got lost here. I guess, all you need to say is "test deparsing FOR UPDATE clause with subqueries.". Anyway, the sentence is very hard to read and needs simplification. +-- d. test deparsing the remote queries for simple foreign table scans in +-- an EPQ subplan of a foreign join in which the foreign tables are full +-- joined and in the remote join query the foreign tables are deparsed as +-- subqueries Will do. 3. Why is foreignrel variable changed to rel? -extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, -RelOptInfo *foreignrel, List *tlist, -List *remote_conds, List *pathkeys, -List **retrieved_attrs, List **params_list); +extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *rel, +List *tlist, List *remote_conds, List *pathkeys, +bool is_subquery, List **retrieved_attrs, +List **params_list); I changed the name that way to match with the function definition in deparse.c. Thanks for the comments! Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GENERAL] Select works only when connected from login postgres
Tom Lane wrote: > BTW, I realized while testing this that there's still one gap in our > understanding of what went wrong for you: cases like "SELECT 'hello'" > should not have tried to use the pager, because that would've produced > less than a screenful of data At some point emacs was mentioned as the terminal: >> And I guess I did that intentionally, my .bashrc has >> >> # I use emacs shells, I got a "pager" already: >> export PAGER='' The M-x shell mode of emacs has a so-called "dumb" terminal emulation (that's the value of $TERM) where the notion of a "page" doesn't quite apply. For instance, when using emacs 24.3 with my default pager on an Ubuntu desktop, this is what I get: test=> select 1; WARNING: terminal is not fully functional - (press RETURN) ?column? -- 1 (1 row) I suspect that psql is unable to determine the screen size of the "dumb" terminal, and that it's the fault of the terminal rather than psql. The warning is displayed by "less" AFAICS. There are other psql features like tab-completion that don't work in this mode because emacs interpret keystrokes first for itself, in effect mixing emacs functionalities with these of the application run in the terminal. It's awesome sometimes and irritating at other times depending on what you expect :) OTOH it has also a M-x term command/mode that provides a more sophisticated screen emulation into which paging seems to work exactly like in a normal terminal and the emacs key bindings are turned off. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada wrote: > On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier > wrote: >> Indeed, I haven't thought about that, and that's a no-brainer. That >> would remove the need to allocate and sort each array, what is simply >> needed is to track the number of times a newest value has been found. >> So what this processing would do is updating the write/flush/apply >> values for the first k loops if the new value is *older* than the >> current one, where k is the quorum number, and between k+1 and N the >> value gets updated only if the value compared is newer. No need to >> take the mutex lock for a long time as well. > > Sorry, I could not understand this algorithm. Could you elaborate > this? It takes only O(n) times? Nah, please forget that, that was a random useless thought. There is no way to be able to select the k-th element without knowing the hierarchy induced by the others, which is what the partial sort would help with here. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Hang in pldebugger after git commit : 98a64d0
Hi All, I have noticed that any SQL query executed immediately after attaching to a particular debugging target (pldebugger) either takes longer time for execution or it hangs on Windows. I have checked it on PostgreSQL-9.5 and PostgreSQL-9.6 versions and have found that the issue only exists on 9.6 version. After doing 'git bisect' i found that the following git commit in PostgreSQL-9.6 branch is the culprit. commit *98a64d0bd713cb89e61bef6432befc* 4b7b5da59e Author: Andres Freund Date: Mon Mar 21 09:56:39 2016 +0100 Introduce WaitEventSet API. Commit ac1d794 ("Make idle backends exit if the postmaster dies.") introduced a regression on, at least, large linux systems. Constantly adding the same postmaster_alive_fds to the OSs internal datastructures for implementing poll/select can cause significant contention; leading to a performance regression of nearly 3x in one example. This can be avoided by using e.g. linux' epoll, which avoids having to add/remove file descriptors to the wait datastructures at a high rate. Unfortunately the current latch interface makes it hard to allocate any persistent per-backend resources. . Following are the steps to reproduce the issue: 1) Download pldebugger from below url and copy it into contrib directory. git clone git://git.postgresql.org/git/pldebugger.git 2) Start a new backend session (psql -d postgres) 3) Create a plpgsql function say func1(); 4) Get the oid of the func1 and enable debugging of this using pldbgapi function as shown below select plpgsql_oid_debug(16487); 5) execute function func1 : select func1(); After executing above query we will get the message as below and terminal will not respond as it will go in listen mode. NOTICE: PLDBGBREAK:2 6) Start another backend session. 7) Execute below query. SELECT * FROM pldbg_attach_to_port(2) NOTE: We need to extract the port number from step 5 NOTICE message after 'PLDBGBREAK:' string and use as input here. 8) Execute any SQL query now and the problem starts. I have tried with below queries. SELECT 1; OR SELECT pg_backend_pid(); OR SELECT FROM pldbg_wait_for_breakpoint(1::INTEGER); Problem Analysis: - Allthough i am very new to Windows, i tried debugging the issue and could find that Backend is not receiving the query executed after "SELECT pldbg_attach_to_port(2)" and is infinitely waiting on "WaitEventSetWaitBlock()" at WaitForMultipleObjects() to read the input command. Below is the backtrace for the same. postgres.exe!WaitEventSetWaitBlock(WaitEventSet * set, int cur_timeout, WaitEvent * occurred_events, int nevents) Line 1384 + 0x2b bytes C postgres.exe!WaitEventSetWait(WaitEventSet * set, long timeout, WaitEvent * occurred_events, int nevents) Line 936 + 0x18 bytes C postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64 len) Line 168 C postgres.exe!pq_recvbuf() Line 921 + 0x33 bytes C postgres.exe!pq_getbyte() Line 963 + 0x5 bytes C postgres.exe!SocketBackend(StringInfoData * inBuf) Line 334 + 0x5 bytes C postgres.exe!ReadCommand(StringInfoData * inBuf) Line 507 + 0xa bytes C postgres.exe!PostgresMain(int argc, char * * argv, const char * dbname, const char * username) Line 4004 + 0xd bytes C postgres.exe!BackendRun(Port * port) Line 4259 C postgres.exe!SubPostmasterMain(int argc, char * * argv) Line 4750 C postgres.exe!main(int argc, char * * argv) Line 216 C postgres.exe!__tmainCRTStartup() Line 555 + 0x19 bytes C postgres.exe!mainCRTStartup() Line 371 C With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?
On Wed, Dec 7, 2016 at 7:18 AM, Craig Ringer wrote: > On 7 December 2016 at 10:53, Tom Lane wrote: > > Just saw another report of what's probably systemd killing off Postgres' > > SysV semaphores, as we've discussed previously at, eg, > > https://www.postgresql.org/message-id/flat/57828C31.5060409%40gmail.com > > Since the systemd people are generally impervious to suggestions that > > they might be mistaken, I do not expect this problem to go away. > > > > I think we should give serious consideration to back-patching commit > > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX > > on Linux. We've seen no problems in the buildfarm in the two months > > that that's been in HEAD. If we don't do this, we can expect to > > continue seeing complaints of this sort until pre-v10 PG releases > > fall out of use ... and I don't want to wait that long. > > > > Commit ecb0d20a9 also changed the default for FreeBSD. I'm not convinced > > we should back-patch that part, because (a) unnamed-POSIX semas have > > only been there since FreeBSD 9.0, which isn't that long ago, and (b) > > the argument for switching is "it'll perform better" not "your server > > will fail randomly without this change". > > That's a huge change to make for something that doesn't risk data > corruption, and that won't happen when using postgres with distro > packages. > > As I understand it, it's only a problem if you're running postgres as > a normal user, not a "system user" which systemd defines at > compile-time as a user < 500 or < 1000 depending on the distro's > default login.conf . So it'll only affect people who're not using > their distro's packages and service mechanism, and are instead running > Pg under some other user, likely started manually with pg_ctl. > > I see quite a few people who compile their own Pg rather than using > packages, and some who even fail to use the init system and instead > use custom scripts to launch Pg. But pretty much everything I've seen > uses a 'postgres' system-user. Clearly there are exceptions out there > in the wild, but I don't think it makes sense to backpatch this to > satisfy people who are, IMO, doing it wrong in the first place. > > Especially since those people can reconfigure systemd not to do this > with the RemoveIPC and KillUserProcesses directives if they insist on > using a non-system user. > > If they defined a systemd service to start postgres they'd be fine... > and isn't it pretty much basic sysadmin 101 to use your init system to > start services? > > Don't get me wrong, I think systemd's behaviour is pretty stupid. > Mostly in terms of its magic definition of a "system user", which > shouldn't be something determined by a uid threshold at compile time. > But I don't think we should double down on it by backpatching a big > change that hasn't even seen in-the-wild loads from real world use > yet, just to make it easier on people who're doing things backwards in > the first place. > +1 (or several). I don't think we should backpatch something that carries risk for people who do things "the right way" to help those that don't. Even if the behavior is stupid. > > If it were possible to detect that systemd was about to clobber us and > log something informative, _that_ would be very nice to backpatch. I > don't see how that's possible though. Surely there must be some way to ask systemd about it's configuration? And if we have that, then we could possibly teach pg_ctl about that and have it throw a big warning? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
Re: [HACKERS] Typmod associated with multi-row VALUES constructs
Hello, At Mon, 5 Dec 2016 21:54:08 -0700, "David G. Johnston" wrote in > feel free to s/typemod/typmod/ ... my fingers don't want to drop the "e" > > On Mon, Dec 5, 2016 at 9:17 PM, Kyotaro HORIGUCHI < > horiguchi.kyot...@lab.ntt.co.jp> wrote: > > > Hello, > > > > At Mon, 5 Dec 2016 18:59:42 -0700, "David G. Johnston" < > > david.g.johns...@gmail.com> wrote in > zrvf9qia0wwt_qzmauy...@mail.gmail.com> > > > On Mon, Dec 5, 2016 at 6:36 PM, Kyotaro HORIGUCHI < > > > horiguchi.kyot...@lab.ntt.co.jp> wrote: > > > > > > > (It's not typo but my poor wording... Sorry.) > > Mmm. I think the typemod of a row should not be applied onto > > other rows, and the same can be said for types. But type of the > > first row is applied to all of the rest rows, though... Does it > > make sense? > > > > Yes. All rows in a given relation must end up with the exact same type > and it isn't friendly to fail when a implicit conversion from unknown is > possible. I'm not sure how clearly users are conscious of the type unkown, my feeling is opposed to implicity applying of the type of the first value in VALUES to all other values of the type "unknown". On the other hand, such behavior doesn't harm anything when we don't explicitly type the values in VALUES. (But it would be a new feature as you wrote below) Anyway the real behavior of the current parser for VALUES is scanning whole the list and extract common-coerceable type, then applying the type on the whole list. (transformValuesClause) > > But what I wanted to say was not that but the something like the > > following. > > > > select pg_typeof('12345'::varchar(1)); > > pg_typeof > > --- > > character varying > > > > A value seemingly doesn't have typmod. So it seems reasonable > > that VALUES cannot handle typmod. It seems reasonable regardless > > of how it is acutually implemented. > > > > This is an artifact of functions - the typemod associated with the value > '12345' is lost when that value is passed into the function pg_typeof. It seems to me what is occuring in VALUES. > Thus it is impossible to write a SQL query the reports the typemod of a > literal or column reference. Nonetheless it is there in reality. Just see > the original CREATE TABLE AS example for proof. The created table's column > is shown (using direct catalog queries) to contain typemod value of 20 - > which it could only have gotten from the first values rows which contained > a casted literal. Ok, I found myself have read the original problem wrongly. =# create table t2 (a) as select * from (values ('abcdee'::varchar(3)), ('defghij'::varchar(5))) x; postgres=# \d t2; Table "public.t2" Column | Type | Modifiers +--+--- a | character varying(3) | postgres=# select * from t2; a --- abc defgh (2 rows) The problem to be resolved here seems to be that CREATE TABLE AS creates a broken in-a-sense table. Not a coercion of VALUES. #6 - raise an error if a subquery's result doesn't fit the newly created table. Or, create a new table so that all the value given from subqery fit to it. (I havent' understand how the source typmod affects the new table and I dont'see how to do that, though) > > But I undetstand what we want to solve here is how to change > > VALUES's behavior to perfectly coerce the row values to the types > > explicity applied in the first row. Right? > > > > Actually, no, since it is not possible to coerce "perfectly". Since any > attempt at doing so could fail it is only possible to scan every row and > compare its typemod to the first row's typemod. Ideally (but maybe not in > reality) as soon as a discrepancy is found stop and discard the typemod. > If every row passes you can retain the typemod. That is arguably the > perfect solution. The concern is that "scan every row" could be very > expensive - though in writing this I'm thinking that you'd quickly find a I found that it already scans the whole list. select_common_type does that. And it returns the type that is found to be applicable on all values. Handling typmod there has a small impact. (Though, I don't assert that we should do this.) Even the value itself doesn't convey typmod, VALUES can take expressions. (It is obvious because the first value was already a coercing expression). They are scanned already. > non-match even in a large dataset - and so a less perfect but still valid > solution is to simply discard the typemod if there is more than one row. > My thought was that if you are going to discard typemod in the n > 1 case > for consistency you should discard the typemod in the n = 1 case as well. > > That is, in a nutshell, options 1, 2, and 3 in order. > > The "fault" in #1 that #4 attempted to fix was that VALUES are often hand > entered and so the inexperienced would like to only type in a cast one the > first row and have it will apply to all subsequent rows
Re: [HACKERS] Quorum commit for multiple synchronous replication.
On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier wrote: > On Wed, Dec 7, 2016 at 2:49 PM, Kyotaro HORIGUCHI > wrote: >> Aside from measurement of the two sorting methods, I'd like to >> point out that quorum commit basically doesn't need >> sorting. Counting conforming santdbys while scanning the >> walsender(receiver) LSN list comparing with the target LSN is >> O(n). Small refactoring of SyncRerpGetOldestSyncRecPtr would >> enough to do that. What does the target LSN mean here? > Indeed, I haven't thought about that, and that's a no-brainer. That > would remove the need to allocate and sort each array, what is simply > needed is to track the number of times a newest value has been found. > So what this processing would do is updating the write/flush/apply > values for the first k loops if the new value is *older* than the > current one, where k is the quorum number, and between k+1 and N the > value gets updated only if the value compared is newer. No need to > take the mutex lock for a long time as well. Sorry, I could not understand this algorithm. Could you elaborate this? It takes only O(n) times? > By the way, the patch now > conflicts on HEAD, it needs a refresh. Thanks, I'll post the latest patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers