Re: [HACKERS] Freeze avoidance of very large table.
On 4/20/15 4:13 PM, Bruce Momjian wrote: On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote: I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. See the TODO list: https://wiki.postgresql.org/wiki/Todo o Avoid the requirement of freezing pages that are infrequently modified Right, but do you have a proposal for how that would actually happen? Perhaps I'm mis-understanding you, but it sounded like you were opposed to this patch because it doesn't do anything to avoid the need to freeze. My point is that no one has any good ideas on how to avoid freezing, and I think it's a safe bet that any ideas people do come up with there will be a lot more invasive than a FrozenMap is. Didn't you think any of the TODO threads had workable solutions? And I didn't realize there were threads there. The first three are discussion around the idea of eliminating the need to freeze based on a page already being all visible. No patches. http://www.postgresql.org/message-id/ca+tgmoaemnolzmvbb8gvy69na8zw9bwpiz9+tlz-lnabozi...@mail.gmail.com has a WIP patch that goes the route of using a tuple flag to indicate frozen, but also raises a lot of concerns about visibility, because it means we'd stop using FrozenXID. That impacts a large amount of code. There were some followup patches as well as a bunch of discussion of how to make it visible that a tuple was frozen or not. That thread died in January 2014. The fifth thread is XID to LSN mapping. AFAICT this has a significant drawback in that it breaks page compatibility, meaning no pg_upgrade. It ends 5/14/2014 with this comment: Well, Heikki was saying on another thread that he had kind of gotten cold feet about this, so I gather he's not planning to pursue it. Not sure if I understood that correctly. If so, I guess it depends on whether someone else can pick it up, but we might first want to establish why he got cold feet and how worrying those problems seem to other people. - http://www.postgresql.org/message-id/CA+TgmoYoN8LzSuaffUaEkyV8Mhv1wi=zlbxq3vofezno1db...@mail.gmail.com So work was done on two alternative approaches, and then abandoned. Both of those approaches might still be valid, but seem to need more work. They're also higher risk because they're changing MVCC at a very fundamental level. As I mentioned, I think there's a lot better stuff we could be doing about tuple lifetime, but there's no easy fixes to be had. This patch solves a problem today, using a concept that's now well proven (visibility map). If we had something more sophisticated being developed then I'd be inclined not to pursue this patch, but that's not the case. Perhaps others can elaborate on where those two patches are at... don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win. Can you elaborate on this? I don't see how the number of transactions would come into play, but the overhead here is not large; the FrozenMap would be the same size as the VM map, which is 1/64,000th as large as the heap. So a 64G table means a 1M FM. That doesn't seem very expensive. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/20/15 2:45 AM, Sawada Masahiko wrote: Current patch adds new source file src/backend/access/heap/frozenmap.c which is quite similar to visibilitymap.c. They have similar code but are separated for now. I do refactoring these source code like adding bitmap.c, if needed. My feeling is we'd definitely want this refactored; it looks to be a whole lot of duplicated code. But before working on that we should get consensus that a FrozenMap is a good idea. Are there any meaningful differences between the two, besides the obvious name changes? I think there's also a bunch of XLOG stuff that could be refactored too... Also, when skipping vacuum by visibility map, we can skip at least SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in frozen map. That's probably something else that can be factored out, since it's basically the same logic. I suspect we just need to some of the checks so we're looking at both FM and VM at the same time. Other comments... It would be nice if we didn't need another page bit for FM; do you see any reasonable way that could happen? +* If we didn't pin the visibility(and frozen) map page and the page has +* become all visible(and frozen) while we were busy locking the buffer, +* or during some subsequent window during which we had it unlocked, +* we'll have to unlock and re-lock, to avoid holding the buffer lock +* across an I/O. That's a bit unfortunate, especially since we'll now +* have to recheck whether the tuple has been locked or updated under us, +* but hopefully it won't happen very often. */ s/(and frozen)/ or frozen/ + * Reply XLOG_HEAP3_FROZENMAP record. s/Reply/Replay/ + /* +* XLogReplayBufferExtended locked the buffer. But frozenmap_set +* will handle locking itself. +*/ + LockBuffer(fmbuffer, BUFFER_LOCK_UNLOCK); Doesn't this create a race condition? Are you sure the bit in finish_heap_swap() is safe? If so, we should add the same the same for the visibility map too (it certainly better be all visible if it's frozen...) + /* +* Current block is all-visible. +* If frozen map represents that it's all frozen and this +* function is called for freezing tuples, we can skip to +* vacuum block. +*/ I would state this as Even if scan_all is true, we can skip blocks that are marked as frozen. + if (frozenmap_test(onerel, blkno, fmbuffer) scan_all) I suspect it's faster to reverse those tests (scan_all frozenmap_test())... but why do we even need to look at scan_all? AFAICT if a block as frozen we can skip it unconditionally. + /* +* If the un-frozen tuple is remaining in current page and +* current page is marked as ALL_FROZEN, we should clear it. +*/ That needs to NEVER happen. If it does then we're going to consider tuples as visible/frozen that shouldn't be. We should probably throw an error here, because it means the heap is now corrupted. At the minimum it needs to be an assert(). Note that I haven't reviewed all the logic in detail at this point. If this ends up being refactored it'll be a lot easier to spot logic problems, so I'll hold off on that for now. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Bruce Momjian wrote: On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: This seems simple to implement: keep two counters, where the second one is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, reset the first counter so that further 5 pages will get HOT pruned. 5% seems a bit high though. (In Simon's design, SOME_MAX_VALUE is essentially +infinity.) This would tend to dirty non-sequential heap pages --- it seems best to just clean as many as we are supposed to, then skip the rest, so we can write sequential dirty pages to storage. Keep in mind there's a disconnect between dirtying a page and writing it to storage. A page could remain dirty for a long time in the buffer cache. This writing of sequential pages would occur at checkpoint time only, which seems the wrong thing to optimize. If some other process needs to evict pages to make room to read some other page in, surely it's going to try one page at a time, not write many sequential dirty pages. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?
On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian br...@momjian.us wrote: Slightly improved patch applied. Is it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Mon, Apr 20, 2015 at 03:58:19PM -0500, Jim Nasby wrote: I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. See the TODO list: https://wiki.postgresql.org/wiki/Todo o Avoid the requirement of freezing pages that are infrequently modified Right, but do you have a proposal for how that would actually happen? Perhaps I'm mis-understanding you, but it sounded like you were opposed to this patch because it doesn't do anything to avoid the need to freeze. My point is that no one has any good ideas on how to avoid freezing, and I think it's a safe bet that any ideas people do come up with there will be a lot more invasive than a FrozenMap is. Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. This seems simple to implement: keep two counters, where the second one is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, reset the first counter so that further 5 pages will get HOT pruned. 5% seems a bit high though. (In Simon's design, SOME_MAX_VALUE is essentially +infinity.) This would tend to dirty non-sequential heap pages --- it seems best to just clean as many as we are supposed to, then skip the rest, so we can write sequential dirty pages to storage. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 3:28 PM, Jeff Janes jeff.ja...@gmail.com wrote: But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background? If there are a handful of very hot pages, then it makes sense not to wait for vacuum to get to them. And that is what a block-count limit does. I think that's a fundamental mischaracterization of the problem. As soon as you define this as vacuum's problem, then of course it makes no sense to prune in the foreground, ever. But if you define the problem as get the best overall system performance, then it clearly DOES sometimes make sense to prune in the foreground, as benchmark results upthread demonstrate. The fact is that on a workload like pgbench - and it doesn't have to be exactly pgbench, just any kind of workload where there are lots of changes to the table - vacuum can at any given time be pruning at most one page of the table. That is because only one vacuum process can be running in a given table at one time, and it can't be doing two things at once. But there can be many processes doing inserts, updates, or deletes on that table, as many as whatever you have max_connections set to. There can easily be dozens even on a well-configured system; on a poorly configured system, there could be hundreds. It seems obvious that if you can have dozens or hundreds of processes creating garbage and at most one process cleaning it up, there will be cases where you get further and further behind. Now, it might well be that the right solution to that problem is to allow multiple vacuum processes in the same database, or add background workers to help with opportunistic HOT-pruning of pages so it doesn't get done in the foreground. Fine. But as of today, on a heavily-modified table, the ONLY way that we can possibly remove junk from the table as fast as we're creating junk is if the backends touching the table do some of the work. Now, Simon is making the argument that it should be good enough to have people *modifying* the table help with the cleanup rather than imposing that load on the people who are only *reading* it, and that's not a dumb argument, but there are still cases where that strategy loses - specifically, where the table churn has stopped or paused, by autovacuum hasn't run yet. If you're going to do 1 sequential scan of the table and then go home for the day, HOT-pruning is dumb even in that case. If you're going to do 1000 sequential scans of that table in a row, HOT-pruning may very well be smart. There's no guarantee that the table has met the autovacuum threshold, but HOT-pruning it could well be a win anyway. Or it might be a loss. You can make any policy here look smart or dumb by picking a particular workload, and you don't even have to invent crazy things that will never happen in real life to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Tue, Apr 14, 2015 at 7:02 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Apr 14, 2015 at 6:22 PM, Peter Geoghegan p...@heroku.com wrote: Why is that good? We did discuss this before. I've recapped some of what I believe to be the most salient points below. I think that people were all too quick to dismiss the idea of a wall time interval playing some role here (at least as a defense against correlated references, as a correlated reference period). I suppose that that's because it doesn't fit with an intuition that says that that kind of interval ought to be derived algebraically - magic delay settings are considered suspect. Yep, Tom gave that reason here: http://www.postgresql.org/message-id/11258.1397673...@sss.pgh.pa.us But there was also this point from Andres - gettimeofday is not free: http://www.postgresql.org/message-id/20140416075307.gc3...@awork2.anarazel.de And this point from me - this can degrade to random eviction under high pressure: http://www.postgresql.org/message-id/ca+tgmoayuxr55zueapp6d2xbyicjwacc9myyn5at4tindsj...@mail.gmail.com You'll notice that my proposal avoids all three of those objections. All I'm saying is that you shouldn't dismiss the idea without trying it out properly. The LRU-K paper, from the early 1990s, recommends this, and gettimeofday() calls were a lot more expensive back then. I'm sure that there is a way to overcome these issues if it turns out to be worth it (by amortizing gettimeofday() calls by driving it from an auxiliary process like the bgwriter, for example). In fact, I'm almost certain there is, because at least one other major database system uses just such a reference period. Back when ARC (or was it 2Q?) was committed before being reverted shortly thereafter, there was a similar idea actually implemented, but with XIDs preventing correlated references (which the LRU-K paper also hints at). I think that an actual delay is more robust than that, though. Even though this correlated reference delay is all I implemented with my prototype,it's just one piece of the puzzle. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0
On Sun, Apr 19, 2015 at 9:37 PM, Peter Geoghegan p...@heroku.com wrote: Attached patch, V3.4, implements what I believe you and Heikki have in mind here. Any plans to look at this, Svenne? You are signed up as a reviewer on the commitfest app. A review of the documentation, and interactions with other features like inheritance, updatable views and postgres_fdw would be useful at this point. Obviously I've gone to a lot of effort to document how things fit together at a high level on the UPSERT wiki page, but these aspects have not been commented on all that much. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/20/15 2:09 PM, Bruce Momjian wrote: On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote: On 4/20/15 1:48 PM, Bruce Momjian wrote: On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. So, this patch avoids reading the all-frozen pages if it has not been modified since the last VACUUM FREEZE? Since it is already frozen, the running VACUUM FREEZE will not modify the page or generate WAL, so is it really worth maintaining a new per-page bitmap just to avoid the sequential scan of tables every 200MB transactions? I would like to see us reduce the need for VACUUM FREEZE, rather than go in this direction. How would you propose we do that? I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. See the TODO list: https://wiki.postgresql.org/wiki/Todo o Avoid the requirement of freezing pages that are infrequently modified Right, but do you have a proposal for how that would actually happen? Perhaps I'm mis-understanding you, but it sounded like you were opposed to this patch because it doesn't do anything to avoid the need to freeze. My point is that no one has any good ideas on how to avoid freezing, and I think it's a safe bet that any ideas people do come up with there will be a lot more invasive than a FrozenMap is. While not perfect, a FrozenMap is something we can do today, without a lot of effort, and it will provide definite value for any tables that have a good amount of frozen pages. Without performance testing, we don't know what good actually looks like, but we can't test without a patch (which we now have). Assuming performance numbers look good I think it would be folly to reject this patch in the hopes that eventually we'll have some way to avoid the need to freeze. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 09:56:20PM +0100, Simon Riggs wrote: On 20 April 2015 at 20:28, Jeff Janes jeff.ja...@gmail.com wrote: But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background? Agreed. I don't see a % as giving us anything at all. The idea is that we want to turn an O(N) problem for one query into an O(1) task. The use case I see for this is when there is a mixed workload. There is one select which reads the entire table, and hundreds of thousands of selects/updates/insert that don't, and of course vacuum comes along every now and then and does it thing. Why should the one massive SELECT have horrible performance just because it was run right before autovacuum would have kicked in instead of right after if finished? +1 You can +1 all you want, but if you ignore the specific workloads I mentioned, you are not going to get much traction. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 20 April 2015 at 20:28, Jeff Janes jeff.ja...@gmail.com wrote: But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background? Agreed. I don't see a % as giving us anything at all. The idea is that we want to turn an O(N) problem for one query into an O(1) task. The use case I see for this is when there is a mixed workload. There is one select which reads the entire table, and hundreds of thousands of selects/updates/insert that don't, and of course vacuum comes along every now and then and does it thing. Why should the one massive SELECT have horrible performance just because it was run right before autovacuum would have kicked in instead of right after if finished? +1 -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?
On Mon, Apr 20, 2015 at 05:04:14PM -0400, Robert Haas wrote: On Mon, Apr 20, 2015 at 4:11 PM, Bruce Momjian br...@momjian.us wrote: Slightly improved patch applied. Is it? The patch has a slightly modified 'if' statement to check a constant before calling a function, and use elseif: + if (!interpretOidsOption(stmt-options, true) cxt.hasoids) --- + if (cxt.hasoids !interpretOidsOption(stmt-options, true)) 47c57 + if (interpretOidsOption(stmt-options, true) !cxt.hasoids) --- + else if (!cxt.hasoids interpretOidsOption(stmt-options, true)) I realize the change is subtle. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel mode and parallel contexts
On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas robertmh...@gmail.com wrote: Here is yet another version of this patch. In addition to the fixes mentioned above, this version includes some minor rebasing around recent commits, and also better handling of the case where we discover that we cannot launch workers after all. This can happen because (1) dynamic_shared_memory_type=none, (2) the maximum number of DSM segments supported by the system configuration are already in use, or (3) the user creates a parallel context with nworkers=0. In any of those cases, the system will now create a backend-private memory segment instead of a dynamic shared memory segment, and will skip steps that don't need to be done in that case. This is obviously an undesirable scenario. If we choose a parallel sequential scan, we want it to launch workers and really run in parallel. Hopefully, in case (1) or case (3), we will avoid choosing a parallel plan in the first place, but case (2) is pretty hard to avoid completely, as we have no idea what other processes may or may not be doing with dynamic shared memory segments ... and, in any case, degrading to non-parallel execution beats failing outright. I see that you're using git format-patch to generate this. But the patch is only patch 1/4. Is that intentional? Where are the other pieces? I think that the parallel seqscan patch, and the assessing parallel safety patch are intended to fit together with this patch, but I can't find a place where there is a high level overview explaining just how they fit together (I notice Amit's patch has an #include access/parallel.h, which is here, but that wasn't trivial to figure out). I haven't been paying too much attention to this patch series. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 04/20/2015 02:13 PM, Bruce Momjian wrote: Didn't you think any of the TODO threads had workable solutions? And don't expect adding an additional file per relation will be zero cost --- added over the lifetime of 200M transactions, I question if this approach would be a win. Well, the only real way to test that is a prototype, no? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On 2015-04-21 AM 03:29, Robert Haas wrote: On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote wrote: On 08-04-2015 PM 12:46, Amit Kapila wrote: Going forward, I think we can improve the same if we decide not to shutdown parallel workers till postmaster shutdown once they are started and then just allocate them during executor-start phase. I wonder if it makes sense to invent the notion of a global pool of workers with configurable number of workers that are created at postmaster start and destroyed at shutdown and requested for use when a query uses parallelizable nodes. Short answer: Yes, but not for the first version of this feature. Longer answer: We can't actually very reasonably have a global pool of workers so long as we retain the restriction that a backend connected to one database cannot subsequently disconnect from it and connect to some other database instead. However, it's certainly a good idea to reuse the same workers for subsequent operations on the same database, especially if they are also by the same user. At the very minimum, it would be good to reuse the same workers for subsequent operations within the same query, instead of destroying the old ones and creating new ones. Notwithstanding the obvious value of all of these ideas, I don't think we should do any of them for the first version of this feature. This is too big a thing to get perfect on the first try. Agreed. Perhaps, Amit has worked (is working) on reuse the same workers for subsequent operations within the same query Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
Hi, At Mon, 20 Apr 2015 16:40:52 +0900, Etsuro Fujita fujita.ets...@lab.ntt.co.jp wrote in 5534ad84.3020...@lab.ntt.co.jp On 2015/04/17 13:16, Amit Langote wrote: On 17-04-2015 PM 12:35, Etsuro Fujita wrote: (2) that might cause the problem of associating subplans' update information with subplans' scan information, pointed out by Tom [1]. Having realized how it really works now, my +1 to Foreign Modifying Scan for cases of pushed down update as suggested by Albe Laurenz. I guess it would be signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE (/CMD_INSERT). Thanks for the opinion! I think that that is an idea. However, I'd like to propose to rename Foreign Update (Foreign Delete) of ModifyTable simply to Update (Delete) not only because (1) that solves the duplication problem but also because (2) ISTM that is consistent with the non-inherited updates in both of the non-pushed-down-update case and the pushed-down-update case. Here are examples for (2). Update node without Foreign that runs Remote SQL looks to me somewhat unusual.. It seems to me that the problem is Foreign Updates for ModifyTable that does nothing eventually. Even though I don't understand this fully, especially what Foreign Update for ModifyTable does when Foreign Update in place of Foreign Scan finished substantial work, I think the ForeignUpdate nodes should be removed during planning if it is really ineffective, or such Foreign Updates shoud be renamed or provided with some explaination in explain output if it does anything or unremovable from some reason. If removed it looks like, | =# explain verbose update p set b = b + 1; | QUERY PLAN | -- | Update on public.p (cost=0.00..360.08 rows=4311 width=14) |Update on public.p |- Seq Scan on public.p (cost=0.00..0.00 rows=1 width=14) | Output: p.a, (p.b + 1), p.ctid |- Foreign Update on public.ft1 (cost=100.00..180.04 rows=2155 width=14) | Remote SQL: UPDATE public.t1 SET b = (b + 1) |- Foreign Update on public.ft2 (cost=100.00..180.04 rows=2155 width=14) | Remote SQL: UPDATE public.t2 SET b = (b + 1) regards, * Inherited and non-inherited updates for the non-pushed-down case: postgres=# explain verbose update parent set c1 = trunc(random() * 9 + 1)::int; QUERY PLAN - Update on public.parent (cost=0.00..452.06 rows=5461 width=6) Update on public.parent Update on public.ft1 Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1 Update on public.ft2 Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1 - Seq Scan on public.parent (cost=0.00..0.01 rows=1 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, parent.ctid - Foreign Scan on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ft1.ctid Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE - Foreign Scan on public.ft2 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ft2.ctid Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE (14 rows) postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int; QUERY PLAN -- Update on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1 - Foreign Scan on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ctid Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE (5 rows) * Inherited and non-inherited updates for the pushed-down case: postgres=# explain verbose update parent set c1 = c1 + 1; QUERY PLAN -- Update on public.parent (cost=0.00..376.59 rows=4819 width=10) Update on public.parent Update on public.ft1 Update on public.ft2 - Seq Scan on public.parent (cost=0.00..0.00 rows=1 width=10) Output: (parent.c1 + 1), parent.ctid - Foreign Update on public.ft1 (cost=100.00..188.29 rows=2409 width=10) Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1) - Foreign Update on public.ft2 (cost=100.00..188.29 rows=2409 width=10) Remote SQL: UPDATE public.t2 SET c1 = (c1 + 1) (10
Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
Kaigai-san, I reviewed the Custom/Foreign join API patch again after writing a patch of join push-down support for postgres_fdw. 2015/03/26 10:51、Kouhei Kaigai kai...@ak.jp.nec.com のメール: Or bottom of make_join_rel(). IMO build_join_rel() is responsible for just building (or searching from a list) a RelOptInfo for given relids. After that make_join_rel() calls add_paths_to_joinrel() with appropriate arguments per join type to generate actual Paths implements the join. make_join_rel() is called only once for particular relid combination, and there SpecialJoinInfo and restrictlist (conditions specified in JOIN-ON and WHERE), so it seems promising for FDW cases. I like that idea, but I think we will have complex hook signature, it won't remain as simple as hook (root, joinrel). Signature of the hook (or the FDW API handler) would be like this: typedef void (*GetForeignJoinPaths_function ) (PlannerInfo *root, RelOptInfo *joinrel, RelOptInfo *outerrel, RelOptInfo *innerrel, JoinType jointype, SpecialJoinInfo *sjinfo, List *restrictlist); This is very similar to add_paths_to_joinrel(), but lacks semifactors and extra_lateral_rels. semifactors can be obtained with compute_semi_anti_join_factors(), and extra_lateral_rels can be constructed from root-placeholder_list as add_paths_to_joinrel() does. From the viewpoint of postgres_fdw, jointype and restrictlist is necessary to generate SELECT statement, so it would require most work done in make_join_rel again if the signature was hook(root, joinrel). sjinfo will be necessary for supporting SEMI/ANTI joins, but currently it is not in the scope of postgres_fdw. I guess that other FDWs require at least jointype and restrictlist. The attached patch adds GetForeignJoinPaths call on make_join_rel() only when 'joinrel' is actually built and both of child relations are managed by same FDW driver, prior to any other built-in join paths. I adjusted the hook definition a little bit, because jointype can be reproduced using SpecialJoinInfo. Right? Yes, it can be derived from the expression below: jointype = sjinfo ? sjinfo-jointype : JOIN_INNER; Probably, it will solve the original concern towards multiple calls of FDW handler in case when it tries to replace an entire join subtree with a foreign- scan on the result of remote join query. How about your opinion? AFAIS it’s well-balanced about calling count and available information. New FDW API GetForeignJoinPaths is called only once for a particular combination of join, such as (A join B join C). Before considering all joins in a join level (number of relations contained in the join tree), possible join combinations of lower join level are considered recursively. As Tom pointed out before, say imagine a case like ((huge JOIN large) LEFT JOIN small), expensive path in lower join level might be Here, please let me summarize the changes in the patch as the result of my review. * Add set_join_pathlist_hook_type in add_paths_to_joinrel This hook is intended to provide a chance to add one or more CustomPaths for an actual join combination. If the join is reversible, the hook is called for both A * B and B * A. This is different from FDW API but it seems fine because FDWs should have chances to process the join in more abstract level than CSPs. Parameters are same as hash_inner_and_outer, so they would be enough for hash-like or nestloop-like methods. I’m not sure whether mergeclause_list is necessary as a parameter or not. It’s information for merge join which is generated when enable_mergejoin is on and the join is not FULL OUTER. Does some CSP need it for processing a join in its own way? Then it must be in parameter list because select_mergejoin_clauses is static so it’s not accessible from external modules. The timing of the hooking, after considering all built-in path types, seems fine because some of CSPs might want to use built-in paths as a template or a source. One concern is in the document of the hook function. Implementing Custom Paths” says: A custom scan provider will be also able to add paths by setting the following hook, to replace built-in join paths by custom-scan that performs as if a scan on preliminary joined relations, which us called after the core code has generated what it believes to be the complete and correct set of access paths for the join. I think “replace” would mis-lead readers that CSP can remove or edit existing built-in paths listed in RelOptInfo#pathlist or linked from cheapest_foo. IIUC CSP can just add paths for the join relation, and planner choose it if it’s the cheapest. * Add new FDW API
[HACKERS] installcheck missing in src/bin/pg_rewind/Makefile
Hi all, As mentioned in $subject, the TAP tests of pg_rewind are currently not run by buildfarm machines as the buildfarm client uses installcheck to run the tests in src/bin. A patch is attached to correct the problem. Regards, -- Michael diff --git a/src/bin/pg_rewind/Makefile b/src/bin/pg_rewind/Makefile index e3400f5..98213c4 100644 --- a/src/bin/pg_rewind/Makefile +++ b/src/bin/pg_rewind/Makefile @@ -49,3 +49,6 @@ clean distclean maintainer-clean: check: $(prove_check) + +installcheck: + $(prove_installcheck) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Decoding follows timelines
On Fri, Feb 13, 2015 at 4:57 PM, Michael Paquier wrote: Moved patch to CF 2015-02 to not lose track of it, also because it does not seem it received a proper review. This patch does not apply anymore, so attached is a rebased version. The comments mentioned here have not been addressed: http://www.postgresql.org/message-id/54a7bf61.9080...@vmware.com Also, what kind of tests have been done? Logical decoding cannot be used while a node is in recovery. Regards, -- Michael diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 4a20569..3036ce6 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -216,7 +216,8 @@ static void WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, Transac static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc); static void XLogRead(char *buf, XLogRecPtr startptr, Size count); - +static XLogRecPtr GetLatestRequestPtr(void); +static TimeLineID ReadSendTimeLine(TimeLineID tli); /* Initialize walsender process before entering the main command loop */ void @@ -535,8 +536,6 @@ StartReplication(StartReplicationCmd *cmd) if (cmd-timeline != 0) { - XLogRecPtr switchpoint; - sendTimeLine = cmd-timeline; if (sendTimeLine == ThisTimeLineID) { @@ -545,18 +544,13 @@ StartReplication(StartReplicationCmd *cmd) } else { - List *timeLineHistory; - sendTimeLineIsHistoric = true; /* * Check that the timeline the client requested for exists, and * the requested start location is on that timeline. */ - timeLineHistory = readTimeLineHistory(ThisTimeLineID); - switchpoint = tliSwitchPoint(cmd-timeline, timeLineHistory, - sendTimeLineNextTLI); - list_free_deep(timeLineHistory); + (void) ReadSendTimeLine(cmd-timeline); /* * Found the requested timeline in the history. Check that @@ -576,8 +570,8 @@ StartReplication(StartReplicationCmd *cmd) * that's older than the switchpoint, if it's still in the same * WAL segment. */ - if (!XLogRecPtrIsInvalid(switchpoint) -switchpoint cmd-startpoint) + if (!XLogRecPtrIsInvalid(sendTimeLineValidUpto) +sendTimeLineValidUpto cmd-startpoint) { ereport(ERROR, (errmsg(requested starting point %X/%X on timeline %u is not in this server's history, @@ -586,10 +580,9 @@ StartReplication(StartReplicationCmd *cmd) cmd-timeline), errdetail(This server's history forked from timeline %u at %X/%X., cmd-timeline, - (uint32) (switchpoint 32), - (uint32) (switchpoint; + (uint32) (sendTimeLineValidUpto 32), + (uint32) (sendTimeLineValidUpto; } - sendTimeLineValidUpto = switchpoint; } } else @@ -928,6 +921,8 @@ static void StartLogicalReplication(StartReplicationCmd *cmd) { StringInfoData buf; + XLogRecPtr FlushPtr; + List *timeLineHistory; /* make sure that our requirements are still fulfilled */ CheckLogicalDecodingRequirements(); @@ -940,6 +935,8 @@ StartLogicalReplication(StartReplicationCmd *cmd) * Force a disconnect, so that the decoding code doesn't need to care * about an eventual switch from running in recovery, to running in a * normal environment. Client code is expected to handle reconnects. + * This covers the race condition where we are promoted half way + * through starting up. */ if (am_cascading_walsender !RecoveryInProgress()) { @@ -948,6 +945,14 @@ StartLogicalReplication(StartReplicationCmd *cmd) walsender_ready_to_stop = true; } + if (am_cascading_walsender) + { + /* this also updates ThisTimeLineID */ + FlushPtr = GetStandbyFlushRecPtr(); + } + else + FlushPtr = GetFlushRecPtr(); + WalSndSetState(WALSNDSTATE_CATCHUP); /* Send a CopyBothResponse message, and start streaming */ @@ -974,6 +979,24 @@ StartLogicalReplication(StartReplicationCmd *cmd) logical_startptr = MyReplicationSlot-data.restart_lsn; /* + * Find the timeline for the start location, or throw an error. + * + * Logical replication relies upon replication slots. Each slot has a + * single timeline history baked into it, so this should be easy. + */ + timeLineHistory = readTimeLineHistory(ThisTimeLineID); + sendTimeLine = tliOfPointInHistory(logical_startptr, timeLineHistory); + if (sendTimeLine != ThisTimeLineID) + { + sendTimeLineIsHistoric = true; + sendTimeLineValidUpto = tliSwitchPoint(sendTimeLine, timeLineHistory, + sendTimeLineNextTLI); + } + list_free_deep(timeLineHistory); + + streamingDoneSending = streamingDoneReceiving = false; + + /* * Report the location after which we'll send out further commits as the * current sentPtr. */ @@ -2179,93 +2202,10 @@ XLogSendPhysical(void) return; } - /* Figure out how far we can safely send the WAL. */ - if (sendTimeLineIsHistoric) - { - /* - * Streaming an old timeline that's in this server's history, but is - * not the one we're currently inserting or
Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?
On Thu, Apr 9, 2015 at 05:33:20PM -0400, Bruce Momjian wrote: On Thu, Apr 9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: Should this be listed in the release notes as a backward-incompatibility? Isn't this a backpatchable bug fix? Uh, I don't think so. I think users are used to the existing behavior and changing it on them will cause more harm than good. Also, we have had zero field reports about this problem. The updated attached patch handles cases where the default_with_oids = true. Slightly improved patch applied. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On 4/20/15 4:40 AM, Sawada Masahiko wrote: Thank you for updating the patch. One question about regarding since v7 (or later) patch is; What is the different between OBJECT logging and SESSION logging? In brief, session logging can log anything that happens in a user session while object logging only targets DML and SELECT on selected relations. The pg_audit.log_relation setting is intended to mimic the prior behavior of pg_audit before object logging was added. In general, I would not expect pg_audit.log = 'read, write' to be used with pg_audit.role. In other words, session logging of DML and SELECT should probably not be turned on at the same time as object logging is in use. Object logging is intended to be a fine-grained version of pg_audit.log = 'read, write' (one or both). I used v9 patch with pg_audit.log_relation = on, and got quite similar but different log as follows. =# select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,select * from hoge, bar where hoge.col = bar.col; The behaviour of SESSION log is similar to OBJECT log now, and SESSION log per session (i.g, pg_audit.log_relation = off) is also similar to 'log_statement = all'. (enhancing log_statement might be enough) So I couldn't understand needs of SESSION log. Session logging is quite different from 'log_statement = all'. See: http://www.postgresql.org/message-id/552323b2.8060...@pgmasters.net and/or the Why pg_audit? section of the pg_audit documentation. I agree that it may make sense in the future to merge session logging into log_statements, but for now it does provide important additional functionality for creating audit logs. Regards, -- - David Steele da...@pgmasters.net signature.asc Description: OpenPGP digital signature
Re: [HACKERS] alternative compression algorithms?
On 04/20/15 05:07, Andres Freund wrote: Hi, On 2015-04-19 22:51:53 +0200, Tomas Vondra wrote: The reason why I'm asking about this is the multivariate statistics patch - while optimizing the planning overhead, I realized that considerable amount of time is spent decompressing the statistics (serialized as bytea), and using an algorithm with better decompression performance (lz4 comes to mind) would help a lot. The statistics may be a few tens/hundreds kB, and in the planner every millisecond counts. I've a hard time believing that a different compression algorithm is going to be the solution for that. Decompressing, quite possibly repeatedly, several hundreds of kb during planning isn't going to be fun. Even with a good, fast, compression algorithm. Sure, it's not an ultimate solution, but it might help a bit. I do have other ideas how to optimize this, but in the planner every milisecond counts. Looking at 'perf top' and seeing pglz_decompress() in top 3. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Tue, Apr 7, 2015 at 11:22 AM, Sawada Masahiko sawada.m...@gmail.com wrote: On Tue, Apr 7, 2015 at 7:53 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/6/15 5:18 PM, Greg Stark wrote: Only I would suggest thinking of it in terms of two orthogonal boolean flags rather than three states. It's easier to reason about whether a table has a specific property than trying to control a state machine in a predefined pathway. So I would say the two flags are: READONLY: guarantees nothing can be dirtied ALLFROZEN: guarantees no unfrozen tuples are present In practice you can't have the later without the former since vacuum can't know everything is frozen unless it knows nobody is inserting. But perhaps there will be cases in the future where that's not true. I'm not so sure about that. There's a logical state progression here (see below). ISTM it's easier to just enforce that in one place instead of a bunch of places having to check multiple conditions. But, I'm not wed to a single field. Incidentally there are number of other optimisations tat over had in mind that are only possible on frozen read-only tables: 1) Compression: compress the pages and pack them one after the other. Build a new fork with offsets for each page. 2) Automatic partition elimination where the statistics track the minimum and maximum value per partition (and number of tuples) and treat then as implicit constraints. In particular it would magically make read only empty parent partitions be excluded regardless of the where clause. AFAICT neither of those actually requires ALLFROZEN, no? You'll need to uncompact and re-compact for #1 when you actually freeze (which maybe isn't worth it), but freezing isn't absolutely required. #2 would only require that everything in the relation is visible; not frozen. I think there's value here to having an ALLVISIBLE state as well as ALLFROZEN. Based on may suggestions, I'm going to deal with FM at first as one patch. It would be simply mechanism and similar to VM, at first patch. - Each bit of FM represent single page - The bit is set only by vacuum - The bit is un-set by inserting and updating and deleting At second, I'll deal with simply read-only table and 2 states, Read/Write(default) and ReadOnly as one patch. ITSM the having the Frozen state needs to more discussion. read-only table just allow us to disable any updating table, and it's controlled by read-only flag pg_class has. And DDL command which changes these status is like ALTER TABLE SET READ ONLY, or READ WRITE. Also as Alvaro's suggested, the read-only table affect not only freezing table but also performance optimization. I'll consider including them when I deal with read-only table. Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. Current patch adds new source file src/backend/access/heap/frozenmap.c which is quite similar to visibilitymap.c. They have similar code but are separated for now. I do refactoring these source code like adding bitmap.c, if needed. Also, when skipping vacuum by visibility map, we can skip at least SKIP_PAGE_THESHOLD consecutive page, but such mechanism is not in frozen map. Please give me feedbacks. Regards, --- Sawada Masahiko diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile index b83d496..53f07fd 100644 --- a/src/backend/access/heap/Makefile +++ b/src/backend/access/heap/Makefile @@ -12,6 +12,6 @@ subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global -OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o +OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o frozenmap.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/access/heap/frozenmap.c b/src/backend/access/heap/frozenmap.c new file mode 100644 index 000..6e64cb8 --- /dev/null +++ b/src/backend/access/heap/frozenmap.c @@ -0,0 +1,567 @@ +/*- + * + * frozenmap.c + * bitmap for tracking frozen heap tuples + * + * Portions Copyright (c) 2015, PostgreSQL Global Development Group + * + * + * IDENTIFICATION + * src/backend/access/heap/frozenmap.c + * + *- + */ +#include postgres.h + +#include access/frozenmap.h +#include access/heapam_xlog.h +#include access/xlog.h
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:26:29 +0300, Heikki Linnakangas wrote: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. Sounds good to me. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)
On Thu, Apr 16, 2015 at 2:34 AM, David Steele da...@pgmasters.net wrote: On 4/15/15 11:30 AM, Sawada Masahiko wrote: On Wed, Apr 15, 2015 at 10:52 AM, Sawada Masahiko sawada.m...@gmail.com wrote: I tested v8 patch with CURSOR case I mentioned before, and got segmentation fault again. Here are log messages in my environment, =# select test(); LOG: server process (PID 29730) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: select test(); LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: the database system is in recovery mode I investigated this problem and inform you my result here. When we execute test() function I mentioned, the three stack items in total are stored into auditEventStack. The two of them are freed by stack_pop() - stack_free() (i.g, stack_free() is called by stack_pop()). One of them is freed by PortalDrop() - .. - MemoryContextDeleteChildren() - ... - stack_free(). And it is freed at the same time as deleting pg_audit memory context, and stack will be completely empty. But after freeing all items, finish_xact_command() function could call PortalDrop(), so ExecutorEnd() function could be called again. pg_audit has ExecutorEnd_hook, so postgres tries to free that item.. SEGV. In my environment, the following change fixes it. --- pg_audit.c.org2015-04-15 14:21:07.541866525 +0900 +++ pg_audit.c2015-04-15 11:36:53.758877339 +0900 @@ -1291,7 +1291,7 @@ standard_ExecutorEnd(queryDesc); /* Pop the audit event off the stack */ -if (!internalStatement) +if (!internalStatement auditEventStack != NULL) { stack_pop(auditEventStack-stackId); } It might be good idea to add Assert() at before calling stack_pop(). I'm not sure this change is exactly correct, but I hope this information helps you. I appreciate you taking the time to look - this is the same conclusion I came to. I also hardened another area that I thought might be vulnerable. I've seen this problem with explicit cursors before (and fixed it in another place a while ago). The memory context that is current in ExecutorStart is freed before ExecutorEnd is called only in this case as far as I can tell. I'm not sure this is very consistent behavior. I have attached patch v9 which fixes this issue as you suggested, but I'm not completely satisfied with it. It seems like there could be an unintentional pop from the stack in a case of deeper nesting. This might not be possible but it's hard to disprove. I'll think about it some more, but meanwhile this patch addresses the present issue. Thank you for updating the patch. One question about regarding since v7 (or later) patch is; What is the different between OBJECT logging and SESSION logging? I used v9 patch with pg_audit.log_relation = on, and got quite similar but different log as follows. =# select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.hoge,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.hoge,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: OBJECT,8,1,READ,SELECT,TABLE,public.bar,select * from hoge, bar where hoge.col = bar.col; NOTICE: AUDIT: SESSION,8,1,READ,SELECT,TABLE,public.bar,select * from hoge, bar where hoge.col = bar.col; The behaviour of SESSION log is similar to OBJECT log now, and SESSION log per session (i.g, pg_audit.log_relation = off) is also similar to 'log_statement = all'. (enhancing log_statement might be enough) So I couldn't understand needs of SESSION log. Regards, --- Sawada Masahiko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Supporting TAP tests with MSVC and Windows
On Sun, Apr 19, 2015 at 10:01 PM, Michael Paquier michael.paqu...@gmail.com wrote: Note as well that this patch uses the following patches fixing independent issues: ... Attached is v4. I added a switch in config.pl to be consistent with ./configure and --enable-tap-tests. -- Michael From f532bbd2522b3a1784038d9863325d055fc445bf Mon Sep 17 00:00:00 2001 From: Michael Paquier mich...@otacoo.com Date: Mon, 20 Apr 2015 04:57:37 -0700 Subject: [PATCH 1/2] Add support for TAP tests on Windows Nodes initialized by the TAP tests use SSPI to securely perform the tests, and test scripts are patched in a couple of places to support Windows grammar. In the case of MSVC, tests can be run with this command: vcregress tapcheck --- .gitignore | 3 + doc/src/sgml/install-windows.sgml | 1 + src/Makefile.global.in | 2 +- src/bin/initdb/t/001_initdb.pl | 18 ++-- src/bin/pg_basebackup/t/010_pg_basebackup.pl | 79 +--- src/bin/pg_controldata/t/001_pg_controldata.pl | 5 +- src/bin/pg_ctl/t/001_start_stop.pl | 14 ++- src/bin/pg_ctl/t/002_status.pl | 12 ++- src/bin/pg_rewind/RewindTest.pm| 119 - src/bin/scripts/t/020_createdb.pl | 3 + src/test/perl/TestLib.pm | 25 -- src/tools/msvc/Solution.pm | 1 + src/tools/msvc/config_default.pl | 1 + src/tools/msvc/vcregress.pl| 60 - 14 files changed, 271 insertions(+), 72 deletions(-) diff --git a/.gitignore b/.gitignore index 8d3af50..3cd37fe 100644 --- a/.gitignore +++ b/.gitignore @@ -36,3 +36,6 @@ lib*.pc /pgsql.sln.cache /Debug/ /Release/ + +# Generated by tests +/tmp_check/ diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml index d154b44..2047790 100644 --- a/doc/src/sgml/install-windows.sgml +++ b/doc/src/sgml/install-windows.sgml @@ -439,6 +439,7 @@ $ENV{CONFIG}=Debug; userinputvcregress modulescheck/userinput userinputvcregress ecpgcheck/userinput userinputvcregress isolationcheck/userinput +userinputvcregress tapcheck/userinput userinputvcregress upgradecheck/userinput /screen diff --git a/src/Makefile.global.in b/src/Makefile.global.in index 4b06fc2..b27ed78 100644 --- a/src/Makefile.global.in +++ b/src/Makefile.global.in @@ -324,7 +324,7 @@ endef define prove_check $(MKDIR_P) tmp_check/log $(MAKE) -C $(top_builddir) DESTDIR='$(CURDIR)'/tmp_check/install install '$(CURDIR)'/tmp_check/log/install.log 21 -cd $(srcdir) TESTDIR='$(CURDIR)' PATH=$(CURDIR)/tmp_check/install$(bindir):$$PATH $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl +cd $(srcdir) TESTDIR='$(CURDIR)' TESTREGRESS='$(top_builddir)/src/test/regress/pg_regress' PATH=$(CURDIR)/tmp_check/install$(bindir):$$PATH $(call add_to_path,$(ld_library_path_var),$(CURDIR)/tmp_check/install$(libdir)) top_builddir='$(CURDIR)/$(top_builddir)' PGPORT='6$(DEF_PGPORT)' $(PROVE) $(PG_PROVE_FLAGS) $(PROVE_FLAGS) t/*.pl endef else diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index d12be84..0865107 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -1,6 +1,8 @@ use strict; use warnings; +use Config; use TestLib; +use File::Path qw(rmtree); use Test::More tests = 19; my $tempdir = TestLib::tempdir; @@ -18,27 +20,31 @@ command_fails([ 'initdb', '-S', $tempdir/data3 ], mkdir $tempdir/data4 or BAIL_OUT($!); command_ok([ 'initdb', $tempdir/data4 ], 'existing empty data directory'); -system_or_bail rm -rf '$tempdir'/*; - +rmtree($tempdir); +mkdir $tempdir; command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'separate xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +rmtree($tempdir); +mkdir $tempdir; command_fails( [ 'initdb', $tempdir/data, '-X', 'pgxlog' ], 'relative xlog directory not allowed'); -system_or_bail rm -rf '$tempdir'/*; +rmtree($tempdir); +mkdir $tempdir; mkdir $tempdir/pgxlog; command_ok([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'existing empty xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +rmtree($tempdir); +mkdir $tempdir; mkdir $tempdir/pgxlog; mkdir $tempdir/pgxlog/lost+found; command_fails([ 'initdb', '-X', $tempdir/pgxlog, $tempdir/data ], 'existing nonempty xlog directory'); -system_or_bail rm -rf '$tempdir'/*; +rmtree($tempdir); +mkdir $tempdir; command_ok([ 'initdb', '-T', 'german', $tempdir/data ], 'select default dictionary'); diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl index 7e9a776..4cb1b5a 100644 --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl @@ -1,8 +1,9 @@ use
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:45 PM, Petr Jelinek wrote: The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. The change does nothing useful, since I doubt anyone will ever need 32768 nodes in their cluster. And if they did there would be other much bigger problems than replication identifier being 16bit (it's actually 65534 as it's unsigned btw). Can you name some of the bigger problems you'd have? Obviously, if you have 10 high-volume OLTP nodes connected to a single server, feeding transactions as a continous stream, you're going to choke the system. But you might have 10 tiny satellite databases that sync up with the master every few hours, and each of them do only a few updates per day. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 2015-04-20 11:09:20 +0300, Heikki Linnakangas wrote: Can you name some of the bigger problems you'd have? Several parts of the system are O(#max_replication_slots). Having 100k outgoing logical replication slots is going to be expensive. Nothing unsolvable, but the 65k 16 bit limit surely isn't going to be the biggest problem. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimization for updating foreign tables in Postgres FDW
On 2015/04/17 13:16, Amit Langote wrote: On 17-04-2015 PM 12:35, Etsuro Fujita wrote: (2) that might cause the problem of associating subplans' update information with subplans' scan information, pointed out by Tom [1]. Having realized how it really works now, my +1 to Foreign Modifying Scan for cases of pushed down update as suggested by Albe Laurenz. I guess it would be signaled by the proposed ForeignScan.CmdType being CMD_UPDATE / CMP_UPDATE (/CMD_INSERT). Thanks for the opinion! I think that that is an idea. However, I'd like to propose to rename Foreign Update (Foreign Delete) of ModifyTable simply to Update (Delete) not only because (1) that solves the duplication problem but also because (2) ISTM that is consistent with the non-inherited updates in both of the non-pushed-down-update case and the pushed-down-update case. Here are examples for (2). * Inherited and non-inherited updates for the non-pushed-down case: postgres=# explain verbose update parent set c1 = trunc(random() * 9 + 1)::int; QUERY PLAN - Update on public.parent (cost=0.00..452.06 rows=5461 width=6) Update on public.parent Update on public.ft1 Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1 Update on public.ft2 Remote SQL: UPDATE public.t2 SET c1 = $2 WHERE ctid = $1 - Seq Scan on public.parent (cost=0.00..0.01 rows=1 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, parent.ctid - Foreign Scan on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ft1.ctid Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE - Foreign Scan on public.ft2 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ft2.ctid Remote SQL: SELECT ctid FROM public.t2 FOR UPDATE (14 rows) postgres=# explain verbose update ft1 set c1 = trunc(random() * 9 + 1)::int; QUERY PLAN -- Update on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Remote SQL: UPDATE public.t1 SET c1 = $2 WHERE ctid = $1 - Foreign Scan on public.ft1 (cost=100.00..226.03 rows=2730 width=6) Output: (trunc(((random() * '9'::double precision) + '1'::double precision)))::integer, ctid Remote SQL: SELECT ctid FROM public.t1 FOR UPDATE (5 rows) * Inherited and non-inherited updates for the pushed-down case: postgres=# explain verbose update parent set c1 = c1 + 1; QUERY PLAN -- Update on public.parent (cost=0.00..376.59 rows=4819 width=10) Update on public.parent Update on public.ft1 Update on public.ft2 - Seq Scan on public.parent (cost=0.00..0.00 rows=1 width=10) Output: (parent.c1 + 1), parent.ctid - Foreign Update on public.ft1 (cost=100.00..188.29 rows=2409 width=10) Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1) - Foreign Update on public.ft2 (cost=100.00..188.29 rows=2409 width=10) Remote SQL: UPDATE public.t2 SET c1 = (c1 + 1) (10 rows) postgres=# explain verbose update ft1 set c1 = c1 + 1; QUERY PLAN -- Update on public.ft1 (cost=100.00..188.29 rows=2409 width=10) - Foreign Update on public.ft1 (cost=100.00..188.29 rows=2409 width=10) Remote SQL: UPDATE public.t1 SET c1 = (c1 + 1) (3 rows) Comments are welcome. Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Sep 29, 2014 at 2:13 AM, Andres Freund and...@anarazel.de wrote: On 2014-09-28 19:51:36 +0100, Simon Riggs wrote: On 27 September 2014 09:29, Andres Freund and...@anarazel.de wrote: On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote: This patch has gotten a fair amount of review, and has been rewritten once during the commitfest. I think it's pretty close to being committable, the only remaining question seems to be what to do with system catalogs. I'm marking this as Returned with feedback, I take it that Simon can proceed from here, outside the commitfest. FWIW, I don't think it is, even with that. As is it seems very likely that it's going to regress a fair share of workloads. At the very least it needs a fair amount of benchmarking beforehand. There is some doubt there. We've not seen a workload that does actually exhibit a negative behaviour. Neither is there much data about the magnitude of positive effect the patch has... I'm not saying one doesn't exist, but it does matter how common/likely it is. If anyone can present a performance test case that demonstrates a regression, I think it will make it easier to discuss how wide that case is and what we should do about it. Discussing whether to do various kinds of limited pruning are moot until that is clear. I doubt it'll be hard to construct a case where it'll show. My first try of using a pgbench scale 100, -M prepared, -cj8 with a custom file with 1 write and 5 read transaction yielded the following on my laptop: Baseline: relname| pgbench_tellers pg_total_relation_size | 458752 relname| pgbench_accounts pg_total_relation_size | 1590337536 relname| pgbench_branches pg_total_relation_size | 286720 relname| pgbench_history pg_total_relation_size | 49979392 Patched: relname| pgbench_tellers pg_total_relation_size | 516096 relname| pgbench_accounts pg_total_relation_size | 1590337536 relname| pgbench_branches pg_total_relation_size | 360448 relname| pgbench_history pg_total_relation_size | 49528832 So, there's a noticeable increase in size. Mostly on the smaller tables, so probably HOT cleanup was sometimes skipped during UPDATEs due to locks. Baseline was: tps = 9655.486532 (excluding connections establishing) Patched was: tps = 9466.158701 (including connections establishing) Was this reproducible? I've run your custom sql file with 4 clients (that is how many CPUs I have) on a machine with a BBU. I had wal_level = hot_standby, but the archive_command just returned true without archiving anything. And using the latest patch. The size of the pgbench_tellers and pgbench_branches relations were surprisingly variable in both patched and unpatched, but there was no reliable difference between them, just within them. On the TPS front, there was a hint that patched one was slightly slower but the within sample variation was also high, and the p-val for difference was only 0.214 on n of 66. test case attached. That's not a unrealistic testcase. I'm pretty sure this could be made quite a bit more pronounced by not using a uniform distribution in the pgbench runs. And selecting a test that's more vulnerable to the change (e.g. using a wider distribution for the read only statements than the modifying ones) would make the the CPU overhead of the additional heap_hot_search_buffer() overhead heavier. Sorry I don't understand this description. Why would queries selecting data that is not changing have any extra overhead? Is the idea that the hot part of the table for updates would move around over time, but the hot part for selects would be even throughout? I'm not sure how to put that to the test. Cheers, Jeff
Re: [HACKERS] Sequence Access Method WIP
On 03/15/2015 09:07 PM, Petr Jelinek wrote: Slightly updated version of the patch. Mainly rebased against current master (there were several conflicts) and fixed some typos, no real functional change. I also attached initial version of the API sgml doc. Thanks! With the patch, pg_class.relam column references to the pg_seqam table for sequences, but pg_indexam for indexes. I believe it's the first instance where we reuse a foreign key column like that. It's not a real foreign key, of course - that wouldn't work with a real foreign key at all - but it's a bit strange. That makes me a bit uncomfortable. How do others feel about that? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch for GROUPING SETS phase 1
Oh, and I build it on top of f92fc4c95ddcc25978354a8248d3df22269201bc On 20-04-2015 10:36, Svenne Krap wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Hi, I have (finally) found time to review this. The syntax is as per spec as I can see, and the queries I have tested have all produced the correct output. The documentation looks good and is clear. I think it is spec compliant, but I am not used enough to the spec to be sure. Also I have not understood the function of set quantifier (DISTINCT,ALL) part in the group by clause (and hence not tested it). Hence I haven't marked the spec compliant part. The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a sorting problem when looking at the diff) which should be unrelated to GSP. I don't know enough of the check to know if it has already run the GSP tests.. I have also been running a few tests on some real data. This is run on my laptop with 32 GB of memory and a fast SSD. The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a tiny multi-column lookup table. I am returning a count(*). Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and ROLLUP both works fine and if work_buffers are large enough it slightly beats the handwritten union all equivalent (runtimes as 7,6 seconds to 7,7 seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) seconds). The other query is on the same datatable as before, but with three columns (two calculated and one natural) for a cube. I am returning a count(*). First column is extract year from date column Second column is divide a value by something and truncate (i.e. make buckets) Third column is a litteral integer column. Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). Nothing obvious about why in the explain (analyze,buffers,costs,timing) . I have the explains, but as the dataset is semi-private and I don't have any easy way to edit out names in it, I will send it on request (non-disclosure from the recipient is of course a must) and not post it on the list. I think the feature is ready to be commited, but am unsure whether I am qualified to gauge that :) /Svenne The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:36 PM, Simon Riggs wrote: On 17 April 2015 at 19:18, Heikki Linnakangas hlinn...@iki.fi wrote: To be honest, I'm not entirely sure what we're arguing over. When arguing over something you consider small, it is customary to allow the author precedence. We can't do things our own way all the time. Sure, I'm not going to throw a tantrum if Andres commits this as it is. I said that IMO the difference in WAL size is so small that we should just use 4-byte OIDs for the replication identifiers, instead of trying to make do with 2 bytes. Not because I find it too likely that you'll run out of IDs (although it could happen), but more to make replication IDs more like all other system objects we have. Andreas did some pgbench benchmarking to show that the difference in WAL size is about 10%. The WAL records generated by pgbench happen to have just the right sizes so that the 2-3 extra bytes bump them over to the next alignment boundary. That's why there is such a big difference - on average it'll be less. I think that's acceptable, Andreas seems to think otherwise. But if the WAL size really is so precious, we could remove the two padding bytes from XLogRecord, instead of dedicating them for the replication ids. That would be an even better use for them. The argument to move to 4 bytes is a poor one. If it was reasonable in terms of code or cosmetic value then all values used in the backend would be 4 bytes. We wouldn't have any 2 byte values anywhere. But we don't do that. That's a straw man argument. I'm not saying we should never use 2 byte values anywhere. OID is usually used as the primary key in system tables. There are exceptions, but that is nevertheless the norm. I'm saying that saving in WAL size is not worth making an exception here, and we should go with the simplest option of using OIDs. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Replication identifiers, take 4
On 04/17/2015 11:54 AM, Andres Freund wrote: I've attached a rebased patch, that adds decision about origin logging to the relevant XLogInsert() callsites for external 2 byte identifiers and removes the pad-reusing version in the interest of moving forward. Putting aside the 2 vs. 4 byte identifier issue, let's discuss naming: I just realized that it talks about replication identifier as the new fundamental concept. The system table is called pg_replication_identifier. But that's like talking about index identifiers, instead of just indexes, and calling the system table pg_index_oid. The important concept this patch actually adds is the *origin* of each transaction. That term is already used in some parts of the patch. I think we should roughly do a search-replace of replication identifier - replication origin to the patch. Or even transaction origin. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP Patch for GROUPING SETS phase 1
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Hi, I have (finally) found time to review this. The syntax is as per spec as I can see, and the queries I have tested have all produced the correct output. The documentation looks good and is clear. I think it is spec compliant, but I am not used enough to the spec to be sure. Also I have not understood the function of set quantifier (DISTINCT,ALL) part in the group by clause (and hence not tested it). Hence I haven't marked the spec compliant part. The installcheck-world fails, but in src/pl/tcl/results/pltcl_queries.out (a sorting problem when looking at the diff) which should be unrelated to GSP. I don't know enough of the check to know if it has already run the GSP tests.. I have also been running a few tests on some real data. This is run on my laptop with 32 GB of memory and a fast SSD. The first dataset is a join between a data table of 472 MB (4,3 Mrows) and a tiny multi-column lookup table. I am returning a count(*). Here the data is hierarchical so CUBE does not make sense. GROUPING SETS and ROLLUP both works fine and if work_buffers are large enough it slightly beats the handwritten union all equivalent (runtimes as 7,6 seconds to 7,7 seconds). If work_buffers are the default 4MB the union-all-equivalent (UAE) beats the GS-query almost 2:1 due to disk spill (14,3 (GS) vs. 8,2 (UAE) seconds). The other query is on the same datatable as before, but with three columns (two calculated and one natural) for a cube. I am returning a count(*). First column is extract year from date column Second column is divide a value by something and truncate (i.e. make buckets) Third column is a litteral integer column. Here the GS-version is slightly slower than the UAE-version (17,5 vs. 14,2). Nothing obvious about why in the explain (analyze,buffers,costs,timing) . I have the explains, but as the dataset is semi-private and I don't have any easy way to edit out names in it, I will send it on request (non-disclosure from the recipient is of course a must) and not post it on the list. I think the feature is ready to be commited, but am unsure whether I am qualified to gauge that :) /Svenne The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 2015-04-20 01:04:18 -0700, Jeff Janes wrote: Was this reproducible? Yes, at least with an old version of the patch. I don't think you could see a difference using exactly that with the newer versions which have the 5 page limit. After all it'll pretty much never reach it. That's not a unrealistic testcase. I'm pretty sure this could be made quite a bit more pronounced by not using a uniform distribution in the pgbench runs. And selecting a test that's more vulnerable to the change (e.g. using a wider distribution for the read only statements than the modifying ones) would make the the CPU overhead of the additional heap_hot_search_buffer() overhead heavier. Sorry I don't understand this description. Why would queries selecting data that is not changing have any extra overhead? The idea, I think, was that by having a uniform (or just wider) distribution of the reads they'd be more likely to land on values that have been updated at some point, but not been pruned since (because at that point the patch IIRC didn't prune during reads at all). I.e. ones wer Is the idea that the hot part of the table for updates would move around over time, but the hot part for selects would be even throughout? Pretty much. I'm not sure how to put that to the test. That pretty much was what I'd tried to model, yea. I guess it'd be possible to model this by inserting NOW()/updating values NOW() - 5 and selecting values up to NOW() - 60. That'd roughly model some realistic insert/update/select patterns I've seen. To possibly see any difference with the new patch this would have to be done in a way that regularly a couple of pages would be touched, with not that many selected tuples on each. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] inherit support for foreign tables
On 2015/04/16 16:05, Etsuro Fujita wrote: On 2015/03/23 2:57, Tom Lane wrote: Etsuro Fujita fujita.ets...@lab.ntt.co.jp writes: [ fdw-inh-8.patch ] I've committed this with some substantial rearrangements, notably: * As I mentioned earlier, I got rid of a few unnecessary restrictions on foreign tables so as to avoid introducing warts into inheritance behavior. In particular, we now allow NOT VALID CHECK constraints (and hence ALTER ... VALIDATE CONSTRAINT), ALTER SET STORAGE, and ALTER SET WITH/WITHOUT OIDS. These are probably no-ops anyway for foreign tables, though conceivably an FDW might choose to implement some behavior for STORAGE or OIDs. I agree with you on this point. However, ISTM there is a bug in handling OIDs on foreign tables; while we now allow for ALTER SET WITH/WITHOUT OIDS, we still don't allow the default_with_oids parameter for foreign tables. I think that since CREATE FOREIGN TABLE should be consistent with ALTER FOREIGN TABLE, we should also allow the parameter for foreign tables. Attached is a patch for that. I also updated docs. Attached is an updated version of the patch. Best regards, Etsuro Fujita *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 6745,6752 dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' /term listitem para ! This controls whether commandCREATE TABLE/command and ! commandCREATE TABLE AS/command include an OID column in newly-created tables, if neither literalWITH OIDS/literal nor literalWITHOUT OIDS/literal is specified. It also determines whether OIDs will be included in tables created by --- 6745,6753 /term listitem para ! This controls whether commandCREATE TABLE/command, ! commandCREATE TABLE AS/command and ! commandCREATE FOREIGN TABLE/command include an OID column in newly-created tables, if neither literalWITH OIDS/literal nor literalWITHOUT OIDS/literal is specified. It also determines whether OIDs will be included in tables created by *** a/doc/src/sgml/ref/create_foreign_table.sgml --- b/doc/src/sgml/ref/create_foreign_table.sgml *** *** 293,298 CHECK ( replaceable class=PARAMETERexpression/replaceable ) --- 293,304 responsibility to ensure that the constraint definition matches reality. /para + + para +To add OIDs to the table created by commandCREATE FOREIGN TABLE/command, +enable the xref linkend=guc-default-with-oids configuration +variable. + /para /refsect1 refsect1 id=SQL-CREATEFOREIGNTABLE-examples *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 580,586 DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId, descriptor = BuildDescForRelation(schema); localHasOids = interpretOidsOption(stmt-options, ! (relkind == RELKIND_RELATION)); descriptor-tdhasoid = (localHasOids || parentOidCount 0); /* --- 580,587 descriptor = BuildDescForRelation(schema); localHasOids = interpretOidsOption(stmt-options, ! (relkind == RELKIND_RELATION || ! relkind == RELKIND_FOREIGN_TABLE)); descriptor-tdhasoid = (localHasOids || parentOidCount 0); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote: With the patch, pg_class.relam column references to the pg_seqam table for sequences, but pg_indexam for indexes. I believe it's the first instance where we reuse a foreign key column like that. It's not a real foreign key, of course - that wouldn't work with a real foreign key at all - but it's a bit strange. That makes me a bit uncomfortable. How do others feel about that? Hm. I'd modeled it more as an extension of the 'relkind' column mentally. I.e. it further specifies how exactly the relation is behaving. Given that the field has been added to pg_class and not pg_index, combined with it not having index in its name, makes me think that it actually was intended to be used the way it's done in the patch. It's not the first column that behaves that way btw, at least pg_depend comes to mind. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
On 20/04/15 12:05, Andres Freund wrote: On 2015-04-20 12:49:39 +0300, Heikki Linnakangas wrote: With the patch, pg_class.relam column references to the pg_seqam table for sequences, but pg_indexam for indexes. I believe it's the first instance where we reuse a foreign key column like that. It's not a real foreign key, of course - that wouldn't work with a real foreign key at all - but it's a bit strange. That makes me a bit uncomfortable. How do others feel about that? Hm. I'd modeled it more as an extension of the 'relkind' column mentally. I.e. it further specifies how exactly the relation is behaving. Given that the field has been added to pg_class and not pg_index, combined with it not having index in its name, makes me think that it actually was intended to be used the way it's done in the patch. That's how I think about it too. It's also short for access method, nothing really suggests to me that it should be index specific by design. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Supporting src/test/modules in MSVC builds
Michael Paquier wrote: And currawong is satisfied with this patch and the new buildfarm code, test modules being run as testmodules-install-check-C: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=currawongdt=2015-04-18%2014%3A51%3A14 So this closes the loop for this thread. Yay! Thanks, Michael and Andrew. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 20, 2015 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote: On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout klep...@svana.org wrote: I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. Yeah, I thought about that, too. It might be worth experimenting with. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. Do we have any reproducible test cases to evaluate these assumptions? I haven't looked at this stuff for a while, but my main issue with the clock sweep was finding sweep heavy cases that did not also have trouble with the buffer mapping locks (although the facts on the ground my have changed since then). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 5:00 PM, Martijn van Oosterhout klep...@svana.org wrote: I've been following this thread from the side with interest and got twigged by the point about loss of information. If you'd like better information about relative ages, you can acheive this by raising the cap on the usage count and dividing (or right-shifting) each sweep. Yeah, I thought about that, too. It might be worth experimenting with. This would allow you to remember much more about about the relative worth of often used pages. With a cap of 32 you'd have the same effect as now where after 5 sweeps the buffer is evicted. Mathematically the count would converge to the number of times the block is used per sweep. Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On 20 April 2015 at 18:33, Bruce Momjian br...@momjian.us wrote: Also, I am also not sure we should be designing features at this stage in our release process. I see this more as a process of gaining approval. I don't think patches at the back of the queue should get the its too late treatment just because they are at the back of the queue. They are logically all at the same stage. There should be a way to allow people that show patience and respect for the process to get the same timeshare as those that push their patches daily. Anyway, in this case, the patch conflicts with other things going in now, so changing things isn't really sensible for this release, in terms of my time. We clearly need to do a better job of piggybacking actions on a dirty block. The discussion has been about whether to do early cleanup, but we should also consider post-access cleanup if we set hint bits or dirtied the block in other ways. Since we have many votes in favour of change in this area I'll post a new version and look for an early review/commit for next release. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote: That is how we arrive at the idea of a cleanup limit, further enhanced by a limit that applies only to dirtying clean blocks, which we have 4? recent votes in favour of. I would personally be in favour of a parameter to control the limit, since whatever we chose is right/wrong depending upon circumstances. I am however comfortable with not having a parameter if people think it is hard to tune that, which I agree it would be, hence no parameter in the patch. I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. If there are future writes, they would dirty the pages and cause even more pruning, but the 5% gives me the maximum pruning number of SELECTs. If there aren't another 19 SELECTs, do I care if the table is pruned or not? Probably not. Measuring in page count doesn't do that, and a large table could receive millions of selects before being fully cleaned. Also, I am also not sure we should be designing features at this stage in our release process. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] optimizing vacuum truncation scans
On 4/20/15 1:50 AM, Jeff Janes wrote: Shouldn't completely empty pages be set as all-visible in the VM? If so, can't we just find the largest not-all-visible page and move forward from there, instead of moving backwards like we currently do? If the entire table is all-visible, we would be starting from the beginning, even though the beginning of the table still has read only tuples present. Except we'd use it in conjunction with nonempty_pages. IIRC, that's set to the last page that vacuum saw data on. If any page after that got written to after vacuum visited it, the VM bit would be cleared. So after we acquire the exclusive lock, AFAICT it's safe to just scan the VM starting with nonempty_pages. For that matter, why do we scan backwards anyway? The comments don't explain it, and we have nonempty_pages as a starting point, so why don't we just scan forward? I suspect that eons ago we didn't have that and just blindly reverse-scanned until we finally hit a non-empty buffer... nonempty_pages is not concurrency safe, as the pages could become used after vacuum passed them over but before the access exclusive lock was grabbed before the truncation scan. But maybe the combination of the two? If it is above nonempty_pages, then anyone who wrote into the page after vacuum passed it must have cleared the VM bit. And currently I think no one but vacuum ever sets VM bit back on, so once cleared it would stay cleared. Right. In any event nonempty_pages could be used to set the guess as to how many pages (if any) might be worth prefetching, as that is not needed for correctness. Yeah, but I think we'd do a LOT better with the VM idea, because we could immediately truncate without scanning anything. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Tue, Apr 7, 2015 at 11:58 PM, Amit Kapila amit.kapil...@gmail.com wrote: One disadvantage of retaining parallel-paths could be that it can increase the number of combinations planner might need to evaluate during planning (in particular during join path evaluation) unless we do some special handling to avoid evaluation of such combinations. Yes, that's true. But the overhead might not be very much. In the common case, many baserels and joinrels will have no parallel paths because the non-parallel paths is known to be better anyway. Also, if parallelism does seem to be winning, we're probably planning a query that involves accessing a fair amount of data, so a little extra planner overhead may not be so bad. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Shouldn't CREATE TABLE LIKE copy the relhasoids property?
On Thu, Apr 9, 2015 at 5:33 PM, Bruce Momjian br...@momjian.us wrote: On Thu, Apr 9, 2015 at 12:32:23PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: Should this be listed in the release notes as a backward-incompatibility? Isn't this a backpatchable bug fix? Uh, I don't think so. I think users are used to the existing behavior and changing it on them will cause more harm than good. Also, we have had zero field reports about this problem. I agree. This should not be back-patched, but fixing it in master seems fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Spinlock Documentation
On Sat, Apr 11, 2015 at 12:03 PM, Artem Luzyanin lisyono...@yahoo.ca wrote: Thank you again for your feedback. I have improved the patch with your suggestions. Please let me know what you think and if I can do anything else. Current CommitFest link for this patch is: https://commitfest.postgresql.org/5/208/ Some review comments: - The first hunk in s_lock.h touches only whitespace. Changing the space to a tab on the Usually line would make sense for consistency, but adding a trailing space to the override them line does not. - As Tom basically said before, I think the File layout block comment will just get out of date and be a maintenance annoyance to future updaters of this file. It's not really that hard to see the structure of the file just by going through it, so I don't think this is worthwhile. - Similarly, adding all of the Currently implemented lines looks useless to me. Why can't somebody see that from just reading the code itself? Overall, I'm not seeing much point to this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote: This is my point though (you're right that flushed isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. I am absolutely, positively, violently in 100% agreement with this. I have made the same point before, but it sure is nice to hear someone else thinking about it the same way. So it makes sense that we end up with a lot of buffers promoted all the way to the most recently used ntile and then have to do n passes around the clock and have no good information about which buffer to evict. Right. What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. Yes, absolutely. If you tilt your head the right way, my proposal of limiting the number of promotions per clock sweep has the effect of tying buffer demotion and buffer promotion together much more tightly than is the case right now. You are limited to 2 promotions per demotion; and practically speaking not all buffers eligible to be promoted will actually get accessed, so the number of promotions per demotion will in reality be somewhere between 0 and 2. Ideally it would be exactly 1, but 1 +/- 1 is still a tighter limit than we have at present. Which is not to say there isn't some other idea that is better still. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On Mon, Apr 20, 2015 at 11:00 AM, Merlin Moncure mmonc...@gmail.com wrote: Hmm, interesting point. It's possible that we'd still have problems with everything maxing out at 32 on some workloads, but at least it'd be a little harder to max out at 32 than at 5. Do we have any reproducible test cases to evaluate these assumptions? The particular point that you are responding to here is easy to reproduce. Just create any workload that fits in shared_buffers - a small pgbench database, for example - and access stuff. No usage counts will go down, but every access will drive usage counts up. Eventually everything will hit any maximum you care to install, and actually I don't think it takes very long. You can use pg_buffercache to see the results. I haven't looked at this stuff for a while, but my main issue with the clock sweep was finding sweep heavy cases that did not also have trouble with the buffer mapping locks (although the facts on the ground my have changed since then). We increased the number of buffer mapping locks to 128, so that problem should be considerably ameliorated now. But it's not totally gone, as demonstrated by Andres's experiments with my chash patch. There was also a patch that eliminated BufFreelistLock in favor of a spinlock held for much shorter time periods, and then Andres took that one step further and made it use atomics. That used to be a *terrible* bottleneck on eviction-heavy workloads and is now much improved. More work may remain to be done, of course. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sequence Access Method WIP
Heikki Linnakangas wrote: * The transformations of the arrays in get_state() and set_state() functions are a bit complicated. The seqam_get_state() function returns two C arrays, and pg_sequence_get_state() turns them into a text[] array. Why not construct the text[] array directly in the AM? I guess it's a bit more convenient to the AM, if the pg_sequence_get_state() do that, but if that's an issue, you could create generic helper function to turn two C string arrays into text[], and call that from the AM. Um, see strlist_to_textarray() in objectaddress.c if you do that. Maybe we need some generic place to store that kind of helper function. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Seq Scan
On Wed, Apr 8, 2015 at 3:34 AM, David Rowley dgrowle...@gmail.com wrote: On 8 April 2015 at 14:24, Robert Haas robertmh...@gmail.com wrote: I think one of the philosophical questions that has to be answered here is what does it mean to talk about the cost of a parallel plan?. For a non-parallel plan, the cost of the plan means both the amount of effort we will spend executing the plan and also the amount of time we think the plan will take to complete, but those two things are different for parallel plans. I'm inclined to think it's right to view the cost of a parallel plan as a proxy for execution time, because the fundamental principle of the planner is that we pick the lowest-cost plan. But there also clearly needs to be some way to prevent the selection of a plan which runs slightly faster at the cost of using vastly more resources. I'd agree with that as far as CPU costs, or maybe I'd just disagree with the alternative, as if we costed in cost of individual worker's work * number of workers then we'd never choose a parallel plan, as by the time we costed in tuple communication costs between the processes a parallel plan would always cost more than the serial equivalent. I/O costs are different, I'd imagine these shouldn't be divided by the estimated number of workers. It's hard to say. If the I/O is from the OS buffer cache, then there's no reason why several workers can't run in parallel. And even if it's from the actual storage, we don't know what degree of I/O parallelism will be possible. Maybe effective_io_concurrency should play into the costing formula somehow, but it's not very clear to me that captures the information we care about. In general, I'm not sure how common it is for the execution speed of a sequential scan to be limited by I/O. For example, on a pgbench database, scale factor 300, on a POWERPC machine provided by IBM for performance testing (thanks, IBM!) a cached read of the pgbench_accounts files took 1.122 seconds. After dropping the caches, it took 10.427 seconds. select * from pgbench_accounts where abalance 3 took 10.244 seconds with a cold cache and 5.029 seconds with a warm cache. So on this particular hardware, on this particular test, parallelism is useless if the cache is cold, but it could be right to use ~4-5 processes for the scan if the cache is warm. However, we have no way of knowing whether the cache will be cold or warm at execution time. This isn't a new problem. As it is, the user has to set seq_page_cost and random_page_cost based on either a cold-cache assumption or a warm-cache assumption, and if they guess wrong, their costing estimates will be off (on this platform, on this test case) by 4-5x. That's pretty bad, and it's totally unclear to me what to do about it. I'm guessing it's unclear to other people, too, or we would likely have done something about it by now. Some ideas for GUCs: max_parallel_degree = The largest number of processes we'll consider using for a single query. min_parallel_speedup = The minimum percentage by which a parallel path must be cheaper (in terms of execution time) than a non-parallel path in order to survive. I'm imagining the default here might be something like 15%. min_parallel_speedup_per_worker = Like the previous one, but per worker. e.g. if this is 5%, which might be a sensible default, then a plan with 4 workers must be at least 20% better to survive, but a plan using only 2 workers only needs to be 10% better. max_parallel_degree feels awfully like it would have to be set conservatively, similar to how work_mem is today. Like with work_mem, during quiet periods it sure would be nice if it could magically increase. Absolutely. But, similar to work_mem, that's a really hard problem. We can't know at plan time how much work memory, or how many CPUs, will be available at execution time. And even if we did, it need not be constant throughout the whole of query execution. It could be that when execution starts, there's lots of memory available, so we do a quicksort rather than a tape-sort. But midway through the machine comes under intense memory pressure and there's no way for the system to switch strategies. Now, having said that, I absolutely believe that it's correct for the planner to make the initial decisions in this area. Parallelism changes the cost of execution nodes, and it's completely wrong to assume that this couldn't alter planner decisions at higher levels of the plan tree. At the same time, it's pretty clear that it would be a great thing for the executor to be able to adjust the strategy if the planner's assumptions don't pan out, or if conditions have changed. For example, if we choose a seq-scan-sort-and-filter over an index-scan-and-filter thinking that we'll be able to do a quicksort, and then it turns out that we're short on memory, it's too late to switch gears and adopt the index-scan-and-filter plan after all. That's long since
Re: [HACKERS] Parallel Seq Scan
On Wed, Apr 8, 2015 at 3:38 AM, Amit Langote langote_amit...@lab.ntt.co.jp wrote: On 08-04-2015 PM 12:46, Amit Kapila wrote: Going forward, I think we can improve the same if we decide not to shutdown parallel workers till postmaster shutdown once they are started and then just allocate them during executor-start phase. I wonder if it makes sense to invent the notion of a global pool of workers with configurable number of workers that are created at postmaster start and destroyed at shutdown and requested for use when a query uses parallelizable nodes. Short answer: Yes, but not for the first version of this feature. Longer answer: We can't actually very reasonably have a global pool of workers so long as we retain the restriction that a backend connected to one database cannot subsequently disconnect from it and connect to some other database instead. However, it's certainly a good idea to reuse the same workers for subsequent operations on the same database, especially if they are also by the same user. At the very minimum, it would be good to reuse the same workers for subsequent operations within the same query, instead of destroying the old ones and creating new ones. Nonwithstanding the obvious value of all of these ideas, I don't think we should do any of them for the first version of this feature. This is too big a thing to get perfect on the first try. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?
On 4/20/15 11:11 AM, Robert Haas wrote: On Wed, Apr 15, 2015 at 5:06 PM, Greg Stark st...@mit.edu wrote: This is my point though (you're right that flushed isn't always the same as eviction but that's not the important point here). Right now we only demote when we consider buffers for eviction. But we promote when we pin buffers. Those two things aren't necessarily happening at the same rate and in fact are often orders of magnitude different. I am absolutely, positively, violently in 100% agreement with this. I have made the same point before, but it sure is nice to hear someone else thinking about it the same way. +1 What I'm saying is that we should demote a buffer every time we promote a buffer. So every time we pin a buffer we should advance the clock a corresponding amount. I know I'm being intentionally vague about what the corresponding amount is.) The important thing is that the two should be tied together. Yes, absolutely. If you tilt your head the right way, my proposal of limiting the number of promotions per clock sweep has the effect of tying buffer demotion and buffer promotion together much more tightly than is the case right now. You are limited to 2 promotions per demotion; and practically speaking not all buffers eligible to be promoted will actually get accessed, so the number of promotions per demotion will in reality be somewhere between 0 and 2. Ideally it would be exactly 1, but 1 +/- 1 is still a tighter limit than we have at present. Which is not to say there isn't some other idea that is better still. I think that would help, but it still leaves user backends trying to advance the clock, which is quite painful. Has anyone tested running the clock in the background? We need a wiki page with all the ideas that have been tested around buffer management... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On 4/20/15 1:48 PM, Bruce Momjian wrote: On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. So, this patch avoids reading the all-frozen pages if it has not been modified since the last VACUUM FREEZE? Since it is already frozen, the running VACUUM FREEZE will not modify the page or generate WAL, so is it really worth maintaining a new per-page bitmap just to avoid the sequential scan of tables every 200MB transactions? I would like to see us reduce the need for VACUUM FREEZE, rather than go in this direction. How would you propose we do that? I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Freeze avoidance of very large table.
On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. So, this patch avoids reading the all-frozen pages if it has not been modified since the last VACUUM FREEZE? Since it is already frozen, the running VACUUM FREEZE will not modify the page or generate WAL, so is it really worth maintaining a new per-page bitmap just to avoid the sequential scan of tables every 200MB transactions? I would like to see us reduce the need for VACUUM FREEZE, rather than go in this direction. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allow SQL/plpgsql functions to accept record
On Sun, Apr 19, 2015 at 3:02 PM, Jim Nasby jim.na...@bluetreble.com wrote: Is there a fundamental reason SQL/plpgsql functions won't accept record as an input type? If not, can someone point me at a patch that might show how much work would be involved in adding support? My particular use case is a generic function that will count how many fields in a record are NULL. I can do it in pure SQL (below), but was hoping to wrap the entire thing in a function. Right now, I have to add a call to row_to_json() to the function call. SELECT count(*) FROM json_each_text( row_to_json($1) ) a WHERE value IS NULL See also: SELECT (src.v).* FROM ( VALUES (ROW(1,2,3)) ) src (v); ERROR: record type has not been registered While it may not be necessary to solve both problems I suspect they have the same underlying root cause - specifically the separation of concerns between the planner and the executor. ISTM that the planner needs to be able to create arbitrarily named composite types and leave them registered in the session somewhere for the executor to find. Session because: PREPARE prep_rec AS SELECT record_input_func(v) FROM ( VALUES (ROW($1::integer,$2::boolean,$3::text)) src (v); EXECUTE prep_rec USING (1, true, 'hi!'); If it requires additional smarts in the executor to make this all work I suspect the cost-benefit equations end up supporting the somewhat more verbose but workable status-quo. I'm not sure how { row_to_json(record) } works but SQL (including pl/pgsql) needs to have some source of definition for what the record type should be in reality - and that source currently is the catalogs whose rows are locked by the planner and injected, I think, into a session cache. The source query in pl/pgsql defines the type for fully embedded use of the record placeholder while the caller's function alias provides that information for RETURNS record. The calling query needs to provide the same information for CREATE FUNCTION func( arg1 record ) since the body of the pl/pgsql function needs to instantiate arg1 with a known type as soon as the function is entered. It is theoretically possible to impute the needed anonymous type from the query definition - the problem is how and where to register that information for execution. At least for pl/pgsql I could see possibly doing something like func( arg1 packed_record_bytes) and having pl/pgsql understand how to unpack those bytes into an anonymous but structured record (like it would with SELECT ... INTO record_var) seems plausible. I would not expect pl/SQL to allow anything of the sort as it doesn't seem compatible with the idea of inline-ability. Maybe the C code for row_to_json (or libpq in general) can provide inspiration (particularly for the pack/unpack bytes) but as I do not know C I'm going to have to leave that to others. David J.
Re: [HACKERS] Freeze avoidance of very large table.
On Mon, Apr 20, 2015 at 01:59:17PM -0500, Jim Nasby wrote: On 4/20/15 1:48 PM, Bruce Momjian wrote: On Mon, Apr 20, 2015 at 04:45:34PM +0900, Sawada Masahiko wrote: Attached WIP patch adds Frozen Map which enables us to avoid whole table vacuuming even when full scan is required: preventing XID wraparound failures. Frozen Map is a bitmap with one bit per heap page, and quite similar to Visibility Map. A set bit means that all tuples on heap page are completely frozen, therefore we don't need to do vacuum freeze that page. A bit is set when vacuum(or autovacuum) figures out that all tuples on corresponding heap page are completely frozen, and a bit is cleared when INSERT and UPDATE(only new heap page) are executed. So, this patch avoids reading the all-frozen pages if it has not been modified since the last VACUUM FREEZE? Since it is already frozen, the running VACUUM FREEZE will not modify the page or generate WAL, so is it really worth maintaining a new per-page bitmap just to avoid the sequential scan of tables every 200MB transactions? I would like to see us reduce the need for VACUUM FREEZE, rather than go in this direction. How would you propose we do that? I also think there's better ways we could handle *all* our cleanup work. Tuples have a definite lifespan, and there's potentially a lot of efficiency to be gained if we could track tuples through their stages of life... but I don't see any easy ways to do that. See the TODO list: https://wiki.postgresql.org/wiki/Todo o Avoid the requirement of freezing pages that are infrequently modified -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
Bruce Momjian wrote: I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. This seems simple to implement: keep two counters, where the second one is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, reset the first counter so that further 5 pages will get HOT pruned. 5% seems a bit high though. (In Simon's design, SOME_MAX_VALUE is essentially +infinity.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. This seems simple to implement: keep two counters, where the second one is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, reset the first counter so that further 5 pages will get HOT pruned. 5% seems a bit high though. (In Simon's design, SOME_MAX_VALUE is essentially +infinity.) Oh, I pulled 5% out of the air. Thinking of a SELECT-only workload, which would be our worse case, I was thinking how many SELECTS running through HOT update chains would it take to be slower than generating the WAL to prune the page. I see the percentage as something that we could reasonably balance, while a fixed page count couldn't be analyzed in that way. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [pgsql-packagers] Palle Girgensohn's ICU patch
On 4/19/15 7:46 AM, Palle Girgensohn wrote: Just poking this old thread again. What happened here, is anyone putting work into this area at the moment? I plan to look at it for 9.6. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] optimizing vacuum truncation scans
On Sun, Apr 19, 2015 at 7:09 PM, Jeff Janes jeff.ja...@gmail.com wrote: I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regular backward scan. IIRC, this patches introduces optional prefetch for the backward scan where file system prefetch may not work well. The result is promising! Also, what would be the best way to drill a hole through bufmgr.c into md.c so that the prefetch could specify an entire range, rather than looping over each individual block? Different from mdread(), which requires a buffer in argument. Extending mdprefetch() looks natural and safe to me. This shall also layout a foundation for other scan style optimizations. What would have to be done to detect people running on SSD and disable the feature, if anything? There are other call sites of prefetch - so this shall be a topic not just for this optimization. Regards, Qingqing -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 10:33 AM, Bruce Momjian br...@momjian.us wrote: On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote: That is how we arrive at the idea of a cleanup limit, further enhanced by a limit that applies only to dirtying clean blocks, which we have 4? recent votes in favour of. I would personally be in favour of a parameter to control the limit, since whatever we chose is right/wrong depending upon circumstances. I am however comfortable with not having a parameter if people think it is hard to tune that, which I agree it would be, hence no parameter in the patch. I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background? If there are a handful of very hot pages, then it makes sense not to wait for vacuum to get to them. And that is what a block-count limit does. But if the entire table is very hot, I think that that is just another of way of saying that autovacuum is horribly misconfigured. I think the purpose of this patch is to fix something that can't be fixed through configuration alone. If there are future writes, they would dirty the pages and cause even more pruning, but the 5% gives me the maximum pruning number of SELECTs. If there aren't another 19 SELECTs, do I care if the table is pruned or not? The use case I see for this is when there is a mixed workload. There is one select which reads the entire table, and hundreds of thousands of selects/updates/insert that don't, and of course vacuum comes along every now and then and does it thing. Why should the one massive SELECT have horrible performance just because it was run right before autovacuum would have kicked in instead of right after if finished? Cheers, Jeff
Re: [HACKERS] Turning off HOT/Cleanup sometimes
On Mon, Apr 20, 2015 at 12:28:11PM -0700, Jeff Janes wrote: But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background? If there are a handful of very hot pages, then it makes sense not to wait for vacuum to get to them. And that is what a block-count limit does. But if the entire table is very hot, I think that that is just another of way of saying that autovacuum is horribly misconfigured. I think the purpose of Well, we have to assume there are many misconfigured configurations --- autovacuum isn't super-easy to configure, so we can't just blame the user if this makes things worse. In fact, page pruning was designed spefically for cases where autovacuum wasn't running our couldn't keep up. this patch is to fix something that can't be fixed through configuration alone. If there are future writes, they would dirty the pages and cause even more pruning, but the 5% gives me the maximum pruning number of SELECTs. If there aren't another 19 SELECTs, do I care if the table is pruned or not? The use case I see for this is when there is a mixed workload. There is one select which reads the entire table, and hundreds of thousands of selects/ updates/insert that don't, and of course vacuum comes along every now and then and does it thing. Why should the one massive SELECT have horrible performance just because it was run right before autovacuum would have kicked in instead of right after if finished? I see your point, but what about the read-only workload after a big update? Do we leave large tables to be non-pruned for a long time? Also, consider cases where you did a big update, the autovacuum thresh-hold was not met, so autovacuum doesn't run on that table --- again, do we keep those non-pruned rows around for millions of scans? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alternative compression algorithms?
On 04/19/2015 11:51 PM, Tomas Vondra wrote: Hi there, in the past we've repeatedly discussed the option of using a different compression algorithm (e.g. lz4), but every time the discussion died off because of fear of possible patent issues [1] [2] and many other threads. Have we decided it's not worth the risks, making patches in this area futile? ... I'm a bit confused though, because I've noticed various other FOSS projects adopting lz4 over the past few years and I'm yet to find a project voicing the same concerns about patents. So either they're reckless or we're excessively paranoid. IMHO we should switch to a different algorithm. Wrt patents, if we choose an algorithm that's already used widely in several other open source projects, that's safe enough. It's as safe as we're going to get. There is always the chance that you infringe on a patent when you write any code. Compression is a particularly heavily-patented field, but it's nevertheless a bit strange to be super-paranoid there, and not worry about patents elsewhere. (The safest option would be to pick an algorithm that was patented, but the patent has expired. I don't think any of the compression algorithms discussed recently are old enough for that, though.) - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] optimizing vacuum truncation scans
On Sun, Apr 19, 2015 at 10:38 PM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/19/15 9:09 PM, Jeff Janes wrote: I did literally the simplest thing I could think of as a proof of concept patch, to see if it would actually fix things. I just jumped back a certain number of blocks occasionally and prefetched them forward, then resumed the regular backward scan. The patch and driving script are attached. Shouldn't completely empty pages be set as all-visible in the VM? If so, can't we just find the largest not-all-visible page and move forward from there, instead of moving backwards like we currently do? If the entire table is all-visible, we would be starting from the beginning, even though the beginning of the table still has read only tuples present. For that matter, why do we scan backwards anyway? The comments don't explain it, and we have nonempty_pages as a starting point, so why don't we just scan forward? I suspect that eons ago we didn't have that and just blindly reverse-scanned until we finally hit a non-empty buffer... nonempty_pages is not concurrency safe, as the pages could become used after vacuum passed them over but before the access exclusive lock was grabbed before the truncation scan. But maybe the combination of the two? If it is above nonempty_pages, then anyone who wrote into the page after vacuum passed it must have cleared the VM bit. And currently I think no one but vacuum ever sets VM bit back on, so once cleared it would stay cleared. In any event nonempty_pages could be used to set the guess as to how many pages (if any) might be worth prefetching, as that is not needed for correctness. Cheers, Jeff