Re: [HACKERS] Proposal: Local indexes for partitioned table
On 17 December 2017 at 16:22, Robert Haas wrote: > On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera > wrote: >> We have two options for marking valid: >> >> 1. after each ALTER INDEX ATTACH, verify whether the set of partitions >> that contain the index is complete; if so, mark it valid, otherwise do >> nothing. This sucks because we have to check that over and over for >> every index that we attach >> >> 2. We invent yet another command, say >> ALTER INDEX VALIDATE It's not perfect that we need to validate each time, but it might not be that expensive to validate since we only really need to count the pg_index rows that have indisvalid = true rows which have a parent index listed as the index we're ATTACHing too, then ensure that matches the number of leaf partitions. > If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I > think it might as well try to validate while it's at it. But if not > then we might want to go with #2. I'm now not that clear on what the behaviour is if the ONLY keyword is not specified on the CREATE INDEX for the partitioned index. Does that go and create each leaf partition index regardless of if there is a suitable candidate to ATTACH? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: GSoC 2018
On 15/12/17 13:57, Aleksander Alekseev wrote: Hi Stephen, HA/fail-over is a very broad topic, with a lot of pieces that need to be done such that I'm not sure it's really viable, but perhaps a precursor project (synchronous logical replication seems like a prereq, no?) would make more sense. Or, perhaps, a different piece of the HA question, but solving the whole thing in a summer strikes me as unlikely to be reasonable. [...] I agree with most of the comments down-thread that this projects seems to undertake much more work than what a GSoC project could be. I'd like to chime in with a few other comments regarding other aspects the proposal: - I believe we're mixing what HA and replication is. HA is about promoting a secondary node when the primary fails, preventing two primaries at the same time, ensuring one master is always up and so forth. Replication is just a data replication mechanism, and that *helps* with HA, but is not HA nor a necessary pre-requisite (actually you could build an HA solution based on shared-disk). Indeed, HA can be built on top of SR, logical replication (with many caveats as have been already commented), disk replication techniques outside of PG and probably others. Topic is way broader. - If the proejct's goal is to improve on some of the logical replication's shortcomings that would be required to make it a useful replication technique on which HA solutions could be built on top of, my +1 for that. - If the project goal would be to augment an existing HA solution to abstract away data replication techniques (so that SR is not the only option, but also logical or disk-based replication or whatever) from the core algorithm, I believe this is also a very sensible GSoC project. - To add a proxy layer to any existing HA project is IMO a project on its own. I don't even see this as a GSoC feasible project. Cheers, Álvaro -- Alvaro Hernandez --- OnGres
Small typo in comment in json_agg_transfn
The attached fixed a small typo in json_agg_transfn. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services json_agg_transfn_comment_fix.patch Description: Binary data
Parallel Aggregates for string_agg and array_agg
Hi, While working on partial aggregation a few years ago, I didn't really think it was worthwhile allowing partial aggregation of string_agg and array_agg. I soon realised that I was wrong about that and allowing parallelisation of these aggregates still could be very useful when many rows are filtered out during the scan. Some benchmarks that I've done locally show that parallel string_agg and array_agg do actually perform better, despite the fact that the aggregate state grows linearly with each aggregated item. Obviously, the performance will get even better when workers are filtering out rows before aggregation takes place, so it seems worthwhile doing this. However, the main reason that I'm motivated to do this is that there are more uses for partial aggregation other than just parallel aggregation, and it seems a shame to disable all these features if a single aggregate does not support partial mode. I've attached a patch which implements all this. I've had most of it stashed away for a while now, but I managed to get some time this weekend to get it into a more completed state. Things are now looking pretty good for the number of aggregates that support partial mode. Just a handful of aggregates now don't support partial aggregation; postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and aggkind='n'; aggfnoid -- xmlagg json_agg json_object_agg jsonb_agg jsonb_object_agg (5 rows) ... and a good number do support it; postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and aggkind='n'; count --- 122 (1 row) There's probably no reason why the last 5 of those couldn't be done either, it might just require shifting a bit more work into the final functions, although, I'm not planning on that for this patch. As for the patch; there's a bit of a quirk in the implementation of string_agg. We previously always threw away the delimiter that belongs to the first aggregated value, but we do now need to keep that around so we can put it in between two states in the combine function. I decided the path of least resistance to do this was just to borrow StringInfo's cursor variable to use as a pointer to the state of the first value and put the first delimiter before that. Both the string_agg(text) and string_agg(bytea) already have a final function, so we just need to skip over the bytes up until the cursor position to get rid of the first delimiter. I could go and invent some new state type to do the same, but I don't really see the trouble with what I've done with StringInfo, but I'll certainly listen if someone else thinks this is wrong. Another thing that I might review later about this is seeing about getting rid of some of the code duplication between array_agg_array_combine and accumArrayResultArr. I'm going to add this to PG11's final commitfest rather than the January 'fest as it seems more like a final commitfest type of patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services combinefn_for_string_and_array_aggs_v1.patch Description: Binary data
Re: pg_(total_)relation_size and partitioned tables
On 17 December 2017 at 16:24, Robert Haas wrote: > On Thu, Dec 14, 2017 at 12:23 AM, Amit Langote > wrote: >> You may have guessed from $subject that the two don't work together. > > It works exactly as documented: > > pg_total_relation_size(regclass) - Total disk space used by the > specified table, including all indexes and TOAST data > > It says nothing about including partitions. If we change this, then > we certainly need to update the documentation (that might be a good > idea if we decide not to update this). > > Personally, I'm -1 on including partitions, because then you can no > longer expect that the sum of pg_total_relation_size(regclass) across > all relations in the database will equal the size of the database > itself. Partitions will be counted a number of times equal to their > depth in the partitioning hierarchy. However, I understand that I > might get outvoted. I'd also vote to leave the relation_size functions alone. Perhaps it's worth thinking of changing pg_table_size() instead. We have taken measures to try and hide the fact that a table is made up of a bunch of partitions from the user in some cases, e.g DROP TABLE works without CASCADE for a partitioned table. I'm sure there are arguments in both directions here too though. I generally think of the difference between a relation and a table similarly to the difference between a tuple and a row. A relation is just what we use to implement tables, and there can be multiple relations per table (in the partitioning case), similar to how there can be multiple tuple versions for a single row. So that might back up that pg_table_size should include all relations that make up that table. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Why does array_position_common bitwise NOT an Oid type?
On 17 December 2017 at 14:53, Tom Lane wrote: > David Rowley writes: >> I was puzzled to see the following code: > >> my_extra->element_type = ~element_type; > > If memory serves, the idea was to force the subsequent datatype-lookup > path to be taken, even if for some reason element_type is InvalidOid. > If we take the lookup path then the bogus element_type will be detected > and reported; if we don't, it won't be. That makes sense. I'd just have expected more documentation on that. Although, perhaps I just didn't look hard enough. I did fail to notice the fact that the same thing does occur over and over when I sent this originally. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
Robert Haas wrote: > On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera > wrote: > > We have two options for marking valid: > > > > 1. after each ALTER INDEX ATTACH, verify whether the set of partitions > > that contain the index is complete; if so, mark it valid, otherwise do > > nothing. This sucks because we have to check that over and over for > > every index that we attach > > > > 2. We invent yet another command, say > > ALTER INDEX VALIDATE > > If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I > think it might as well try to validate while it's at it. But if not > then we might want to go with #2. The problem I have with it is that restoring a dump containing indexes on partitions becomes a O(N^2) deal as it has to do the full check once for every index we attach. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Top-N sorts verses parallelism
On Fri, Dec 15, 2017 at 4:13 PM, Thomas Munro wrote: > Looks right to me. Commit 3452dc52 just forgot to tell the planner. > I'm pleased about that because it makes this a slam-dunk bug-fix and > not some confusing hard to justify costing problem. Jeff Janes inquired off-list about other places where cost_sort() gets called. In costsize.c, it is called from initial_cost_mergejoin, which seems like it does not present an opportunity for pushing down limits, since we don't know how many rows we'll have to join to get a certain number of outputs. In createplan.c, it is called only from label_sort_with_costsize(). That in turn called from create_merge_append_plan(), passing the tuple limit from the path being converted to a plan, which seems unimpeachable. It's also called from make_unique_plan() and create_mergejoin_plan() with -1, which seems OK since in neither case do we know how many input rows we need to read. In planner.c, it's called from plan_cluster_use_sort() with -1; CLUSTER has to read the whole input, so that's fine. In prepunion.c, it's called from choose_hashed_setop() with -1. I think set operations also need to read the whole input. In pathnode.c, it's called from create_merge_append_path, create_unique_path, create_gather_merge_path, create_groupingset_path, and create_sort_path. create_merge_append_path passes any limit applicable to the subpath. create_unique_path passes -1. create_gather_merge_path also passes -1, which as Jeff also pointed out seems possibly wrong. create_sort_path also passes -1, and so does create_groupingsets_path. I went through the callers to create_sort_path and the only one that looks like it can pass a limit is the one you and Jeff already identified. So I think the question is just whether create_gather_merge_path needs a similar fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Backfill bgworker Extension?
On Sat, Dec 16, 2017 at 8:31 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 12/15/17 23:50, Jeremy Finzel wrote: > > The common ground is some column in some table needs to be bulk updated. > > I may not be explaining well, but in our environment we have done > > hundreds of these using a generic framework to build a backfill. So I’m > > not sure what you are questioning about the need? We have had to build a > > worker to accomplish this because it can’t be done as a sql script alone. > > I'm trying to identify the independently useful pieces in your use case. > A background worker to backfill large tables is a very specific use > case. If instead we had a job/scheduler mechanism and a way to have > server-side scripts that can control transactions, then that might > satisfy your requirements as well (I'm not sure), but it would also > potentially address many other uses. I follow you now. Yes, I think it probably would. I think it would at least provide a framework on which to build the tool I want. It would be great to have a "worker-capable" tool inside postgres than always having to write external logic to do things like this. > I’m not sure what you mean by a stored procedure in the background. > > Since it would not be a single transaction, it doesn’t fit as a stored > > procedure at least in Postgres when a function is 1 transaction. > > In progress: https://commitfest.postgresql.org/16/1360/ Looking forward to this. I think this will help. A stored procedure with subtransactions could have the logic for the backfill in it, but would still need an external worker that could retry it in case of failure especially when things like a server restart happens. Thanks, Jeremy
Re: [HACKERS] Something for the TODO list: deprecating abstime and friends
On Wed, Dec 13, 2017 at 12:05:06AM -0800, Andres Freund wrote: > On 2017-07-15 18:00:34 -0400, Tom Lane wrote: > > * contrib/spi/timetravel depends on abstime columns to represent what > > would nowadays be better done as a tstzrange. I'd have thought we > > could maybe toss that example module overboard, but we just today got > > a question about its usage, so I'm afraid it's not quite dead yet. > > What shall we do with that? > > Looking at the code I'd be pretty strongly inclined to scrap it. > > > Before I'd discovered this thread, I'd started to write up a > patch. Attached. It's clearly not fully done. Questions I'd while > hacking things up: > - what to do with contrib/spi/timetravel - I'd just removed it from the > relevant Makefile for now. Porting it to tstzrange seems friendlier, however: - Does that look like significant work to do this port? - How about the work involved with upgrading existing installations? > - nabstime.c currently implements timeofday() which imo is a pretty > weird function. I'd be quite inclined to remove it at the same time as > this. +1 for binning it. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: [sqlsmith] Parallel worker executor crash on master
On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich wrote: > Thomas Munro writes: > >> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich >> wrote: >>> Core was generated by `postgres: smith regression [local] SELECT >>> '. >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 >>> 283 estate->es_query_dsa = >>> gatherstate->pei->area; >>> #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 >> >> Hmm, thanks. That's not good. Do we know if gatherstate->pei is >> NULL, or if it's somehow pointing to garbage? > > It was NULL on all the coredumps I looked into. Below[1] is a full > gatherstate. > >> Not sure how either of those things could happen, since we only set it >> to NULL in ExecShutdownGather() after which point we shouldn't call >> ExecGather() again, and any MemoryContext problems with pei should >> have caused problems already without this patch (for example in >> ExecParallelCleanup). Clearly I'm missing something. > > FWIW, all backtraces collected so far are identical for the first nine > frames. After ExecProjectSet, they are pretty random executor innards. Oh, right. We only create pei if (gather->num_workers > 0 && estate->es_use_parallel_mode), so I need to teach that patch that pei may be NULL. I'll go and post a new version like that over on the other thread. -- Thomas Munro http://www.enterprisedb.com
Re: Protect syscache from bloating with negative cache entries
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund wrote: >> I'm not sure we should regard very quick bloating as a problem in need >> of solving. Doesn't that just mean we need the cache to be bigger, at >> least temporarily? > > Leaving that aside, is that actually not at least to a good degree, > solved by that problem? By bumping the generation on hash resize, we > have recency information we can take into account. I agree that we can do it. I'm just not totally sure it's a good idea. I'm also not totally sure it's a bad idea, either. That's why I asked the question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_(total_)relation_size and partitioned tables
On Sun, Dec 17, 2017 at 9:54 AM, David Rowley wrote: > I'd also vote to leave the relation_size functions alone. > > Perhaps it's worth thinking of changing pg_table_size() instead. We > have taken measures to try and hide the fact that a table is made up > of a bunch of partitions from the user in some cases, e.g DROP TABLE > works without CASCADE for a partitioned table. I'm sure there are > arguments in both directions here too though. Yeah, I don't really understand why changing pg_table_size() is any more desirable than changing pg_relation_size(). I mean, we could have a table-size function that takes an array of things you want to include (indexes, toast, partitions, etc), but changing the semantics of existing functions seems like it's going to be more painful than helpful (aside from the arguments I brought up before). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_(total_)relation_size and partitioned tables
On Sun, Dec 17, 2017 at 11:54 PM, David Rowley wrote: > I'd also vote to leave the relation_size functions alone. Count me in that bucket as well. > Perhaps it's worth thinking of changing pg_table_size() instead. We > have taken measures to try and hide the fact that a table is made up > of a bunch of partitions from the user in some cases, e.g DROP TABLE > works without CASCADE for a partitioned table. I'm sure there are > arguments in both directions here too though. > > I generally think of the difference between a relation and a table > similarly to the difference between a tuple and a row. A relation is > just what we use to implement tables, and there can be multiple > relations per table (in the partitioning case), similar to how there > can be multiple tuple versions for a single row. So that might back up > that pg_table_size should include all relations that make up that > table. The barrier here is thin. What's proposed here is already doable with a WITH RECURSIVE query. So why not just documenting this query and be done with it instead of complicating the code? It seems to me that the performance in calling pg_relation_size() in a cascading times fashion would not matter much. Or one could invent an additional cascading option which scans inheritance and/or partition chains, or simply have a new function. -- Michael
Re: worker_spi example BGW code GUC tweak
On Thu, Dec 14, 2017 at 5:41 PM, Chapman Flack wrote: > Would this sample code make an even better teaching example if it > used the existing GUC way to declare that worker_spi.naptime is > in units of seconds? > > Or does it not do that for some reason I've overlooked? Making it use GUC_UNIT_S seems like a good idea to me, but removing the mention of seconds from the description doesn't seem like a good idea to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: es_query_dsa is broken
On Thu, Dec 7, 2017 at 12:51 PM, Thomas Munro wrote: > On Wed, Dec 6, 2017 at 10:16 PM, Amit Kapila wrote: >> On Wed, Dec 6, 2017 at 1:14 AM, Robert Haas wrote: >>> On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila wrote: + EState *estate = gatherstate->ps.state; + + /* Install our DSA area while executing the plan. */ + estate->es_query_dsa = gatherstate->pei->area; outerTupleSlot = ExecProcNode(outerPlan); + estate->es_query_dsa = NULL; Won't the above coding pattern create a problem, if ExecProcNode throws an error and outer block catches it and continues execution (consider the case of execution inside PL blocks)? >>> >>> I don't see what the problem is. The query that got aborted by the >>> error wouldn't be sharing an EState with one that didn't. >> >> That's right. Ignore my comment, I got confused. Other than that, I >> don't see any problem with the code as such apart from that it looks >> slightly hacky. I think Thomas or someone needs to develop a patch on >> the lines you have mentioned or what Thomas was trying to describe in >> his email and see how it comes out. Andreas Seltenreich found a query[1] that suffers from this bug (thanks!), and Amit confirmed that this patch fixes it, but further sqlsmith fuzz testing with the patch applied also revealed that it failed to handle the case where pei is NULL because parallelism was inhibited. Here's a new version to fix that. Someone might prefer a coding with "if" rather than the ternary operator, but I didn't have a strong preference. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2tfz1XNDcyU_a5ZiEaopz6%2BbBo9vQY%2BiJVLTtUVNztcQ%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com fix-es_query_dsa-pg10-v3.patch Description: Binary data fix-es_query_dsa-pg11-v3.patch Description: Binary data
Re: worker_spi example BGW code GUC tweak
On 12/17/2017 07:32 PM, Robert Haas wrote: > Making it use GUC_UNIT_S seems like a good idea to me, but removing > the mention of seconds from the description doesn't seem like a good > idea to me. I took for my model a quick survey of existing GUCs that use GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit in the text description. (It's not a shut-out; some do, but only a handful.) I think that makes sense, because once the GUC_UNIT_foo is specified, you get output like: select current_setting('worker_spi.naptime'); current_setting - 10s and, if you set it for, say, 12ms or 180min, it will be displayed as 2min or 3h, etc., making 'seconds' in the text description a little redundant in the best case—when the current value is most naturally shown with s—and a little goofy in the other cases, where the value would be displayed with min, h, or d, and reading the value combined with the text description makes the snarky little voice in your head go "nap for 3 hours seconds??". -Chap
Re: genomic locus
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov wrote: > I need a data type to represent genomic positions, which will consist of a > string and a pair of integers with interval logic and access methods. Sort > of like my seg type, but more straightforward. > > I noticed somebody took a good care of seg while I was away for the last 20 > years, and I am extremely grateful for that. I have been using it. In the > meantime, things have changed and now I am almost clueless about how you > deal with contributed modules and what steps I should take once I get it to > work. Also, is it a good idea to clone and fix seg for this purpose, or is > there a more appropriate template? Or maybe just building it from scratch > will be a better idea? Have you thought about just using a composite type? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: worker_spi example BGW code GUC tweak
On Sun, Dec 17, 2017 at 8:30 PM, Chapman Flack wrote: > On 12/17/2017 07:32 PM, Robert Haas wrote: >> Making it use GUC_UNIT_S seems like a good idea to me, but removing >> the mention of seconds from the description doesn't seem like a good >> idea to me. > > I took for my model a quick survey of existing GUCs that use > GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit > in the text description. (It's not a shut-out; some do, but only > a handful.) > > I think that makes sense, because once the GUC_UNIT_foo is > specified, you get output like: > > select current_setting('worker_spi.naptime'); > current_setting > - > 10s > > and, if you set it for, say, 12ms or 180min, it will be > displayed as 2min or 3h, etc., making 'seconds' in the text > description a little redundant in the best case—when the > current value is most naturally shown with s—and a little > goofy in the other cases, where the value would be displayed > with min, h, or d, and reading the value combined with the text > description makes the snarky little voice in your head go > "nap for 3 hours seconds??". Well, you have a point, at that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 10:50 AM, Alvaro Herrera wrote: >> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I >> think it might as well try to validate while it's at it. But if not >> then we might want to go with #2. > > The problem I have with it is that restoring a dump containing indexes > on partitions becomes a O(N^2) deal as it has to do the full check once > for every index we attach. Sure. If the constant factor is high enough to matter, then VALIDATE makes sense. IMHO, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 5:29 AM, David Rowley wrote: >> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I >> think it might as well try to validate while it's at it. But if not >> then we might want to go with #2. > > I'm now not that clear on what the behaviour is if the ONLY keyword is > not specified on the CREATE INDEX for the partitioned index. Does that > go and create each leaf partition index regardless of if there is a > suitable candidate to ATTACH? No, the other way around. ONLY is being proposed as a way to create an initially-not-valid parent to which we can then ATTACH subsequently-created child indexes. But because we will have REPLACE rather than DETACH, once you get the index valid it never goes back to not-valid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Using ProcSignal to get memory context stats from a running backend
On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer wrote: > On 15 December 2017 at 09:24, Greg Stark wrote: >> Another simpler option would be to open up a new file in the log >> directory > > ... if we have one. > > We might be logging to syslog, or whatever else. > > I'd rather keep it simple(ish). +1. I still think just printing it to the log is fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 18 December 2017 at 15:04, Robert Haas wrote: > On Sun, Dec 17, 2017 at 5:29 AM, David Rowley > wrote: >> I'm now not that clear on what the behaviour is if the ONLY keyword is >> not specified on the CREATE INDEX for the partitioned index. Does that >> go and create each leaf partition index regardless of if there is a >> suitable candidate to ATTACH? > > No, the other way around. ONLY is being proposed as a way to create > an initially-not-valid parent to which we can then ATTACH > subsequently-created child indexes. But because we will have REPLACE > rather than DETACH, once you get the index valid it never goes back to > not-valid. I understand what the ONLY is proposed to do. My question was in regards to the behaviour without ONLY. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pg_(total_)relation_size and partitioned tables
On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier wrote: > The barrier here is thin. What's proposed here is already doable with > a WITH RECURSIVE query. So why not just documenting this query and be > done with it instead of complicating the code? It seems to me that the > performance in calling pg_relation_size() in a cascading times fashion > would not matter much. Or one could invent an additional cascading > option which scans inheritance and/or partition chains, or simply have > a new function. I just blogged on the matter, and here is one possibility here compatible with v10: WITH RECURSIVE partition_info (relid, relname, relsize, relispartition, relkind) AS ( SELECT oid AS relid, relname, pg_relation_size(oid) AS relsize, relispartition, relkind FROM pg_catalog.pg_class WHERE relname = 'parent_name' AND relkind = 'p' UNION ALL SELECT c.oid AS relid, c.relname AS relname, pg_relation_size(c.oid) AS relsize, c.relispartition AS relispartition, c.relkind AS relkind FROM partition_info AS p, pg_catalog.pg_inherits AS i, pg_catalog.pg_class AS c WHERE p.relid = i.inhparent AND c.oid = i.inhrelid AND c.relispartition ) SELECT * FROM partition_info; This is not really straight-forward. You could as well have the pg_relation_size call in the outer query. -- Michael
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 9:38 PM, David Rowley wrote: > On 18 December 2017 at 15:04, Robert Haas wrote: >> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley >> wrote: >>> I'm now not that clear on what the behaviour is if the ONLY keyword is >>> not specified on the CREATE INDEX for the partitioned index. Does that >>> go and create each leaf partition index regardless of if there is a >>> suitable candidate to ATTACH? >> >> No, the other way around. ONLY is being proposed as a way to create >> an initially-not-valid parent to which we can then ATTACH >> subsequently-created child indexes. But because we will have REPLACE >> rather than DETACH, once you get the index valid it never goes back to >> not-valid. > > I understand what the ONLY is proposed to do. My question was in > regards to the behaviour without ONLY. Oh, sorry -- I was confused. I'm not sure whether that should try to attach to something if it exists, or just create unconditionally... what do you think? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 18 December 2017 at 15:59, Robert Haas wrote: > On Sun, Dec 17, 2017 at 9:38 PM, David Rowley > wrote: >> On 18 December 2017 at 15:04, Robert Haas wrote: >>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley >>> wrote: I'm now not that clear on what the behaviour is if the ONLY keyword is not specified on the CREATE INDEX for the partitioned index. Does that go and create each leaf partition index regardless of if there is a suitable candidate to ATTACH? >>> >>> No, the other way around. ONLY is being proposed as a way to create >>> an initially-not-valid parent to which we can then ATTACH >>> subsequently-created child indexes. But because we will have REPLACE >>> rather than DETACH, once you get the index valid it never goes back to >>> not-valid. >> >> I understand what the ONLY is proposed to do. My question was in >> regards to the behaviour without ONLY. > > Oh, sorry -- I was confused. I'm not sure whether that should try to > attach to something if it exists, or just create unconditionally... > what do you think? I think you feel quite strongly about not having the code select a random matching index, so if we want to stick to that rule, then we'll need to create a set of new leaf indexes rather than select a random one. If we don't do that, then we might as well go with my idea and ditch the ONLY syntax and have the CREATE INDEX try to find a suitable leaf index, or create a new leaf index if one cannot be found. Would it be surprising to users if CREATE INDEX ON partitioned_table created a bunch of duplicate new indexes on the leaf tables? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 10:05 PM, David Rowley wrote: > I think you feel quite strongly about not having the code select a > random matching index, so if we want to stick to that rule, then we'll > need to create a set of new leaf indexes rather than select a random > one. I feel quite strongly about it *in the context of pg_dump*. Outside of that scenario, I'm happy to go with whatever behavior you and others think will satisfy the POLA. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada wrote: > +1 from me. Works for me, too, although I still don't really follow how it's happening in the present coding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
On Mon, Dec 18, 2017 at 12:14 PM, Robert Haas wrote: > On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada > wrote: >> +1 from me. > > Works for me, too, although I still don't really follow how it's > happening in the present coding. Craig has mentioned at least one way upthread: https://www.postgresql.org/message-id/CAMsr+YHj4asAu8dJ1Q5hiW+wxL2Jd1=uxo52dvxhmxive-x...@mail.gmail.com And that's possible when building with LOCK_DEBUG with trace_lwlocks enabled at least. So I would rather not rely on assuming that CHECK_FOR_INTERRUPTS() as a code path never taken for the cases discussed here. -- Michael
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas wrote: > On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada > wrote: >> Here is the result. >> I've measured the through-put with some cases on my virtual machine. >> Each client loads 48k file to each different relations located on >> either xfs filesystem or ext4 filesystem, for 30 sec. >> >> Case 1: COPYs to relations on different filessystems(xfs and ext4) and >> N_RELEXTLOCK_ENTS is 1024 >> >> clients = 2, avg = 296.2068 >> clients = 5, avg = 372.0707 >> clients = 10, avg = 389.8850 >> clients = 50, avg = 428.8050 >> >> Case 2: COPYs to relations on different filessystems(xfs and ext4) and >> N_RELEXTLOCK_ENTS is 1 >> >> clients = 2, avg = 294.3633 >> clients = 5, avg = 358.9364 >> clients = 10, avg = 383.6945 >> clients = 50, avg = 424.3687 >> >> And the result of current HEAD is following. >> >> clients = 2, avg = 284.9976 >> clients = 5, avg = 356.1726 >> clients = 10, avg = 375.9856 >> clients = 50, avg = 429.5745 >> >> In case2, the through-put got decreased compare to case 1 but it seems >> to be almost same as current HEAD. Because the speed of acquiring and >> releasing extension lock got x10 faster than current HEAD as I >> mentioned before, the performance degradation may not have gotten >> decreased than I expected even in case 2. >> Since my machine doesn't have enough resources the result of clients = >> 50 might not be a valid result. > > I have to admit that result is surprising to me. > I think the environment I used for performance measurement did not have enough resources. I will do the same benchmark on an another environment to see if it was a valid result, and will share it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: pg_(total_)relation_size and partitioned tables
Thanks all for your thoughts. I agree with the Robert's point which both David and Michael seem to agree with that we shouldn't really be changing what pg_relation_size() is doing under the covers. And I guess the same for pg_table_size(), too. Both of those functions and their siblings work with relations that possess on-disk structures and have associated relations (TOAST, indexes) that in turn possess on-disk structures. It seems quite clearly documented as such. Partitioned tables are different in that they neither possess on-disk structures nor have any relations (TOAST, indexes) associated directly with them. Instead, they have partitions that are the relations that aforementioned dbsize.c functions are familiar with. So, I withdraw the patch I originally posted in favor of some other approach. Reply continues below... On 2017/12/18 11:51, Michael Paquier wrote: > On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier > wrote: >> The barrier here is thin. What's proposed here is already doable with >> a WITH RECURSIVE query. So why not just documenting this query and be >> done with it instead of complicating the code? It seems to me that the >> performance in calling pg_relation_size() in a cascading times fashion >> would not matter much. Or one could invent an additional cascading >> option which scans inheritance and/or partition chains, or simply have >> a new function. > > I just blogged on the matter, and here is one possibility here > compatible with v10: > WITH RECURSIVE partition_info > (relid, >relname, >relsize, >relispartition, >relkind) AS ( > SELECT oid AS relid, >relname, >pg_relation_size(oid) AS relsize, >relispartition, >relkind > FROM pg_catalog.pg_class > WHERE relname = 'parent_name' AND > relkind = 'p' > UNION ALL > SELECT > c.oid AS relid, > c.relname AS relname, > pg_relation_size(c.oid) AS relsize, > c.relispartition AS relispartition, > c.relkind AS relkind > FROM partition_info AS p, > pg_catalog.pg_inherits AS i, > pg_catalog.pg_class AS c > WHERE p.relid = i.inhparent AND > c.oid = i.inhrelid AND > c.relispartition > ) > SELECT * FROM partition_info; > > This is not really straight-forward. You could as well have the > pg_relation_size call in the outer query. Thanks Michael for coming up with that. Do you (and/or others) think that's something that we can wrap inside a built-in function(s), that is, one defined in system_views.sql? Or if we decide to have new functions, say, pg_get_partitions() and/or pg_get_partition_sizes(), we might as well implement them as C functions inside dbsize.c. If so, do we have want to have "partition" variants of all *_size() functions viz. pg_relation_size(), pg_total_relation_size(), pg_indexes_size(), and pg_table_size()? Thanks, Amit
Re: pg_(total_)relation_size and partitioned tables
On Mon, Dec 18, 2017 at 2:17 PM, Amit Langote wrote: > Do you (and/or others) think that's something that we can wrap inside a > built-in function(s), that is, one defined in system_views.sql? Or if we > decide to have new functions, say, pg_get_partitions() and/or > pg_get_partition_sizes(), we might as well implement them as C functions > inside dbsize.c. If so, do we have want to have "partition" variants of > all *_size() functions viz. pg_relation_size(), pg_total_relation_size(), > pg_indexes_size(), and pg_table_size()? I can see value in something like Robert is proposing upthread by having a function one is able to customize to decide what to include in the calculation. There could be options for at least: - partitions, or relation cascading. - index. - toast tables. - visibility maps. - FSM. The first three ones is what Robert are mentioned, still I would see the last two ones are interesting things if optional. Or we could have just a SRF that returns one row per object scanned. -- Michael
Re: [HACKERS] replace GrantObjectType with ObjectType
On Sat, Dec 16, 2017 at 12:40 AM, Robert Haas wrote: > On Fri, Dec 15, 2017 at 12:40 PM, Peter Eisentraut > wrote: > > On 12/14/17 22:59, Rushabh Lathia wrote: > >> I noted that no_priv_msg and not_owner_msg array been removed > >> and code fitted the code into aclcheck_error(). Actually that > >> makes the code more complex then what it used to be. I would > >> prefer the array rather then code been fitted into the function. > > > > There is an argument for having a big array versus the switch/case > > approach. But most existing code around object addresses uses the > > switch/case approach, so it's better to align it that way, I think. > > It's weird to have to maintain two different styles. > > Only motivation is, earlier approach looks more cleaner. Also patch is getting bigger - so if we continue with old approach it will make review easy. Just in case switch/case approach is a go to, then it can be done as part of separate clean up patch. Thanks, Rushabh Lathia
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 17 December 2017 at 16:22, Robert Haas wrote: > On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera > wrote: >> We have two options for marking valid: >> >> 1. after each ALTER INDEX ATTACH, verify whether the set of partitions >> that contain the index is complete; if so, mark it valid, otherwise do >> nothing. This sucks because we have to check that over and over for >> every index that we attach >> >> 2. We invent yet another command, say >> ALTER INDEX VALIDATE It's not perfect that we need to validate each time, but it might not be that expensive to validate since we only really need to count the pg_index rows that have indisvalid = true rows which have a parent index listed as the index we're ATTACHing too, then ensure that matches the number of leaf partitions. > If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I > think it might as well try to validate while it's at it. But if not > then we might want to go with #2. I'm now not that clear on what the behaviour is if the ONLY keyword is not specified on the CREATE INDEX for the partitioned index. Does that go and create each leaf partition index regardless of if there is a suitable candidate to ATTACH? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: GSoC 2018
On 15/12/17 13:57, Aleksander Alekseev wrote: Hi Stephen, HA/fail-over is a very broad topic, with a lot of pieces that need to be done such that I'm not sure it's really viable, but perhaps a precursor project (synchronous logical replication seems like a prereq, no?) would make more sense. Or, perhaps, a different piece of the HA question, but solving the whole thing in a summer strikes me as unlikely to be reasonable. [...] I agree with most of the comments down-thread that this projects seems to undertake much more work than what a GSoC project could be. I'd like to chime in with a few other comments regarding other aspects the proposal: - I believe we're mixing what HA and replication is. HA is about promoting a secondary node when the primary fails, preventing two primaries at the same time, ensuring one master is always up and so forth. Replication is just a data replication mechanism, and that *helps* with HA, but is not HA nor a necessary pre-requisite (actually you could build an HA solution based on shared-disk). Indeed, HA can be built on top of SR, logical replication (with many caveats as have been already commented), disk replication techniques outside of PG and probably others. Topic is way broader. - If the proejct's goal is to improve on some of the logical replication's shortcomings that would be required to make it a useful replication technique on which HA solutions could be built on top of, my +1 for that. - If the project goal would be to augment an existing HA solution to abstract away data replication techniques (so that SR is not the only option, but also logical or disk-based replication or whatever) from the core algorithm, I believe this is also a very sensible GSoC project. - To add a proxy layer to any existing HA project is IMO a project on its own. I don't even see this as a GSoC feasible project. Cheers, Álvaro -- Alvaro Hernandez --- OnGres
Small typo in comment in json_agg_transfn
The attached fixed a small typo in json_agg_transfn. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services json_agg_transfn_comment_fix.patch Description: Binary data
Parallel Aggregates for string_agg and array_agg
Hi, While working on partial aggregation a few years ago, I didn't really think it was worthwhile allowing partial aggregation of string_agg and array_agg. I soon realised that I was wrong about that and allowing parallelisation of these aggregates still could be very useful when many rows are filtered out during the scan. Some benchmarks that I've done locally show that parallel string_agg and array_agg do actually perform better, despite the fact that the aggregate state grows linearly with each aggregated item. Obviously, the performance will get even better when workers are filtering out rows before aggregation takes place, so it seems worthwhile doing this. However, the main reason that I'm motivated to do this is that there are more uses for partial aggregation other than just parallel aggregation, and it seems a shame to disable all these features if a single aggregate does not support partial mode. I've attached a patch which implements all this. I've had most of it stashed away for a while now, but I managed to get some time this weekend to get it into a more completed state. Things are now looking pretty good for the number of aggregates that support partial mode. Just a handful of aggregates now don't support partial aggregation; postgres=# select aggfnoid from pg_aggregate where aggcombinefn=0 and aggkind='n'; aggfnoid -- xmlagg json_agg json_object_agg jsonb_agg jsonb_object_agg (5 rows) ... and a good number do support it; postgres=# select count(*) from pg_aggregate where aggcombinefn<>0 and aggkind='n'; count --- 122 (1 row) There's probably no reason why the last 5 of those couldn't be done either, it might just require shifting a bit more work into the final functions, although, I'm not planning on that for this patch. As for the patch; there's a bit of a quirk in the implementation of string_agg. We previously always threw away the delimiter that belongs to the first aggregated value, but we do now need to keep that around so we can put it in between two states in the combine function. I decided the path of least resistance to do this was just to borrow StringInfo's cursor variable to use as a pointer to the state of the first value and put the first delimiter before that. Both the string_agg(text) and string_agg(bytea) already have a final function, so we just need to skip over the bytes up until the cursor position to get rid of the first delimiter. I could go and invent some new state type to do the same, but I don't really see the trouble with what I've done with StringInfo, but I'll certainly listen if someone else thinks this is wrong. Another thing that I might review later about this is seeing about getting rid of some of the code duplication between array_agg_array_combine and accumArrayResultArr. I'm going to add this to PG11's final commitfest rather than the January 'fest as it seems more like a final commitfest type of patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services combinefn_for_string_and_array_aggs_v1.patch Description: Binary data
Re: pg_(total_)relation_size and partitioned tables
On 17 December 2017 at 16:24, Robert Haas wrote: > On Thu, Dec 14, 2017 at 12:23 AM, Amit Langote > wrote: >> You may have guessed from $subject that the two don't work together. > > It works exactly as documented: > > pg_total_relation_size(regclass) - Total disk space used by the > specified table, including all indexes and TOAST data > > It says nothing about including partitions. If we change this, then > we certainly need to update the documentation (that might be a good > idea if we decide not to update this). > > Personally, I'm -1 on including partitions, because then you can no > longer expect that the sum of pg_total_relation_size(regclass) across > all relations in the database will equal the size of the database > itself. Partitions will be counted a number of times equal to their > depth in the partitioning hierarchy. However, I understand that I > might get outvoted. I'd also vote to leave the relation_size functions alone. Perhaps it's worth thinking of changing pg_table_size() instead. We have taken measures to try and hide the fact that a table is made up of a bunch of partitions from the user in some cases, e.g DROP TABLE works without CASCADE for a partitioned table. I'm sure there are arguments in both directions here too though. I generally think of the difference between a relation and a table similarly to the difference between a tuple and a row. A relation is just what we use to implement tables, and there can be multiple relations per table (in the partitioning case), similar to how there can be multiple tuple versions for a single row. So that might back up that pg_table_size should include all relations that make up that table. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Why does array_position_common bitwise NOT an Oid type?
On 17 December 2017 at 14:53, Tom Lane wrote: > David Rowley writes: >> I was puzzled to see the following code: > >> my_extra->element_type = ~element_type; > > If memory serves, the idea was to force the subsequent datatype-lookup > path to be taken, even if for some reason element_type is InvalidOid. > If we take the lookup path then the bogus element_type will be detected > and reported; if we don't, it won't be. That makes sense. I'd just have expected more documentation on that. Although, perhaps I just didn't look hard enough. I did fail to notice the fact that the same thing does occur over and over when I sent this originally. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
Robert Haas wrote: > On Fri, Dec 15, 2017 at 5:18 PM, Alvaro Herrera > wrote: > > We have two options for marking valid: > > > > 1. after each ALTER INDEX ATTACH, verify whether the set of partitions > > that contain the index is complete; if so, mark it valid, otherwise do > > nothing. This sucks because we have to check that over and over for > > every index that we attach > > > > 2. We invent yet another command, say > > ALTER INDEX VALIDATE > > If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I > think it might as well try to validate while it's at it. But if not > then we might want to go with #2. The problem I have with it is that restoring a dump containing indexes on partitions becomes a O(N^2) deal as it has to do the full check once for every index we attach. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Top-N sorts verses parallelism
On Fri, Dec 15, 2017 at 4:13 PM, Thomas Munro wrote: > Looks right to me. Commit 3452dc52 just forgot to tell the planner. > I'm pleased about that because it makes this a slam-dunk bug-fix and > not some confusing hard to justify costing problem. Jeff Janes inquired off-list about other places where cost_sort() gets called. In costsize.c, it is called from initial_cost_mergejoin, which seems like it does not present an opportunity for pushing down limits, since we don't know how many rows we'll have to join to get a certain number of outputs. In createplan.c, it is called only from label_sort_with_costsize(). That in turn called from create_merge_append_plan(), passing the tuple limit from the path being converted to a plan, which seems unimpeachable. It's also called from make_unique_plan() and create_mergejoin_plan() with -1, which seems OK since in neither case do we know how many input rows we need to read. In planner.c, it's called from plan_cluster_use_sort() with -1; CLUSTER has to read the whole input, so that's fine. In prepunion.c, it's called from choose_hashed_setop() with -1. I think set operations also need to read the whole input. In pathnode.c, it's called from create_merge_append_path, create_unique_path, create_gather_merge_path, create_groupingset_path, and create_sort_path. create_merge_append_path passes any limit applicable to the subpath. create_unique_path passes -1. create_gather_merge_path also passes -1, which as Jeff also pointed out seems possibly wrong. create_sort_path also passes -1, and so does create_groupingsets_path. I went through the callers to create_sort_path and the only one that looks like it can pass a limit is the one you and Jeff already identified. So I think the question is just whether create_gather_merge_path needs a similar fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Backfill bgworker Extension?
On Sat, Dec 16, 2017 at 8:31 AM Peter Eisentraut < peter.eisentr...@2ndquadrant.com> wrote: > On 12/15/17 23:50, Jeremy Finzel wrote: > > The common ground is some column in some table needs to be bulk updated. > > I may not be explaining well, but in our environment we have done > > hundreds of these using a generic framework to build a backfill. So I’m > > not sure what you are questioning about the need? We have had to build a > > worker to accomplish this because it can’t be done as a sql script alone. > > I'm trying to identify the independently useful pieces in your use case. > A background worker to backfill large tables is a very specific use > case. If instead we had a job/scheduler mechanism and a way to have > server-side scripts that can control transactions, then that might > satisfy your requirements as well (I'm not sure), but it would also > potentially address many other uses. I follow you now. Yes, I think it probably would. I think it would at least provide a framework on which to build the tool I want. It would be great to have a "worker-capable" tool inside postgres than always having to write external logic to do things like this. > I’m not sure what you mean by a stored procedure in the background. > > Since it would not be a single transaction, it doesn’t fit as a stored > > procedure at least in Postgres when a function is 1 transaction. > > In progress: https://commitfest.postgresql.org/16/1360/ Looking forward to this. I think this will help. A stored procedure with subtransactions could have the logic for the backfill in it, but would still need an external worker that could retry it in case of failure especially when things like a server restart happens. Thanks, Jeremy
Re: [HACKERS] Something for the TODO list: deprecating abstime and friends
On Wed, Dec 13, 2017 at 12:05:06AM -0800, Andres Freund wrote: > On 2017-07-15 18:00:34 -0400, Tom Lane wrote: > > * contrib/spi/timetravel depends on abstime columns to represent what > > would nowadays be better done as a tstzrange. I'd have thought we > > could maybe toss that example module overboard, but we just today got > > a question about its usage, so I'm afraid it's not quite dead yet. > > What shall we do with that? > > Looking at the code I'd be pretty strongly inclined to scrap it. > > > Before I'd discovered this thread, I'd started to write up a > patch. Attached. It's clearly not fully done. Questions I'd while > hacking things up: > - what to do with contrib/spi/timetravel - I'd just removed it from the > relevant Makefile for now. Porting it to tstzrange seems friendlier, however: - Does that look like significant work to do this port? - How about the work involved with upgrading existing installations? > - nabstime.c currently implements timeofday() which imo is a pretty > weird function. I'd be quite inclined to remove it at the same time as > this. +1 for binning it. Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: [sqlsmith] Parallel worker executor crash on master
On Sun, Dec 17, 2017 at 12:26 PM, Andreas Seltenreich wrote: > Thomas Munro writes: > >> On Sat, Dec 16, 2017 at 10:13 PM, Andreas Seltenreich >> wrote: >>> Core was generated by `postgres: smith regression [local] SELECT >>> '. >>> Program terminated with signal SIGSEGV, Segmentation fault. >>> #0 gather_getnext (gatherstate=0x555a5fff1350) at nodeGather.c:283 >>> 283 estate->es_query_dsa = >>> gatherstate->pei->area; >>> #1 ExecGather (pstate=0x555a5fff1350) at nodeGather.c:216 >> >> Hmm, thanks. That's not good. Do we know if gatherstate->pei is >> NULL, or if it's somehow pointing to garbage? > > It was NULL on all the coredumps I looked into. Below[1] is a full > gatherstate. > >> Not sure how either of those things could happen, since we only set it >> to NULL in ExecShutdownGather() after which point we shouldn't call >> ExecGather() again, and any MemoryContext problems with pei should >> have caused problems already without this patch (for example in >> ExecParallelCleanup). Clearly I'm missing something. > > FWIW, all backtraces collected so far are identical for the first nine > frames. After ExecProjectSet, they are pretty random executor innards. Oh, right. We only create pei if (gather->num_workers > 0 && estate->es_use_parallel_mode), so I need to teach that patch that pei may be NULL. I'll go and post a new version like that over on the other thread. -- Thomas Munro http://www.enterprisedb.com
Re: Protect syscache from bloating with negative cache entries
On Sat, Dec 16, 2017 at 11:42 PM, Andres Freund wrote: >> I'm not sure we should regard very quick bloating as a problem in need >> of solving. Doesn't that just mean we need the cache to be bigger, at >> least temporarily? > > Leaving that aside, is that actually not at least to a good degree, > solved by that problem? By bumping the generation on hash resize, we > have recency information we can take into account. I agree that we can do it. I'm just not totally sure it's a good idea. I'm also not totally sure it's a bad idea, either. That's why I asked the question. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_(total_)relation_size and partitioned tables
On Sun, Dec 17, 2017 at 9:54 AM, David Rowley wrote: > I'd also vote to leave the relation_size functions alone. > > Perhaps it's worth thinking of changing pg_table_size() instead. We > have taken measures to try and hide the fact that a table is made up > of a bunch of partitions from the user in some cases, e.g DROP TABLE > works without CASCADE for a partitioned table. I'm sure there are > arguments in both directions here too though. Yeah, I don't really understand why changing pg_table_size() is any more desirable than changing pg_relation_size(). I mean, we could have a table-size function that takes an array of things you want to include (indexes, toast, partitions, etc), but changing the semantics of existing functions seems like it's going to be more painful than helpful (aside from the arguments I brought up before). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: pg_(total_)relation_size and partitioned tables
On Sun, Dec 17, 2017 at 11:54 PM, David Rowley wrote: > I'd also vote to leave the relation_size functions alone. Count me in that bucket as well. > Perhaps it's worth thinking of changing pg_table_size() instead. We > have taken measures to try and hide the fact that a table is made up > of a bunch of partitions from the user in some cases, e.g DROP TABLE > works without CASCADE for a partitioned table. I'm sure there are > arguments in both directions here too though. > > I generally think of the difference between a relation and a table > similarly to the difference between a tuple and a row. A relation is > just what we use to implement tables, and there can be multiple > relations per table (in the partitioning case), similar to how there > can be multiple tuple versions for a single row. So that might back up > that pg_table_size should include all relations that make up that > table. The barrier here is thin. What's proposed here is already doable with a WITH RECURSIVE query. So why not just documenting this query and be done with it instead of complicating the code? It seems to me that the performance in calling pg_relation_size() in a cascading times fashion would not matter much. Or one could invent an additional cascading option which scans inheritance and/or partition chains, or simply have a new function. -- Michael
Re: worker_spi example BGW code GUC tweak
On Thu, Dec 14, 2017 at 5:41 PM, Chapman Flack wrote: > Would this sample code make an even better teaching example if it > used the existing GUC way to declare that worker_spi.naptime is > in units of seconds? > > Or does it not do that for some reason I've overlooked? Making it use GUC_UNIT_S seems like a good idea to me, but removing the mention of seconds from the description doesn't seem like a good idea to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: es_query_dsa is broken
On Thu, Dec 7, 2017 at 12:51 PM, Thomas Munro wrote: > On Wed, Dec 6, 2017 at 10:16 PM, Amit Kapila wrote: >> On Wed, Dec 6, 2017 at 1:14 AM, Robert Haas wrote: >>> On Tue, Dec 5, 2017 at 7:24 AM, Amit Kapila wrote: + EState *estate = gatherstate->ps.state; + + /* Install our DSA area while executing the plan. */ + estate->es_query_dsa = gatherstate->pei->area; outerTupleSlot = ExecProcNode(outerPlan); + estate->es_query_dsa = NULL; Won't the above coding pattern create a problem, if ExecProcNode throws an error and outer block catches it and continues execution (consider the case of execution inside PL blocks)? >>> >>> I don't see what the problem is. The query that got aborted by the >>> error wouldn't be sharing an EState with one that didn't. >> >> That's right. Ignore my comment, I got confused. Other than that, I >> don't see any problem with the code as such apart from that it looks >> slightly hacky. I think Thomas or someone needs to develop a patch on >> the lines you have mentioned or what Thomas was trying to describe in >> his email and see how it comes out. Andreas Seltenreich found a query[1] that suffers from this bug (thanks!), and Amit confirmed that this patch fixes it, but further sqlsmith fuzz testing with the patch applied also revealed that it failed to handle the case where pei is NULL because parallelism was inhibited. Here's a new version to fix that. Someone might prefer a coding with "if" rather than the ternary operator, but I didn't have a strong preference. [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2tfz1XNDcyU_a5ZiEaopz6%2BbBo9vQY%2BiJVLTtUVNztcQ%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com fix-es_query_dsa-pg10-v3.patch Description: Binary data fix-es_query_dsa-pg11-v3.patch Description: Binary data
Re: worker_spi example BGW code GUC tweak
On 12/17/2017 07:32 PM, Robert Haas wrote: > Making it use GUC_UNIT_S seems like a good idea to me, but removing > the mention of seconds from the description doesn't seem like a good > idea to me. I took for my model a quick survey of existing GUCs that use GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit in the text description. (It's not a shut-out; some do, but only a handful.) I think that makes sense, because once the GUC_UNIT_foo is specified, you get output like: select current_setting('worker_spi.naptime'); current_setting - 10s and, if you set it for, say, 12ms or 180min, it will be displayed as 2min or 3h, etc., making 'seconds' in the text description a little redundant in the best case—when the current value is most naturally shown with s—and a little goofy in the other cases, where the value would be displayed with min, h, or d, and reading the value combined with the text description makes the snarky little voice in your head go "nap for 3 hours seconds??". -Chap
Re: genomic locus
On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov wrote: > I need a data type to represent genomic positions, which will consist of a > string and a pair of integers with interval logic and access methods. Sort > of like my seg type, but more straightforward. > > I noticed somebody took a good care of seg while I was away for the last 20 > years, and I am extremely grateful for that. I have been using it. In the > meantime, things have changed and now I am almost clueless about how you > deal with contributed modules and what steps I should take once I get it to > work. Also, is it a good idea to clone and fix seg for this purpose, or is > there a more appropriate template? Or maybe just building it from scratch > will be a better idea? Have you thought about just using a composite type? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: worker_spi example BGW code GUC tweak
On Sun, Dec 17, 2017 at 8:30 PM, Chapman Flack wrote: > On 12/17/2017 07:32 PM, Robert Haas wrote: >> Making it use GUC_UNIT_S seems like a good idea to me, but removing >> the mention of seconds from the description doesn't seem like a good >> idea to me. > > I took for my model a quick survey of existing GUCs that use > GUC_UNIT_(MS|S|MIN) - most of which do not restate the time unit > in the text description. (It's not a shut-out; some do, but only > a handful.) > > I think that makes sense, because once the GUC_UNIT_foo is > specified, you get output like: > > select current_setting('worker_spi.naptime'); > current_setting > - > 10s > > and, if you set it for, say, 12ms or 180min, it will be > displayed as 2min or 3h, etc., making 'seconds' in the text > description a little redundant in the best case—when the > current value is most naturally shown with s—and a little > goofy in the other cases, where the value would be displayed > with min, h, or d, and reading the value combined with the text > description makes the snarky little voice in your head go > "nap for 3 hours seconds??". Well, you have a point, at that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 10:50 AM, Alvaro Herrera wrote: >> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I >> think it might as well try to validate while it's at it. But if not >> then we might want to go with #2. > > The problem I have with it is that restoring a dump containing indexes > on partitions becomes a O(N^2) deal as it has to do the full check once > for every index we attach. Sure. If the constant factor is high enough to matter, then VALIDATE makes sense. IMHO, anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 5:29 AM, David Rowley wrote: >> If ALTER INDEX .. ATTACH is already taking AEL on the parent, then I >> think it might as well try to validate while it's at it. But if not >> then we might want to go with #2. > > I'm now not that clear on what the behaviour is if the ONLY keyword is > not specified on the CREATE INDEX for the partitioned index. Does that > go and create each leaf partition index regardless of if there is a > suitable candidate to ATTACH? No, the other way around. ONLY is being proposed as a way to create an initially-not-valid parent to which we can then ATTACH subsequently-created child indexes. But because we will have REPLACE rather than DETACH, once you get the index valid it never goes back to not-valid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Using ProcSignal to get memory context stats from a running backend
On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer wrote: > On 15 December 2017 at 09:24, Greg Stark wrote: >> Another simpler option would be to open up a new file in the log >> directory > > ... if we have one. > > We might be logging to syslog, or whatever else. > > I'd rather keep it simple(ish). +1. I still think just printing it to the log is fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 18 December 2017 at 15:04, Robert Haas wrote: > On Sun, Dec 17, 2017 at 5:29 AM, David Rowley > wrote: >> I'm now not that clear on what the behaviour is if the ONLY keyword is >> not specified on the CREATE INDEX for the partitioned index. Does that >> go and create each leaf partition index regardless of if there is a >> suitable candidate to ATTACH? > > No, the other way around. ONLY is being proposed as a way to create > an initially-not-valid parent to which we can then ATTACH > subsequently-created child indexes. But because we will have REPLACE > rather than DETACH, once you get the index valid it never goes back to > not-valid. I understand what the ONLY is proposed to do. My question was in regards to the behaviour without ONLY. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: pg_(total_)relation_size and partitioned tables
On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier wrote: > The barrier here is thin. What's proposed here is already doable with > a WITH RECURSIVE query. So why not just documenting this query and be > done with it instead of complicating the code? It seems to me that the > performance in calling pg_relation_size() in a cascading times fashion > would not matter much. Or one could invent an additional cascading > option which scans inheritance and/or partition chains, or simply have > a new function. I just blogged on the matter, and here is one possibility here compatible with v10: WITH RECURSIVE partition_info (relid, relname, relsize, relispartition, relkind) AS ( SELECT oid AS relid, relname, pg_relation_size(oid) AS relsize, relispartition, relkind FROM pg_catalog.pg_class WHERE relname = 'parent_name' AND relkind = 'p' UNION ALL SELECT c.oid AS relid, c.relname AS relname, pg_relation_size(c.oid) AS relsize, c.relispartition AS relispartition, c.relkind AS relkind FROM partition_info AS p, pg_catalog.pg_inherits AS i, pg_catalog.pg_class AS c WHERE p.relid = i.inhparent AND c.oid = i.inhrelid AND c.relispartition ) SELECT * FROM partition_info; This is not really straight-forward. You could as well have the pg_relation_size call in the outer query. -- Michael
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 9:38 PM, David Rowley wrote: > On 18 December 2017 at 15:04, Robert Haas wrote: >> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley >> wrote: >>> I'm now not that clear on what the behaviour is if the ONLY keyword is >>> not specified on the CREATE INDEX for the partitioned index. Does that >>> go and create each leaf partition index regardless of if there is a >>> suitable candidate to ATTACH? >> >> No, the other way around. ONLY is being proposed as a way to create >> an initially-not-valid parent to which we can then ATTACH >> subsequently-created child indexes. But because we will have REPLACE >> rather than DETACH, once you get the index valid it never goes back to >> not-valid. > > I understand what the ONLY is proposed to do. My question was in > regards to the behaviour without ONLY. Oh, sorry -- I was confused. I'm not sure whether that should try to attach to something if it exists, or just create unconditionally... what do you think? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Proposal: Local indexes for partitioned table
On 18 December 2017 at 15:59, Robert Haas wrote: > On Sun, Dec 17, 2017 at 9:38 PM, David Rowley > wrote: >> On 18 December 2017 at 15:04, Robert Haas wrote: >>> On Sun, Dec 17, 2017 at 5:29 AM, David Rowley >>> wrote: I'm now not that clear on what the behaviour is if the ONLY keyword is not specified on the CREATE INDEX for the partitioned index. Does that go and create each leaf partition index regardless of if there is a suitable candidate to ATTACH? >>> >>> No, the other way around. ONLY is being proposed as a way to create >>> an initially-not-valid parent to which we can then ATTACH >>> subsequently-created child indexes. But because we will have REPLACE >>> rather than DETACH, once you get the index valid it never goes back to >>> not-valid. >> >> I understand what the ONLY is proposed to do. My question was in >> regards to the behaviour without ONLY. > > Oh, sorry -- I was confused. I'm not sure whether that should try to > attach to something if it exists, or just create unconditionally... > what do you think? I think you feel quite strongly about not having the code select a random matching index, so if we want to stick to that rule, then we'll need to create a set of new leaf indexes rather than select a random one. If we don't do that, then we might as well go with my idea and ditch the ONLY syntax and have the CREATE INDEX try to find a suitable leaf index, or create a new leaf index if one cannot be found. Would it be surprising to users if CREATE INDEX ON partitioned_table created a bunch of duplicate new indexes on the leaf tables? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Proposal: Local indexes for partitioned table
On Sun, Dec 17, 2017 at 10:05 PM, David Rowley wrote: > I think you feel quite strongly about not having the code select a > random matching index, so if we want to stick to that rule, then we'll > need to create a set of new leaf indexes rather than select a random > one. I feel quite strongly about it *in the context of pg_dump*. Outside of that scenario, I'm happy to go with whatever behavior you and others think will satisfy the POLA. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada wrote: > +1 from me. Works for me, too, although I still don't really follow how it's happening in the present coding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.
On Mon, Dec 18, 2017 at 12:14 PM, Robert Haas wrote: > On Thu, Dec 14, 2017 at 7:51 PM, Masahiko Sawada > wrote: >> +1 from me. > > Works for me, too, although I still don't really follow how it's > happening in the present coding. Craig has mentioned at least one way upthread: https://www.postgresql.org/message-id/CAMsr+YHj4asAu8dJ1Q5hiW+wxL2Jd1=uxo52dvxhmxive-x...@mail.gmail.com And that's possible when building with LOCK_DEBUG with trace_lwlocks enabled at least. So I would rather not rely on assuming that CHECK_FOR_INTERRUPTS() as a code path never taken for the cases discussed here. -- Michael
Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas wrote: > On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada > wrote: >> Here is the result. >> I've measured the through-put with some cases on my virtual machine. >> Each client loads 48k file to each different relations located on >> either xfs filesystem or ext4 filesystem, for 30 sec. >> >> Case 1: COPYs to relations on different filessystems(xfs and ext4) and >> N_RELEXTLOCK_ENTS is 1024 >> >> clients = 2, avg = 296.2068 >> clients = 5, avg = 372.0707 >> clients = 10, avg = 389.8850 >> clients = 50, avg = 428.8050 >> >> Case 2: COPYs to relations on different filessystems(xfs and ext4) and >> N_RELEXTLOCK_ENTS is 1 >> >> clients = 2, avg = 294.3633 >> clients = 5, avg = 358.9364 >> clients = 10, avg = 383.6945 >> clients = 50, avg = 424.3687 >> >> And the result of current HEAD is following. >> >> clients = 2, avg = 284.9976 >> clients = 5, avg = 356.1726 >> clients = 10, avg = 375.9856 >> clients = 50, avg = 429.5745 >> >> In case2, the through-put got decreased compare to case 1 but it seems >> to be almost same as current HEAD. Because the speed of acquiring and >> releasing extension lock got x10 faster than current HEAD as I >> mentioned before, the performance degradation may not have gotten >> decreased than I expected even in case 2. >> Since my machine doesn't have enough resources the result of clients = >> 50 might not be a valid result. > > I have to admit that result is surprising to me. > I think the environment I used for performance measurement did not have enough resources. I will do the same benchmark on an another environment to see if it was a valid result, and will share it. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: pg_(total_)relation_size and partitioned tables
Thanks all for your thoughts. I agree with the Robert's point which both David and Michael seem to agree with that we shouldn't really be changing what pg_relation_size() is doing under the covers. And I guess the same for pg_table_size(), too. Both of those functions and their siblings work with relations that possess on-disk structures and have associated relations (TOAST, indexes) that in turn possess on-disk structures. It seems quite clearly documented as such. Partitioned tables are different in that they neither possess on-disk structures nor have any relations (TOAST, indexes) associated directly with them. Instead, they have partitions that are the relations that aforementioned dbsize.c functions are familiar with. So, I withdraw the patch I originally posted in favor of some other approach. Reply continues below... On 2017/12/18 11:51, Michael Paquier wrote: > On Mon, Dec 18, 2017 at 9:29 AM, Michael Paquier > wrote: >> The barrier here is thin. What's proposed here is already doable with >> a WITH RECURSIVE query. So why not just documenting this query and be >> done with it instead of complicating the code? It seems to me that the >> performance in calling pg_relation_size() in a cascading times fashion >> would not matter much. Or one could invent an additional cascading >> option which scans inheritance and/or partition chains, or simply have >> a new function. > > I just blogged on the matter, and here is one possibility here > compatible with v10: > WITH RECURSIVE partition_info > (relid, >relname, >relsize, >relispartition, >relkind) AS ( > SELECT oid AS relid, >relname, >pg_relation_size(oid) AS relsize, >relispartition, >relkind > FROM pg_catalog.pg_class > WHERE relname = 'parent_name' AND > relkind = 'p' > UNION ALL > SELECT > c.oid AS relid, > c.relname AS relname, > pg_relation_size(c.oid) AS relsize, > c.relispartition AS relispartition, > c.relkind AS relkind > FROM partition_info AS p, > pg_catalog.pg_inherits AS i, > pg_catalog.pg_class AS c > WHERE p.relid = i.inhparent AND > c.oid = i.inhrelid AND > c.relispartition > ) > SELECT * FROM partition_info; > > This is not really straight-forward. You could as well have the > pg_relation_size call in the outer query. Thanks Michael for coming up with that. Do you (and/or others) think that's something that we can wrap inside a built-in function(s), that is, one defined in system_views.sql? Or if we decide to have new functions, say, pg_get_partitions() and/or pg_get_partition_sizes(), we might as well implement them as C functions inside dbsize.c. If so, do we have want to have "partition" variants of all *_size() functions viz. pg_relation_size(), pg_total_relation_size(), pg_indexes_size(), and pg_table_size()? Thanks, Amit
Re: pg_(total_)relation_size and partitioned tables
On Mon, Dec 18, 2017 at 2:17 PM, Amit Langote wrote: > Do you (and/or others) think that's something that we can wrap inside a > built-in function(s), that is, one defined in system_views.sql? Or if we > decide to have new functions, say, pg_get_partitions() and/or > pg_get_partition_sizes(), we might as well implement them as C functions > inside dbsize.c. If so, do we have want to have "partition" variants of > all *_size() functions viz. pg_relation_size(), pg_total_relation_size(), > pg_indexes_size(), and pg_table_size()? I can see value in something like Robert is proposing upthread by having a function one is able to customize to decide what to include in the calculation. There could be options for at least: - partitions, or relation cascading. - index. - toast tables. - visibility maps. - FSM. The first three ones is what Robert are mentioned, still I would see the last two ones are interesting things if optional. Or we could have just a SRF that returns one row per object scanned. -- Michael
Re: [HACKERS] replace GrantObjectType with ObjectType
On Sat, Dec 16, 2017 at 12:40 AM, Robert Haas wrote: > On Fri, Dec 15, 2017 at 12:40 PM, Peter Eisentraut > wrote: > > On 12/14/17 22:59, Rushabh Lathia wrote: > >> I noted that no_priv_msg and not_owner_msg array been removed > >> and code fitted the code into aclcheck_error(). Actually that > >> makes the code more complex then what it used to be. I would > >> prefer the array rather then code been fitted into the function. > > > > There is an argument for having a big array versus the switch/case > > approach. But most existing code around object addresses uses the > > switch/case approach, so it's better to align it that way, I think. > > It's weird to have to maintain two different styles. > > Only motivation is, earlier approach looks more cleaner. Also patch is getting bigger - so if we continue with old approach it will make review easy. Just in case switch/case approach is a go to, then it can be done as part of separate clean up patch. Thanks, Rushabh Lathia