Re: [HACKERS] advanced partition matching algorithm for partition-wise join
Hi Fujita San, Please find my comments inline below: On Wed, Aug 28, 2019 at 3:52 PM Etsuro Fujita wrote: > On Fri, Aug 16, 2019 at 10:25 PM Etsuro Fujita > wrote: > > [... skipped ..] > > About the attached: > > * The attached patch modified try_partitionwise_join() so that we call > partition_bounds_equal() then partition_bounds_merge() (if necessary) > to create the partition bounds for the join rel. We don't support for > merging the partition bounds for the hash-partitioning case, so this > makes code to handle the hash-partitioning case in > partition_bounds_merge() completely unnecessary, so I removed that > entirely. > Yes, that make sense. On thinking further, a thought hits to me why we can't join two hash partitioned table which has the same modulus and partition key specification, but some of the partitions are missing from either partitioned table. For e.g. here is a smaller case where foo has two partitions and bar has only one. CREATE TABLE foo(a int) PARTITION BY HASH(a); CREATE TABLE foo_p0 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 0); CREATE TABLE foo_p1 PARTITION OF foo FOR VALUES WITH(modulus 2, remainder 1); CREATE TABLE bar(a int) PARTITION BY HASH(a); <-- missing partitions for REMAINDER 1 CREATE TABLE bar_p0 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 0); Explain: postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a; QUERY PLAN - Merge Join (cost=590.35..1578.47 rows=65025 width=8) Merge Cond: (p2.a = p1.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: p2.a -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=410.57..423.32 rows=5100 width=4) Sort Key: p1.a -> Append (cost=0.00..96.50 rows=5100 width=4) -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4) -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4) (10 rows) The partitions-wise join will be performed only if we fill the partition hole that exists for the joining table i.e. adding partitions to bar table. postgres=# CREATE TABLE bar_p1 PARTITION OF bar FOR VALUES WITH(modulus 2, remainder 1); CREATE TABLE postgres=# explain select * from foo p1, bar p2 where p1.a = p2.a; QUERY PLAN - Append (cost=359.57..2045.11 rows=65024 width=8) -> Merge Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (p1.a = p2.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: p1.a -> Seq Scan on foo_p0 p1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: p2.a -> Seq Scan on bar_p0 p2 (cost=0.00..35.50 rows=2550 width=4) -> Merge Join (cost=359.57..860.00 rows=32512 width=8) Merge Cond: (p1_1.a = p2_1.a) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: p1_1.a -> Seq Scan on foo_p1 p1_1 (cost=0.00..35.50 rows=2550 width=4) -> Sort (cost=179.78..186.16 rows=2550 width=4) Sort Key: p2_1.a -> Seq Scan on bar_p1 p2_1 (cost=0.00..35.50 rows=2550 width=4) (17 rows) It would have been nice if we could support this case, as we do allow partitions hole for the other partition scheme, but there wouldn't be much objection if we don't want to add this support for now since there will be a lesser chance that hash partitioned table has the hole, IMO. > * I removed this assertion in partition_bounds_merge(), because I > think this is covered by two assertions above this. > > + Assert((*outer_parts == NIL || *inner_parts != NIL) && > + (*inner_parts == NIL || *outer_parts != NIL)); > > * (I forgot to mention this in a previous email, but) I removed this > bit of generate_matching_part_pairs(), because we already have the > same check in try_partitionwise_join(), so this bit would be redundant > IIUC. > Looks good. > > * I added more comments. > Thanks. > If there are no objections, I'll merge the attached with the base patch in > [1]. > The proposed enhancement in the patch is too good and the patch is pretty much reasonable to merge into the main patch. Here are the few cosmetic fixes for this path I think is needed. Feel free to ignore if some of them do not make sense or too obvious. Note: left side number represents code line number of the patch. 118 + } 119 + 120 + /* 121 +* Try to create the partition bounds along with join pairs. 122 +*/ 123 + if (boundinfo == NULL) 124 + { Can we add this block as else part of previous if condition checking equal partitions bound? 133 +
Re: Bug in GiST paring heap comparator
> 2 сент. 2019 г., в 9:54, Alexander Korotkov > написал(а): > > It appears to be related to implementation of comparison function in > pairing heap used as priority queue for KNN. It used plain float > comparison, which doesn't handle Inf and Nan values well. Attached > patch replaced it with float8_cmp_internal(). Thanks! This patch fixes tests of my new GiST build :) While patch looks good to me, I want to add that that there's a lot of <= and > comparisons in gistproc.c in function: static float8 computeDistance(bool isLeaf, BOX *box, Point *point) Should we fix this too? Or add comment why current code is safe. Thanks! Best regards, Andrey Borodin.
Re: moonjelly vs tsearch
News from Fabien's bleeding edge compiler zoo: apparently GCC 10 20190831 thinks the tsearch test should produce different output than 20190824 did. It looks like the parsing of question marks change... Indeed, that looks pretty strange. I can't wait for next week's installment. I'll relaunch gcc trunk compilation before the end of the week. -- Fabien.
Bug in GiST paring heap comparator
Hi! Andrey Borodin noticed me that results of some KNN-GIST tests are obviously wrong and don't match results of sort node. SELECT * FROM point_tbl ORDER BY f1 <-> '0,1'; f1 --- (10,10) (NaN,NaN) (0,0) (1e-300,-1e-300) (-3,4) (-10,0) (-5,-12) (5.1,34.5) (1e+300,Infinity) (10 rows) It appears to be related to implementation of comparison function in pairing heap used as priority queue for KNN. It used plain float comparison, which doesn't handle Inf and Nan values well. Attached patch replaced it with float8_cmp_internal(). Also, note that with patch KNN results still don't fully match results of sort node. SELECT * FROM point_tbl ORDER BY f1 <-> '0,1'; f1 --- (0,0) (1e-300,-1e-300) (-3,4) (-10,0) (10,10) (-5,-12) (5.1,34.5) (1e+300,Infinity) (NaN,NaN) (10 rows) NULL and '(NaN,NaN)' are swapped. It happens so, because we assume distance to NULL to be Inf, while float8_cmp_internal() assumes NaN to be greater than NULL. If even we would assume distance to NULL to be Nan, it doesn't guarantee that NULLs would be last. It looks like we can handle this only by introduction array of "distance is null" flags to GISTSearchItem. But does it worth it? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company gist-pairing-heap-cmp-fix-1.patch Description: Binary data
Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity
On Mon, Sep 2, 2019 at 6:31 AM Tom Lane wrote: > I wrote: > > As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty > > duff anyway, because it doesn't bother to check for the defined failure > > return for RelationIdGetRelation. But if we're concerned about the > > cost of recalculating this stuff per-row, couldn't we cache it a little > > better? It should be safe to assume the set of index columns isn't > > changing intra-query. > > ... in fact, isn't all the infrastructure for that present already? > > Why is this code looking directly at the index at all, rather than > > using the relcache's rd_idattr bitmap? > > Here's a proposed patch along those lines. It fixes Hadi's original > crash case and passes check-world. Agree that this patch would be a better solution for Hadi's report, although I also agree that the situation with index locking for DELETE isn't perfect. > I'm a bit suspicious of the exclusion for idattrs being empty, but > if I remove that, some of the contrib/test_decoding test results > change. Anybody want to comment on that? If that's actually an > expected situation, why is there an elog(DEBUG) in that path? ISTM that the exclusion case may occur with the table's replica identity being REPLICA_IDENTITY_DEFAULT and there being no primary index defined, in which case nothing needs to get logged. The elog(DEBUG) may just be a remnant from the days when this was being developed. I couldn't find any notes on it though in the archives [1] though. Thanks, Amit [1] https://www.postgresql.org/message-id/flat/20131204155510.GO24801%40awork2.anarazel.de
pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout
Hi pg_basebackup -F t fails when fsync spends more time than tcp_user_timeout in following environment. [Environment] Postgres 13dev (master branch) Red Hat Enterprise Postgres 7.4 [Error] $ pg_basebackup -F t --progress --verbose -h -D pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/5A60 on timeline 1 pg_basebackup: starting background WAL receiver pg_basebackup: created temporary replication slot "pg_basebackup_15647" pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. [Analysis] - pg_basebackup -F t creates a tar file and does fsync() for each tablespace. (Otherwise, -F p does fsync() only once at the end.) - While doing fsync() for a tar file for one tablespace, wal sender sends the content of the next tablespace. When fsync() spends long time, the tcp socket of pg_basebackup returns "zero window" packets to wal sender. This means the tcp socket buffer of pg_basebackup is exhausted since pg_basebackup cannot receive during fsync(). - The socket of wal sender retries to send the packet, but resets connection after tcp_user_timeout. After wal sender resets connection, pg_basebackup cannot receive data and fails with above error. [Solution] I think fsync() for each tablespace is not necessary. Like pg_basebackup -F p, I think fsync() is necessary only once at the end. Could you give me any comment? Regards, Ryohei Takahashi
Re: safe to overload objectSubId for a type?
Chapman Flack writes: > I don't mean "overload objectSubId" in any ObjectAddress that PG code > would ever see. I am only thinking of a data structure of my own that > is ObjectAddress-like and has all three components available all the > time, and for an object that's a type, I would find it handy to stash > a typmod there, and not have to carry those around separately. If this is totally independent of ObjectAddress, why are you even asking? I assume that what you mean is you'd like these values to bleed into ObjectAddresses or vice versa. If we ever do make ObjectAddress.objectSubId mean something for types, I'd expect --- based on the precedent of relation columns --- that it'd specify a column number for a column of a composite type. There are fairly obvious use-cases for that, such as allowing a DROP of a column type to not propagate to the whole composite type. So I'd be pretty down on allowing it to mean typmod instead. regards, tom lane
Re: POC: Cleaning up orphaned files using undo logs
On Fri, Aug 30, 2019 at 8:27 PM Kuntal Ghosh wrote: > I'm getting the following assert failure while performing the recovery > with the same. > "TRAP: FailedAssertion("slot->meta.status == UNDO_LOG_STATUS_FULL", > File: "undolog.c", Line: 997)" > > I found that we don't emit an WAL record when we update the > slot->meta.status as UNDO_LOG_STATUS_FULL. If we don't that, after > crash recovery, some new transaction may use that undo log which is > wrong, IMHO. Am I missing something? Thanks, right, that status logging is wrong, will fix in next version. -- Thomas Munro https://enterprisedb.com
moonjelly vs tsearch
Hi, News from Fabien's bleeding edge compiler zoo: apparently GCC 10 20190831 thinks the tsearch test should produce different output than 20190824 did. It looks like the parsing of question marks change... I can't wait for next week's installment. -- Thomas Munro https://enterprisedb.com
Re: XPRS
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra wrote: > > Although “The Wei Hong Optimizer” was designed in the context of > > Postgres, it became the standard approach for many of the parallel > > query optimizers in industry." > > I assume this quote is from 30 years ago. I wonder if the claim is still > true, on current hardware (including e.g. distributed databases). The quote is from 2018, and appears in the article I linked (it's a chapter from the book Making Databases Work: The Wisdom of Michael Stonebraker), but I'm not sure which systems it's referring to. Speculation: Many other systems have what we call parallel-oblivious operators only, and then insert various exchange operators to make a parallel plan. That is, they don't necessarily have a direct equivalent of our "Parallel Sequential Scan", "Parallel Index Scan", "Parallel Foreign Scan": they just use their regular scans, possibly with the addition of some kind of "Parallel Scatter" node (that's my made up name, it's the opposite of Gather, called various things like "page supplier" or "block iterator") or "Parallel Repartition" inserted in the right places. Perhaps they create a serial plan first, and then try to parallelise it by inserting various parallel operators and then recomputing the costs? Rather than considering the separate paths in the first phase of the optimiser, as we do. The cases where Hong's best-parallel-plan hypothesis isn't true for us now might go away if we had Parallel Repartition, so that each 'stream' would be the complete set of tuples for some known partition. To be clear, I'm not suggesting we do that necessarily, just pointing out some interesting claims about ancient POSTGRES wisdom, in a highly speculative water cooler thread. Actually, this isn't the first time it's occurred to me that elements of our design were falling out of the order that we chose to implement things in. Another example is the "Shared Hash" that I had in an early version of my work on Parallel Hash Join, where just one process would run a parallel-safe but non-parallel-oblivious plan to build a shared hash table while other workers twiddled their thumbs; I dropped it because our cost model has no penalty for running N copies of the same plan rather than just one so there was no way to pick that plan, and that's because we don't have a cost model like Hong's that considers resource usage too. Another more speculative observation: maybe no-partition/shared Parallel Hash Join is only obvious if you already have the general concept of parallel-aware executor nodes. AFAIK Robert and Amit invented those to be able to coordinate parallel scans between processes, where thread-based systems might be able to share a single scan state somehow under a Scatter-like operator. If you had threads, you might not need that concept that allows arbitrary executor nodes to communicate with each other between workers, and then it might be more obvious and natural to use repartitioning for parallelising hash joins. > FWIW I think we'll have to do something about resource acquisition, sooner > or later. It was always quite annoying that we don't really consider > memory consumption of the query as a whole during planning, and parallel > query made it a bit more painful. Agreed. Here's an approach I have been wondering about to cap total executor memory usage, which is a much more down-to-Earth idea than any of the above space cadet planner problems. Let's start at the other end of the problem, by introducing admission control and memory quotas. That is, keep using work_mem with its current per-node-copy meaning at planning time, for now, and then: 1. Compute the peak amount of memory each plan thinks it will need. Initially that could be done by by summing estimates from all nodes and considering workers. A later refinement could deal with nodes that give back memory early, if we get around to doing that. The estimate could be shown by EXPLAIN. (Some details to work out: worst case vs expected etc.) 2. Introduce a new GUC global_work_mem, which limits the total plan that are allowed to run concurrently, according to their memory estimates. Introduce a shared memory counter of currently allocated quota. 3. Introduce a new GUC session_work_mem, which is the amount of quota that every session tries to acquire when it connects or perhaps first runs a query, and that it won't give back until the end of the session. Or perhaps they acquire less than that if they need less, but that's the amount they never give back once they've got that much. The idea is to allow queries with estimates under that limit, for example high frequency OLTP queries, to avoid any extra communication overhead from this scheme. 4. To run queries that have estimates higher than the session's current allocated quota, the session must acquire more quota for the duration of the query. If it can't be acquired right now without exceeding global_work_mem, it has to
safe to overload objectSubId for a type?
Hi, I don't mean "overload objectSubId" in any ObjectAddress that PG code would ever see. I am only thinking of a data structure of my own that is ObjectAddress-like and has all three components available all the time, and for an object that's a type, I would find it handy to stash a typmod there, and not have to carry those around separately. As far as I can tell, most objects (maybe all, except attributes and attribute defaults?) have the objectSubId unused. But I would not want to paint myself into a corner if anyone anticipates making objectSubId mean something for type objects somewhere down the road. Thanks, -Chap
Re: [Patch] Add a reset_computed_values function in pg_stat_statements
Em dom, 1 de set de 2019 às 06:04, Pierre Ducroquet escreveu: > > Hello > > pg_stat_statements is a great tool to track performance issue in live > databases, especially when adding interfaces like PoWA on top of it. > But so far, tools like PoWA can not track the min_time, max_time, mean_time > and sum_var_time of queries : these statistics are cumulated over time, > fetching points in time would be of little to no use, especially when looking > at the impact of a DDL change. > This patch thus introduces a simple pg_stat_statements_reset_computed_values > function that reset the computed statistics, leaving all other information > alive, thus allowing the aforementioned scenario. > Pierre, I don't understand why you want to reset part of the statement statistics. If you want the minimal query time every week, reset all statistics of that statement (v12 or later). Postgres provides histogram metrics (min, max, mean, stddev). AFAICS you want a metric type called timer (combination of histogram and meter). For example, calls, sum, min, max, mean, standard deviation will be calculated each hour. If I were to add a new metric to pg_stat_statements, it would be last_time. Compare histogram metrics with the last statement time is useful to check if a certain optimization was effective. Talking about your patch, math is wrong. If you don't reset total_time, all computed values will be incorrect. You are changing the actual meaning of those metrics and I think it is unacceptable because it will break applications. Instead, you should provide (i) new counters or (ii) add GUC to control this behavior. It is a trade-off between incompatibility and too many metrics. Though I wouldn't like to break compatibility (as I said you can always reset all statement statistics). Why don't you provide a pg_stat_statements_reset_computed_values(userid Oid, dbid Oid, queryid bigint)? You forgot to provide documentation for the new function. You should revoke pg_stat_statements_reset_computed_values from PUBLIC. Regards, -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Re: refactoring - share str2*int64 functions
On Sun, Sep 01, 2019 at 08:07:06PM +0200, Fabien COELHO wrote: > They allow to quickly do performance tests, for me it is useful to keep it > around, but you are the committer, you do as you feel. If there are more voices for having that in core, we could consider it. For now I have added that into my own plugin repository with all the functions discussed on this thread: https://github.com/michaelpq/pg_plugins/ > The signed overflows are trickier even, I have not paid attention to the > fallback code. I agree that it is better left untouched for know. Thanks. > Hmmm. I did manual tests really. Attached a psql script replicating them. > > # with builtin overflow detection > sh> psql < oc.sql > NOTICE: int 16 mul: 00:00:02.747269 # slow > NOTICE: int 16 add: 00:00:01.83281 > NOTICE: int 16 sub: 00:00:01.8501 > NOTICE: uint 16 mul: 00:00:03.68362 # slower > NOTICE: uint 16 add: 00:00:01.835294 > NOTICE: uint 16 sub: 00:00:02.136895 # slow > NOTICE: int 32 mul: 00:00:01.828065 > NOTICE: int 32 add: 00:00:01.840269 > NOTICE: int 32 sub: 00:00:01.843557 > NOTICE: uint 32 mul: 00:00:02.447052 # slow > NOTICE: uint 32 add: 00:00:01.849899 > NOTICE: uint 32 sub: 00:00:01.840773 > NOTICE: int 64 mul: 00:00:01.839051 > NOTICE: int 64 add: 00:00:01.839065 > NOTICE: int 64 sub: 00:00:01.838599 > NOTICE: uint 64 mul: 00:00:02.161346 # slow > NOTICE: uint 64 add: 00:00:01.839404 > NOTICE: uint 64 sub: 00:00:01.838549 Actually that's much faster than a single core on my debian SID with gcc 9.2.1. Here are more results from me: Built-in undef Built-in int16 mul 00:00:05.425207 00:00:05.634417 int16 add 00:00:05.389738 00:00:06.38885 int16 sub 00:00:05.446529 00:00:06.39569 uint16 mul 00:00:05.499066 00:00:05.541617 uint16 add 00:00:05.281622 00:00:06.252511 uint16 sub 00:00:05.366424 00:00:05.457148 int32 mul 00:00:05.121209 00:00:06.154989 int32 add 00:00:05.228722 00:00:06.344721 int32 sub 00:00:05.237594 00:00:06.323543 uint32 mul 00:00:05.126339 00:00:05.921738 uint32 add 00:00:05.212085 00:00:06.183031 uint32 sub 00:00:05.201884 00:00:05.363667 int64 mul 00:00:05.136129 00:00:06.148101 int64 add 00:00:05.200201 00:00:06.329091 int64 sub 00:00:05.218028 00:00:06.313114 uint64 mul 00:00:05.444733 00:00:08.089742 uint64 add 00:00:05.603978 00:00:06.377753 uint64 sub 00:00:05.544838 00:00:05.490873 This part has been committed, now let's move to the next parts of your patch. -- Michael signature.asc Description: PGP signature
Re: row filtering for logical replication
Em dom, 1 de set de 2019 às 06:09, Erik Rijkers escreveu: > > The first 4 of these apply without error, but I can't get 0005 to apply. > This is what I use: > Erik, I generate a new patch set with patience diff algorithm. It seems it applies cleanly. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento From c07af2f00b7a72ba9660e389bb1392fc9e5d2688 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Wed, 24 Jan 2018 17:01:31 -0200 Subject: [PATCH 4/8] Rename a WHERE node A WHERE clause will be used for row filtering in logical replication. We already have a similar node: 'WHERE (condition here)'. Let's rename the node to a generic name and use it for row filtering too. --- src/backend/parser/gram.y | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index c97bb367f8..1de8f56794 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -476,7 +476,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type def_arg columnElem where_clause where_or_current_clause a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound columnref in_expr having_clause func_table xmltable array_expr -ExclusionWhereClause operator_def_arg +OptWhereClause operator_def_arg %type rowsfrom_item rowsfrom_list opt_col_def_list %type opt_ordinality %type ExclusionConstraintList ExclusionConstraintElem @@ -3710,7 +3710,7 @@ ConstraintElem: $$ = (Node *)n; } | EXCLUDE access_method_clause '(' ExclusionConstraintList ')' -opt_c_include opt_definition OptConsTableSpace ExclusionWhereClause +opt_c_include opt_definition OptConsTableSpace OptWhereClause ConstraintAttributeSpec { Constraint *n = makeNode(Constraint); @@ -3812,7 +3812,7 @@ ExclusionConstraintElem: index_elem WITH any_operator } ; -ExclusionWhereClause: +OptWhereClause: WHERE '(' a_expr ')' { $$ = $3; } | /*EMPTY*/{ $$ = NULL; } ; -- 2.11.0 From 367631ac4ba1e41170d59d39693e2eaf7c406621 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Tue, 27 Feb 2018 02:21:03 + Subject: [PATCH 3/8] Refactor function create_estate_for_relation Relation localrel is the only LogicalRepRelMapEntry structure member that is useful for create_estate_for_relation. --- src/backend/replication/logical/worker.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 11e6331f49..d9952c8b7e 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -173,7 +173,7 @@ ensure_transaction(void) * This is based on similar code in copy.c */ static EState * -create_estate_for_relation(LogicalRepRelMapEntry *rel) +create_estate_for_relation(Relation rel) { EState *estate; ResultRelInfo *resultRelInfo; @@ -183,13 +183,13 @@ create_estate_for_relation(LogicalRepRelMapEntry *rel) rte = makeNode(RangeTblEntry); rte->rtekind = RTE_RELATION; - rte->relid = RelationGetRelid(rel->localrel); - rte->relkind = rel->localrel->rd_rel->relkind; + rte->relid = RelationGetRelid(rel); + rte->relkind = rel->rd_rel->relkind; rte->rellockmode = AccessShareLock; ExecInitRangeTable(estate, list_make1(rte)); resultRelInfo = makeNode(ResultRelInfo); - InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0); + InitResultRelInfo(resultRelInfo, rel, 1, NULL, 0); estate->es_result_relations = resultRelInfo; estate->es_num_result_relations = 1; @@ -589,7 +589,7 @@ apply_handle_insert(StringInfo s) } /* Initialize the executor state. */ - estate = create_estate_for_relation(rel); + estate = create_estate_for_relation(rel->localrel); remoteslot = ExecInitExtraTupleSlot(estate, RelationGetDescr(rel->localrel), ); @@ -696,7 +696,7 @@ apply_handle_update(StringInfo s) check_relation_updatable(rel); /* Initialize the executor state. */ - estate = create_estate_for_relation(rel); + estate = create_estate_for_relation(rel->localrel); remoteslot = ExecInitExtraTupleSlot(estate, RelationGetDescr(rel->localrel), ); @@ -815,7 +815,7 @@ apply_handle_delete(StringInfo s) check_relation_updatable(rel); /* Initialize the executor state. */ - estate = create_estate_for_relation(rel); + estate = create_estate_for_relation(rel->localrel); remoteslot = ExecInitExtraTupleSlot(estate, RelationGetDescr(rel->localrel), ); -- 2.11.0 From e83de1cab559f4ca8f9a75356e220356814cd243 Mon Sep 17 00:00:00 2001 From: Euler Taveira Date: Fri, 9 Mar 2018 18:39:22 + Subject: [PATCH 1/8] Remove unused atttypmod column from initial table synchronization Since commit 7c4f52409a8c7d85ed169bbbc1f6092274d03920, atttypmod was added but not
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > After investigation, the mechanism that's causing that is that the > src/test/recovery/t/010_logical_decoding_timelines.pl test shuts > down its replica server with a mode-immediate stop, which causes > that postmaster to shut down all its children with SIGQUIT, and > in particular that signal propagates to a "cp" command that the > archiver process is executing. The "cp" is unsurprisingly running > with default SIGQUIT handling, which per the signal man page > includes dumping core. We've experienced this (core dump in the data directory by an archive command) years ago. Related to this, the example of using cp in the PostgreSQL manual is misleading, because cp doesn't reliably persist the WAL archive file. > This makes me wonder whether we shouldn't be using some other signal > to shut down archiver subprocesses. It's not real cool if we're > spewing cores all over the place. Admittedly, production servers > are likely running with "ulimit -c 0" on most modern platforms, > so this might not be a huge problem in the field; but accumulation > of core files could be a problem anywhere that's configured to allow > server core dumps. We enable the core dump in production to help the investigation just in case. > Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down > non-Postgres child processes. But redesigning the system's signal > handling to make that possible seems like a bit of a mess. > > Thoughts? We're using a shell script and a command that's called in the shell script. That is: archive_command = 'call some_shell_script.sh ...' [some_shell_script.sh] ulimit -c 0 trap SIGQUIT to just exit on the receipt of the signal call some_command to copy file some_command also catches SIGQUIT just exit. It copies and syncs the file. I proposed something in this line as below, but I couldn't respond to Peter's review comments due to other tasks. Does anyone think it's worth resuming this? https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau Regards Takayuki Tsunakawa
Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity
I wrote: > As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty > duff anyway, because it doesn't bother to check for the defined failure > return for RelationIdGetRelation. But if we're concerned about the > cost of recalculating this stuff per-row, couldn't we cache it a little > better? It should be safe to assume the set of index columns isn't > changing intra-query. > ... in fact, isn't all the infrastructure for that present already? > Why is this code looking directly at the index at all, rather than > using the relcache's rd_idattr bitmap? Here's a proposed patch along those lines. It fixes Hadi's original crash case and passes check-world. I'm a bit suspicious of the exclusion for idattrs being empty, but if I remove that, some of the contrib/test_decoding test results change. Anybody want to comment on that? If that's actually an expected situation, why is there an elog(DEBUG) in that path? regards, tom lane diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index cb811d3..8a3c7dc 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -7594,18 +7594,20 @@ log_heap_new_cid(Relation relation, HeapTuple tup) * * Returns NULL if there's no need to log an identity or if there's no suitable * key in the Relation relation. + * + * *copy is set to true if the returned tuple is a modified copy rather than + * the same tuple that was passed in. */ static HeapTuple -ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool *copy) +ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, + bool *copy) { TupleDesc desc = RelationGetDescr(relation); - Oid replidindex; - Relation idx_rel; char replident = relation->rd_rel->relreplident; - HeapTuple key_tuple = NULL; + Bitmapset *idattrs; + HeapTuple key_tuple; bool nulls[MaxHeapAttributeNumber]; Datum values[MaxHeapAttributeNumber]; - int natt; *copy = false; @@ -7624,7 +7626,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool * if (HeapTupleHasExternal(tp)) { *copy = true; - tp = toast_flatten_tuple(tp, RelationGetDescr(relation)); + tp = toast_flatten_tuple(tp, desc); } return tp; } @@ -7633,41 +7635,37 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool * if (!key_changed) return NULL; - /* find the replica identity index */ - replidindex = RelationGetReplicaIndex(relation); - if (!OidIsValid(replidindex)) + /* find out the replica identity columns */ + idattrs = RelationGetIndexAttrBitmap(relation, + INDEX_ATTR_BITMAP_IDENTITY_KEY); + + if (bms_is_empty(idattrs)) { elog(DEBUG4, "could not find configured replica identity for table \"%s\"", RelationGetRelationName(relation)); return NULL; } - idx_rel = RelationIdGetRelation(replidindex); - - Assert(CheckRelationLockedByMe(idx_rel, AccessShareLock, true)); - - /* deform tuple, so we have fast access to columns */ - heap_deform_tuple(tp, desc, values, nulls); - - /* set all columns to NULL, regardless of whether they actually are */ - memset(nulls, 1, sizeof(nulls)); - /* - * Now set all columns contained in the index to NOT NULL, they cannot - * currently be NULL. + * Construct a new tuple containing only the replica identity columns, + * with nulls elsewhere. While we're at it, assert that the replica + * identity columns aren't null. */ - for (natt = 0; natt < IndexRelationGetNumberOfKeyAttributes(idx_rel); natt++) - { - int attno = idx_rel->rd_index->indkey.values[natt]; + heap_deform_tuple(tp, desc, values, nulls); - if (attno < 0) - elog(ERROR, "system column in index"); - nulls[attno - 1] = false; + for (int i = 0; i < desc->natts; i++) + { + if (bms_is_member(i + 1 - FirstLowInvalidHeapAttributeNumber, + idattrs)) + Assert(!nulls[i]); + else + nulls[i] = true; } key_tuple = heap_form_tuple(desc, values, nulls); *copy = true; - RelationClose(idx_rel); + + bms_free(idattrs); /* * If the tuple, which by here only contains indexed columns, still has @@ -7680,7 +7678,7 @@ ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_changed, bool * { HeapTuple oldtup = key_tuple; - key_tuple = toast_flatten_tuple(oldtup, RelationGetDescr(relation)); + key_tuple = toast_flatten_tuple(oldtup, desc); heap_freetuple(oldtup); }
Re: REL_12_STABLE crashing with assertion failure in ExtractReplicaIdentity
Andres Freund writes: > On 2019-08-17 01:43:45 -0400, Tom Lane wrote: >> Yeah ... see the discussion leading up to 9c703c169, >> https://www.postgresql.org/message-id/flat/19465.1541636036%40sss.pgh.pa.us >> We didn't pull the trigger on removing the CMD_DELETE exception here, >> but I think the handwriting has been on the wall for some time. > ISTM there's a few different options here: > 1a) We build all index infos, unconditionally. As argued in the thread > you reference, future tableams may eventually require that anyway, > by doing more proactive index maintenance somehow. Currently there's > however no support for such AMs via tableam (mostly because I wasn't > sure how exactly that'd look, and none of the already in-development > AMs needed it). > 2a) We separate acquisition of index locks from ExecOpenIndices(), and > acquire index locks even for CMD_DELETE. Do so either during > executor startup, or as part of AcquireExecutorLocks() (the latter > on the basis that parsing/planning would have required the locks > already). > There's also corresponding *b) options, where we only do additional work > for CMD_DELETE if wal_level = logical, and the table has a replica > identity requiring use of the index during deleteions. But I think > that's clearly enough a bad idea that we can just dismiss it out of > hand. > 3) Remove the CheckRelationLockedByMe() assert from > ExtractReplicaIdentity(), at least for 12. I don't think this is an > all that convicing option, but it'd reduce churn relatively late in > beta. > 4) Add a index_open(RowExclusiveLock) to ExtractReplicaIdentity(). That > seems very unconvincing to me, because we'd do so for every row. As far as 4) goes, I think the code in ExtractReplicaIdentity is pretty duff anyway, because it doesn't bother to check for the defined failure return for RelationIdGetRelation. But if we're concerned about the cost of recalculating this stuff per-row, couldn't we cache it a little better? It should be safe to assume the set of index columns isn't changing intra-query. ... in fact, isn't all the infrastructure for that present already? Why is this code looking directly at the index at all, rather than using the relcache's rd_idattr bitmap? I suspect we'll have to do 1a) eventually anyway, but this particular problem seems like it has a better solution. Will try to produce a patch in a bit. regards, tom lane
Re: Proposal: roll pg_stat_statements into core
ne 1. 9. 2019 v 20:48 odesílatel David Fetter napsal: > On Sun, Sep 01, 2019 at 08:12:15PM +0200, Pavel Stehule wrote: > > Hi > > > > ne 1. 9. 2019 v 20:00 odesílatel David Fetter napsal: > > > > > Folks, > > > > > > I'd like to $Subject, on by default, with a switch to turn it off for > > > those really at the outer edges of performance. Some reasons include: > > > > > > - It's broadly useful. > > > - Right now, the barrier for turning it on is quite high. In addition > > > to being non-core, which is already a pretty high barrier at a lot > > > of organizations, it requires a shared_preload_libraries setting, > > > which is pretty close to untenable in a lot of use cases. > > > - The overhead for most use cases is low compared to the benefit. > > > > > > > I have not a strong opinion about it. pg_stat_statements is really useful > > extenstion, on second hand > > > > 1. the API is not stabilized yet - there are some patches related to this > > extension if I remember correctly > > You do. > > > 2. there are not any numbers what is a overhead > > What numbers would you suggest collecting? We could get some by > running them with the extension loaded and not, although this goes to > your next proposal. > I would to see some benchmarks (pg_bench - readonly, higher number of connects) > > Maybe better solution can be some new API for shared memory, that doesn't > > need to use shared_preload_library. > > What would such an API look like? > possibility to allocate shared large blocks of shared memory without necessity to do it at startup time > > It can be useful for lot of other monitoring extensions, profilers, > > debuggers, > > It would indeed. > > Do you see this new API as a separate project, and if so, of > approximately what size? Are we talking about something about the > size of DSM? Of JIT? > Probably about DSM, and about API how to other processes can connect to some blocks of DSM. I rember similar request related to shared fulltext dictionaries It can be part of some project "enhancing pg_stat_statement - be possible load this extension without restart" > > Best, > David. > -- > David Fetter http://fetter.org/ > Phone: +1 415 235 3778 > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate >
Re: Proposal: roll pg_stat_statements into core
David Fetter writes: > - It's broadly useful. Maybe. Whether it can be argued that it's so broadly useful as to justify being on-by-default is not clear. > - Right now, the barrier for turning it on is quite high. In addition > to being non-core, which is already a pretty high barrier at a lot > of organizations, it requires a shared_preload_libraries setting, > which is pretty close to untenable in a lot of use cases. That argument seems pretty weak. It's part of contrib and therefore maintained by the same people as the "core" code. Also, I don't buy for a minute that people who would need it don't also need a bunch of other changes in default GUC settings (shared_buffers etc). > - The overhead for most use cases is low compared to the benefit. Please do not expect that we're going to accept such assertions unsupported by evidence. (As a very quick-n-dirty test, I tried "pgbench -S" and got somewhere around 4% TPS degradation with pg_stat_statements loaded. That doesn't seem negligible.) I think also that we would need to consider the migration penalty for people who already have the contrib version installed. To judge by previous cases (I'm thinking tsearch2) that could be pretty painful. Admittedly, tsearch2 might not be a good comparison, but points as simple as "the views/functions won't be in the same schema as before" are enough to cause trouble. regards, tom lane
Re: Proposal: roll pg_stat_statements into core
On Sun, Sep 01, 2019 at 08:12:15PM +0200, Pavel Stehule wrote: > Hi > > ne 1. 9. 2019 v 20:00 odesílatel David Fetter napsal: > > > Folks, > > > > I'd like to $Subject, on by default, with a switch to turn it off for > > those really at the outer edges of performance. Some reasons include: > > > > - It's broadly useful. > > - Right now, the barrier for turning it on is quite high. In addition > > to being non-core, which is already a pretty high barrier at a lot > > of organizations, it requires a shared_preload_libraries setting, > > which is pretty close to untenable in a lot of use cases. > > - The overhead for most use cases is low compared to the benefit. > > > > I have not a strong opinion about it. pg_stat_statements is really useful > extenstion, on second hand > > 1. the API is not stabilized yet - there are some patches related to this > extension if I remember correctly You do. > 2. there are not any numbers what is a overhead What numbers would you suggest collecting? We could get some by running them with the extension loaded and not, although this goes to your next proposal. > Maybe better solution can be some new API for shared memory, that doesn't > need to use shared_preload_library. What would such an API look like? > It can be useful for lot of other monitoring extensions, profilers, > debuggers, It would indeed. Do you see this new API as a separate project, and if so, of approximately what size? Are we talking about something about the size of DSM? Of JIT? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Proposal: roll pg_stat_statements into core
Hi ne 1. 9. 2019 v 20:00 odesílatel David Fetter napsal: > Folks, > > I'd like to $Subject, on by default, with a switch to turn it off for > those really at the outer edges of performance. Some reasons include: > > - It's broadly useful. > - Right now, the barrier for turning it on is quite high. In addition > to being non-core, which is already a pretty high barrier at a lot > of organizations, it requires a shared_preload_libraries setting, > which is pretty close to untenable in a lot of use cases. > - The overhead for most use cases is low compared to the benefit. > I have not a strong opinion about it. pg_stat_statements is really useful extenstion, on second hand 1. the API is not stabilized yet - there are some patches related to this extension if I remember correctly 2. there are not any numbers what is a overhead Maybe better solution can be some new API for shared memory, that doesn't need to use shared_preload_library. It can be useful for lot of other monitoring extensions, profilers, debuggers, Regards Pavel > Before I go do the patch, I'd like to see whether there's anything > like a consensus as to the what and how of doing this. > > What say? > > Best, > David. > -- > David Fetter http://fetter.org/ > Phone: +1 415 235 3778 > > Remember to vote! > Consider donating to Postgres: http://www.postgresql.org/about/donate > > >
Re: refactoring - share str2*int64 functions
Michaël, I would put back unlikely() on overflow tests, as there are indeed unlikely to occur and it may help some compilers, and cannot be harmful. It also helps the code reader to know that these path are not expected to be taken often. Hm. I don't agree about that part, and the original signed portions don't do that. I think that we should let the callers of the routines decide if a problem is unlikely to happen or not as we do now. Hmmm. Maybe inlining propates them, but otherwise they make sense for a compiler perspective. I am still not convinced that this module is worth the extra cycles to justify its existence though. They allow to quickly do performance tests, for me it is useful to keep it around, but you are the committer, you do as you feel. [...] There is no doubt that dividing 64 bits integers is a very bad idea, at least on my architecture! That's surprising. I cannot reproduce that. It seems to me that somehow you can, you have a 5 to 18 seconds drop below. There are actual reasons why some processors are more expensive than others, it is not just marketing:-) 2-1) mul: - built-in: 5.1s - fallback (with uint128): 8.0s - fallback (without uint128): 18.1s So, the built-in option is always faster, and keeping the int128 path if available for the multiplication makes sense, but not for the subtraction and the addition. Yep. I am wondering if we should review further the signed part for add and sub, but I'd rather not touch it in this patch. The signed overflows are trickier even, I have not paid attention to the fallback code. I agree that it is better left untouched for know. If you have done any changes on my previous patch, or if you have a script to share I could use to attempt to reproduce your results, I would be happy to do so. Hmmm. I did manual tests really. Attached a psql script replicating them. # with builtin overflow detection sh> psql < oc.sql NOTICE: int 16 mul: 00:00:02.747269 # slow NOTICE: int 16 add: 00:00:01.83281 NOTICE: int 16 sub: 00:00:01.8501 NOTICE: uint 16 mul: 00:00:03.68362 # slower NOTICE: uint 16 add: 00:00:01.835294 NOTICE: uint 16 sub: 00:00:02.136895 # slow NOTICE: int 32 mul: 00:00:01.828065 NOTICE: int 32 add: 00:00:01.840269 NOTICE: int 32 sub: 00:00:01.843557 NOTICE: uint 32 mul: 00:00:02.447052 # slow NOTICE: uint 32 add: 00:00:01.849899 NOTICE: uint 32 sub: 00:00:01.840773 NOTICE: int 64 mul: 00:00:01.839051 NOTICE: int 64 add: 00:00:01.839065 NOTICE: int 64 sub: 00:00:01.838599 NOTICE: uint 64 mul: 00:00:02.161346 # slow NOTICE: uint 64 add: 00:00:01.839404 NOTICE: uint 64 sub: 00:00:01.838549 DO So, do you have more comments? No other comments. -- Fabien. oc.sql Description: application/sql
Proposal: roll pg_stat_statements into core
Folks, I'd like to $Subject, on by default, with a switch to turn it off for those really at the outer edges of performance. Some reasons include: - It's broadly useful. - Right now, the barrier for turning it on is quite high. In addition to being non-core, which is already a pretty high barrier at a lot of organizations, it requires a shared_preload_libraries setting, which is pretty close to untenable in a lot of use cases. - The overhead for most use cases is low compared to the benefit. Before I go do the patch, I'd like to see whether there's anything like a consensus as to the what and how of doing this. What say? Best, David. -- David Fetter http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Re: Yet another fast GiST build
On Fri, Aug 30, 2019 at 6:28 AM Peter Geoghegan wrote: > On Thu, Aug 29, 2019 at 8:22 PM Alexander Korotkov > wrote: > > Alternatively you can encode size in Z-value. But this increases > > dimensionality of space and decreases efficiency of join. Also, > > spatial join can be made using two indexes, even just current GiST > > without Z-values. We've prototyped that, see [1]. > > I'm pretty sure that spatial joins generally need two spatial indexes > (usually R-Trees). There seems to have been quite a lot of research in > it in the 1990s. Sure, our prototype was an implementation of one of such papers. My point is that advantages of Z-value ordered GiST for spatial joins are not yet clear for me. Except faster build and smaller index, which are general advantages. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Yet another fast GiST build
On Fri, Aug 30, 2019 at 2:44 PM Andrey Borodin wrote: > > 30 авг. 2019 г., в 3:47, Alexander Korotkov > написал(а): > > 1) Binary search in non-leaf pages instead of probing each key is much faster. > > > That's a neat idea, but key union breaks ordering, even for z-order. > for two sets of tuples X and Y > if for any i,o from N, Xi < Yo > does not guaranty union(X) < union (Y) > > For example consider this z-ordered keyspace (picture attached) > > union(5, 9) is z-order-smaller than union(4,4) > > I'm not even sure we can use sorted search for choosing subtree for insertion. Sorry, I didn't explain my proposal in enough details. I didn't mean B-tree separator keys would be the same as union key (MBR). I mean B-tree on Z-values, which maintains union key in addition to separator keys. So, you select downlink to insert using separator Z-values and then also extend union key (if needed). It's probably still not enough detail yet. I'll try to spend more time for more detailed description later. > > How do you think, should I supply GiST-build patch with docs and tests and > add it to CF? Or do we need more design discussion before? +1 for adding to CF. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Write visibility map during CLUSTER/VACUUM FULL
On Sun, Sep 1, 2019 at 11:07 AM Alexander Korotkov wrote: > This patch have to implement its own check if tuple is allvisible. > But it appears to be possible to simplify this check assuming that all > tuples already past HeapTupleSatisfiesVacuum(), which sets hint bits. Forgot to check tuple xmin against oldest xmin. Fixed. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-write-vm-during-cluster-2.patch Description: Binary data
Re: range_agg
> > Btw I have working multirange_{send,recv,in,out} now. . . . Just about all the other operators are done too, but I wonder what symbols people like for union and minus? Range uses + for union. I have working code and tests that adds this: r + mr = mr mr + r = mr mr + mr = mr But I would like to use a different symbol instead, like ++, so I can have all four: r ++ r = mr r ++ mr = mr mr ++ r = mr mr ++ mr = mr (The existing r + r operator throws an error if the inputs have a gap.) The trouble is that ++ isn't allowed. (Neither is --.) From https://www.postgresql.org/docs/11/sql-createoperator.html : > A multicharacter operator name cannot end in + or -, unless the name also > contains at least one of these characters: > ~ ! @ # % ^ & | ` ? So are there any other suggestions? I'm open to arguments that I should just use +, but I think having a way to add two simple ranges and get a multirange would be nice too, so my preference is to find a new operator. It should work with minus and intersection (* for ranges) too. Some proposals: +* and -* and ** (* as in regex "zero or many" reminds me of how a multirange holds zero or many ranges. ** is confusing though because it's like exponentiation.) @+ and @- and @* (I dunno why but I kind of like it. We already have @> and <@.) <+> and <-> and <*> (I was hoping for (+) etc for the math connection to circled operators, but this is close. Maybe this would be stronger if multirange_{in,out} used <> delims instead of {}, although I also like how {} is consistent with arrays.) Anyone else have any thoughts? Thanks, Paul
Re: refactoring - share str2*int64 functions
On Sun, Sep 01, 2019 at 01:57:06PM +0200, Fabien COELHO wrote: > I would put back unlikely() on overflow tests, as there are indeed unlikely > to occur and it may help some compilers, and cannot be harmful. It also > helps the code reader to know that these path are not expected to be taken > often. Hm. I don't agree about that part, and the original signed portions don't do that. I think that we should let the callers of the routines decide if a problem is unlikely to happen or not as we do now. > On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 > are useful. The res < a or b > a tricks should suffice, just like for u16 > and u32 cases, and it may cost a little less anyway. Actually, I agree and this is something I can see as well with some extra measurements. mul_u64 without int128 is twice slower, while add_u64 and sub_u64 are 15~20% faster. > I would suggest keep the overflow extension as "contrib/overflow_test". For > mul tests, I'd suggest not to try only min/max values like add/sub, but also > "standard" multiplications that overflow or not. It would be good if "make > check" could be made to work", for some reason it requires "installcheck". Any extensions out of core can only work with "installcheck", and check is not supported (see pgxs.mk). I am still not convinced that this module is worth the extra cycles to justify its existence though. > There is no doubt that dividing 64 bits integers is a very bad idea, at > least on my architecture! That's surprising. I cannot reproduce that. Are you sure that you didn't just undefine HAVE_INT128? This would cause HAVE__BUILTIN_OP_OVERFLOW to still be active in all the code paths. Here are a couple of results from my side with this query, FWIW, and some numbers for all the compile flags (-O2 used): select pg_overflow_check(1, 1, 20, 'XXX', 'XXX'); 1) uint16: 1-1) mul: - built-in: 5.5s - fallback: 5.5s 1-2) sub: - built-in: 5.3s - fallback: 5.4s 1-3) add: - built-in: 5.3s - fallback: 6.2s 2) uint32: 2-1) mul: - built-in: 5.1s - fallback: 5.9s 2-2) sub: - built-in: 5.2s - fallback: 5.4s 2-3) add: - built-in: 5.2s - fallback: 6.2s 2) uint64: 2-1) mul: - built-in: 5.1s - fallback (with uint128): 8.0s - fallback (without uint128): 18.1s 2-2) sub: - built-in: 5.2s - fallback (with uint128): 7.1s - fallback (without uint128): 5.5s 2-3) add: - built-in: 5.2s - fallback (with uint128): 7.1s - fallback (without uint128): 6.3s So, the built-in option is always faster, and keeping the int128 path if available for the multiplication makes sense, but not for the subtraction and the addition. I am wondering if we should review further the signed part for add and sub, but I'd rather not touch it in this patch. > Note that checks depends on value, so actual performance may vary depending > on actual val1 and val2 passed. I used 1 1 like your example. Sure. Still that offers helpful hints as we do the same operations for all code paths the same number of times. If you have done any changes on my previous patch, or if you have a script to share I could use to attempt to reproduce your results, I would be happy to do so. So, do you have more comments? -- Michael signature.asc Description: PGP signature
Re: refactoring - share str2*int64 functions
Bonjour Michaël, You are right as well that having symmetry with the signed methods is much better. In order to see the difference, you can just do that with the extension attached, after of course hijacking int.h with some undefs and recompiling the backend and the module: select pg_overflow_check(1, 1, 20, 'uint32', 'mul'); Ok. still it is possible to trick things with signed integer arguments. Is it? If you pass -1 and then you can fall back to the maximum of each 16, 32 or 64 bits for the unsigned (see the regression tests I added with the module). Attached is also an updated version of the module I used to validate this stuff. Fabien, any thoughts? Patch apply cleanly, compiles, "make check" ok (although changes are untested). I would put back unlikely() on overflow tests, as there are indeed unlikely to occur and it may help some compilers, and cannot be harmful. It also helps the code reader to know that these path are not expected to be taken often. On reflection, I'm not sure that add_u64 and sub_u64 overflow with uint128 are useful. The res < a or b > a tricks should suffice, just like for u16 and u32 cases, and it may cost a little less anyway. I would suggest keep the overflow extension as "contrib/overflow_test". For mul tests, I'd suggest not to try only min/max values like add/sub, but also "standard" multiplications that overflow or not. It would be good if "make check" could be made to work", for some reason it requires "installcheck". I could not test performance directly, loops are optimized out by the compiler. I added "volatile" on input value declarations to work around that. On 2B iterations I got on my laptop: int16: mul = 2770 ms, add = 1830 ms, sub = 1826 ms int32: mul = 1838 ms, add = 1835 ms, sub = 1840 ms int64: mul = 1836 ms, add = 1834 ms, sub = 1833 ms uint16: mul = 3670 ms, add = 1830 ms, sub = 2148 ms uint32: mul = 2438 ms, add = 1834 ms, sub = 1831 ms uint64: mul = 2139 ms, add = 1841 ms, sub = 1882 ms Why int16 mul, uint* mul and uint16 sub are bad is unclear. With fallback code triggered with: #undef HAVE__BUILTIN_OP_OVERFLOW int16: mul = 1433 ms, add = 1424 ms, sub = 1254 ms int32: mul = 1433 ms, add = 1425 ms, sub = 1443 ms int64: mul = 1430 ms, add = 1429 ms, sub = 1441 ms uint16: mul = 1445 ms, add = 1291 ms, sub = 1265 ms uint32: mul = 1419 ms, add = 1434 ms, sub = 1493 ms uint64: mul = 1266 ms, add = 1430 ms, sub = 1440 ms For some unclear reason, 4 tests are significantly faster. Forcing further down fallback code with: #undef HAVE_INT128 int64: mul = 1424 ms, add = 1429 ms, sub = 1440 ms uint64: mul = 24145 ms, add = 1434 ms, sub = 1435 ms There is no doubt that dividing 64 bits integers is a very bad idea, at least on my architecture! Note that checks depends on value, so actual performance may vary depending on actual val1 and val2 passed. I used 1 1 like your example. These results are definitely depressing because the fallback code is nearly twice as fast as the builtin overflow detection version. For the record: gcc 7.4.0 on ubuntu 18.04 LTS. Not sure what to advise, relying on the builtin should be the better idea… -- Fabien.
Re: Yet another fast GiST build
> 30 авг. 2019 г., в 16:44, Andrey Borodin написал(а): > > > How do you think, should I supply GiST-build patch with docs and tests and > add it to CF? Or do we need more design discussion before? PFA v2: now sort support is part of opclass. There's a problem with Z-ordering NaN which causes regression tests to fail. So I decided not to add patch to CF (despite having few minutes to do so). How to correctly Z-order NaN? So that it would be consistent with semantics of union() and consistent() functions. If one of values is NaN, then we consider all it's bits to be 1? BTW patch uses union { float f; uint32 i; } I hope it's OK, because AFAIK we do not have non-IEEE-754 platforms now. Thanks! Best regards, Andrey Borodin. v2-0001-Add-sort-support-for-point-gist_point_sortsupport.patch Description: Binary data v2-0002-Implement-GiST-build-using-sort-support.patch Description: Binary data
Re: row filtering for logical replication
On 2019-09-01 02:28, Euler Taveira wrote: Em dom, 3 de fev de 2019 às 07:14, Andres Freund escreveu: As far as I can tell, the patch has not been refreshed since. So I'm marking this as returned with feedback for now. Please resubmit once ready. I fix all of the bugs pointed in this thread. I decide to disallow 0001-Remove-unused-atttypmod-column-from-initial-table-sy.patch 0002-Store-number-of-tuples-in-WalRcvExecResult.patch 0003-Refactor-function-create_estate_for_relation.patch 0004-Rename-a-WHERE-node.patch 0005-Row-filtering-for-logical-replication.patch 0006-Print-publication-WHERE-condition-in-psql.patch 0007-Publication-where-condition-support-for-pg_dump.patch 0008-Debug-for-row-filtering.patch Hi, The first 4 of these apply without error, but I can't get 0005 to apply. This is what I use: patch --dry-run -b -l -F 5 -p 1 < /home/aardvark/download/pgpatches/0130/logrep_rowfilter/20190901/0005-Row-filtering-for-logical-replication.patch checking file doc/src/sgml/catalogs.sgml Hunk #1 succeeded at 5595 (offset 8 lines). checking file doc/src/sgml/ref/alter_publication.sgml checking file doc/src/sgml/ref/create_publication.sgml checking file src/backend/catalog/pg_publication.c checking file src/backend/commands/publicationcmds.c Hunk #1 succeeded at 352 (offset 8 lines). Hunk #2 succeeded at 381 (offset 8 lines). Hunk #3 succeeded at 539 (offset 8 lines). Hunk #4 succeeded at 570 (offset 8 lines). Hunk #5 succeeded at 601 (offset 8 lines). Hunk #6 succeeded at 626 (offset 8 lines). Hunk #7 succeeded at 647 (offset 8 lines). Hunk #8 succeeded at 679 (offset 8 lines). Hunk #9 succeeded at 693 (offset 8 lines). checking file src/backend/parser/gram.y checking file src/backend/parser/parse_agg.c checking file src/backend/parser/parse_expr.c Hunk #4 succeeded at 3571 (offset -2 lines). checking file src/backend/parser/parse_func.c Hunk #1 succeeded at 2516 (offset -13 lines). checking file src/backend/replication/logical/tablesync.c checking file src/backend/replication/logical/worker.c checking file src/backend/replication/pgoutput/pgoutput.c Hunk #1 FAILED at 12. Hunk #2 succeeded at 60 (offset 2 lines). Hunk #3 succeeded at 336 (offset 2 lines). Hunk #4 succeeded at 630 (offset 2 lines). Hunk #5 succeeded at 647 (offset 2 lines). Hunk #6 succeeded at 738 (offset 2 lines). 1 out of 6 hunks FAILED checking file src/include/catalog/pg_publication.h checking file src/include/catalog/pg_publication_rel.h checking file src/include/catalog/toasting.h checking file src/include/nodes/nodes.h checking file src/include/nodes/parsenodes.h Hunk #1 succeeded at 3461 (offset -1 lines). Hunk #2 succeeded at 3486 (offset -1 lines). checking file src/include/parser/parse_node.h checking file src/include/replication/logicalrelation.h checking file src/test/regress/expected/publication.out Hunk #1 succeeded at 116 (offset 9 lines). checking file src/test/regress/sql/publication.sql Hunk #1 succeeded at 69 with fuzz 1 (offset 9 lines). checking file src/test/subscription/t/013_row_filter.pl perhaps that can be fixed? thanks, Erik Rijkers
[Patch] Add a reset_computed_values function in pg_stat_statements
Hello pg_stat_statements is a great tool to track performance issue in live databases, especially when adding interfaces like PoWA on top of it. But so far, tools like PoWA can not track the min_time, max_time, mean_time and sum_var_time of queries : these statistics are cumulated over time, fetching points in time would be of little to no use, especially when looking at the impact of a DDL change. This patch thus introduces a simple pg_stat_statements_reset_computed_values function that reset the computed statistics, leaving all other information alive, thus allowing the aforementioned scenario. Regards Pierre Ducroquet>From 5e6d16d738e5279968321c5f3695e72ded2432db Mon Sep 17 00:00:00 2001 From: Pierre Date: Thu, 14 Feb 2019 14:37:48 +0100 Subject: [PATCH] Add a function to reset the cumulative statistics pg_stat_statements has two parts : raw statistics that are simple 'stable' counters, and computed statistics (averages, min time, max time...) When using pg_stat_statements to find and fix performance issues, being able to reset the computed statistics can help track the issue and the impact of the fixes. This would also make it possible for tools like powa to collect these statistics too and track them over time by resetting them after each collection. --- .../pg_stat_statements--1.7--1.8.sql | 11 .../pg_stat_statements/pg_stat_statements.c | 52 +-- .../pg_stat_statements.control| 2 +- 3 files changed, 61 insertions(+), 4 deletions(-) create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql new file mode 100644 index 00..7690a9ceba --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql @@ -0,0 +1,11 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.8'" to load this file. \quit + +CREATE FUNCTION pg_stat_statements_reset_computed_values() +RETURNS void +AS 'MODULE_PATHNAME' +LANGUAGE C PARALLEL SAFE; + + diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 221b47298c..3a6c227a80 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -150,6 +150,7 @@ typedef struct Counters double max_time; /* maximum execution time in msec */ double mean_time; /* mean execution time in msec */ double sum_var_time; /* sum of variances in execution time in msec */ + int64 computed_calls; /* # of times executed considered for the previous computed values */ int64 rows; /* total # of retrieved or affected rows */ int64 shared_blks_hit; /* # of shared buffer hits */ int64 shared_blks_read; /* # of shared disk blocks read */ @@ -289,6 +290,7 @@ static bool pgss_save; /* whether to save stats across shutdown */ void _PG_init(void); void _PG_fini(void); +PG_FUNCTION_INFO_V1(pg_stat_statements_reset_computed_values); PG_FUNCTION_INFO_V1(pg_stat_statements_reset); PG_FUNCTION_INFO_V1(pg_stat_statements_reset_1_7); PG_FUNCTION_INFO_V1(pg_stat_statements_1_2); @@ -1252,8 +1254,9 @@ pgss_store(const char *query, uint64 queryId, e->counters.usage = USAGE_INIT; e->counters.calls += 1; + e->counters.computed_calls += 1; e->counters.total_time += total_time; - if (e->counters.calls == 1) + if (e->counters.computed_calls == 1) { e->counters.min_time = total_time; e->counters.max_time = total_time; @@ -1268,7 +1271,7 @@ pgss_store(const char *query, uint64 queryId, double old_mean = e->counters.mean_time; e->counters.mean_time += -(total_time - old_mean) / e->counters.calls; +(total_time - old_mean) / e->counters.computed_calls; e->counters.sum_var_time += (total_time - old_mean) * (total_time - e->counters.mean_time); @@ -1324,7 +1327,7 @@ pg_stat_statements_reset_1_7(PG_FUNCTION_ARGS) } /* - * Reset statement statistics. + * Reset all statement statistics. */ Datum pg_stat_statements_reset(PG_FUNCTION_ARGS) @@ -1334,6 +1337,49 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS) PG_RETURN_VOID(); } +/* + * Reset computed statistics from all statements. + */ +Datum +pg_stat_statements_reset_computed_values(PG_FUNCTION_ARGS) +{ + if (!pgss || !pgss_hash) + ereport(ERROR, +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("pg_stat_statements must be loaded via shared_preload_libraries"))); + entry_reset_computed(); + PG_RETURN_VOID(); +} + + +/* + * Reset statement computed statistics. + * This takes a shared lock only on the hash table, and a lock per entry + */ +static void +entry_reset_computed(void) +{ + HASH_SEQ_STATUS hash_seq; + pgssEntry *entry; + + /* Lookup the hash table entry with
Write visibility map during CLUSTER/VACUUM FULL
Hi! I found it weird that CLUSTER/VACUUM FULL don't write visibility map. Attached patch implements writing visibility map in heapam_relation_copy_for_cluster(). I've studied previous attempt to implement this [1]. The main problem of that attempt was usage of existing heap_page_is_all_visible() and visibilitymap_set() functions. These functions works through buffer manager, while heap rewriting is made bypass buffer manager. In my patch visibility map pages are handled in the same way as heap pages are. RewriteStateData holds contents of single VM page. Once heap page is finished, corresponding bits are set to VM page etc. VM pages are flushed one-by-one to smgr. This patch have to implement its own check if tuple is allvisible. But it appears to be possible to simplify this check assuming that all tuples already past HeapTupleSatisfiesVacuum(), which sets hint bits. Links 1. https://www.postgresql.org/message-id/flat/20131203162556.GC27105%40momjian.us#90e4a6e77d92076650dcf1d96b32ba38 -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company 0001-write-vm-during-cluster-1.patch Description: Binary data
Re: Should we add xid_current() or a int8->xid cast?
On Fri, Aug 2, 2019 at 10:42 PM Thomas Munro wrote: > On Thu, Jul 25, 2019 at 1:11 PM Thomas Munro wrote: > > On Thu, Jul 25, 2019 at 12:42 PM Tom Lane wrote: > > > Andres Freund writes: > > > > On 2019-07-24 20:34:39 -0400, Tom Lane wrote: > > > >> Yeah, I would absolutely NOT recommend that you open that can of worms > > > >> right now. We have looked at adding unsigned integer types in the past > > > >> and it looked like a mess. > > > > > > > I assume Thomas was thinking more of another bespoke type like xid, just > > > > wider. There's some notational advantage in not being able to > > > > immediately do math etc on xids. > > > > > > Well, we could invent an xid8 type if we want, just don't try to make > > > it part of the numeric hierarchy (as indeed xid isn't). > > > > Yeah, I meant an xid64/xid8/fxid/pg_something/... type that isn't a > > kind of number. I thought about how to deal with the transition to xid8 for the txid_XXX() family of functions. The best idea I've come up with so far is to create a parallel xid8_XXX() family of functions, and declare the bigint-based functions to be deprecated, and threaten to drop them from a future release. The C code for the two families can be the same (it's a bit of a dirty trick, but only until the txid_XXX() variants go away). Here's a PoC patch demonstrating that. Not tested much, yet, probably needs some more work, but I wanted to see if others thought the idea was terrible first. I wonder if there is a better way to share hash functions than the hack in check_hash_func_signature(), which I had to extend to cover xid8. Adding to CF. -- Thomas Munro https://enterprisedb.com 0001-Add-SQL-type-xid8-to-expose-FullTransactionId-to-use.patch Description: Binary data 0002-Introduce-xid8-variants-of-the-txid_XXX-fmgr-functio.patch Description: Binary data