Re: Online enabling of checksums
On 2/21/18 15:53, Magnus Hagander wrote: > *Two new functions are added, pg_enable_data_checksums() and > pg_disable_data_checksums(). The disable one is easy -- it just changes > to disable. The enable one will change the state to inprogress, and then > start a background worker (the “checksumhelper launcher”). This worker > in turn will start one sub-worker (“checksumhelper worker”) in each > database (currently all done sequentially).* This is at least the fourth version of the pattern launcher plus worker background workers. I wonder whether we can do something to make this easier and less repetitive. Not in this patch, of course. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Drop --disable-floatN-byval configure options?
On 2/21/18 12:17, Tom Lane wrote: > I don't actually envision changing the C code much at all; we might want > to resurrect the old code at some point. I just want to reduce the number > of supported configurations. Yeah, it can exist like EXEC_BACKEND, as a way to manually simulate a different architecture for testing. But we don't need it as an option exposed to users. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] SERIALIZABLE with parallel query
On Fri, Jan 26, 2018 at 4:24 AM, Robert Haaswrote: > I took a look at this today and thought it might be OK to commit, Thank you for looking at this! > modulo a few minor issues: (1) you didn't document the new tranche and Fixed. > (2) I prefer to avoid if (blah) { Assert(thing) } in favor of > Assert(!blah || thing). Done. > But then I got a little bit concerned about ReleasePredicateLocks(). > Suppose that in the middle of a read-only transaction, > SXACT_FLAG_RO_SAFE becomes true. The next call to > SerializationNeededForRead in each process will call > ReleasePredicateLocks. In the workers, this won't do anything, so > we'll just keep coming back there. But in the leader, we'll go ahead > do all that stuff, including MySerializableXact = > InvalidSerializableXact. But in the workers, it's still set. Maybe > that's OK, but I'm not sure that it's OK... Ouch. Yeah. It's not OK. If the leader gives up its SERIALIZABLEXACT object early due to that safe-read-only optimisation, the workers are left with a dangling pointer to a SERIALIZABLEXACT object that has been pushed onto FinishedSerializableTransactions. >From there it will move to PredXact->availableTransactions and might be handed out to some other transaction, so it is not safe to retain a pointer to that object. The best solution I have come up with so far is to add a reference count to SERIALIZABLEXACT. I toyed with putting the refcount into the DSM instead, but then I ran into problems making that work when you have a query with multiple Gather nodes. Since the refcount is in SERIALIZABLEXACT I also had to add a generation counter so that I could detect the case where you try to attach too late (the leader has already errored out, the refcount has reached 0 and the SERIALIZABLEXACT object has been recycled). The attached is a draft patch only, needing some testing and polish. Brickbats, better ideas? FWIW I also considered a couple of other ideas: (1) Keeping the object alive on the FinishedSerializableTransactions list until the leader's transaction is finished seems broken because we need to be able to spill that list to the SLRU at any time, and if we somehow made them sticky we could run out of memory. (2) Anything involving the leader having sole control of the object lifetime seems problematic... well, it might work if you disabled the SXACT_FLAG_RO_SAFE optimisation so that ReleasePredicateLocks() always happens after all workers have finished, but that seems like cheating. PS I noticed that for BecomeLockGroupMember() we say "If we can't join the lock group, the leader has gone away, so just exit quietly" but for various other similar things we spew errors (most commonly seen one being "ERROR: could not map dynamic shared memory segment"). Intentional? -- Thomas Munro http://www.enterprisedb.com ssi-parallel-v11.patch Description: Binary data
Re: [HACKERS] Runtime Partition Pruning
On 22 February 2018 at 22:31, Amit Langotewrote: > Some comments: Hi Amit, Thanks for looking at this. I'll work through your comments and produce a patch sometime in the near future. One problem that I'm facing now is down to the way I'm gathering the ParamIds that match the partkeys. As you'll see from the patch I've added a 'paramids' field to PartitionPruneContext and I'm populating this when the clauses are being pre-processed in extract_partition_clauses(). The problem is that the or_clauses are not pre-processed at all, so the current patch will not properly perform run-time pruning when the Params are "hidden" in OR branches. One way I thought to fix this was to change the clause processing to create an array of PartitionClauseInfos, one for each OR branch. This would also improve the performance of the run-time pruning, meaning that all of the or clauses would be already matched to the partition keys once, rather than having to redo that again each time a Param changes its value. If I go and write a patch to do that, would you want it in your patch, or would you rather I kept it over here? Or perhaps you have a better idea on how to solve...? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Incorrect grammar
Here is a tiny patch to fix $SUBJECT in a comment in execPartition.c. Best regards, Etsuro Fujita diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 4048c3e..882b538 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -140,10 +140,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, partrel = leaf_part_rri->ri_RelationDesc; /* - * This is required in order to we convert the partition's - * tuple to be compatible with the root partitioned table's - * tuple descriptor. When generating the per-subplan result - * rels, this was not set. + * This is required in order to convert the partition's tuple + * to be compatible with the root partitioned table's tuple + * descriptor. When generating the per-subplan result rels, + * this was not set. */ leaf_part_rri->ri_PartitionRoot = rel;
Re: [HACKERS] Add support for tuple routing to foreign partitions
(2018/02/22 11:52), Amit Langote wrote: On 2018/02/21 20:54, Etsuro Fujita wrote: void BeginForeignRouting(ModifyTableState *mtstate, ResultRelInfo *resultRelInfo, int partition_index); Prepare for a tuple-routing operation on a foreign table. This is called from ExecSetupPartitionTupleRouting and ExecInitPartitionInfo. I wonder why partition_index needs to be made part of this API? The reason for that is because I think the FDW might want to look at the partition info stored in mtstate->mt_partition_tuple_routing for some reason or other, such as parent_child_tupconv_maps and child_parent_tupconv_maps, which are accessed with the given partition index. Patch foreign-routing-fdwapi-WIP.patch: main patch to add new FDW APIs, which is created on top of patch postgres-fdw-refactoring-WIP.patch and the lazy-initialization-of-partition-info patch [1]. Noticed a typo in the patch (s/parition/partition/g): +* Also let the FDW init itself if this parition is foreign. +* Also let the FDW init itself if this parition is foreign. Good catch! Will fix. Perhaps an independent concern, but one thing I noticed is that it does not seem to play well with the direct modification (update push-down) feature. Now because updates (at least local, let's say) support re-routing, I thought we'd be able move rows across servers via the local server, but with direct modification we'd never get the chance. However, since update tuple routing is triggered by partition constraint failure, which we don't enforce for foreign tables/partitions anyway, I'm not sure if we need to do anything about that, and even if we did, whether it concerns this proposal. Good point! Actually, update tuple routing we have in HEAD doesn't allow re-routing tuples from foreign partitions even without direct modification. Here is an example using postgres_fdw: postgres=# create table pt (a int, b text) partition by list (a); CREATE TABLE postgres=# create table loct (a int check (a in (1)), b text); CREATE TABLE postgres=# create foreign table remp partition of pt for values in (1) server loopback options (table_name 'loct'); CREATE FOREIGN TABLE postgres=# create table locp partition of pt for values in (2); CREATE TABLE postgres=# insert into remp values (1, 'foo'); INSERT 0 1 postgres=# insert into locp values (2, 'bar'); INSERT 0 1 postgres=# select tableoid::regclass, * from pt; tableoid | a | b --+---+- remp | 1 | foo locp | 2 | bar (2 rows) postgres=# create function postgres_fdw_abs(int) returns int as $$begin return abs($1); end$$ language plpgsql immutable; CREATE FUNCTION postgres=# explain verbose update pt set a = postgres_fdw_abs(a) + 1 where b = 'foo'; QUERY PLAN - Update on public.pt (cost=100.00..154.54 rows=12 width=42) Foreign Update on public.remp Remote SQL: UPDATE public.loct SET a = $2 WHERE ctid = $1 Update on public.locp -> Foreign Scan on public.remp (cost=100.00..127.15 rows=6 width=42) Output: (postgres_fdw_abs(remp.a) + 1), remp.b, remp.ctid Remote SQL: SELECT a, b, ctid FROM public.loct WHERE ((b = 'foo'::text)) FOR UPDATE -> Seq Scan on public.locp (cost=0.00..27.39 rows=6 width=42) Output: (postgres_fdw_abs(locp.a) + 1), locp.b, locp.ctid Filter: (locp.b = 'foo'::text) (10 rows) postgres=# update pt set a = postgres_fdw_abs(a) + 1 where b = 'foo'; ERROR: new row for relation "loct" violates check constraint "loct_a_check" DETAIL: Failing row contains (2, foo). CONTEXT: Remote SQL command: UPDATE public.loct SET a = $2 WHERE ctid = $1 To be able to move tuples across foreign servers, I think we would first have to do something to allow re-routing tuples from foreign partitions. The patches I proposed hasn't done anything about that, so the patched version would behave the same way as HEAD with/without direct modification. That said, I saw in the changes to ExecSetupPartitionTupleRouting() that BeginForeignRouting() is called for a foreign partition even if direct modification might already have been set up. If direct modification is set up, then ExecForeignRouting() will never get called, because we'd never call ExecUpdate() or ExecInsert(). Yeah, but I am thinking to leave the support for re-routing tuples across foreign servers for another patch. One difference between HEAD and the patched version would be: we can re-route tuples from a plain partition to a foreign partition if the foreign partition supports this tuple-routing API. Here is an example using the above data: postgres=# select tableoid::regclass, * from pt; tableoid | a | b --+---+- remp | 1 | foo locp | 2 | bar (2 rows) postgres=# update pt set a = 1 where b = 'bar'; UPDATE 1
Re: Hash Joins vs. Bloom Filters / take 2
On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondrawrote: > On 02/21/2018 02:10 AM, Peter Geoghegan wrote: >> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra >> wrote: I suspect that it could make sense to use a Bloom filter to summarize the entire inner side of the join all at once, even when there are multiple batches. I also suspect that this is particularly beneficial with parallel hash joins, where IPC/synchronization overhead can be a big problem. >>> >>> But that's what the patch does, currently - the filter is built >>> during the initial pass through the data, and then used for all >>> batches. >> >> I misunderstood. I would probably do something like double or triple >> the original rows estimate instead, though. The estimate must be at >> least slightly inaccurate when we get to this point, but I don't >> think that that's a good enough reason to give up on the estimate >> completely. >> > > That's a problem only for the multi-batch case, though. > > With a single batch we can walk the hash table and count non-empty > buckets, to get a good ndistinct estimate cheaply. And then size the > filter considering both memory requirements (fits into CPU cache) and > false positive rate. There are other things we may need to consider > (memory usage vs. work_mem) but that's a separate issue. > > With multiple batches I think we could use the "size the bloom filter > for a fraction of work_mem" which the current patch uses when switching > to multiple batches halfway-through. That pretty much entirely ignores > the estimate and essentially replaces it with a "fictional" estimate. > > I think that's a better approach than using some arbitrary multiple of > the estimate. When we have to start batching halfway through, the > estimate is proven to be rather bogus anyway, but we may treat it as a > lower boundary for the bloom filter size. ... >> As I said, X should not be a portion of work_mem, because that has >> only a weak relationship to what really matters. >> > > I agree a fixed fraction of work_mem may not be the right thing, but the > goal was to make the bloom filter part of the Hash memory budget, i.e. > > bloom filter + hash table <= work_mem > > (which I think we agree should be the case), without increasing the > number of batches too much. For example, if you size the filter ignoring > this, and it end up being 90% of work_mem, you may need to do the hash > join in 128 batches instead of just 16. Or something like that. > > Maybe that would still be a win, though. Firstly, the higher number of > batches may not have a huge impact - in one case we need to serialie > 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more > accurate filter allows us to discard much more data from the outer > relation ... Let me reiterate, you can avoid both issues with scalable bloom filters[1]. An HLL can be used to estimate set size, the paper makes no mention of it, probably assuming only distinct items are added to the set. [1] http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
Re: [HACKERS] path toward faster partition pruning
On 22 February 2018 at 22:48, Amit Langotewrote: >> I'm having to add some NULL handling there for the run-time pruning >> patch but wondered if it was also required for your patch. > > Hmm, not sure why. Can you explain a bit more? hmm, yeah, but perhaps we should be discussing on the other thread... With a prepared statement the Param will be unavailable until execution, in which case we don't do the const folding. A simple case is: create table listp (a int) partition by list (a); create table listp1 partition of listp for values in(1); prepare q1 (int) as select * from listp where a = $1; explain analyze execute q1(1); -- repeat 5 times. explain analyze execute q1(null); -- partkey_datum_from_expr() gets a NULL param via the call from nodeAppend.c -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Online enabling of checksums
Re-sending this one with proper formatting. Apologies for the horrible gmail-screws-up-the-text-part of the last one! No change to patch or text, just the formatting. //Magnus Once more, here is an attempt to solve the problem of on-line enabling of checksums that me and Daniel have been hacking on for a bit. See for example https://www.postgresql.org/message-id/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com and https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com#ff393672-5608-46d6-9224-6620ec532...@endpoint.com for some previous discussions. Base design: Change the checksum flag to instead of on and off be an enum. off/inprogress/on. When checksums are off and on, they work like today. When checksums are in progress, checksums are *written* but not verified. State can go from “off” to “inprogress”, from “inprogress” to either “on” or “off”, or from “on” to “off”. Two new functions are added, pg_enable_data_checksums() and pg_disable_data_checksums(). The disable one is easy -- it just changes to disable. The enable one will change the state to inprogress, and then start a background worker (the “checksumhelper launcher”). This worker in turn will start one sub-worker (“checksumhelper worker”) in each database (currently all done sequentially). This worker will enumerate all tables/indexes/etc in the database and validate their checksums. If there is no checksum, or the checksum is incorrect, it will compute a new checksum and write it out. When all databases have been processed, the checksum state changes to “on” and the launcher shuts down. At this point, the cluster has checksums enabled as if it was initdb’d with checksums turned on. If the cluster shuts down while “inprogress”, the DBA will have to manually either restart the worker (by calling pg_enable_checksums()) or turn checksums off again. Checksums “in progress” only carries a cost and no benefit. The change of the checksum state is WAL logged with a new xlog record. All the buffers written by the background worker are forcibly enabled full page writes to make sure the checksum is fully updated on the standby even if no actual contents of the buffer changed. We’ve also included a small commandline tool, bin/pg_verify_checksums, that can be run against an offline cluster to validate all checksums. Future improvements includes being able to use the background worker/launcher to perform an online check as well. Being able to run more parallel workers in the checksumhelper might also be of interest. The patch includes two sets of tests, an isolation test turning on checksums while one session is writing to the cluster and another is continuously reading, to simulate turning on checksums in a production database. There is also a TAP test which enables checksums with streaming replication turned on to test the new xlog record. The isolation test ran into the 1024 character limit of the isolation test lexer, with a separate patch and discussion at https://www.postgresql.org/message-id/8d628be4-6606-4ff6-a3ff-8b2b0e9b4...@yesql.se diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 4c998fe51f..dc05ac3e55 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -8537,7 +8537,8 @@ LOG: CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1) or hide corruption, or other serious problems. However, it may allow you to get past the error and retrieve undamaged tuples that might still be present in the table if the block header is still sane. If the header is -corrupt an error will be reported even if this option is enabled. The +corrupt an error will be reported even if this option is enabled. This +option can only enabled when data checksums are enabled. The default setting is off, and it can only be changed by a superuser. diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 1e535cf215..8000ce89df 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -19412,6 +19412,64 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup()); + + Data Checksum Functions + + +The functions shown in can +be used to enable or disable data checksums in a running cluster. +See for details. + + + +Checksum SQL Functions + + + + Function + Return Type + Description + + + + + + + pg_enable_data_checksums + +pg_enable_data_checksums() + + +bool + + +Initiates data checksums for the cluster. This will switch the data checksums mode +to in progress and start a background worker that will process +all data in the database and enable checksums for it. When all data pages have had +checksums enabled, the cluster will automatically switch to
Re: [HACKERS] Runtime Partition Pruning
Hi David. On 2018/02/21 18:06, David Rowley wrote: > Other things I don't particularly like about the current patch are how > I had to construct the partition key expressions in > set_valid_runtime_subplans_recurse(). This pretty much uses code > borrowed from set_baserel_partition_key_exprs(). One way around that > would be to change the partprune.c code to deal with the > partkey->partattrs and consume an expression from the list on attnum = > 0. I didn't do that as I want to minimise how much I touch Amit's > patch before it gets committed as doing so would likely just cause > more headaches for me keeping this merged with his work. Another > option to resolve this would be to put the partition key expressions > into the PartitionPruneInfo. Another way could be to refactor the code you've borrowed from set_baserel_partition_key_exprs() into its own function that is exported by some optimizer header. I removed PartitionKey/Relation from signatures of various functions of my patch to avoid having to re-heap_open() the relation per [1]. > I've attached v11 of the patch. Some comments: * I noticed that the patch adds a function to bitmapset.c which you might want to extract into its own patch, like your patch to add bms_add_range() that got committed as 84940644d [2]. * Maybe it's better to add `default: break;` in the two switch's you've added in partkey_datum_from_expr() partprune.c: In function ‘partkey_datum_from_expr’: partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in switch [-Wswitch] switch (nodeTag(expr)) partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled in switch [-Wswitch] switch (((Param *) expr)->paramkind) * I see that you moved PartitionClauseInfo's definition from partprune.c to partition.h. Isn't it better to move it to partprune.h? Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d
Re: [HACKERS] path toward faster partition pruning
On 21 February 2018 at 23:44, Amit Langotewrote: > Please find attached updated patches. Thanks for updating the code. The question I have now is around NULL handling in partkey_datum_from_expr(). I've not managed to find a way to get a NULL Const in there as it seems all the clauses I try get removed somewhere earlier in planning. Do you know for a fact that a NULL Const is impossible to get there? I'm having to add some NULL handling there for the run-time pruning patch but wondered if it was also required for your patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services