Re: Online enabling of checksums

2018-02-22 Thread Peter Eisentraut
On 2/21/18 15:53, Magnus Hagander wrote:
> *Two new functions are added, pg_enable_data_checksums() and
> pg_disable_data_checksums(). The disable one is easy -- it just changes
> to disable. The enable one will change the state to inprogress, and then
> start a background worker (the “checksumhelper launcher”). This worker
> in turn will start one sub-worker (“checksumhelper worker”) in each
> database (currently all done sequentially).*

This is at least the fourth version of the pattern launcher plus worker
background workers.  I wonder whether we can do something to make this
easier and less repetitive.  Not in this patch, of course.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Drop --disable-floatN-byval configure options?

2018-02-22 Thread Peter Eisentraut
On 2/21/18 12:17, Tom Lane wrote:
> I don't actually envision changing the C code much at all; we might want
> to resurrect the old code at some point.  I just want to reduce the number
> of supported configurations.

Yeah, it can exist like EXEC_BACKEND, as a way to manually simulate a
different architecture for testing.  But we don't need it as an option
exposed to users.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] SERIALIZABLE with parallel query

2018-02-22 Thread Thomas Munro
On Fri, Jan 26, 2018 at 4:24 AM, Robert Haas  wrote:
> I took a look at this today and thought it might be OK to commit,

Thank you for looking at this!

> modulo a few minor issues: (1) you didn't document the new tranche and

Fixed.

> (2) I prefer to avoid if (blah) { Assert(thing) } in favor of
> Assert(!blah || thing).

Done.

> But then I got a little bit concerned about ReleasePredicateLocks().
> Suppose that in the middle of a read-only transaction,
> SXACT_FLAG_RO_SAFE becomes true. The next call to
> SerializationNeededForRead in each process will call
> ReleasePredicateLocks.  In the workers, this won't do anything, so
> we'll just keep coming back there.  But in the leader, we'll go ahead
> do all that stuff, including MySerializableXact =
> InvalidSerializableXact.  But in the workers, it's still set.  Maybe
> that's OK, but I'm not sure that it's OK...

Ouch.  Yeah.  It's not OK.  If the leader gives up its
SERIALIZABLEXACT object early due to that safe-read-only optimisation,
the workers are left with a dangling pointer to a SERIALIZABLEXACT
object that has been pushed onto FinishedSerializableTransactions.
>From there it will move to PredXact->availableTransactions and might
be handed out to some other transaction, so it is not safe to retain a
pointer to that object.

The best solution I have come up with so far is to add a reference
count to SERIALIZABLEXACT.  I toyed with putting the refcount into the
DSM instead, but then I ran into problems making that work when you
have a query with multiple Gather nodes.  Since the refcount is in
SERIALIZABLEXACT I also had to add a generation counter so that I
could detect the case where you try to attach too late (the leader has
already errored out, the refcount has reached 0 and the
SERIALIZABLEXACT object has been recycled).

The attached is a draft patch only, needing some testing and polish.
Brickbats, better ideas?

FWIW I also considered a couple of other ideas:  (1) Keeping the
object alive on the FinishedSerializableTransactions list until the
leader's transaction is finished seems broken because we need to be
able to spill that list to the SLRU at any time, and if we somehow
made them sticky we could run out of memory.  (2) Anything involving
the leader having sole control of the object lifetime seems
problematic... well, it might work if you disabled the
SXACT_FLAG_RO_SAFE optimisation so that ReleasePredicateLocks() always
happens after all workers have finished, but that seems like cheating.

PS  I noticed that for BecomeLockGroupMember() we say "If we can't
join the lock group, the leader has gone away, so just exit quietly"
but for various other similar things we spew errors (most commonly
seen one being "ERROR:  could not map dynamic shared memory segment").
Intentional?

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-parallel-v11.patch
Description: Binary data


Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread David Rowley
On 22 February 2018 at 22:31, Amit Langote
 wrote:
> Some comments:

Hi Amit,

Thanks for looking at this. I'll work through your comments and
produce a patch sometime in the near future.

One problem that I'm facing now is down to the way I'm gathering the
ParamIds that match the partkeys. As you'll see from the patch I've
added a 'paramids' field to PartitionPruneContext and I'm populating
this when the clauses are being pre-processed in
extract_partition_clauses(). The problem is that the or_clauses are
not pre-processed at all, so the current patch will not properly
perform run-time pruning when the Params are "hidden" in OR branches.

One way I thought to fix this was to change the clause processing to
create an array of PartitionClauseInfos, one for each OR branch. This
would also improve the performance of the run-time pruning, meaning
that all of the or clauses would be already matched to the partition
keys once, rather than having to redo that again each time a Param
changes its value.

If I go and write a patch to do that, would you want it in your patch,
or would you rather I kept it over here?  Or perhaps you have a better
idea on how to solve...?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Incorrect grammar

2018-02-22 Thread Etsuro Fujita
Here is a tiny patch to fix $SUBJECT in a comment in execPartition.c.

Best regards,
Etsuro Fujita
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 4048c3e..882b538 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -140,10 +140,10 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 partrel = leaf_part_rri->ri_RelationDesc;
 
 /*
- * This is required in order to we convert the partition's
- * tuple to be compatible with the root partitioned table's
- * tuple descriptor.  When generating the per-subplan result
- * rels, this was not set.
+ * This is required in order to convert the partition's tuple
+ * to be compatible with the root partitioned table's tuple
+ * descriptor.  When generating the per-subplan result rels,
+ * this was not set.
  */
 leaf_part_rri->ri_PartitionRoot = rel;
 


Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-22 Thread Etsuro Fujita

(2018/02/22 11:52), Amit Langote wrote:

On 2018/02/21 20:54, Etsuro Fujita wrote:

void
BeginForeignRouting(ModifyTableState *mtstate,
 ResultRelInfo *resultRelInfo,
 int partition_index);

Prepare for a tuple-routing operation on a foreign table.  This is called
from ExecSetupPartitionTupleRouting and ExecInitPartitionInfo.


I wonder why partition_index needs to be made part of this API?


The reason for that is because I think the FDW might want to look at the 
partition info stored in mtstate->mt_partition_tuple_routing for some 
reason or other, such as parent_child_tupconv_maps and 
child_parent_tupconv_maps, which are accessed with the given partition 
index.



Patch foreign-routing-fdwapi-WIP.patch: main patch to add new FDW APIs,
which is created on top of patch postgres-fdw-refactoring-WIP.patch and
the lazy-initialization-of-partition-info patch [1].


Noticed a typo in the patch (s/parition/partition/g):

+* Also let the FDW init itself if this 
parition is foreign.

+* Also let the FDW init itself if this parition is foreign.


Good catch!  Will fix.


Perhaps an independent concern, but one thing I noticed is that it does
not seem to play well with the direct modification (update push-down)
feature.  Now because updates (at least local, let's say) support
re-routing, I thought we'd be able move rows across servers via the local
server, but with direct modification we'd never get the chance.  However,
since update tuple routing is triggered by partition constraint failure,
which we don't enforce for foreign tables/partitions anyway, I'm not sure
if we need to do anything about that, and even if we did, whether it
concerns this proposal.


Good point!  Actually, update tuple routing we have in HEAD doesn't 
allow re-routing tuples from foreign partitions even without direct 
modification.  Here is an example using postgres_fdw:


postgres=# create table pt (a int, b text) partition by list (a);
CREATE TABLE
postgres=# create table loct (a int check (a in (1)), b text);
CREATE TABLE
postgres=# create foreign table remp partition of pt for values in (1) 
server loopback options (table_name 'loct');

CREATE FOREIGN TABLE
postgres=# create table locp partition of pt for values in (2);
CREATE TABLE
postgres=# insert into remp values (1, 'foo');
INSERT 0 1
postgres=# insert into locp values (2, 'bar');
INSERT 0 1
postgres=# select tableoid::regclass, * from pt;
 tableoid | a |  b
--+---+-
 remp | 1 | foo
 locp | 2 | bar
(2 rows)

postgres=# create function postgres_fdw_abs(int) returns int as $$begin 
return abs($1); end$$ language plpgsql immutable;

CREATE FUNCTION
postgres=# explain verbose update pt set a = postgres_fdw_abs(a) + 1 
where b = 'foo';

 QUERY PLAN


-
 Update on public.pt  (cost=100.00..154.54 rows=12 width=42)
   Foreign Update on public.remp
 Remote SQL: UPDATE public.loct SET a = $2 WHERE ctid = $1
   Update on public.locp
   ->  Foreign Scan on public.remp  (cost=100.00..127.15 rows=6 width=42)
 Output: (postgres_fdw_abs(remp.a) + 1), remp.b, remp.ctid
 Remote SQL: SELECT a, b, ctid FROM public.loct WHERE ((b = 
'foo'::text)) FOR UPDATE

   ->  Seq Scan on public.locp  (cost=0.00..27.39 rows=6 width=42)
 Output: (postgres_fdw_abs(locp.a) + 1), locp.b, locp.ctid
 Filter: (locp.b = 'foo'::text)
(10 rows)

postgres=# update pt set a = postgres_fdw_abs(a) + 1 where b = 'foo';
ERROR:  new row for relation "loct" violates check constraint "loct_a_check"
DETAIL:  Failing row contains (2, foo).
CONTEXT:  Remote SQL command: UPDATE public.loct SET a = $2 WHERE ctid = $1

To be able to move tuples across foreign servers, I think we would first 
have to do something to allow re-routing tuples from foreign partitions. 
 The patches I proposed hasn't done anything about that, so the patched 
version would behave the same way as HEAD with/without direct modification.



That said, I saw in the changes to ExecSetupPartitionTupleRouting() that
BeginForeignRouting() is called for a foreign partition even if direct
modification might already have been set up.  If direct modification is
set up, then ExecForeignRouting() will never get called, because we'd
never call ExecUpdate() or ExecInsert().


Yeah, but I am thinking to leave the support for re-routing tuples 
across foreign servers for another patch.


One difference between HEAD and the patched version would be: we can 
re-route tuples from a plain partition to a foreign partition if the 
foreign partition supports this tuple-routing API.  Here is an example 
using the above data:


postgres=# select tableoid::regclass, * from pt;
 tableoid | a |  b
--+---+-
 remp | 1 | foo
 locp | 2 | bar
(2 rows)

postgres=# update pt set a = 1 where b = 'bar';
UPDATE 1

Re: Hash Joins vs. Bloom Filters / take 2

2018-02-22 Thread Claudio Freire
On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
 wrote:
> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
>>  wrote:
 I suspect that it could make sense to use a Bloom filter to
 summarize the entire inner side of the join all at once, even
 when there are multiple batches. I also suspect that this is
 particularly beneficial with parallel hash joins, where
 IPC/synchronization overhead can be a big problem.

>>>
>>> But that's what the patch does, currently - the filter is built
>>> during the initial pass through the data, and then used for all
>>> batches.
>>
>> I misunderstood. I would probably do something like double or triple
>> the original rows estimate instead, though. The estimate must be at
>> least slightly inaccurate when we get to this point, but I don't
>> think that that's a good enough reason to give up on the estimate
>> completely.
>>
>
> That's a problem only for the multi-batch case, though.
>
> With a single batch we can walk the hash table and count non-empty
> buckets, to get a good ndistinct estimate cheaply. And then size the
> filter considering both memory requirements (fits into CPU cache) and
> false positive rate. There are other things we may need to consider
> (memory usage vs. work_mem) but that's a separate issue.
>
> With multiple batches I think we could use the "size the bloom filter
> for a fraction of work_mem" which the current patch uses when switching
> to multiple batches halfway-through. That pretty much entirely ignores
> the estimate and essentially replaces it with a "fictional" estimate.
>
> I think that's a better approach than using some arbitrary multiple of
> the estimate. When we have to start batching halfway through, the
> estimate is proven to be rather bogus anyway, but we may treat it as a
> lower boundary for the bloom filter size.

...

>> As I said, X should not be a portion of work_mem, because that has
>> only a weak relationship to what really matters.
>>
>
> I agree a fixed fraction of work_mem may not be the right thing, but the
> goal was to make the bloom filter part of the Hash memory budget, i.e.
>
> bloom filter + hash table <= work_mem
>
> (which I think we agree should be the case), without increasing the
> number of batches too much. For example, if you size the filter ignoring
> this, and it end up being 90% of work_mem, you may need to do the hash
> join in 128 batches instead of just 16. Or something like that.
>
> Maybe that would still be a win, though. Firstly, the higher number of
> batches may not have a huge impact - in one case we need to serialie
> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
> accurate filter allows us to discard much more data from the outer
> relation ...

Let me reiterate, you can avoid both issues with scalable bloom filters[1].

An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.

[1] http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf



Re: [HACKERS] path toward faster partition pruning

2018-02-22 Thread David Rowley
On 22 February 2018 at 22:48, Amit Langote
 wrote:
>> I'm having to add some NULL handling there for the run-time pruning
>> patch but wondered if it was also required for your patch.
>
> Hmm, not sure why.  Can you explain a bit more?

hmm, yeah, but perhaps we should be discussing on the other thread...

With a prepared statement the Param will be unavailable until
execution, in which case we don't do the const folding.

A simple case is:

create table listp (a int) partition by list (a);
create table listp1 partition of listp for values in(1);
prepare q1 (int) as  select * from listp where a = $1;
explain analyze execute q1(1); -- repeat 5 times.
explain analyze execute q1(null); -- partkey_datum_from_expr() gets a
NULL param via the call from nodeAppend.c


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Online enabling of checksums

2018-02-22 Thread Magnus Hagander
Re-sending this one with proper formatting. Apologies for the horrible
gmail-screws-up-the-text-part of the last one!


No change to patch or text, just the formatting.

//Magnus



Once more, here is an attempt to solve the problem of on-line enabling of
checksums that me and Daniel have been hacking on for a bit. See for
example
https://www.postgresql.org/message-id/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com
and
https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com#ff393672-5608-46d6-9224-6620ec532...@endpoint.com
for some previous discussions.


Base design:

Change the checksum flag to instead of on and off be an enum.
off/inprogress/on. When checksums are off and on, they work like today.
When checksums are in progress, checksums are *written* but not verified.
State can go from “off” to “inprogress”, from “inprogress” to either “on”
or “off”, or from “on” to “off”.


Two new functions are added, pg_enable_data_checksums() and
pg_disable_data_checksums(). The disable one is easy -- it just changes to
disable. The enable one will change the state to inprogress, and then start
a background worker (the “checksumhelper launcher”). This worker in turn
will start one sub-worker (“checksumhelper worker”) in each database
(currently all done sequentially). This worker will enumerate all
tables/indexes/etc in the database and validate their checksums. If there
is no checksum, or the checksum is incorrect, it will compute a new
checksum and write it out. When all databases have been processed, the
checksum state changes to “on” and the launcher shuts down. At this point,
the cluster has checksums enabled as if it was initdb’d with checksums
turned on.


If the cluster shuts down while “inprogress”, the DBA will have to manually
either restart the worker (by calling pg_enable_checksums()) or turn
checksums off again. Checksums “in progress” only carries a cost and no
benefit.


The change of the checksum state is WAL logged with a new xlog record. All
the buffers written by the background worker are forcibly enabled full page
writes to make sure the checksum is fully updated on the standby even if no
actual contents of the buffer changed.


We’ve also included a small commandline tool, bin/pg_verify_checksums, that
can be run against an offline cluster to validate all checksums. Future
improvements includes being able to use the background worker/launcher to
perform an online check as well. Being able to run more parallel workers in
the checksumhelper might also be of interest.


The patch includes two sets of tests, an isolation test turning on
checksums while one session is writing to the cluster and another is
continuously reading, to simulate turning on checksums in a production
database. There is also a TAP test which enables checksums with streaming
replication turned on to test the new xlog record. The isolation test ran
into the 1024 character limit of the isolation test lexer, with a separate
patch and discussion at
https://www.postgresql.org/message-id/8d628be4-6606-4ff6-a3ff-8b2b0e9b4...@yesql.se
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4c998fe51f..dc05ac3e55 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8537,7 +8537,8 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
 or hide corruption, or other serious problems.  However, it may allow
 you to get past the error and retrieve undamaged tuples that might still be
 present in the table if the block header is still sane. If the header is
-corrupt an error will be reported even if this option is enabled. The
+corrupt an error will be reported even if this option is enabled. This
+option can only enabled when data checksums are enabled. The
 default setting is off, and it can only be changed by a superuser.

   
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1e535cf215..8000ce89df 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19412,6 +19412,64 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 
   
 
+  
+   Data Checksum Functions
+
+   
+The functions shown in  can
+be used to enable or disable data checksums in a running cluster.
+See  for details.
+   
+
+   
+Checksum SQL Functions
+
+ 
+  
+   Function
+   Return Type
+   Description
+  
+ 
+ 
+  
+   
+
+ pg_enable_data_checksums
+
+pg_enable_data_checksums()
+   
+   
+bool
+   
+   
+Initiates data checksums for the cluster. This will switch the data checksums mode
+to in progress and start a background worker that will process
+all data in the database and enable checksums for it. When all data pages have had
+checksums enabled, the cluster will automatically switch to 

Re: [HACKERS] Runtime Partition Pruning

2018-02-22 Thread Amit Langote
Hi David.

On 2018/02/21 18:06, David Rowley wrote:
> Other things I don't particularly like about the current patch are how
> I had to construct the partition key expressions in
> set_valid_runtime_subplans_recurse(). This pretty much uses code
> borrowed from set_baserel_partition_key_exprs(). One way around that
> would be to change the partprune.c code to deal with the
> partkey->partattrs and consume an expression from the list on attnum =
> 0. I didn't do that as I want to minimise how much I touch Amit's
> patch before it gets committed as doing so would likely just cause
> more headaches for me keeping this merged with his work. Another
> option to resolve this would be to put the partition key expressions
> into the PartitionPruneInfo.

Another way could be to refactor the code you've borrowed from
set_baserel_partition_key_exprs() into its own function that is exported
by some optimizer header.

I removed PartitionKey/Relation from signatures of various functions of my
patch to avoid having to re-heap_open() the relation per [1].

> I've attached v11 of the patch.

Some comments:

* I noticed that the patch adds a function to bitmapset.c which you might
want to extract into its own patch, like your patch to add bms_add_range()
that got committed as 84940644d [2].

* Maybe it's better to add `default: break;` in the two switch's
you've added in partkey_datum_from_expr()

partprune.c: In function ‘partkey_datum_from_expr’:
partprune.c:1555:2: warning: enumeration value ‘T_Invalid’ not handled in
switch [-Wswitch]
  switch (nodeTag(expr))

partprune.c:1562:4: warning: enumeration value ‘PARAM_SUBLINK’ not handled
in switch [-Wswitch]
switch (((Param *) expr)->paramkind)

* I see that you moved PartitionClauseInfo's definition from partprune.c
to partition.h.  Isn't it better to move it to partprune.h?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=84940644d




Re: [HACKERS] path toward faster partition pruning

2018-02-22 Thread David Rowley
On 21 February 2018 at 23:44, Amit Langote
 wrote:
> Please find attached updated patches.

Thanks for updating the code.

The question I have now is around NULL handling in
partkey_datum_from_expr(). I've not managed to find a way to get a
NULL Const in there as it seems all the clauses I try get removed
somewhere earlier in planning. Do you know for a fact that a NULL
Const is impossible to get there?

I'm having to add some NULL handling there for the run-time pruning
patch but wondered if it was also required for your patch.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



<    1   2