Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Andres Freund
On 2017-08-30 12:52:26 +0900, Michael Paquier wrote:
> On Wed, Aug 30, 2017 at 11:20 AM, Peter Eisentraut
>  wrote:
> > On 8/29/17 20:36, Andres Freund wrote:
> >> So the question is whether we want {max,min}_wal_size be sized in
> >> multiples of segment sizes or as a proper byte size.  I'm leaning
> >> towards the latter.
> >
> > I'm not sure what the question is or what its impact would be.
> 
> FWIW, I get the question as: do we want the in-memory values of
> min/max_wal_size to be calculated in MB (which is now the case) or
> just bytes. Andres tends for using bytes.

Not quite.  There's essentially two things:

1) Currently the default for {min,max}_wal_size depends on the segment
   size. Given that the segment size is about to be configurable, that
   seems confusing.
2) Currently wal_segment_size is measured in GUC_UNIT_XBLOCKS, which
   requires us to keep two copies of the underlying variable, one in
   XBLOCKS one in bytes. I'd rather just have the byte variant.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Peter Geoghegan
On Tue, Aug 29, 2017 at 7:22 PM, Thomas Munro
 wrote:
> Indeed.  Thank you for working on this!  To summarise a couple of
> ideas that Peter and I discussed off-list a while back:  (1) While
> building the hash table for a hash join we could build a Bloom filter
> per future batch and keep it in memory, and then while reading from
> the outer plan we could skip writing tuples out to future batches if
> there is no chance they'll find a match when read back in later (works
> only for inner joins and only pays off in inverse proportion to the
> join's selectivity);

Right. Hash joins do tend to be very selective, though, so I think
that this could help rather a lot. With just 8 or 10 bits per element,
you can eliminate almost all the batch write-outs on the outer side.
No per-worker synchronization for BufFiles is needed when this
happens, either. It seems like that could be very important.

> To use this for anything involving parallelism where a Bloom filter
> must be shared we'd probably finish up having to create a shared
> version of bloom_init() that either uses caller-provided memory and
> avoids the internal pointer, or allocates DSA memory.  I suppose you
> could consider splitting your bloom_init() function up into
> bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
> getting rid of the separate pointer to bitset (ie stick char
> bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?

Makes sense. Not hard to add that.

> I was just observing that there is an opportunity for code reuse here.
> This function's definition would ideally be something like:
>
> double
> bloom_prop_bits_set(bloom_filter *filter)
> {
> return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
> }
>
> This isn't an objection to the way you have it, since we don't have a
> popcount function yet!  We have several routines in the tree for
> counting bits, though not yet the complete set from Hacker's Delight.

Right. I'm also reminded of the lookup tables for the visibility/freeze map.

> Your patch brings us one step closer to that goal.  (The book says
> that this approach is good far sparse bitsets, but your comment says
> that we expect something near 50%.  That's irrelevant anyway since a
> future centralised popcount() implementation would do this in
> word-sized chunks with a hardware instruction or branch-free-per-word
> lookups in a table and not care at all about sparseness.)

I own a copy of Hacker's Delight (well, uh, Daniel Farina lent me his
copy about 2 years ago!). pop()/popcount() does seem like a clever
algorithm, that we should probably think about adopting in some cases,
but I should point at that the current caller to my
bloom_prop_bits_set() function is an elog() DEBUG1 call. This is not
at all performance critical.

> + * Test if bloom filter definitely lacks element.
>
> I think where "Bloom filter" appears in prose it should have a capital
> letter (person's name).

Agreed.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Michael Paquier
On Wed, Aug 30, 2017 at 11:20 AM, Peter Eisentraut
 wrote:
> On 8/29/17 20:36, Andres Freund wrote:
>> So the question is whether we want {max,min}_wal_size be sized in
>> multiples of segment sizes or as a proper byte size.  I'm leaning
>> towards the latter.
>
> I'm not sure what the question is or what its impact would be.

FWIW, I get the question as: do we want the in-memory values of
min/max_wal_size to be calculated in MB (which is now the case) or
just bytes. Andres tends for using bytes.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-29 Thread Robert Haas
On Tue, Aug 29, 2017 at 10:36 PM, Amit Langote
 wrote:
>> I keep having the feeling that this is a big patch with a small patch
>> struggling to get out.  Is it really necessary to change
>> RelationGetPartitionDispatchInfo so much or could you just do a really
>> minimal surgery to remove the code that sets the stuff we don't need?
>> Like this:
>
> Sure, done in the attached updated patch.

On first glance, that looks pretty good.  I'll have a deeper look
tomorrow.  It strikes me that if PartitionTupleRoutingInfo is an
executor structure, we should probably (as a separate patch) relocate
FormPartitionKeyDatum and get_partition_for_tuple to someplace in
src/backend/executor and rename the accordingly; maybe it's time for
an execPartition.c?  catalog/partition.c is getting really bug, so
it's not a bad thing if some of that stuff gets moved elsewhere.  A
lingering question in my mind, though, is whether it's really correct
to think of PartitionTupleRoutingInfo as executor-specific.  Maybe
we're going to need that for other things too?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] expanding inheritance in partition bound order

2017-08-29 Thread Amit Langote
On 2017/08/29 4:26, Robert Haas wrote:
> I think this patch could be further simplified by continuing to use
> the existing function signature for RelationGetPartitionDispatchInfo
> instead of changing it to return a List * rather than an array.  I
> don't see any benefit to such a change.  The current system is more
> efficient.

OK, restored the array way.

> I keep having the feeling that this is a big patch with a small patch
> struggling to get out.  Is it really necessary to change
> RelationGetPartitionDispatchInfo so much or could you just do a really
> minimal surgery to remove the code that sets the stuff we don't need?
> Like this:

Sure, done in the attached updated patch.

Thanks,
Amit
From 9dd8e6f6bd3636f8c125c71e6d1c65bf606a2a22 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 30 Aug 2017 10:02:05 +0900
Subject: [PATCH 1/2] Decouple RelationGetPartitionDispatchInfo() from executor

Currently it and the structure it generates viz. PartitionDispatch
objects are too coupled with the executor's tuple-routing code.  In
particular, it's pretty undesirable that it makes it the responsibility
of the caller to release some resources, such as executor tuple table
slots, tuple-conversion maps, etc.  That makes it harder to use in
places other than where it's currently being used.

After this refactoring, ExecSetupPartitionTupleRouting() now needs to
do some of the work that was previously done in
RelationGetPartitionDispatchInfo().
---
 src/backend/catalog/partition.c| 53 +++-
 src/backend/commands/copy.c| 32 +++--
 src/backend/executor/execMain.c| 88 ++
 src/backend/executor/nodeModifyTable.c | 37 +++---
 src/include/catalog/partition.h| 20 +++-
 src/include/executor/executor.h|  4 +-
 src/include/nodes/execnodes.h  | 40 +++-
 7 files changed, 181 insertions(+), 93 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 96a64ce6b2..c92756ecd5 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1081,7 +1081,6 @@ RelationGetPartitionDispatchInfo(Relation rel,
Relationpartrel = lfirst(lc1);
Relationparent = lfirst(lc2);
PartitionKey partkey = RelationGetPartitionKey(partrel);
-   TupleDesc   tupdesc = RelationGetDescr(partrel);
PartitionDesc partdesc = RelationGetPartitionDesc(partrel);
int j,
m;
@@ -1089,29 +1088,12 @@ RelationGetPartitionDispatchInfo(Relation rel,
pd[i] = (PartitionDispatch) 
palloc(sizeof(PartitionDispatchData));
pd[i]->reldesc = partrel;
pd[i]->key = partkey;
-   pd[i]->keystate = NIL;
pd[i]->partdesc = partdesc;
if (parent != NULL)
-   {
-   /*
-* For every partitioned table other than root, we must 
store a
-* tuple table slot initialized with its tuple 
descriptor and a
-* tuple conversion map to convert a tuple from its 
parent's
-* rowtype to its own. That is to make sure that we are 
looking at
-* the correct row using the correct tuple descriptor 
when
-* computing its partition key for tuple routing.
-*/
-   pd[i]->tupslot = MakeSingleTupleTableSlot(tupdesc);
-   pd[i]->tupmap = 
convert_tuples_by_name(RelationGetDescr(parent),
-   
   tupdesc,
-   
   gettext_noop("could not convert row type"));
-   }
+   pd[i]->parentoid = RelationGetRelid(parent);
else
-   {
-   /* Not required for the root partitioned table */
-   pd[i]->tupslot = NULL;
-   pd[i]->tupmap = NULL;
-   }
+   pd[i]->parentoid = InvalidOid;
+
pd[i]->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
 
/*
@@ -1860,7 +1842,7 @@ generate_partition_qual(Relation rel)
  * Construct values[] and isnull[] arrays for the 
partition key
  * of a tuple.
  *
- * pd  Partition dispatch object of the 
partitioned table
+ * ptrinfo PartitionTupleRoutingInfo object of the table
  * slotHeap tuple from which to extract partition key
  * estate  executor state for evaluating any partition key
  *  

Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization

2017-08-29 Thread Masahiko Sawada
On Tue, Aug 29, 2017 at 4:47 PM, Fabien COELHO  wrote:
>
> Hello,
>
> Patch applies and works.
>
>>> I would like to insist a little bit: the name was declared as an int but
>>> passed to init and used as a bool there before the patch. Conceptually
>>> what
>>> is meant is really a bool, and I see no added value bar saving a few
>>> strokes
>>> to have an int. ISTM that recent pgbench changes have started turning old
>>> int-for-bool habits into using bool when bool is meant.
>>
>>
>> Since is_no_vacuum is a existing one, if we follow the habit we should
>> change other similar variables as well: is_init_mode, do_vacuum_accounts and
>> debug. And I think we should change them in a separated patch.
>
>
> Hmmm. The existing "is_no_vacuum" variable is typed *both* as int (in
> "main") and as bool (in "init"), called by main (yuk!). I see no reason to
> choose the bad one for the global:-)
>

Yeah, I think this might be a good timing to re-consider int-for-bool
habits in pgbench. If we decided to change is_no_vacuum to bool I want
to change other similar variables as well.

>> After more thought, I'm bit inclined to not have a short option for
>> --custom-initialize because this option will be used for not primary
>> cases. It would be better to save short options for future
>> enhancements of pgbench. Thought?
>
>
> I like it as is, especially as now the associated value is a simple and
> short string, I think that it makes sense to have a simple and short option
> to trigger it. Moreover -I stands cleanly for "initialization", and the
> capital stands for something a little special which it is. Its to good to
> miss.
>
> I think that the "-I" it should be added to the "--help" line, as it is done
> with other short & long options.

Okay, I'll leave it as of now. Maybe we can discuss later.

>
> Repeating "-I f" results in multiple foreign key constraints:
>
>   Foreign-key constraints:
> "pgbench_tellers_bid_fkey" FOREIGN KEY (bid) REFERENCES
> pgbench_branches(bid)
> "pgbench_tellers_bid_fkey1" FOREIGN KEY (bid) REFERENCES
> pgbench_branches(bid)
> "pgbench_tellers_bid_fkey2" FOREIGN KEY (bid) REFERENCES
> pgbench_branches(bid)
> "pgbench_tellers_bid_fkey3" FOREIGN KEY (bid) REFERENCES
> pgbench_branches(bid)
>
> I wonder if this could be avoided easily? Maybe by setting the constraint
> name explicitely so that the second one fails on the existing one, which is
> fine, like for primary keys? Or adding a DROP CONSTRAINT IF EXISTS before
> the CREATE CONSTRAINT, like for tables? Or doing nothing about it? I would
> prefer the first option.

Good point, I agree with first option.

>
> Maybe the initial cleanup (DROP TABLE) could be made an option added to the
> default, so that cleaning up the database could be achieved with some
> "pgbench -i -I c", instead of connecting and droping the tables one by one
> which I have done quite a few times... What do you think?

Yeah, I sometimes wanted that. Having the cleaning up tables option
would be good idea.

> Before it is definitely engraved, I'm thinking about the letters:
>
>  c - cleanup
>  t - create table
>  d - data
>  p - primary key
>  f - foreign key
>  v - vacuum
>
> I think it is mostly okay, but it is the last time to think about it. Using
> "d" for cleanup (drop) would mean finding another letter for filling in
> data... maybe "g" for data generation? "c" may have been chosen for "create
> table", but then would not be available for "cleanup". Thoughts?

I'd say "g" for data generation would be better. Also, I'm inclined to
add a command for the unlogged tables. How about this?

c - cleanup
t - create table
u - create unlogged table
g - data generation
p - primary key
f - foreign key
v - vacuum

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Thomas Munro
On Wed, Aug 30, 2017 at 1:00 PM, Peter Geoghegan  wrote:
> On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
>  wrote:
>> Some drive-by comments on the lib patch:
>
> I was hoping that you'd look at this, since you'll probably want to
> use a bloom filter for parallel hash join at some point. I've tried to
> keep this one as simple as possible. I think that there is a good
> chance that it will be usable for parallel hash join with multiple
> batches. You'll need to update the interface a little bit to make that
> work (e.g., bring-your-own-hash interface), but hopefully the changes
> you'll need will be limited to a few small details.

Indeed.  Thank you for working on this!  To summarise a couple of
ideas that Peter and I discussed off-list a while back:  (1) While
building the hash table for a hash join we could build a Bloom filter
per future batch and keep it in memory, and then while reading from
the outer plan we could skip writing tuples out to future batches if
there is no chance they'll find a match when read back in later (works
only for inner joins and only pays off in inverse proportion to the
join's selectivity); (2) We could push a Bloom filter down to scans
(many other databases do this, and at least one person has tried this
with PostgreSQL and found it to pay off[1]).

To use this for anything involving parallelism where a Bloom filter
must be shared we'd probably finish up having to create a shared
version of bloom_init() that either uses caller-provided memory and
avoids the internal pointer, or allocates DSA memory.  I suppose you
could consider splitting your bloom_init() function up into
bloom_estimate() and bloom_init(user_supplied_space, ...) now, and
getting rid of the separate pointer to bitset (ie stick char
bitset[FLEXIBLE_ARRAY_MEMBER] at the end of the struct)?

>> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
>>
>> I think the plan is to use size_t for new stuff[1].
>
> I'd forgotten.
>
>> This is another my_log2(), right?
>>
>> It'd be nice to replace both with fls() or flsl(), though it's
>> annoying to have to think about long vs int64 etc.  We already use
>> fls() in two places and supply an implementation in src/port/fls.c for
>> platforms that lack it (Windows?), but not the long version.
>
> win64 longs are only 32-bits, so my_log2() would do the wrong thing
> for me on that platform. pow2_truncate() is provided with a number of
> bits as its argument, not a number of bytes (otherwise this would
> work).

Hmm.  Right.

> Ideally, we'd never use long integers, because its width is platform
> dependent, and yet it is only ever used as an alternative to int
> because it is wider than int. One example of where this causes
> trouble: logtape.c uses long ints, so external sorts can use half the
> temp space on win64.

Agreed, "long" is terrible.

>> +bloom_prop_bits_set(bloom_filter *filter)
>> +{
>> +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
>> +int64bits_set = 0;
>> +inti;
>> +
>> +for (i = 0; i < bitset_bytes; i++)
>> +{
>> +unsigned char byte = filter->bitset[i];
>> +
>> +while (byte)
>> +{
>> +bits_set++;
>> +byte &= (byte - 1);
>> +}
>> +}
>
> I don't follow what you mean here.

I was just observing that there is an opportunity for code reuse here.
This function's definition would ideally be something like:

double
bloom_prop_bits_set(bloom_filter *filter)
{
return popcount(filter->bitset, NBYTES(filter)) / (double) NBITS(filter);
}

This isn't an objection to the way you have it, since we don't have a
popcount function yet!  We have several routines in the tree for
counting bits, though not yet the complete set from Hacker's Delight.
Your patch brings us one step closer to that goal.  (The book says
that this approach is good far sparse bitsets, but your comment says
that we expect something near 50%.  That's irrelevant anyway since a
future centralised popcount() implementation would do this in
word-sized chunks with a hardware instruction or branch-free-per-word
lookups in a table and not care at all about sparseness.)

+ * Test if bloom filter definitely lacks element.

I think where "Bloom filter" appears in prose it should have a capital
letter (person's name).

[1] 
http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Peter Eisentraut
On 8/29/17 20:36, Andres Freund wrote:
> So the question is whether we want {max,min}_wal_size be sized in
> multiples of segment sizes or as a proper byte size.  I'm leaning
> towards the latter.

I'm not sure what the question is or what its impact would be.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update low-level backup documentation to match actual behavior

2017-08-29 Thread Michael Paquier
On Tue, Aug 29, 2017 at 10:59 PM, David Steele  wrote:
> Hi Robert,
>
> On 8/25/17 4:03 PM, David Steele wrote:
>> On 8/25/17 3:26 PM, Robert Haas wrote:
>>> On Fri, Aug 25, 2017 at 3:21 PM, David Steele 
>>> wrote:
 No problem.  I'll base it on your commit to capture any changes you
 made.
>>>
>>> Thanks, but you incorporated everything I wanted in response to my
>>> first review -- so I didn't tweak it any further.
>>
>> Thank you for committing that.  I'll get the 9.6 patch to you early next
>> week.
>
> Attached is the 9.6 patch.  It required a bit more work in func.sgml
> than I was expecting so have a close look at that.  The rest was mostly
> removing irrelevant hunks.

+ switch to the next WAL segment.  On a standby, it is not possible to
+ automatically switch WAL segments, so you may wish to run
+ pg_switch_wal on the primary to perform a manual
+ switch. The reason for the switch is to arrange for
[...]
+WAL segments have been archived. If write activity on the primary
is low, it
+may be useful to run pg_switch_wal on the primary in order to
+trigger an immediate segment switch of the last required WAL
It seems to me that both portions are wrong. There is no archiving
wait on standbys for 9.6, and pg_stop_backup triggers by itself the
segment switch, so saying that enforcing pg_switch_wal on the primary
is moot. pg_switch_xlog has been renamed to pg_switch_wal in PG10, so
the former name applies.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Andres Freund
On 2017-08-30 10:14:22 +0900, Michael Paquier wrote:
> On Wed, Aug 30, 2017 at 10:06 AM, Andres Freund  wrote:
> > On 2017-08-30 09:49:14 +0900, Michael Paquier wrote:
> >> Do you think that we should worry about wal segment sizes higher than
> >> 2GB? Support for int64 GUCs is not here yet.
> >
> > 1GB will be the limit anyway.
> 
> Yeah, but imagine that we'd want to raise that even more up.

I'm doubtfull of that. But even if, it'd not be hard to GUC support.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Michael Paquier
On Wed, Aug 30, 2017 at 10:06 AM, Andres Freund  wrote:
> On 2017-08-30 09:49:14 +0900, Michael Paquier wrote:
>> Do you think that we should worry about wal segment sizes higher than
>> 2GB? Support for int64 GUCs is not here yet.
>
> 1GB will be the limit anyway.

Yeah, but imagine that we'd want to raise that even more up. By using
bytes, int64 GUCs become mandatory, or we'd come back into using MBs
for this parameter, bringing us back to the original implementation.
That's something worth thinking about.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Andres Freund
On 2017-08-30 09:49:14 +0900, Michael Paquier wrote:
> On Wed, Aug 30, 2017 at 9:36 AM, Andres Freund  wrote:
> > So the question is whether we want {max,min}_wal_size be sized in
> > multiples of segment sizes or as a proper byte size.  I'm leaning
> > towards the latter.
> 
> Logically in the code it is just a matter of adjusting multipliers.

No code difficulties here, I think we just need to decide what we want.


> Do you think that we should worry about wal segment sizes higher than
> 2GB? Support for int64 GUCs is not here yet.

1GB will be the limit anyway.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Peter Geoghegan
On Tue, Aug 29, 2017 at 4:34 PM, Thomas Munro
 wrote:
> Some drive-by comments on the lib patch:

I was hoping that you'd look at this, since you'll probably want to
use a bloom filter for parallel hash join at some point. I've tried to
keep this one as simple as possible. I think that there is a good
chance that it will be usable for parallel hash join with multiple
batches. You'll need to update the interface a little bit to make that
work (e.g., bring-your-own-hash interface), but hopefully the changes
you'll need will be limited to a few small details.

> +bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)
>
> I think the plan is to use size_t for new stuff[1].

I'd forgotten.

> This is another my_log2(), right?
>
> It'd be nice to replace both with fls() or flsl(), though it's
> annoying to have to think about long vs int64 etc.  We already use
> fls() in two places and supply an implementation in src/port/fls.c for
> platforms that lack it (Windows?), but not the long version.

win64 longs are only 32-bits, so my_log2() would do the wrong thing
for me on that platform. pow2_truncate() is provided with a number of
bits as its argument, not a number of bytes (otherwise this would
work).

Ideally, we'd never use long integers, because its width is platform
dependent, and yet it is only ever used as an alternative to int
because it is wider than int. One example of where this causes
trouble: logtape.c uses long ints, so external sorts can use half the
temp space on win64.

> +/*
> + * Hash function is taken from sdbm, a public-domain reimplementation of the
> + * ndbm database library.
> + */
> +static uint32
> +sdbmhash(unsigned char *elem, Size len)
> +{

> I see that this is used in gawk, BerkeleyDB and all over the place[2].
> Nice.  I understand that this point of this is to be a hash function
> that is different from our usual one, for use by k_hashes().

Right. It's only job is to be a credible hash function that isn't
derivative of hash_any().

> Do you
> think it belongs somewhere more common than this?  It seems a bit like
> our hash-related code is scattered all over the place but should be
> consolidated, but I suppose that's a separate project.

Unsure. In its defense, there is also a private murmurhash one-liner
within tidbitmap.c.  I don't mind changing this, but it's slightly odd
to expose a hash function whose only job is to be completely unrelated
to hash_any().

> Unnecessary braces here and elsewhere for single line body of for loops.
>
> +bloom_prop_bits_set(bloom_filter *filter)
> +{
> +intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
> +int64bits_set = 0;
> +inti;
> +
> +for (i = 0; i < bitset_bytes; i++)
> +{
> +unsigned char byte = filter->bitset[i];
> +
> +while (byte)
> +{
> +bits_set++;
> +byte &= (byte - 1);
> +}
> +}

I don't follow what you mean here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Michael Paquier
On Wed, Aug 30, 2017 at 9:36 AM, Andres Freund  wrote:
> So the question is whether we want {max,min}_wal_size be sized in
> multiples of segment sizes or as a proper byte size.  I'm leaning
> towards the latter.

Logically in the code it is just a matter of adjusting multipliers. Do
you think that we should worry about wal segment sizes higher than
2GB? Support for int64 GUCs is not here yet.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error

2017-08-29 Thread Craig Ringer
On 29 August 2017 at 22:02, Andres Freund  wrote:

> Hi,
>
> On 2017-08-29 13:42:05 +0200, Simone Gotti wrote:
> > On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrera
> >  wrote:
> > >
> >
> > Hi Alvaro,
> >
> > > Simone Gotti wrote:
> > > > Hi all,
> > > >
> > > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot
> on an
> > > > active slot will block until it's released instead of returning an
> error
> > > > like
> > > > done in pg 9.6. Since this is a change in the previous behavior and
> the docs
> > > > wasn't changed I made a patch to restore the previous behavior.
> > >
> > > Changing that behavior was the entire point of the cited commit.
> >
> > Sorry, I was thinking that the new behavior was needed for internal
> > future functions since the doc wasn't changed.
>
> FWIW, I also don't think it's ok to just change the behaviour
> unconditionally and without a replacement for existing behaviour.


Seems like it just needs a new argument nowait DEFAULT false


 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


[HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)

2017-08-29 Thread Andres Freund
Hi,

intentionally breaking the thread here, I want this one point to get
a bit wider audience.

The excerpt of the relevant discussion is:

On 2017-08-23 12:13:15 +0530, Beena Emerson wrote:
> >> + /* set default max_wal_size and min_wal_size */
> >> + snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
> >> +  pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
> >> + conflines = replace_token(conflines, "#min_wal_size = 80MB", 
> >> repltok);
> >> +
> >> + snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
> >> +  pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
> >> + conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
> >> +
> >
> > Hm. So postgresql.conf.sample values are now going to contain misleading
> > information for clusters with non-default segment sizes.
> >
> > Have we discussed instead defaulting min_wal_size/max_wal_size to a
> > constant amount of megabytes and rounding up when it doesn't work for
> > a particular segment size?
> 
> This was not discussed.
> 
> In the original code, the min_wal_size and max_wal_size are computed
> in the guc.c for any wal_segment_size set at configure.
> 
> {
> {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
> gettext_noop("Sets the minimum size to shrink the WAL to."),
> NULL,
> GUC_UNIT_MB
> },
> _wal_size_mb,
> 5 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES,
> NULL, NULL, NULL
> },
> 
> {
> {"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
> gettext_noop("Sets the WAL size that triggers a checkpoint."),
> NULL,
> GUC_UNIT_MB
> },
> _wal_size_mb,
> 64 * (XLOG_SEG_SIZE / (1024 * 1024)), 2, MAX_KILOBYTES,
> NULL, assign_max_wal_size, NULL
> },
> 
> Hence I have retained the same calculation for min_wal_size and
> max_wal_size. If we get consensus for fixing a default and updating
> when required, then I will change the code accordingly.

So the question is whether we want {max,min}_wal_size be sized in
multiples of segment sizes or as a proper byte size.  I'm leaning
towards the latter.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Michael Paquier
On Wed, Aug 30, 2017 at 8:34 AM, Thomas Munro
 wrote:
> It'd be nice to replace both with fls() or flsl(), though it's
> annoying to have to think about long vs int64 etc.  We already use
> fls() in two places and supply an implementation in src/port/fls.c for
> platforms that lack it (Windows?), but not the long version.

Yes, you can complain about MSVC compilation here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra


On 08/30/2017 02:19 AM, Andres Freund wrote:
> On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote:
>>
>> I'm not really following your reasoning. You may very well be right that
>> the BeginInternalSubTransaction() example is not quite valid on postgres
>> core, but I don't see how that implies there can be no other errors in
>> the PG_TRY block. It was merely an explanation how I noticed this issue.
>>
>> To make it absolutely clear, I claim that the PG_CATCH() block is
>> demonstrably broken as it calls AbortCurrentTransaction() and then
>> accesses already freed memory.
> 
> Yea, but that's not what happens normally. The txn memory isn't
> allocated in the context created by the started [sub]transaction. I
> think you're just seeing what you're seeing because an error thrown
> before the BeginInternalSubTransaction() succeeds, will roll back the
> *containing* transaction (the one that did the SQL SELECT * FROM
> pg_*logical*() call), and free all the entire reorderbuffer stuff.
> 
> 
>> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in
>> the switch, and AFAICS hitting any of them will have exactly the same
>> effect as failure in BeginInternalSubTransaction().
> 
> No, absolutely not. Once the subtransaction starts, the
> AbortCurrentTransaction() will abort that subtransaction, rather than
> the containing one.
> 

Ah, I see. You're right elog(ERROR) calls after the subtransaction start
are handled correctly and don't trigger a segfault. Sorry for the noise
and thanks for the explanation.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Andres Freund
On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote:
> 
> 
> On 08/30/2017 01:34 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote:
> >> I've been investigating some failures in test_decoding regression tests,
> >> and it seems to me the error-handling in ReorderBufferCommit() is
> >> somewhat broken, leading to segfault crashes.
> >>
> >> The problematic part is this:
> >>
> >> PG_CATCH()
> >> {
> >> /*
> >>  * Force cache invalidation to happen outside of a valid transaction
> >>  * to prevent catalog access as we just caught an error.
> >>  */
> >> AbortCurrentTransaction();
> >>
> >> /* make sure there's no cache pollution */
> >> ReorderBufferExecuteInvalidations(rb, txn);
> >>
> >> ...
> >>
> >> }
> >>
> >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the
> >> PG_TRY() block.
> > 
> > That's not really a valid thing to do, you should put it after the
> > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically
> > assumed that those won't fail - arguably they should be outside of a
> > PG_TRY then, but that's a different matter.  If you start decoding
> > outside of SQL failing before those will lead to rolling back the parent
> > tx...
> > 
> > 
> >> I suppose this is not quite intentional, but rather an example that
> >> error-handling code is an order of magnitude more complicated to write
> >> and test. I've only noticed as I'm investigating some regression
> >> failures on Postgres-XL 10, which does not support subtransactions and
> >> so the BeginInternalSubTransaction() call in the try branch always
> >> fails, triggering the issue.
> > 
> > So, IIUC, there's no live problem in postgres core, besides some ugly &
> > undocumented assumptions?
> > 
> 
> I'm not really following your reasoning. You may very well be right that
> the BeginInternalSubTransaction() example is not quite valid on postgres
> core, but I don't see how that implies there can be no other errors in
> the PG_TRY block. It was merely an explanation how I noticed this issue.
> 
> To make it absolutely clear, I claim that the PG_CATCH() block is
> demonstrably broken as it calls AbortCurrentTransaction() and then
> accesses already freed memory.

Yea, but that's not what happens normally. The txn memory isn't
allocated in the context created by the started [sub]transaction. I
think you're just seeing what you're seeing because an error thrown
before the BeginInternalSubTransaction() succeeds, will roll back the
*containing* transaction (the one that did the SQL SELECT * FROM
pg_*logical*() call), and free all the entire reorderbuffer stuff.


> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in
> the switch, and AFAICS hitting any of them will have exactly the same
> effect as failure in BeginInternalSubTransaction().

No, absolutely not. Once the subtransaction starts, the
AbortCurrentTransaction() will abort that subtransaction, rather than
the containing one.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuple-routing for certain partitioned tables not working as expected

2017-08-29 Thread Amit Langote
On 2017/08/29 20:18, Etsuro Fujita wrote:
> On 2017/08/25 22:26, Robert Haas wrote:
>> On Wed, Aug 23, 2017 at 4:55 AM, Etsuro Fujita
>>  wrote:
>>> Agreed, but I'd vote for fixing this in v10 as proposed; I agree that just
>>> ripping the CheckValidResultRel checks out entirely is not a good idea,
>>> but
>>> that seems OK to me at least as a fix just for v10.
>>
>> I'm still not on-board with having this be the one case where we don't
>> do CheckValidResultRel.  If we want to still call it but pass down
>> some additional information that can selectively skip certain checks,
>> I could probably live with that.
> 
> Another idea would be to not do CheckValidResultRel for partitions in
> ExecSetupPartitionTupleRouting; instead, do that the first time the
> partition is chosen by ExecFindPartition, and if successfully checked,
> initialize the partition's ResultRelInfo and other stuff.  (We could skip
> this after the first time by checking whether we already have a valid
> ResultRelInfo for the chosen partition.)  That could allow us to not only
> call CheckValidResultRel the way it is, but avoid initializing useless
> partitions for which tuples aren't routed at all.

I too have thought about the idea of lazy initialization of the partition
ResultRelInfos.  I think it would be a good idea, but definitely something
for PG 11.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra


On 08/30/2017 01:34 AM, Andres Freund wrote:
> Hi,
> 
> On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote:
>> I've been investigating some failures in test_decoding regression tests,
>> and it seems to me the error-handling in ReorderBufferCommit() is
>> somewhat broken, leading to segfault crashes.
>>
>> The problematic part is this:
>>
>> PG_CATCH()
>> {
>> /*
>>  * Force cache invalidation to happen outside of a valid transaction
>>  * to prevent catalog access as we just caught an error.
>>  */
>> AbortCurrentTransaction();
>>
>> /* make sure there's no cache pollution */
>> ReorderBufferExecuteInvalidations(rb, txn);
>>
>> ...
>>
>> }
>>
>> Triggering it trivial - just add elog(ERROR,...) at the beginning of the
>> PG_TRY() block.
> 
> That's not really a valid thing to do, you should put it after the
> BeginInternalSubTransaction()/StartTransactionCommand(). It's basically
> assumed that those won't fail - arguably they should be outside of a
> PG_TRY then, but that's a different matter.  If you start decoding
> outside of SQL failing before those will lead to rolling back the parent
> tx...
> 
> 
>> I suppose this is not quite intentional, but rather an example that
>> error-handling code is an order of magnitude more complicated to write
>> and test. I've only noticed as I'm investigating some regression
>> failures on Postgres-XL 10, which does not support subtransactions and
>> so the BeginInternalSubTransaction() call in the try branch always
>> fails, triggering the issue.
> 
> So, IIUC, there's no live problem in postgres core, besides some ugly &
> undocumented assumptions?
> 

I'm not really following your reasoning. You may very well be right that
the BeginInternalSubTransaction() example is not quite valid on postgres
core, but I don't see how that implies there can be no other errors in
the PG_TRY block. It was merely an explanation how I noticed this issue.

To make it absolutely clear, I claim that the PG_CATCH() block is
demonstrably broken as it calls AbortCurrentTransaction() and then
accesses already freed memory.

The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in
the switch, and AFAICS hitting any of them will have exactly the same
effect as failure in BeginInternalSubTransaction(). And I suppose there
many other ways to trigger an error and get into the catch block. In
other words, why would we have the block at all?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Thomas Munro
On Wed, Aug 30, 2017 at 7:58 AM, Peter Geoghegan  wrote:
> On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan  wrote:
>> I spent only a few hours writing a rough prototype, and came up with
>> something that does an IndexBuildHeapScan() scan following the
>> existing index verification steps. Its amcheck callback does an
>> index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
>> all), and tests it for membership of a bloom filter generated as part
>> of the main B-Tree verification phase. The IndexTuple memory is freed
>> immediately following hashing.
>
> I attach a cleaned-up version of this. It has extensive documentation.
> My bloom filter implementation is broken out as a separate patch,
> added as core infrastructure under "lib".

Some drive-by comments on the lib patch:

+bloom_add_element(bloom_filter *filter, unsigned char *elem, Size len)

I think the plan is to use size_t for new stuff[1].

+/*
+ * Which element of the sequence of powers-of-two is less than or equal to n?
+ *
+ * Used to size bitset, which in practice is never allowed to exceed 2 ^ 30
+ * bits (128MB).  This frees us from giving further consideration to int
+ * overflow.
+ */
+static int
+pow2_truncate(int64 target_bitset_size)
+{
+int v = 0;
+
+while (target_bitset_size > 0)
+{
+v++;
+target_bitset_size = target_bitset_size >> 1;
+}
+
+return Min(v - 1, 30);
+}

This is another my_log2(), right?

It'd be nice to replace both with fls() or flsl(), though it's
annoying to have to think about long vs int64 etc.  We already use
fls() in two places and supply an implementation in src/port/fls.c for
platforms that lack it (Windows?), but not the long version.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
+static uint32
+sdbmhash(unsigned char *elem, Size len)
+{
+uint32hash = 0;
+inti;
+
+for (i = 0; i < len; elem++, i++)
+{
+hash = (*elem) + (hash << 6) + (hash << 16) - hash;
+}
+
+return hash;
+}

I see that this is used in gawk, BerkeleyDB and all over the place[2].
Nice.  I understand that this point of this is to be a hash function
that is different from our usual one, for use by k_hashes().  Do you
think it belongs somewhere more common than this?  It seems a bit like
our hash-related code is scattered all over the place but should be
consolidated, but I suppose that's a separate project.

Unnecessary braces here and elsewhere for single line body of for loops.

+bloom_prop_bits_set(bloom_filter *filter)
+{
+intbitset_bytes = NBITS(filter) / BITS_PER_BYTE;
+int64bits_set = 0;
+inti;
+
+for (i = 0; i < bitset_bytes; i++)
+{
+unsigned char byte = filter->bitset[i];
+
+while (byte)
+{
+bits_set++;
+byte &= (byte - 1);
+}
+}

Sorry I didn't follow up with my threat[3] to provide a central
popcount() function to replace the implementations all over the tree.

[1] https://www.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us
[2] http://www.cse.yorku.ca/~oz/hash.html
[3] 
https://www.postgresql.org/message-id/CAEepm%3D3g1_fjJGp38QGv%2B38BC2HHVkzUq6s69nk3mWLgPHqC3A%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Andres Freund
Hi,

On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote:
> I've been investigating some failures in test_decoding regression tests,
> and it seems to me the error-handling in ReorderBufferCommit() is
> somewhat broken, leading to segfault crashes.
> 
> The problematic part is this:
> 
> PG_CATCH()
> {
> /*
>  * Force cache invalidation to happen outside of a valid transaction
>  * to prevent catalog access as we just caught an error.
>  */
> AbortCurrentTransaction();
> 
> /* make sure there's no cache pollution */
> ReorderBufferExecuteInvalidations(rb, txn);
> 
> ...
> 
> }
> 
> Triggering it trivial - just add elog(ERROR,...) at the beginning of the
> PG_TRY() block.

That's not really a valid thing to do, you should put it after the
BeginInternalSubTransaction()/StartTransactionCommand(). It's basically
assumed that those won't fail - arguably they should be outside of a
PG_TRY then, but that's a different matter.  If you start decoding
outside of SQL failing before those will lead to rolling back the parent
tx...


> I suppose this is not quite intentional, but rather an example that
> error-handling code is an order of magnitude more complicated to write
> and test. I've only noticed as I'm investigating some regression
> failures on Postgres-XL 10, which does not support subtransactions and
> so the BeginInternalSubTransaction() call in the try branch always
> fails, triggering the issue.

So, IIUC, there's no live problem in postgres core, besides some ugly &
undocumented assumptions?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken

2017-08-29 Thread Tomas Vondra
Hi,

I've been investigating some failures in test_decoding regression tests,
and it seems to me the error-handling in ReorderBufferCommit() is
somewhat broken, leading to segfault crashes.

The problematic part is this:

PG_CATCH()
{
/*
 * Force cache invalidation to happen outside of a valid transaction
 * to prevent catalog access as we just caught an error.
 */
AbortCurrentTransaction();

/* make sure there's no cache pollution */
ReorderBufferExecuteInvalidations(rb, txn);

...

}

Triggering it trivial - just add elog(ERROR,...) at the beginning of the
PG_TRY() block.

The problem is that AbortCurrentTransaction() apparently releases the
memory where txn is allocated, making it entirely bogus. So in assert
builds txn->ivalidations are 0x7f7f7f7f7f7f7f7f,  triggering a segfault.

Similar issues apply to subsequent calls in the catch block, that also
use txn in some way (e.g. through snapshot_now).

I suppose this is not quite intentional, but rather an example that
error-handling code is an order of magnitude more complicated to write
and test. I've only noticed as I'm investigating some regression
failures on Postgres-XL 10, which does not support subtransactions and
so the BeginInternalSubTransaction() call in the try branch always
fails, triggering the issue.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-08-29 Thread Andres Freund

Hi,

On 2017-08-23 12:13:15 +0530, Beena Emerson wrote:
> >> + /*
> >> +  * The calculation of XLOGbuffers requires the run-time 
> >> parameter
> >> +  * XLogSegSize which is set from the control file. This 
> >> value is
> >> +  * required to create the shared memory segment. Hence, 
> >> temporarily
> >> +  * allocate space for reading the control file.
> >> +  */
> >
> > This makes me uncomfortable.  Having to choose the control file multiple
> > times seems wrong.  We're effectively treating the control file as part
> > of the configuration now, and that means we should move it's parsing to
> > an earlier part of startup.
> 
> Yes, this may seem ugly. ControlFile was originally read into the
> shared memory segment but then we now need the XLogSegSize from the
> ControlFile to initialise the shared memory segment. I could not
> figure out any other way to achieve this.

I think reading it one into local memory inside the startup process and
then copying it into shared memory from there should work?


> >> @@ -8146,6 +8181,9 @@ InitXLOGAccess(void)
> >>   ThisTimeLineID = XLogCtl->ThisTimeLineID;
> >>   Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
> >>
> >> + /* set XLogSegSize */
> >> + XLogSegSize = ControlFile->xlog_seg_size;
> >> +
> >
> > Hm, why do we have two variables keeping track of the segment size?
> > wal_segment_size and XLogSegSize? That's bound to lead to confusion.
> >
> 
> wal_segment_size is the guc which stores the number of segments
> (XLogSegSize / XLOG_BLCKSZ).

wal_segment_size and XLogSegSize are the same name, spelt different, so
if that's where we want to go, we should name them differently. But
perhaps more fundamentally, I don't see why we need both: What stops us
from just defining the GUC in bytes?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Improve history file reasons when doing promotion

2017-08-29 Thread Michael Paquier
Hi all,

When triggering a promotion, the history file generated for the
timeline bump provides a reason behind the timeline bump not really
helpful:
$ cat 0002.history
1   0/3000858   no recovery target specified
I was wondering if we could improve that a bit for a promotion. One
thing popping up immediately to my mind would be to check for
IsPromoteTriggered() and replace this reason by "standby promotion".
Any thoughts perhaps?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make pg_regress print a connstring with sockdir

2017-08-29 Thread Robert Haas
On Mon, Aug 28, 2017 at 7:57 AM, Craig Ringer  wrote:
> I'm frequently debugging postmasters that are around long enough. Deadlocks,
> etc.
>
> It's also way easier to debug shmem related issues with a live postmaster vs
> a core.

Yeah.  I don't *frequently* debug postmasters that hang during the
regression tests, but I definitely have done it, and I think something
like this would make it easier.  Right now if something wedges and you
need to connect to the postmaster to see what's going on, you have to
grep for the pid, then lsof to get the socket directory.  This would
simplify things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw bug in 9.6

2017-08-29 Thread Tom Lane
Etsuro Fujita  writes:
> [ epqpath-for-foreignjoin-11.patch ]

I started looking at this.  I've not yet wrapped my head around what
CreateLocalJoinPath() is doing, but I noted that Robert's concerns
about ABI breakage in the back branches would not be very difficult
to satisfy.  What we'd need to do is

(1) In the back-branch patch, add the new fields of JoinPathExtraData
at the end, rather than in their more natural locations.  This should
avoid any ABI break for uses of that struct, since there's no reason
why an FDW would be creating its own variables of that type rather
than just using what the core code passes it.

(2) In the back branches, just leave GetExistingLocalJoinPath in place
rather than deleting it.  Existing FDWs would still be subject to the
bug until they were updated, but given how hard it is to trigger, that
doesn't seem like a huge problem.

A variant of (2) is to install a hack fix in GetExistingLocalJoinPath
rather than leaving it unchanged.  I'm not very excited about that though;
it doesn't seem unlikely that we might introduce new bugs that way, and
it would certainly be a distraction from getting the real fix finished.

We'd definitely need to do things that way in 9.6.  I'm not quite sure
whether it's too late to adopt the clean solution in v10.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2017-08-29 Thread Robert Haas
On Tue, Jul 4, 2017 at 12:33 AM, Amit Kapila  wrote:
> I have updated the patch to support wait events and moved it to upcoming CF.

This patch doesn't apply any more, but I made it apply with a hammer
and then did a little benchmarking (scylla, EDB server, Intel Xeon
E5-2695 v3 @ 2.30GHz, 2 sockets, 14 cores/socket, 2 threads/core).
The results were not impressive.  There's basically no clog contention
to remove, so the patch just doesn't really do anything.  For example,
here's a wait event profile with master and using Ashutosh's test
script with 5 savepoints:

  1  Lock| tuple
  2  IO  | SLRUSync
  5  LWLock  | wal_insert
  5  LWLock  | XidGenLock
  9  IO  | DataFileRead
 12  LWLock  | lock_manager
 16  IO  | SLRURead
 20  LWLock  | CLogControlLock
 97  LWLock  | buffer_content
216  Lock| transactionid
237  LWLock  | ProcArrayLock
   1238  IPC | ProcArrayGroupUpdate
   2266  Client  | ClientRead

This is just a 5-minute test; maybe things would change if we ran it
for longer, but if only 0.5% of the samples are blocked on
CLogControlLock without the patch, obviously the patch can't help
much.  I did some other experiments too, but I won't bother
summarizing the results here because they're basically boring.  I
guess I should have used a bigger machine.

Given that we've changed the approach here somewhat, I think we need
to validate that we're still seeing a substantial reduction in
CLogControlLock contention on big machines.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-08-29 Thread Robert Haas
On Tue, Aug 29, 2017 at 1:08 AM, Dilip Kumar  wrote:
> (Time in ms)
> Queryhead  patch
>
> 6   2381914571
> 14 1351411183
> 15 49980 32400
> 20204441   188978

These are cool results, but this patch is obviously not ready for
prime time as-is, since there are various other references that will
need to be updated:

 * Since we are called as soon as nentries exceeds maxentries, we should
 * push nentries down to significantly less than maxentries, or else we'll
 * just end up doing this again very soon.  We shoot for maxentries/2.

/*
 * With a big bitmap and small work_mem, it's possible that we cannot get
 * under maxentries.  Again, if that happens, we'd end up uselessly
 * calling tbm_lossify over and over.  To prevent this from becoming a
 * performance sink, force maxentries up to at least double the current
 * number of entries.  (In essence, we're admitting inability to fit
 * within work_mem when we do this.)  Note that this test will not fire if
 * we broke out of the loop early; and if we didn't, the current number of
 * entries is simply not reducible any further.
 */
if (tbm->nentries > tbm->maxentries / 2)
tbm->maxentries = Min(tbm->nentries, (INT_MAX - 1) / 2) * 2;

I suggest defining a TBM_FILLFACTOR constant instead of repeating the
value 0.9 in a bunch of places.  I think it would also be good to try
to find the sweet spot for that constant.  Making it bigger reduces
the number of lossy entries  created, but making it smaller reduces
the number of times we have to walk the bitmap.  So if, for example,
0.75 is sufficient to produce almost all of the gain, then I think we
would want to prefer 0.75 to 0.9.  But if 0.9 is better, then we can
stick with that.

Note that a value higher than 0.9375 wouldn't be sane without some
additional safety precautions because maxentries could be as low as
16.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improving overflow checks when adding tuple to PGresult Re: [GENERAL] Retrieving query results

2017-08-29 Thread Michael Paquier
On Wed, Aug 30, 2017 at 4:24 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Mon, Aug 28, 2017 at 3:05 PM, Michael Paquier
>>  wrote:
>>> Attached are two patches:
>>> 1) 0001 refactors the code around pqAddTuple to be able to handle
>>> error messages and assign them in PQsetvalue particularly.
>>> 2) 0002 adds sanity checks in pqAddTuple for overflows, maximizing the
>>> size of what is allocated to INT_MAX but now more.
>
> I've pushed these (as one commit) with some adjustments.

Thanks!

> Mainly,
> I changed PQsetvalue to report failure messages with PQinternalNotice,
> which is what already happens inside check_field_number() for the case
> of an out-of-range field number.  It's possible that storing the
> message into the PGresult in addition would be worth doing, but I'm
> unconvinced about that --- we certainly haven't had any field requests
> for it.

OK, fine for me.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] A design for amcheck heapam verification

2017-08-29 Thread Peter Geoghegan
On Thu, May 11, 2017 at 4:30 PM, Peter Geoghegan  wrote:
> I spent only a few hours writing a rough prototype, and came up with
> something that does an IndexBuildHeapScan() scan following the
> existing index verification steps. Its amcheck callback does an
> index_form_tuple() call, hashes the resulting IndexTuple (heap TID and
> all), and tests it for membership of a bloom filter generated as part
> of the main B-Tree verification phase. The IndexTuple memory is freed
> immediately following hashing.

I attach a cleaned-up version of this. It has extensive documentation.
My bloom filter implementation is broken out as a separate patch,
added as core infrastructure under "lib".

I do have some outstanding concerns about V1 of the patch series:

* I'm still uncertain about the question of using IndexBuildHeapScan()
during Hot Standby. It seems safe, since we're only using the
CONCURRENTLY/AccessShareLock path when this happens, but someone might
find that objectionable on general principle. For now, in this first
version, it remains possible to call IndexBuildHeapScan() during Hot
Standby, to allow the new verification to work there.

* The bloom filter has been experimentally verified, and is based on
source material which is directly cited. It would nevertheless be
useful to have the hashing stuff scrutinized, because it's possible
that I've overlooked some subtlety.

This is only the beginning for heapam verification. Comprehensive
coverage can be added later, within routines that specifically target
some table, not some index.

While this patch series only adds index-to-heap verification for
B-Tree indexes, I can imagine someone adopting the same technique to
verifying that other access methods are consistent with their heap
relation. For example, it would be easy to do this with hash indexes.
Any other index access method where the same high-level principle that
I rely on applies can do index-to-heap verification with just a few
tweaks. I'm referring to the high-level principle that comments
specifically point out in the patch: that REINDEX leaves you with an
index structure that has exactly the same entries as the old index
structure had, though possibly with fewer dead index tuples. I like my
idea of reusing IndexBuildHeapScan() for verification. Very few new
LOCs are actually added to amcheck by this patch, and
IndexBuildHeapScan() is itself tested.

-- 
Peter Geoghegan
From 48499cfb58b7bf705e93fb12cc5359ec12cd9c51 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql|  28 
 contrib/amcheck/amcheck.control  |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql  |   9 +-
 contrib/amcheck/verify_nbtree.c  | 236 ---
 doc/src/sgml/amcheck.sgml| 103 +++---
 7 files changed, 345 insertions(+), 49 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)
 
 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"
 
 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 

Re: [HACKERS] Improving overflow checks when adding tuple to PGresult Re: [GENERAL] Retrieving query results

2017-08-29 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Aug 28, 2017 at 3:05 PM, Michael Paquier
>  wrote:
>> Attached are two patches:
>> 1) 0001 refactors the code around pqAddTuple to be able to handle
>> error messages and assign them in PQsetvalue particularly.
>> 2) 0002 adds sanity checks in pqAddTuple for overflows, maximizing the
>> size of what is allocated to INT_MAX but now more.

I've pushed these (as one commit) with some adjustments.  Mainly,
I changed PQsetvalue to report failure messages with PQinternalNotice,
which is what already happens inside check_field_number() for the case
of an out-of-range field number.  It's possible that storing the
message into the PGresult in addition would be worth doing, but I'm
unconvinced about that --- we certainly haven't had any field requests
for it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-08-29 Thread Robert Haas
On Tue, Aug 22, 2017 at 8:14 AM, amul sul  wrote:
> Attaching patch 0002 for the reviewer's testing.

I think that this 0002 is not something we can think of committing
because there's no guarantee that hash functions will return the same
results on all platforms.  However, what we could and, I think, should
do is hash some representative values of each data type and verify
that hash(x) and hashextended(x, 0) come out the same at least as to
the low-order 32 bits -- and maybe that hashextend(x, 1) comes out
different as to the low-order 32 bits.  The easiest way to do this
seems to be to cast to bit(32).  For example:

SELECT v, hashint4(v)::bit(32) as standard,
hashint4extended(v, 0)::bit(32) as extended0,
hashint4extended(v, 1)::bit(32) as extended1
FROM (VALUES (0), (1), (17), (42), (550273), (207112489)) x(v)
WHERE hashint4(v)::bit(32) != hashint4extended(v, 0)::bit(32)
   OR hashint4(v)::bit(32) = hashint4extended(v, 1)::bit(32);

I suggest adding a version of this for each data type.

>From your latest version of 0001, I get:

rangetypes.c:1297:8: error: unused variable 'rngtypid'
  [-Werror,-Wunused-variable]
Oid rngtypid = RangeTypeGetOid(r);

I suggest not duplicating the comments from each regular function into
the extended function, but just writing /* Same approach as hashfloat8
*/ when the implementation is non-trivial (no need for this if the
extended function is a single line or the original function had no
comments anyway).

hash_aclitem seems embarrassingly stupid.  I suggest that we make the
extended version slightly less stupid -- i.e. if the seed is non-zero,
actually call hash_any_extended on the sum and pass the seed through.

  * Reset info about hash function whenever we pick up new info about
  * equality operator.  This is so we can ensure that the hash function
  * matches the operator.
  */
 typentry->flags &= ~(TCFLAGS_CHECKED_HASH_PROC);
+typentry->flags &= ~(TCFLAGS_CHECKED_HASH_EXTENDED_PROC);

Adjust comment: "about hash function" -> "about hash functions", "hash
functions matches" -> "hash functions match".

+extern Datum
+hash_any_extended(register const unsigned char *k, register int
+  keylen, uint64 seed);

Formatting.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90

2017-08-29 Thread Tom Lane
Amit Kapila  writes:
> On Tue, Aug 29, 2017 at 8:32 AM, Robert Haas  wrote:
>> There's already ExecParallelReinitialize, which could be made to walk
>> the nodes in addition to what it does already, but I don't understand
>> exactly what else needs fixing.

> Sure, but it is not advisable to reset state of all the nodes below
> gather at that place, otherwise, it will be more or less like we are
> forcing rescan of each node.  I think there we can reset the shared
> parallel state of parallel-aware nodes (or anything related) and then
> allow rescan to reset the master backend specific state for all nodes
> beneath gather.

Right, the idea is to make this happen separately from the "rescan"
logic.  In general, it's a good idea for ExecReScanFoo to do as little
as possible, so that you don't pay if a node is rescanned more than
once before it's asked to do anything, or indeed if no rows are ever
demanded from it at all.

Attached is a WIP patch along this line.  It's unfinished because
I've not done the tedium of extending the FDW and CustomScan APIs
to support this new type of per-node operation; but that part seems
straightforward enough.  The core code is complete and survived
light testing.  I'm pretty happy with the results --- note in
particular how we get rid of some very dubious coding in
ExecReScanIndexScan and ExecReScanIndexOnlyScan.

If you try the test case from a2b70c89c on this patch alone, you'll notice
that instead of sometimes reporting too-small counts during the rescans,
it pretty consistently reports too-large counts.  This is because we are
now successfully resetting the shared state for the parallel seqscan, but
we haven't done anything about the leader's HashAgg node deciding that it
can re-use its old hashtable.  So on the first scan, the leader typically
scans all or most of the table because of its startup time advantage, and
saves those counts in its hashtable.  On later scans, the workers read all
of the table while the leader decides it need do no scanning.  So we get
counts that reflect all of the table (from the workers) plus whatever part
of the table the leader read the first time.  So this by no means removes
the need for my other patch.

If no objections, I'll do the additional legwork and push.  As before,
I think we can probably get away without fixing 9.6, even though it's
nominally got the same bug.

regards, tom lane

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ff03c68..e29c5ad 100644
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
*** heap_rescan(HeapScanDesc scan,
*** 1531,1551 
  	 * reinitialize scan descriptor
  	 */
  	initscan(scan, key, true);
- 
- 	/*
- 	 * reset parallel scan, if present
- 	 */
- 	if (scan->rs_parallel != NULL)
- 	{
- 		ParallelHeapScanDesc parallel_scan;
- 
- 		/*
- 		 * Caller is responsible for making sure that all workers have
- 		 * finished the scan before calling this.
- 		 */
- 		parallel_scan = scan->rs_parallel;
- 		pg_atomic_write_u64(_scan->phs_nallocated, 0);
- 	}
  }
  
  /* 
--- 1531,1536 
*** heap_parallelscan_initialize(ParallelHea
*** 1643,1648 
--- 1628,1646 
  }
  
  /* 
+  *		heap_parallelscan_reinitialize - reset a parallel scan
+  *
+  *		Call this in the leader process.  Caller is responsible for
+  *		making sure that all workers have finished the scan beforehand.
+  * 
+  */
+ void
+ heap_parallelscan_reinitialize(ParallelHeapScanDesc parallel_scan)
+ {
+ 	pg_atomic_write_u64(_scan->phs_nallocated, 0);
+ }
+ 
+ /* 
   *		heap_beginscan_parallel - join a parallel scan
   *
   *		Caller must hold a suitable lock on the correct relation.
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ce47f1d..d8cdb0e 100644
*** a/src/backend/executor/execParallel.c
--- b/src/backend/executor/execParallel.c
*** static bool ExecParallelInitializeDSM(Pl
*** 109,114 
--- 109,116 
  		  ExecParallelInitializeDSMContext *d);
  static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt,
  			 bool reinitialize);
+ static bool ExecParallelReInitializeDSM(PlanState *planstate,
+ 			ParallelContext *pcxt);
  static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
  	SharedExecutorInstrumentation *instrumentation);
  
*** ExecParallelSetupTupleQueues(ParallelCon
*** 365,382 
  }
  
  /*
-  * Re-initialize the parallel executor info such that it can be reused by
-  * workers.
-  */
- void
- ExecParallelReinitialize(ParallelExecutorInfo *pei)
- {
- 	ReinitializeParallelDSM(pei->pcxt);
- 	pei->tqueue = ExecParallelSetupTupleQueues(pei->pcxt, true);
- 	pei->finished = false;
- }
- 
- /*
   * Sets up the required infrastructure for backend workers to perform
   * execution and 

[HACKERS] OpenFile() Permissions Refactor

2017-08-29 Thread David Steele
Hackers,

While working on the patch to allow group reads of $PGDATA I refactored
the various OpenFile() functions to use default/global permissions
rather than permissions defined in each call.

I think the patch stands on its own regardless of whether we accept the
patch to allow group permissions (which won't make this CF).  We had a
couple different ways of defining permissions (e.g. 0600 vs S_IRUSR |
S_IWUSR) and in any case it represented quite a bit of duplication.
This way seems simpler and less error-prone.

I have intentionally not touched the open/fopen/mkdir calls in these
files since they are specialized and require per-instance consideration
(if they are changed at all):

backend/postmaster/fork_process.c
backend/postmaster/postmaster.c
backend/utils/error/elog.c
backend/utils/init/miscinit.c

All tests pass but it's possible that I've missed something or changed
something that shouldn't be changed.

I'll submit the patch to the 2017-09 CF.

Thanks,
-- 
-David
da...@pgmasters.net
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index fa409d72b7..3ab1fd2db4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1869,8 +1869,7 @@ qtext_store(const char *query, int query_len,
*query_offset = off;
 
/* Now write the data into the successfully-reserved part of the file */
-   fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDWR | O_CREAT | PG_BINARY,
-  S_IRUSR | S_IWUSR);
+   fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDWR | O_CREAT | PG_BINARY);
if (fd < 0)
goto error;
 
@@ -1934,7 +1933,7 @@ qtext_load_file(Size *buffer_size)
int fd;
struct stat stat;
 
-   fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDONLY | PG_BINARY, 0);
+   fd = OpenTransientFile(PGSS_TEXT_FILE, O_RDONLY | PG_BINARY);
if (fd < 0)
{
if (errno != ENOENT)
diff --git a/src/backend/access/heap/rewriteheap.c 
b/src/backend/access/heap/rewriteheap.c
index bd560e47e1..f93c194e18 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1013,8 +1013,7 @@ logical_rewrite_log_mapping(RewriteState state, 
TransactionId xid,
src->off = 0;
memcpy(src->path, path, sizeof(path));
src->vfd = PathNameOpenFile(path,
-   O_CREAT 
| O_EXCL | O_WRONLY | PG_BINARY,
-   S_IRUSR 
| S_IWUSR);
+   O_CREAT 
| O_EXCL | O_WRONLY | PG_BINARY);
if (src->vfd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -1133,8 +1132,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 xlrec->mapped_xid, XLogRecGetXid(r));
 
fd = OpenTransientFile(path,
-  O_CREAT | O_WRONLY | 
PG_BINARY,
-  S_IRUSR | S_IWUSR);
+  O_CREAT | O_WRONLY | 
PG_BINARY);
if (fd < 0)
ereport(ERROR,
(errcode_for_file_access(),
@@ -1258,7 +1256,7 @@ CheckPointLogicalRewriteHeap(void)
}
else
{
-   int fd = OpenTransientFile(path, 
O_RDONLY | PG_BINARY, 0);
+   int fd = OpenTransientFile(path, 
O_RDONLY | PG_BINARY);
 
/*
 * The file cannot vanish due to concurrency since this 
function
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 77edc51e1c..9dd77190ec 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno)
 
SlruFileName(ctl, path, segno);
 
-   fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
+   fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
if (fd < 0)
{
/* expected: file doesn't exist */
@@ -654,7 +654,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 * where the file doesn't exist, and return zeroes instead.
 */
-   fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
+   fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
if (fd < 0)
{
if (errno != ENOENT || !InRecovery)
@@ -804,8 +804,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, 
SlruFlush 

Re: [HACKERS] Explicit relation name in VACUUM VERBOSE log

2017-08-29 Thread Robert Haas
On Tue, Aug 22, 2017 at 2:23 AM, Simon Riggs  wrote:
> Yes, we can. I'm not sure why you would do this only for VACUUM
> though? I see many messages in various places that need same treatment

I'm skeptical about the idea of doing this too generally.

rhaas=> select * from foo;
ERROR:  permission denied for relation foo

Do we really want to start saying public.foo in all such error
messages?  To me, that's occasionally helpful but more often just
useless chatter.

One problem with making this behavior optional is that we'd then need
two separate translatable strings in every case, one saying "table %s
has problems" and the other saying "table %s.%s has problems".  Maybe
we could avoid that via some clever trick, but that's how we're doing
it in some existing cases.

I have a feeling we're making a small patch with a narrow goal into a
giant patch for which everyone has a different goal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error

2017-08-29 Thread Andres Freund
Hi,

On 2017-08-29 13:42:05 +0200, Simone Gotti wrote:
> On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrera
>  wrote:
> >
> 
> Hi Alvaro,
> 
> > Simone Gotti wrote:
> > > Hi all,
> > >
> > > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an
> > > active slot will block until it's released instead of returning an error
> > > like
> > > done in pg 9.6. Since this is a change in the previous behavior and the 
> > > docs
> > > wasn't changed I made a patch to restore the previous behavior.
> >
> > Changing that behavior was the entire point of the cited commit.
> 
> Sorry, I was thinking that the new behavior was needed for internal
> future functions since the doc wasn't changed.

FWIW, I also don't think it's ok to just change the behaviour
unconditionally and without a replacement for existing behaviour.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update low-level backup documentation to match actual behavior

2017-08-29 Thread David Steele
Hi Robert,

On 8/25/17 4:03 PM, David Steele wrote:
> On 8/25/17 3:26 PM, Robert Haas wrote:
>> On Fri, Aug 25, 2017 at 3:21 PM, David Steele 
>> wrote:
>>> No problem.  I'll base it on your commit to capture any changes you
>>> made.
>>
>> Thanks, but you incorporated everything I wanted in response to my
>> first review -- so I didn't tweak it any further.
> 
> Thank you for committing that.  I'll get the 9.6 patch to you early next
> week.

Attached is the 9.6 patch.  It required a bit more work in func.sgml
than I was expecting so have a close look at that.  The rest was mostly
removing irrelevant hunks.

Thanks,
-- 
-David
da...@pgmasters.net
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 03c0dbf1cd..456f6f2c98 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -889,8 +889,11 @@ SELECT pg_start_backup('label', false, false);
 
 SELECT * FROM pg_stop_backup(false);
 
- This terminates the backup mode and performs an automatic switch to
- the next WAL segment.  The reason for the switch is to arrange for
+ This terminates backup mode. On a primary, it also performs an automatic
+ switch to the next WAL segment.  On a standby, it is not possible to
+ automatically switch WAL segments, so you may wish to run
+ pg_switch_wal on the primary to perform a manual
+ switch. The reason for the switch is to arrange for
  the last WAL segment file written during the backup interval to be
  ready to archive.
 
@@ -908,7 +911,7 @@ SELECT * FROM pg_stop_backup(false);
  Once the WAL segment files active during the backup are archived, you are
  done.  The file identified by pg_stop_backup's first return
  value is the last segment that is required to form a complete set of
- backup files.  If archive_mode is enabled,
+ backup files.  On a primary, if archive_mode is enabled,
  pg_stop_backup does not return until the last segment has
  been archived.
  Archiving of these files happens automatically since you have
@@ -924,6 +927,13 @@ SELECT * FROM pg_stop_backup(false);
  pg_stop_backup terminates because of this your backup
  may not be valid.
 
+
+
+ Note that on a standby pg_stop_backup does not wait for
+ WAL segments to be archived so the backup process must ensure that all WAL
+ segments required for the backup are successfully archived.
+
+

   
 
@@ -932,9 +942,9 @@ SELECT * FROM pg_stop_backup(false);
 Making an exclusive low level backup
 
  The process for an exclusive backup is mostly the same as for a
- non-exclusive one, but it differs in a few key steps. It does not allow
- more than one concurrent backup to run, and there can be some issues on
- the server if it crashes during the backup. Prior to PostgreSQL 9.6, this
+ non-exclusive one, but it differs in a few key steps. This type of backup
+ can only be taken on a primary and does not allow concurrent backups.
+ Prior to PostgreSQL 9.6, this
  was the only low-level method available, but it is now recommended that
  all users upgrade their scripts to use non-exclusive backups if possible.
 
@@ -992,6 +1002,11 @@ SELECT pg_start_backup('label', true);
   for things to
  consider during this backup.
 
+
+  Note that if the server crashes during the backup it may not be
+  possible to restart until the backup_label file has been
+  manually deleted from the PGDATA directory.
+


 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 8482601294..a803e747b2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18070,11 +18070,18 @@ postgres=# select pg_start_backup('label_goes_here');
 pg_start_backup. In a non-exclusive backup, the contents of
 the backup_label and tablespace_map are returned
 in the result of the function, and should be written to files in the
-backup (and not in the data directory).
+backup (and not in the data directory).  When executed on a primary
+pg_stop_backup will wait for WAL to be archived when archiving
+is enabled.  On a standby pg_stop_backup will return
+immediately without waiting so it's important to verify that all required
+WAL segments have been archived. If write activity on the primary is low, 
it
+may be useful to run pg_switch_wal on the primary in order to
+trigger an immediate segment switch of the last required WAL.

 

-The function also creates a backup history file in the transaction log
+When executed on a primary, the function also creates a backup history file
+in the write-ahead log
 archive area. The history file includes the label given to
 pg_start_backup, the starting and ending transaction log 
locations for
 the backup, and the starting and ending times of the backup.  The return

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] More replication race conditions

2017-08-29 Thread Michael Paquier
On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier
 wrote:
> Today's run has finished with the same failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi=2017-08-27%2018%3A00%3A13
> Attached is a patch to make this code path wait that the transaction
> has been replayed. We could use as well synchronous_commit = apply,
> but I prefer the solution of this patch with a wait query.

I have added an open item for this issue for PG10. The 2PC tests in
src/test/recovery are new in PG10, introduced by 3082098.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error

2017-08-29 Thread Simone Gotti
On Tue, Aug 29, 2017 at 12:13 PM, Alvaro Herrera
 wrote:
>

Hi Alvaro,

> Simone Gotti wrote:
> > Hi all,
> >
> > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an
> > active slot will block until it's released instead of returning an error
> > like
> > done in pg 9.6. Since this is a change in the previous behavior and the docs
> > wasn't changed I made a patch to restore the previous behavior.
>
> Changing that behavior was the entire point of the cited commit.

Sorry, I was thinking that the new behavior was needed for internal
future functions since the doc wasn't changed.

>
> A better fix, from my perspective, is to amend the docs as per the
> attached patch.  This is what would be useful for logical replication,
> which is what replication slots were invented for in the first place.

I don't know the reasons why the new behavior is better for logical
replication so I trust you. We are using repl slots for physical
replication.

> If you disagree, let's discuss what other use cases you have, and we can
> come up with alternatives that satisfy both.

I just faced the opposite problem, in stolon [1], we currently rely on
the previous behavior. i.e. we don't want to block waiting for a slot
to be released (that under some partitioning conditions could not
happen for a long time), but prefer to continue retrying the drop
later. Now we partially avoid blocking timing out the drop operation
after some seconds.
Another idea will be to query the slot status before doing the drop
but will lead to a race condition (probably the opposite that that
commit was fixing) if the slot is acquired between the query and the
drop.

> I think a decent answer,
> but one which would create a bit of extra churn, would be to have an
> optional boolean flag in the command/function for "nowait", instead of
> hardcoding either behavior.

I think that this will be the best fix. I'm not sure on the policy of
these commands and if backward compatibility will be better (in such
case the old behavior should be the default and a new "wait" flag
could be added).

If the default behavior is going to change we have to add different
code for postgres >= 10.


Thanks,
Simone.


[1] https://github.com/sorintlab/stolon

>
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tuple-routing for certain partitioned tables not working as expected

2017-08-29 Thread Etsuro Fujita

On 2017/08/25 22:26, Robert Haas wrote:

On Wed, Aug 23, 2017 at 4:55 AM, Etsuro Fujita
 wrote:

Agreed, but I'd vote for fixing this in v10 as proposed; I agree that just
ripping the CheckValidResultRel checks out entirely is not a good idea, but
that seems OK to me at least as a fix just for v10.


I'm still not on-board with having this be the one case where we don't
do CheckValidResultRel.  If we want to still call it but pass down
some additional information that can selectively skip certain checks,
I could probably live with that.


Another idea would be to not do CheckValidResultRel for partitions in 
ExecSetupPartitionTupleRouting; instead, do that the first time the 
partition is chosen by ExecFindPartition, and if successfully checked, 
initialize the partition's ResultRelInfo and other stuff.  (We could 
skip this after the first time by checking whether we already have a 
valid ResultRelInfo for the chosen partition.)  That could allow us to 
not only call CheckValidResultRel the way it is, but avoid initializing 
useless partitions for which tuples aren't routed at all.



At some point we've got to stop developing v10 and just let it be what it is.


I agree on that point, but ISTM that this is more like a bug.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding support for Default partition in partitioning

2017-08-29 Thread Ashutosh Bapat
On Mon, Aug 21, 2017 at 4:47 PM, Jeevan Ladhe
 wrote:
>
> Hi,
>
> On Thu, Aug 17, 2017 at 3:29 PM, Jeevan Ladhe
>  wrote:
>>
>> Hi,
>>
>> On Tue, Aug 15, 2017 at 7:20 PM, Robert Haas 
>> wrote:
>>>
>>> On Wed, Jul 26, 2017 at 8:14 AM, Jeevan Ladhe
>>>  wrote:
>>> > I have rebased the patches on the latest commit.
>>>
>>> This needs another rebase.
>>
>>
>> I have rebased the patch and addressed your and Ashutosh comments on last
>> set of patches.

Thanks for the rebased patches.

>>
>> The current set of patches contains 6 patches as below:
>>
>> 0001:
>> Refactoring existing ATExecAttachPartition  code so that it can be used
>> for
>> default partitioning as well

  * Returns an expression tree describing the passed-in relation's partition
- * constraint.
+ * constraint. If there are no partition constraints returns NULL e.g. in case
+ * default partition is the only partition.
The first sentence uses singular constraint. The second uses plural. Given that
partition bounds together form a single constraint we should use singular
constraint in the second sentence as well.

Do we want to add a similar comment in the prologue of
generate_partition_qual(). The current wording there seems to cover this case,
but do we want to explicitly mention this case?

+if (!and_args)
+result = NULL;
While this is correct, I am increasingly seeing (and_args != NIL) usage.

get_partition_qual_relid() is called from pg_get_partition_constraintdef(),
constr_expr = get_partition_qual_relid(relationId);

/* Quick exit if not a partition */
if (constr_expr == NULL)
PG_RETURN_NULL();
The comment is now wrong since a default partition may have no constraints. May
be rewrite it as simply, "Quick exit if no partition constraint."

generate_partition_qual() has three callers and all of them are capable of
handling NIL partition constraint for default partition. May be it's better to
mention in the commit message that we have checked that the callers of
this function
can handle NIL partition constraint.

>>
>> 0002:
>> This patch teaches the partitioning code to handle the NIL returned by
>> get_qual_for_list().
>> This is needed because a default partition will not have any constraints
>> in case
>> it is the only partition of its parent.

If the partition being dropped is the default partition,
heap_drop_with_catalog() locks default partition twice, once as the default
partition and the second time as the partition being dropped. So, it will be
counted as locked twice. There doesn't seem to be any harm in this, since the
lock will be help till the transaction ends, by when all the locks will be
released.

Same is the case with cache invalidation message. If we are dropping default
partition, the cache invalidation message on "default partition" is not
required. Again this might be harmless, but better to avoid it.

Similar problems exists with ATExecDetachPartition(), default partition will be
locked twice if it's being detached.

+/*
+ * If this is a default partition, pg_partitioned_table must have it's
+ * OID as value of 'partdefid' for it's parent (i.e. rel) entry.
+ */
+if (castNode(PartitionBoundSpec, boundspec)->is_default)
+{
+Oidpartdefid;
+
+partdefid = get_default_partition_oid(RelationGetRelid(rel));
+Assert(partdefid == inhrelid);
+}
Since an accidental change or database corruption may change the default
partition OID in pg_partition_table. An Assert won't help in such a case. May
be we should throw an error or at least report a warning. If we throw an error,
the table will become useless (or even the database will become useless
RelationBuildPartitionDesc is called from RelationCacheInitializePhase3() on
such a corrupted table). To avoid that we may raise a warning.

I am wondering whether we could avoid call to get_default_partition_oid() in
the above block, thus avoiding a sys cache lookup. The sys cache search
shouldn't be expensive since the cache should already have that entry, but
still if we can avoid it, we save some CPU cycles. The default partition OID is
stored in pg_partition_table catalog, which is looked up in
RelationGetPartitionKey(), a function which precedes RelationGetPartitionDesc()
everywhere. What if that RelationGetPartitionKey() also returns the default
partition OID and the common caller passes it to RelationGetPartitionDesc()?.

+/* A partition cannot be attached if there exists a default partition */
+defaultPartOid = get_default_partition_oid(RelationGetRelid(rel));
+if (OidIsValid(defaultPartOid))
+ereport(ERROR,
+(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+ errmsg("cannot attach a new partition to table
\"%s\" having a default partition",
+

Re: [HACKERS] hash partitioning based on v10Beta2

2017-08-29 Thread yangjie







Hi,



When the number of partitions and the data are more, adding new partitions, there will be some efficiency problems.


I don't know how the solution you're talking about is how to implement a hash partition?



















---youngHighGo Database: http://www.highgo.com











On 8/28/2017 22:25,Robert Haas wrote: 


On Sat, Aug 26, 2017 at 12:40 AM, yang...@highgo.com  wrote:
> A partition table can be create as bellow:
>
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;

This syntax is very problematic for reasons that have been discussed
on the existing hash partitioning thread.  Fortunately, a solution has
already been implemented... you're the third person to try to write a
patch for this feature.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: [HACKERS] [POC] hash partitioning

2017-08-29 Thread yangjie












Hi,

This is my patch, before I forgot to add attachments, and the following address is also discussed.
https://www.postgresql.org/message-id/2017082612390093777512%40highgo.com


















---youngHighGo Database: http://www.highgo.com











On 8/28/2017 16:28,Yugo Nagata wrote: 


Hi young,

On Mon, 28 Aug 2017 15:33:46 +0800
"yang...@highgo.com"  wrote:

> Hello
> 
> Looking at your hash partitioning syntax, I implemented a hash partition in a more concise way, with no need to determine the number of sub-tables, and dynamically add partitions.

I think it is great work, but the current consensus about hash-partitioning supports 
Amul's patch[1], in which the syntax is different from the my original proposal. 
So, you will have to read Amul's patch and make a discussion if you still want to
propose your implementation.

Regards,

[1] https://www.postgresql.org/message-id/CAAJ_b965A2oog=6efuhelexl3rmgfssb3g7lwkva1bw0wuj...@mail.gmail.com


> 
> Description
> 
> The hash partition's implement is on the basis of the original range / list partition,and using similar syntax.
> 
> To create a partitioned table ,use:
> 
> CREATE TABLE h (id int) PARTITION BY HASH(id);
> 
> The partitioning key supports only one value, and I think the partition key can support multiple values, 
> which may be difficult to implement when querying, but it is not impossible.
> 
> A partition table can be create as bellow:
> 
>  CREATE TABLE h1 PARTITION OF h;
>  CREATE TABLE h2 PARTITION OF h;
>  CREATE TABLE h3 PARTITION OF h;
>  
> FOR VALUES clause cannot be used, and the partition bound is calclulated automatically as partition index of single integer value.
> 
> An inserted record is stored in a partition whose index equals 
> DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, values[0])) % nparts/* Number of partitions */
> ;
> In the above example, this is DatumGetUInt32(OidFunctionCall1(lookup_type_cache(key->parttypid[0], TYPECACHE_HASH_PROC)->hash_proc, id)) % 3;
> 
> postgres=# insert into h select generate_series(1,20);
> INSERT 0 20
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id 
> --+
>  h1   |  3
>  h1   |  5
>  h1   | 17
>  h1   | 19
>  h2   |  2
>  h2   |  6
>  h2   |  7
>  h2   | 11
>  h2   | 12
>  h2   | 14
>  h2   | 15
>  h2   | 18
>  h2   | 20
>  h3   |  1
>  h3   |  4
>  h3   |  8
>  h3   |  9
>  h3   | 10
>  h3   | 13
>  h3   | 16
> (20 rows)
> 
> The number of partitions here can be dynamically added, and if a new partition is created, the number of partitions changes, the calculated target partitions will change, and the same data is not reasonable in different partitions,So you need to re-calculate the existing data and insert the target partition when you create a new partition.
> 
> postgres=# create table h4 partition of h;
> CREATE TABLE
> postgres=# select tableoid::regclass,* from h;
>  tableoid | id 
> --+
>  h1   |  5
>  h1   | 17
>  h1   | 19
>  h1   |  6
>  h1   | 12
>  h1   |  8
>  h1   | 13
>  h2   | 11
>  h2   | 14
>  h3   |  1
>  h3   |  9
>  h3   |  2
>  h3   | 15
>  h4   |  3
>  h4   |  7
>  h4   | 18
>  h4   | 20
>  h4   |  4
>  h4   | 10
>  h4   | 16
> (20 rows)
> 
> When querying the data, the hash partition uses the same algorithm as the insertion, and filters out the table that does not need to be scanned.
> 
> postgres=# explain analyze select * from h where id = 1;
>  QUERY PLAN 
> 
>  Append  (cost=0.00..41.88 rows=13 width=4) (actual time=0.020..0.023 rows=1 loops=1)
>->  Seq Scan on h3  (cost=0.00..41.88 rows=13 width=4) (actual time=0.013..0.016 rows=1 loops=1)
>  Filter: (id = 1)
>  Rows Removed by Filter: 3
>  Planning time: 0.346 ms
>  Execution time: 0.061 ms
> (6 rows)
> 
> postgres=# explain analyze select * from h where id in (1,5);;
>  QUERY PLAN 
> 
>  Append  (cost=0.00..83.75 rows=52 width=4) (actual time=0.016..0.028 rows=2 loops=1)
>->  Seq Scan on h1  (cost=0.00..41.88 rows=26 width=4) (actual time=0.015..0.018 rows=1 loops=1)
>  Filter: (id = ANY ('{1,5}'::integer[]))
>  Rows Removed by Filter: 6
>->  Seq Scan on h3  (cost=0.00..41.88 rows=26 width=4) (actual time=0.005..0.007 rows=1 loops=1)
> 

Re: [HACKERS] [PATCH] Fix drop replication slot blocking instead of returning error

2017-08-29 Thread Alvaro Herrera
Simone Gotti wrote:
> Hi all,
> 
> I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an
> active slot will block until it's released instead of returning an error
> like
> done in pg 9.6. Since this is a change in the previous behavior and the docs
> wasn't changed I made a patch to restore the previous behavior.

Changing that behavior was the entire point of the cited commit.

A better fix, from my perspective, is to amend the docs as per the
attached patch.  This is what would be useful for logical replication,
which is what replication slots were invented for in the first place.
If you disagree, let's discuss what other use cases you have, and we can
come up with alternatives that satisfy both.  I think a decent answer,
but one which would create a bit of extra churn, would be to have an
optional boolean flag in the command/function for "nowait", instead of
hardcoding either behavior.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 5761b79c6ebc4f6dfaf4d2489e9bff47b9e0c3ad Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 29 Aug 2017 12:08:47 +0200
Subject: [PATCH] Document that DROP_REPLICATION_SLOT now waits

---
 doc/src/sgml/protocol.sgml | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 7c012f59a3..e61d57a234 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2179,7 +2179,8 @@ The commands accepted in walsender mode are:
 
  
   Drops a replication slot, freeing any reserved server-side resources. If
-  the slot is currently in use by an active connection, this command fails.
+  the slot is currently in use by an active connection, this command waits
+  until the slot becomes free.
   If the slot is a logical slot that was created in a database other than
   the database the walsender is connected to, this command fails.
  
-- 
2.11.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [PATCH] Fix drop replication slot blocking instead of returning error

2017-08-29 Thread Alvaro Herrera
Noah Misch wrote:
> On Thu, Aug 24, 2017 at 03:38:20PM +0200, Simone Gotti wrote:
> > I noticed that in postgres 10beta3, calling pg_drop_replication_slot on an
> > active slot will block until it's released instead of returning an error
> > like
> > done in pg 9.6. Since this is a change in the previous behavior and the docs
> > wasn't changed I made a patch to restore the previous behavior.

> > after git commit 9915de6c1cb calls to pg_drop_replication_slot or the
> > replication protocol DROP_REPLICATION_SLOT command will block when a
> > slot is in use instead of returning an error as before (as the doc
> > states).
> > 
> > This patch will set nowait to true to restore the previous
> > behavior.

> The above-described topic is currently a PostgreSQL 10 open item.

Looking at it now -- next report tomorrow.


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Separate log file for extension

2017-08-29 Thread Masahiko Sawada
On Fri, Aug 25, 2017 at 4:12 PM, Antonin Houska  wrote:
> Attached is a draft patch to allow extension to write log messages to a
> separate file.

Does it allow postgres core code to write log messages to multiple log
files as well? I can imagine a use case where we want to write log
messages to separate log files by error level or purpose (server log
and SQL log etc).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-29 Thread Mithun Cy
On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund  wrote:
>I think this patch should have a "cover letter" explaining what it's
> trying to achieve, how it does so and why it's safe/correct.

Cache the SnapshotData for reuse:
===
In one of our perf analysis using perf tool it showed GetSnapshotData
takes a very high percentage of CPU cycles on readonly workload when
there is very high number of concurrent connections >= 64.

Machine : cthulhu 8 node machine.
--
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   2128.000
BogoMIPS:  4266.63
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

On further analysis, it appears this mainly accounts to cache-misses
happening while iterating through large proc array to compute the
SnapshotData. Also, it seems mainly on read-only (read heavy) workload
SnapshotData mostly remains same as no new(rarely a) transaction
commits or rollbacks. Taking advantage of this we can save the
previously computed SanspshotData in shared memory and in
GetSnapshotData we can use the saved SnapshotData instead of computing
same when it is still valid. A saved SnapsotData is valid until no
open transaction end after it.

[Perf cache-misses analysis of Base for 128 pgbench readonly clients]
==
Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
Overhead  Command   Shared Object   Symbol
18.63%  postgres  postgres[.] GetSnapshotData
11.98%  postgres  postgres[.] _bt_compare
10.20%  postgres  postgres[.] PinBuffer
  8.63%  postgres  postgres[.] LWLockAttemptLock
6.50%  postgres  postgres[.] LWLockRelease


[Perf cache-misses analysis with Patch for 128 pgbench readonly clients]

Samples: 357K of event 'cache-misses', Event count (approx.): 102380622
Overhead  Command   Shared Object   Symbol
0.27%  postgres  postgres[.] GetSnapshotData

with the patch, it appears cache-misses events has reduced significantly.

Test Setting:
=
Server configuration:
./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
wal_buffers=256MB &

pgbench configuration:
scale_factor = 300
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  postgres

The machine has 64 cores with this patch I can see server starts
improvement after 64 clients. I have tested up to 256 clients. Which
shows performance improvement nearly max 39% [1]. This is the best
case for the patch where once computed snapshotData is reused further.

The worst case seems to be small, quick write transactions which
frequently invalidate the cached SnapshotData before it is reused by
any next GetSnapshotData call. As of now, I tested simple update
tests: pgbench -M Prepared -N on the same machine with the above
server configuration. I do not see much change in TPS numbers.

All TPS are median of 3 runs
Clients TPS-With Patch 05   TPS-Base%Diff
1 752.461117755.186777  -0.3%
64   32171.296537   31202.153576   +3.1%
128 41059.660769   40061.929658   +2.49%

I will do some profiling and find out why this case is not costing us
some performance due to caching overhead.

>> diff --git a/src/backend/storage/ipc/procarray.c 
>> b/src/backend/storage/ipc/procarray.c
>> index a7e8cf2..4d7debe 100644
>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>>   (arrayP->numProcs - index - 1) * 
>> sizeof(int));
>>   arrayP->pgprocnos[arrayP->numProcs - 1] = -1;   /* for 
>> debugging */
>>   arrayP->numProcs--;
>> + ProcGlobal->is_snapshot_valid = false;
>>   LWLockRelease(ProcArrayLock);
>>   return;
>>

Re: [HACKERS] postgres_fdw bug in 9.6

2017-08-29 Thread Etsuro Fujita

On 2017/08/21 20:37, Etsuro Fujita wrote:

Attached is an updated version of the patch.


I corrected some comments.  New version attached.

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 1701,1722  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = 
t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS 
NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) 
END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, 
r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN 
"S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, 
r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Merge Join
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Merge Cond: (t1.c1 = t2.c1)
!  ->  Sort
 Output: t1.c1, t1.c3, t1.*
!Sort Key: t1.c1
!->  Foreign Scan on public.ft1 t1
!  Output: t1.c1, t1.c3, t1.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, 
c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Sort
 Output: t2.c1, t2.*
!Sort Key: t2.c1
!->  Foreign Scan on public.ft2 t2
!  Output: t2.c1, t2.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, 
c6, c7, c8 FROM "S 1"."T 1"
! (23 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY 
t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
--- 1701,1716 
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS 
NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) 
END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, 
r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN 
"S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, 
r1."C 1" ASC NULLS LAST FOR UPDATE OF r1
!->  Nested Loop
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Join Filter: (t1.c1 = t2.c1)
!  ->  Foreign Scan on public.ft1 t1
 Output: t1.c1, t1.c3, t1.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, 
c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Foreign Scan on public.ft2 t2
 Output: t2.c1, t2.*
!Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, 
c8 FROM "S 1"."T 1"
! (17 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY 
t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR UPDATE OF t1;
   c1  | c1  
***
*** 1745,1766  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = 
t2.c1) ORDER BY t1.c3, t
 Output: t1.c1, t2.c1, t1.c3, t1.*, t2.*
 Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Remote SQL: SELECT r1."C 1", r1.c3, CASE WHEN (r1.*)::text IS 
NOT NULL THEN ROW(r1."C 1", r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8) 
END, r2."C 1", CASE WHEN (r2.*)::text IS NOT NULL THEN ROW(r2."C 1", r2.c2, 
r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8) END FROM ("S 1"."T 1" r1 INNER JOIN 
"S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1" ORDER BY r1.c3 ASC NULLS LAST, 
r1."C 1" ASC NULLS LAST FOR UPDATE OF r1 FOR UPDATE OF r2
!->  Merge Join
   Output: t1.c1, t1.c3, t1.*, t2.c1, t2.*
!  Merge Cond: (t1.c1 = t2.c1)
!  ->  Sort
 Output: t1.c1, t1.c3, t1.*
!Sort Key: t1.c1
!->  Foreign Scan on public.ft1 t1
!  Output: t1.c1, t1.c3, t1.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, 
c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
!  ->  Sort
 Output: t2.c1, t2.*
!Sort Key: t2.c1
!->  Foreign Scan on public.ft2 t2
!  Output: t2.c1, t2.*
!  Remote SQL: SELECT "C 1", c2, c3, c4, c5, 
c6, c7, c8 FROM "S 1"."T 1" FOR UPDATE
! (23 rows)
  
  SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 ON (t1.c1 = t2.c1) ORDER BY 
t1.c3, t1.c1 OFFSET 100 LIMIT 10 FOR 

Re: [HACKERS] pgbench: Skipping the creating primary keys after initialization

2017-08-29 Thread Fabien COELHO


Hello,

Patch applies and works.


I would like to insist a little bit: the name was declared as an int but
passed to init and used as a bool there before the patch. Conceptually what
is meant is really a bool, and I see no added value bar saving a few strokes
to have an int. ISTM that recent pgbench changes have started turning old
int-for-bool habits into using bool when bool is meant.


Since is_no_vacuum is a existing one, if we follow the habit we should 
change other similar variables as well: is_init_mode, do_vacuum_accounts 
and debug. And I think we should change them in a separated patch.


Hmmm. The existing "is_no_vacuum" variable is typed *both* as int (in 
"main") and as bool (in "init"), called by main (yuk!). I see no reason to 
choose the bad one for the global:-)



After more thought, I'm bit inclined to not have a short option for
--custom-initialize because this option will be used for not primary
cases. It would be better to save short options for future
enhancements of pgbench. Thought?


I like it as is, especially as now the associated value is a simple and 
short string, I think that it makes sense to have a simple and short 
option to trigger it. Moreover -I stands cleanly for "initialization", and 
the capital stands for something a little special which it is. Its to good 
to miss.


I think that the "-I" it should be added to the "--help" line, as it is 
done with other short & long options.


Repeating "-I f" results in multiple foreign key constraints:

  Foreign-key constraints:
"pgbench_tellers_bid_fkey" FOREIGN KEY (bid) REFERENCES 
pgbench_branches(bid)
"pgbench_tellers_bid_fkey1" FOREIGN KEY (bid) REFERENCES 
pgbench_branches(bid)
"pgbench_tellers_bid_fkey2" FOREIGN KEY (bid) REFERENCES 
pgbench_branches(bid)
"pgbench_tellers_bid_fkey3" FOREIGN KEY (bid) REFERENCES 
pgbench_branches(bid)

I wonder if this could be avoided easily? Maybe by setting the constraint 
name explicitely so that the second one fails on the existing one, which 
is fine, like for primary keys? Or adding a DROP CONSTRAINT IF EXISTS 
before the CREATE CONSTRAINT, like for tables? Or doing nothing about it? 
I would prefer the first option.


Maybe the initial cleanup (DROP TABLE) could be made an option added to 
the default, so that cleaning up the database could be achieved with some 
"pgbench -i -I c", instead of connecting and droping the tables one by one 
which I have done quite a few times... What do you think?


Before it is definitely engraved, I'm thinking about the letters:

 c - cleanup
 t - create table
 d - data
 p - primary key
 f - foreign key
 v - vacuum

I think it is mostly okay, but it is the last time to think about it. 
Using "d" for cleanup (drop) would mean finding another letter for filling 
in data... maybe "g" for data generation? "c" may have been chosen for 
"create table", but then would not be available for "cleanup". Thoughts?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] [postgresql 10 beta3] unrecognized node type: 90

2017-08-29 Thread Amit Kapila
On Tue, Aug 29, 2017 at 8:32 AM, Robert Haas  wrote:
> On Mon, Aug 28, 2017 at 10:17 PM, Tom Lane  wrote:
>> In the meantime, I think what we should do is commit the bug fix more or
>> less as I have it, and then work on Amit's concern about losing parallel
>> efficiency by separating the resetting of shared parallel-scan state
>> into a new plan tree traversal that's done before launching new worker
>> processes.
>>

Sounds reasonable plan to me.

>>  The only real alternative is to lobotomize the existing rescan
>> optimizations, and that seems like a really poor choice from here.
>
> There's already ExecParallelReinitialize, which could be made to walk
> the nodes in addition to what it does already, but I don't understand
> exactly what else needs fixing.
>

Sure, but it is not advisable to reset state of all the nodes below
gather at that place, otherwise, it will be more or less like we are
forcing rescan of each node.  I think there we can reset the shared
parallel state of parallel-aware nodes (or anything related) and then
allow rescan to reset the master backend specific state for all nodes
beneath gather.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] assorted code cleanup

2017-08-29 Thread Ryan Murphy
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I've reviewed the code changes, and it's pretty clear to me that they clean 
things up a bit while not changing any behavior.  They simplify things in a way 
that make the code more comprehensible.  I've run all the tests and they behave 
the same way as they did before the patch.  I also trying manually playing 
around the the function in question, `metaphone`, and it seems to behave the 
same as before.

I think it's ready to commit!

The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash Functions

2017-08-29 Thread amul sul
On Tue, Aug 22, 2017 at 5:44 PM, amul sul  wrote:

> On Fri, Aug 18, 2017 at 11:01 PM, Robert Haas 
> wrote:
>
>> On Fri, Aug 18, 2017 at 1:12 PM, amul sul  wrote:
>> > I have a small query,  what if I want a cache entry with extended hash
>> > function instead standard one, I might require that while adding
>> > hash_array_extended function? Do you think we need to extend
>> > lookup_type_cache() as well?
>>
>> Hmm, I thought you had changed the hash partitioning stuff so that it
>> didn't rely on lookup_type_cache().  You have to look up the function
>> using the opclass provided in the partition key definition;
>> lookup_type_cache() will give you the default one for the datatype.
>> Maybe just use get_opfamily_proc?
>>
>>
> Yes, we can do that for
> ​the ​
> partitioning code, but my concern is a little bit
> different.  I apologize, I wasn't clear enough.
>
> I am trying to extend hash_array & hash_range function. The hash_array and
> hash_range function calculates hash by using the respective hash function
> for
> the given argument type (i.e. array/range element type), and those hash
> functions are made available in the TypeCacheEntry via
> lookup_type_cache(). But
> in the hash_array & hash_range extended version requires a respective
> extended
> hash function for those element type.
>
> I have added hash_array_extended & hash_range_extended function in the
> attached
> patch 0001, which maintains a local copy of TypeCacheEntry with extended
> hash
> functions. But I am a little bit skeptic about this logic, any
> ​ ​
> advice/suggestions will be
> greatly appreciated.
>
>
​Instead, in the attached patch, I have modified lookup_type_cache() to
request it to get extended hash function in the TypeCacheEntry.

For that, I've introduced new flags as TYPECACHE_HASH_EXTENDED_PROC,
TYPECACHE_HASH_EXTENDED_PROC_FINFO & TCFLAGS_CHECKED_HASH_EXTENDED_PROC, and
additional variables in TypeCacheEntry structure to hold extended hash proc
information.


> The logic in the rest of the extended hash functions is same as the
> standard
> one.
>

​Same for the hash_array_extended() & hash_range_extended() function as
well.

​Regards,
Amul​


0001-add-optional-second-hash-proc-v2.patch
Description: Binary data


0002-test-Hash_functions_wip.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Quorum commit for multiple synchronous replication.

2017-08-29 Thread Masahiko Sawada
On Thu, Aug 24, 2017 at 4:27 PM, Masahiko Sawada  wrote:
> On Thu, Aug 24, 2017 at 3:11 AM, Josh Berkus  wrote:
>> On 08/22/2017 11:04 PM, Masahiko Sawada wrote:
>>> WARNING:  what you did is ok, but you might have wanted to do something else
>>>
>>> First of all, whether or not that can properly be called a warning is
>>> highly debatable.  Also, if you do that sort of thing to your spouse
>>> and/or children, they call it "nagging".  I don't think users will
>>> like it any more than family members do.
>>
>> Realistically, we'll support the backwards-compatible syntax for 3-5
>> years.  Which is fine.
>>
>> I suggest that we just gradually deprecate the old syntax from the docs,
>> and then around Postgres 16 eliminate it.  I posit that that's better
>> than changing the meaning of the old syntax out from under people.
>>
>
> It seems to me that there is no folk who intently votes for making the
> quorum commit the default. There some folks suggest to keep backward
> compatibility in PG10 and gradually deprecate the old syntax. And only
> the issuing from docs can be possible in PG10.
>

According to the discussion so far, it seems to me that keeping
backward compatibility and issuing a warning in docs that old syntax
could be changed or removed in a future release is the most acceptable
way in PG10. There is no objection against that so far and I already
posted a patch to add a warning in docs[1]. I'll wait for the
committer's decision.

[1] 
https://www.postgresql.org/message-id/CAD21AoAe%2BoGSFi3bjZ%2BfW6Q%3DTK7avPdDCZLEr02zM_c-U0JsRA%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-29 Thread Mithun Cy
Hi all please ignore this mail, this email is incomplete I have to add
more information and Sorry accidentally I pressed the send button
while replying.


On Tue, Aug 29, 2017 at 11:27 AM, Mithun Cy  wrote:
> Cache the SnapshotData for reuse:
> ===
> In one of our perf analysis using perf tool it showed GetSnapshotData
> takes a very high percentage of CPU cycles on readonly workload when
> there is very high number of concurrent connections >= 64.
>
> Machine : cthulhu 8 node machine.
> --
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):128
> On-line CPU(s) list:   0-127
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 8
> NUMA node(s):  8
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 47
> Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
> Stepping:  2
> CPU MHz:   2128.000
> BogoMIPS:  4266.63
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  24576K
> NUMA node0 CPU(s): 0,65-71,96-103
> NUMA node1 CPU(s): 72-79,104-111
> NUMA node2 CPU(s): 80-87,112-119
> NUMA node3 CPU(s): 88-95,120-127
> NUMA node4 CPU(s): 1-8,33-40
> NUMA node5 CPU(s): 9-16,41-48
> NUMA node6 CPU(s): 17-24,49-56
> NUMA node7 CPU(s): 25-32,57-64
>
> Perf CPU cycle 128 clients.
>
> On further analysis, it appears this mainly accounts to cache-misses
> happening while iterating through large proc array to compute the
> SnapshotData. Also, it seems mainly on read-only (read heavy) workload
> SnapshotData mostly remains same as no new(rarely a) transaction
> commits or rollbacks. Taking advantage of this we can save the
> previously computed SanspshotData in shared memory and in
> GetSnapshotData we can use the saved SnapshotData instead of computing
> same when it is still valid. A saved SnapsotData is valid until no
> transaction end.
>
> [Perf analysis Base]
>
> Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
> Overhead  Command   Shared Object   Symbol
> 18.63%  postgres  postgres[.] GetSnapshotData
> 11.98%  postgres  postgres[.] _bt_compare
> 10.20%  postgres  postgres[.] PinBuffer
>   8.63%  postgres  postgres[.] LWLockAttemptLock
> 6.50%  postgres  postgres[.] LWLockRelease
>
>
> [Perf analysis with Patch]
>
>
> Server configuration:
> ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
> wal_buffers=256MB &
>
> pgbench configuration:
> scale_factor = 300
> ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  
> postgres
>
> The machine has 64 cores with this patch I can see server starts
> improvement after 64 clients. I have tested up to 256 clients. Which
> shows performance improvement nearly max 39% [1]. This is the best
> case for the patch where once computed snapshotData is reused further.
>
> The worst case seems to be small, quick write transactions with which
> frequently invalidate the cached SnapshotData before it is reused by
> any next GetSnapshotData call. As of now, I tested simple update
> tests: pgbench -M Prepared -N on the same machine with the above
> server configuration. I do not see much change in TPS numbers.
>
> All TPS are median of 3 runs
> Clients TPS-With Patch 05   TPS-Base%Diff
> 1 752.461117755.186777  -0.3%
> 64   32171.296537   31202.153576   +3.1%
> 128 41059.660769   40061.929658   +2.49%
>
> I will do some profiling and find out why this case is not costing us
> some performance due to caching overhead.
>
>
> []
>
> On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund  wrote:
>> Hi,
>>
>> I think this patch should have a "cover letter" explaining what it's
>> trying to achieve, how it does so and why it's safe/correct.  I think
>> it'd also be good to try to show some of the worst cases of this patch
>> (i.e. where the caching only adds overhead, but never gets used), and
>> some of the best cases (where the cache gets used quite often, and
>> reduces contention significantly).
>>
>> I wonder if there's a way to avoid copying the snapshot into the cache
>> in situations where we're barely ever going to need it. But right now
>> the only way I can think of is to look at the length of the
>> ProcArrayLock wait queue and count the exclusive locks therein - which'd
>> be expensive and intrusive...
>>
>>
>> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
>>> diff --git a/src/backend/storage/ipc/procarray.c 
>>> 

Re: [HACKERS] MAIN, Uncompressed?

2017-08-29 Thread Simon Riggs
On 26 August 2017 at 05:40, Mark Kirkwood  wrote:
> On 26/08/17 12:18, Simon Riggs wrote:
>
>> On 25 August 2017 at 20:53, Tom Lane  wrote:
>>>
>>> Greg Stark  writes:

 I think this is a particularly old piece of code and we're lucky the
 default heuristics have served well for all this time because I doubt
 many people fiddle with these storage attributes. The time may have
 come to come up with a better UI for the storage attributes because
 people are doing new things (like json) and wanting more control over
 this heuristic.
>>>
>>> Yeah, I could get behind a basic rethinking of the tuptoaster control
>>> knobs.  I'm just not in love with Simon's specific proposal, especially
>>> not if we can't think of a better name for it than "MAINU".
>>
>> Extended/External would be just fine if you could set the toast
>> target, so I think a better suggestion would be to make "toast_target"
>> a per-attribute option .
>>
>
> +1, have thought about this myself previouslythank you for bringing it
> up!

OK, so table-level option for "toast_tuple_target", not attribute-level option

The attached patch and test shows this concept is useful and doesn't
affect existing data.

For 4x 4000 byte rows:
* by default we use 1 heap block and 3 toast blocks
* toast_tuple_target=4080 uses 2 heap blocks and 0 toast blocks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


toast_tuple_target.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] standby server crashes hard on out-of-disk-space in HEAD

2017-08-29 Thread Michael Paquier
On Mon, Aug 28, 2017 at 9:06 PM, Michael Paquier
 wrote:
> On Tue, Jun 13, 2017 at 4:21 AM, Andres Freund  wrote:
>> On 2017-06-12 15:12:23 -0400, Robert Haas wrote:
>>> Commit 4b4b680c3d6d8485155d4d4bf0a92d3a874b7a65 (Make backend local
>>> tracking of buffer pins memory efficient., vintage 2014) seems like a
>>> likely culprit here, but I haven't tested.
>>
>> I'm not that sure. As written above, the Assert isn't new, and given
>> this hasn't been reported before, I'm a bit doubtful that it's a general
>> refcount tracking bug.  The FPI code has been whacked around more
>> heavily, so it could well be a bug in it somewhere.
>
> Something doing a bisect could just use a VM that puts the standby on
> a tiny partition. I remember seeing this assertion failure some time
> ago on a test deployment, and that was really surprising. I think that
> this may be hiding something, so we should really try to investigate
> more what's wrong here.

I have been playing a bit with the builds and the attached script
triggering out-of-space errors on a standby (adapt to your
environment), and while looking for a good commit, I have found that
this thing is a bit older than the 2014 vintage... Down to the
merge-base of REL9_4_STABLE and REL9_3_STABLE, the assertion failure
is still the same.

The failure is older than even 9.2, for example by testing at the
merge-base of 9.2 and 9.3:
CONTEXT:  xlog redo insert(init): rel 1663/16384/16385; tid 181441/1
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c",
Line: 1788)
But well this assertion got changed in dcafdbcd.
-- 
Michael


crash_standby.bash
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] show "aggressive" or not in autovacuum logs

2017-08-29 Thread Masahiko Sawada
On Tue, Aug 29, 2017 at 10:16 AM, Robert Haas  wrote:
> On Mon, Aug 28, 2017 at 5:26 AM, Kyotaro HORIGUCHI
>  wrote:
>> Currently the message shows the '%d skipped-frozen' message but
>> it is insufficient to verify the true effect. This is a patch to
>> show mode as 'aggressive' or 'normal' in the closing message of
>> vacuum. %d frozen-skipped when 'aggressive mode' shows the true
>> effect of ALL_FROZEN.
>>
>> I will add this patch to CF2017-09.
>
> I would be a bit inclined to somehow show aggressive if it's
> aggressive and not insert anything at all otherwise.  That'd probably
> require two separate translatable strings in each case, but maybe
> that's OK.
>
> What do other people think?

FWIW I prefer the Robert's idea; not insert anything if normal vacuum.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers