Re: [HACKERS] postgres_fdw cost estimation defaults and documentation

2017-06-05 Thread Ashutosh Bapat
On Tue, Jun 6, 2017 at 12:07 AM, Jeff Janes  wrote:
>
> The default value for fdw_tuple_cost is 0.01, which seems way too low.  If I
> set up a loop-back foreign server with a large fetch_size, then tests like:
>
> select * from pgbench_accounts except select * from
> loopback.pgbench_accounts
>
> vs
>
> select * from pgbench_accounts except select * from pgbench_accounts
>
> indicate that 0.1 is about the lowest value for fdw_tuple_cost that could
> make sense, and a reasonable default would probably be 0.25.  Yes, it is
> only a default, but the default should make sense for at least some
> situation, and I can't imagine any situation in which 0.01 makes sense

The per tuple cost to transfer 10 rows and 1M rows is going to be
different and so it's going to different when the widths of rows vary.
It depends upon the package sizes and how many rows a packet can
contains. So having a single value to represent that cost is
insufficient. Till the time FDW gets smarter and adds some logic to
amortize the cost across rows, this looks better. 0.01 probably keeps
our regressions stable and mostly the users will have to change those
values for each server based on the type of network and topology
anyway. Said, that I am not objecting to changing it as long as
regression tests are stable across multiple platforms and machines.

>
> In the documentation for fdw_startup_cost, it says "This represents the
> additional overhead of establishing a connection, parsing and planning the
> query on the remote side, etc.".  I think that "establishing a connection"
> should be stricken. Either you need a connection or you don't, there is
> nothing the optimizer can do about this.  And if do need one, you only
> establish one once (at most), not once per query sent to the remote side.  I
> think the implementation correctly doesn't try to account for the overhead
> of establishing a connection, so the docs should remove that claim.

Yes. Instead of "establishing a connection", I think we should mention
the protocol overhead to run a query.

>
> In regards to use_remote_estimate, the documentation says "Running ANALYZE
> on the foreign table is the way to update the local statistics; this will
> perform a scan of the remote table and then calculate and store statistics
> just as though the table were local. Keeping local statistics can be a
> useful way to reduce per-query planning overhead for a remote table — but if
> the remote table is frequently updated, the local statistics will soon be
> obsolete."  This makes it send like local stats is basically equivalent to
> use_remote_estimate, other than the staleness issue.  But they are far from
> equivalent.  use_remote_estimate has implicit knowledge of the indexes on
> the foreign server (implicit via the reduced cost estimates derived from the
> foreign side for parameterized queries), whereas local stats of foreign
> tables just assumes there are no indexes for planning purposes. Perhaps
> adding something like "Also, local statistics do not contain information on
> the available indexes on the remote side, while use_remote_estimate does
> take these into account"?

That's not the impression I got when I read the complete paragraph at [1]
--
When use_remote_estimate is true, postgres_fdw obtains row count and
cost estimates from the remote server and then adds fdw_startup_cost
and fdw_tuple_cost to the cost estimates. When use_remote_estimate is
false, postgres_fdw performs local row count and cost estimation and
then adds fdw_startup_cost and fdw_tuple_cost to the cost estimates.
This local estimation is unlikely to be very accurate unless local
copies of the remote table's statistics are available. Running ANALYZE
on the foreign table is the way to update the local statistics; this
will perform a scan of the remote table and then calculate and store
statistics just as though the table were local. Keeping local
statistics can be a useful way to reduce per-query planning overhead
for a remote table — but if the remote table is frequently updated,
the local statistics will soon be obsolete.
--

The paragraph you quoted should be read in the context of "When
use_remote_estimate is false, ". It just says what happens when
use_remote_estimate is turned off. It doesn't imply that local
statistics is substitute for 'use_remote_estimate = true'.

[1] https://www.postgresql.org/docs/10/static/postgres-fdw.html
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Ashutosh Bapat
On Tue, Jun 6, 2017 at 10:00 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> On Tue, Jun 6, 2017 at 9:48 AM, Craig Ringer  wrote:
>>> Storing an epoch implies that rows can't have (xmin,xmax) different by
>>> more than one epoch. So if you're updating/deleting an extremely old
>>> tuple you'll presumably have to set xmin to FrozenTransactionId if it
>>> isn't already, so you can set a new epoch and xmax.
>
>> If the page has multiple such tuples, updating one tuple will mean
>> updating headers of other tuples as well? This means that those tuples
>> need to be locked for concurrent scans?
>
> Locks for tuple header updates are taken at page level anyway, so in
> principle you could run around and freeze other tuples on the page
> anytime you had to change the page's high-order-XID value.  Holding
> the lock for long enough to do that is slightly annoying, but it
> should happen so seldom as to not represent a real performance problem.
>
> In my mind the harder problem is where to find another 32 bits for the
> new page header field.  You could convert the header format on-the-fly
> if there's free space in the page, but what if there isn't?

I guess, we will have to reserve 32 bits in the header. That's much
better than increasing tuple header by 32 bits.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Tom Lane
Ashutosh Bapat  writes:
> On Tue, Jun 6, 2017 at 9:48 AM, Craig Ringer  wrote:
>> Storing an epoch implies that rows can't have (xmin,xmax) different by
>> more than one epoch. So if you're updating/deleting an extremely old
>> tuple you'll presumably have to set xmin to FrozenTransactionId if it
>> isn't already, so you can set a new epoch and xmax.

> If the page has multiple such tuples, updating one tuple will mean
> updating headers of other tuples as well? This means that those tuples
> need to be locked for concurrent scans?

Locks for tuple header updates are taken at page level anyway, so in
principle you could run around and freeze other tuples on the page
anytime you had to change the page's high-order-XID value.  Holding
the lock for long enough to do that is slightly annoying, but it
should happen so seldom as to not represent a real performance problem.

In my mind the harder problem is where to find another 32 bits for the
new page header field.  You could convert the header format on-the-fly
if there's free space in the page, but what if there isn't?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Minor fix for EventCacheLookup()

2017-06-05 Thread Amit Langote
It should return NIL when no entry is found in the cache, not NULL.

Attached patch fixes that.

Thanks,
Amit
diff --git a/src/backend/utils/cache/evtcache.c 
b/src/backend/utils/cache/evtcache.c
index 54ddc55f76..6faf4ae354 100644
--- a/src/backend/utils/cache/evtcache.c
+++ b/src/backend/utils/cache/evtcache.c
@@ -68,7 +68,7 @@ EventCacheLookup(EventTriggerEvent event)
if (EventTriggerCacheState != ETCS_VALID)
BuildEventTriggerCache();
entry = hash_search(EventTriggerCache, , HASH_FIND, NULL);
-   return entry != NULL ? entry->triggerlist : NULL;
+   return entry != NULL ? entry->triggerlist : NIL;
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Ashutosh Bapat
On Tue, Jun 6, 2017 at 9:48 AM, Craig Ringer  wrote:
> On 6 June 2017 at 12:13, Ashutosh Bapat  
> wrote:
>
>> What happens when the epoch is so low that the rest of the XID does
>> not fit in 32bits of tuple header? Or such a case should never arise?
>
> Storing an epoch implies that rows can't have (xmin,xmax) different by
> more than one epoch. So if you're updating/deleting an extremely old
> tuple you'll presumably have to set xmin to FrozenTransactionId if it
> isn't already, so you can set a new epoch and xmax.

If the page has multiple such tuples, updating one tuple will mean
updating headers of other tuples as well? This means that those tuples
need to be locked for concurrent scans? May be not, since such tuples
will be anyway visible to any concurrent scans and updating xmin/xmax
doesn't change the visibility. But we might have to prevent multiple
updates to the xmin/xmax because of concurrent updates on the same
page.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Craig Ringer
On 6 June 2017 at 12:13, Ashutosh Bapat  wrote:

> What happens when the epoch is so low that the rest of the XID does
> not fit in 32bits of tuple header? Or such a case should never arise?

Storing an epoch implies that rows can't have (xmin,xmax) different by
more than one epoch. So if you're updating/deleting an extremely old
tuple you'll presumably have to set xmin to FrozenTransactionId if it
isn't already, so you can set a new epoch and xmax.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump issues

2017-06-05 Thread Craig Ringer
On 6 June 2017 at 11:37, Дмитрий Воронин  wrote:
> Hello,
>
> We are working on one project with postgres as engeneer.
>
> Bellow is list of inconveniences that we are having with postgresql. We
> would like to solve them as possible.
>
> We can't create any schema dump with another (user defined) name. E.g. we
> dump schema test and we want to save it's dump with test2 name in any
> format. Those refers to databases dump.

This is a pretty common request. I expect a patch to add a --transform
or --rename option to pg_dump (or maybe pg_restore) might be accepted.
I suggest posting a detailed design for how you plan to do it and
asking for feedback before proceeding to implement it. You should
search the mailing list for past discussions and ideas too.

Otherwise, consulting outfits can do this sort of thing for you; see
https://www.postgresql.org/support/professional_support/ . (Note, I
work for one of them).

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Ashutosh Bapat
On Mon, Jun 5, 2017 at 2:38 PM, Heikki Linnakangas  wrote:
> On 06/05/2017 11:49 AM, Tianzhou Chen wrote:
>>
>> Hi Pg Hackers,
>>
>> XID wraparound seems to be quite a big concern and we introduce
>> changes like “adding another frozen bit to each page”
>> [http://rhaas.blogspot.com/2016/03/no-more-full-table-vacuums.html
>> 
>> to tackle this. I am just wondering what’s the challenges preventing
>> us from moving to 64 bit xid?  This is the previous thread I find
>>
>> https://www.postgresql.org/message-id/CAEYLb_UfC%2BHZ4RAP7XuoFZr%2B2_ktQmS9xqcQgE-rNf5UCqEt5A%40mail.gmail.com
>>
>> ,
>> the only answer there is:
>>
>> “ The most obvious reason for not using 64-bit xid values is that
>> they require more storage than 32-bit values. There is a patch
>> floating around that makes it safe to not forcibly safety shutdown
>> the server where currently it is necessary, but it doesn't work by
>> making xids 64-bit.
>> "
>>
>> I am personally not quite convinced that is the main reason, since I
>> feel for database hitting this issue, the schema is mostly
>> non-trivial and doesn’t matter so much with 8 more bytes. Could some
>> postgres experts share more insights about the challenges?
>
>
> That quote is accurate. We don't want to just expand XIDs to 64 bits,
> because it would significantly bloat the tuple header. PostgreSQL's
> per-tuple overhead is already quite large, compared to many other systems.
>
> The most promising approach to tackle this is to switch to 64-bit XIDs in
> in-memory structures, and add some kind of an extra epoch field to the page
> header. That would effectively give you 64-bit XIDs, but would only add one
> a field to each page, not every tuple.
>

What happens when the epoch is so low that the rest of the XID does
not fit in 32bits of tuple header? Or such a case should never arise?
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_dump issues

2017-06-05 Thread Дмитрий Воронин
Hello,We are working on one project with postgres as engeneer.Bellow is list of inconveniences that we are having with postgresql. We would like to solve them as possible.We can't create any schema dump with another (user defined) name. E.g. we dump schema test and we want to save it's dump with test2 name in any format. Those refers to databases dump.So, no one mechanisms to copy one schema to second one or to make aliases for any database object.How can we solve them?Thank you.



Re: [HACKERS] Server ignores contents of SASLInitialResponse

2017-06-05 Thread Michael Paquier
On Thu, Jun 1, 2017 at 4:58 AM, Heikki Linnakangas  wrote:
> I bisected that; the culprit was commit 61bf96cab0, where I refactored the
> libpq authentication code in preparation for SCRAM. The logic around that
> free() was always a bit wonky, but the refactoring made it outright broken.
> Attached is a patch for that, see commit message for details. (Review of
> that would be welcome.)

That looks fine to me.

> So, after fixing that, back to the original question; don't we have a
> similar "duplicate authentication request" problem with GSS? Yes, turns out
> that we do, even on stable branches:
>
> psql "sslmode=prefer dbname=postgres hostaddr=127.0.0.1 krbsrvname=postgres
> host=localhost user=krbtestuser"
> psql: duplicate GSS authentication request
>
> To fix, I suppose we can do what you did for SASL in your patch, and move
> the cleanup of conn->gctx from closePGconn to pgDropConnection. And I
> presume we need to do the same for the SSPI state too, but I don't have a
> Windows set up to test that at the moment.

SSPI does not complain with sslmode=prefer as each time
pg_SSPI_startup() is called conn->sspictx is enforced to NULL. This
looks wrong to me by the way as pg_SSPI_startup() is invoked only once
per authentication, and it leaks memory this way. That's also
inconsistent with SASL and GSS. At the same time this inconsistency is
not causing actual problems except a leak with SSPI in libpq, so not
doing anything except on HEAD looks fine to me.
-- 
Michael


libpq-auth-cleanup.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we need the gcc feature "__builtin_expect" to promote the branches prediction?

2017-06-05 Thread Craig Ringer
On 6 June 2017 at 10:44, Hao Lee  wrote:
> and another things come into my mind, in some scenario, as we know that the
> native library is not the most effective way to do that, such as, allocation
> a large amount of memories by using "alloc()"... and "memmove()", so on. As
> the SIMD instruction became the standard in CPU, therefore, we can that to
> do something more effectively. for example, using SIMD to impl
> 'alloc',"memmove" and so on, or the other database operations. and it seems
> that there some one have done this job.

Compilers are already pretty good at this.

https://en.wikipedia.org/wiki/Program_optimization


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do we need the gcc feature "__builtin_expect" to promote the branches prediction?

2017-06-05 Thread Hao Lee
and another things come into my mind, in some scenario, as we know that the
native library is not the most effective way to do that, such as,
allocation a large amount of memories by using "alloc()"... and
"memmove()", so on. As the SIMD instruction became the standard in CPU,
therefore, we can that to do something more effectively. for example, using
SIMD to impl 'alloc',"memmove" and so on, or the other database operations.
and it seems that there some one have done this job.


https://pdfs.semanticscholar.org/eff8/5ca6394eb2ed1e48fe48aa73d34d5a14760c.pdf



On Fri, Jun 2, 2017 at 11:32 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-06-02 16:40:56 +0800, Hao Lee wrote:
> > Hi all,
> >There is a lot of "if statement" in system, and GCC provides a
> > feature,"__builtin_expect", which  let compilers know which branch is
> > mostly run. as we known, miss-prediction will lead the performance
> > lost(because the CPU will thrown away some instructions, and re-fetch
> some
> > new instructions). so that we can tell GCC how produce more efficient
> code.
> > for example as following.
> > It will gain performance promotion i think. As i know, the in Linux
> kernel,
> > this feature is also applied already.
> >
> >  #define likely(cond) __builtin_expect(cond,true)
> > #define unlikely(cond)  __builtin_expect(cond,false)
> >
> > if (likely(cond)) {
> > //most likely run.
> >
> > } else //otherwise.
> > {
> >
> > }
>
> We already do this in a few cases, that are performance critical enough
> to matter.  But in most cases the CPUs branch predictor does a good
> enough job on its own.
>
> - Andres
>


Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces

2017-06-05 Thread Bruce Momjian
On Mon, Jun  5, 2017 at 07:38:43PM -0700, Andres Freund wrote:
> On 2017-06-05 22:34:17 -0400, Bruce Momjian wrote:
> > On Mon, Jun  5, 2017 at 04:38:32PM -0500, Jerry Sievers wrote:
> > > The SAN snaps capture the entire pgdata and WAL pg_xlog area but there
> > > is no attempt to copy the NVME device when the snaps are made.
> > > 
> > > There's an event trigger plus batch job now running tou avoid this risk.
> > > 
> > > We realize too that there are implications here if a backup is
> > > instantiated and PITR is done.
> > > 
> > > Just FYI that there could be others running like this ignorant of the
> > > potential gotchas.
> > 
> > Yes, if we implement the TODO you will create a TEMPORARY tablespace
> > that can't contain non-temporary and/or non-unlogged tables.
> 
> FWIW, allowing UNLOGGED tables, rather than just TEMPORARY ones,
> increases the complexity of that project noticeably.  For TEMPORARY you
> basically don't need to do much but to recreate the structure inside the
> tablespace at start - fairly simple.  But for UNLOGGED you need to find
> a way to recreate the relevant file and init forks - otherwise we might
> not notice what needs to be reset at a crash restart, and we might error
> out when executing selects etc. and then the table's not there.
> Presumably recreating files & init forks that at first table access is
> doable, but it's not entirely trivial to do locking wise.

Agreed, that is why they are separate adjacent items on the TODO list. 
:-)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces

2017-06-05 Thread Andres Freund
On 2017-06-05 22:34:17 -0400, Bruce Momjian wrote:
> On Mon, Jun  5, 2017 at 04:38:32PM -0500, Jerry Sievers wrote:
> > The SAN snaps capture the entire pgdata and WAL pg_xlog area but there
> > is no attempt to copy the NVME device when the snaps are made.
> > 
> > There's an event trigger plus batch job now running tou avoid this risk.
> > 
> > We realize too that there are implications here if a backup is
> > instantiated and PITR is done.
> > 
> > Just FYI that there could be others running like this ignorant of the
> > potential gotchas.
> 
> Yes, if we implement the TODO you will create a TEMPORARY tablespace
> that can't contain non-temporary and/or non-unlogged tables.

FWIW, allowing UNLOGGED tables, rather than just TEMPORARY ones,
increases the complexity of that project noticeably.  For TEMPORARY you
basically don't need to do much but to recreate the structure inside the
tablespace at start - fairly simple.  But for UNLOGGED you need to find
a way to recreate the relevant file and init forks - otherwise we might
not notice what needs to be reset at a crash restart, and we might error
out when executing selects etc. and then the table's not there.
Presumably recreating files & init forks that at first table access is
doable, but it's not entirely trivial to do locking wise.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces

2017-06-05 Thread Bruce Momjian
On Mon, Jun  5, 2017 at 04:38:32PM -0500, Jerry Sievers wrote:
> The SAN snaps capture the entire pgdata and WAL pg_xlog area but there
> is no attempt to copy the NVME device when the snaps are made.
> 
> There's an event trigger plus batch job now running tou avoid this risk.
> 
> We realize too that there are implications here if a backup is
> instantiated and PITR is done.
> 
> Just FYI that there could be others running like this ignorant of the
> potential gotchas.

Yes, if we implement the TODO you will create a TEMPORARY tablespace
that can't contain non-temporary and/or non-unlogged tables.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher set application_name?

2017-06-05 Thread Peter Eisentraut
On 6/2/17 23:23, Peter Eisentraut wrote:
> On 6/2/17 15:08, Peter Eisentraut wrote:
>> On 5/30/17 23:10, Peter Eisentraut wrote:
>>> Here is a proposed solution that splits bgw_name into bgw_type and
>>> bgw_name_extra.  bgw_type shows up in pg_stat_activity.backend_type.
>>> Uses of application_name are removed, because they are no longer
>>> necessary to identity the process type.
>>>
>>> This code appears to be buggy because I sometimes get NULL results of
>>> the backend_type lookup, implying that it couldn't find the background
>>> worker slot.  This needs another look.
>>
>> I would like some more input on this proposal, especially from those
>> have have engineered the extended pg_stat_activity content.
>>
>> If we don't come to a quick conclusion on this, I'd be content to leave
>> PG10 as is and put this patch into the next commit fest.
> 
> If there are no new insights into this by Monday, I will commit patches
> that remove the setting of application_name, which was originally
> complained about, and postpone the rest of this patch.

Done, and added the rest of the patch to the next commit fest:
https://commitfest.postgresql.org/14/1165/

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Andres Freund
Hi,

On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
> + * This will trigger walsenders to send the remaining WAL, prevent them from
> + * accepting further commands. After that they'll wait till the last WAL is
> + * written.
> s/prevent/preventing/?
> I would rephrase the last sentence a bit:
> "After that each WAL sender will wait until the end-of-checkpoint
> record has been flushed on the receiver side."

I didn't like your proposed phrasing much, but I aggree that what I had
wasn't good either.  Tried to improve it.

Thanks for the review.


I pushed this series, this should resolve the issue in this thread
entirely, and should fix a good chunk of the issues in the 'walsender
and parallelism' thread.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] inconsistent application_name use in logical workers

2017-06-05 Thread Peter Eisentraut
The logical replication code is supposed to use the subscription name as
the fallback_application_name, but in some cases it uses the slot name,
which could be different.  See attached patch to correct this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From d134d4eae3bd176e10e418746202e69e3f6e7e28 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 5 Jun 2017 22:14:40 -0400
Subject: [PATCH] Consistently use subscription name as application name

---
 src/backend/replication/logical/tablesync.c | 2 +-
 src/backend/replication/logical/worker.c| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c 
b/src/backend/replication/logical/tablesync.c
index 6e268f3521..e8540f291c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -800,7 +800,7 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
MySubscription->oid,
MyLogicalRepWorker->relid);
 
-   wrconn = walrcv_connect(MySubscription->conninfo, true, slotname, );
+   wrconn = walrcv_connect(MySubscription->conninfo, true, 
MySubscription->name, );
if (wrconn == NULL)
ereport(ERROR,
(errmsg("could not connect to the publisher: 
%s", err)));
diff --git a/src/backend/replication/logical/worker.c 
b/src/backend/replication/logical/worker.c
index a570900a42..b87b6921c0 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1595,7 +1595,7 @@ ApplyWorkerMain(Datum main_arg)
origin_startpos = replorigin_session_get_progress(false);
CommitTransactionCommand();
 
-   wrconn = walrcv_connect(MySubscription->conninfo, true, 
myslotname,
+   wrconn = walrcv_connect(MySubscription->conninfo, true, 
MySubscription->name,
);
if (wrconn == NULL)
ereport(ERROR,
-- 
2.13.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Amit Kapila
On Mon, Jun 5, 2017 at 7:26 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> Sure.  I think it is slightly tricky because specs don't say clearly
>> how ASLR can impact the behavior of any API and in my last attempt I
>> could not reproduce the issue.
>
>> I can try to do basic verification with the patch you have proposed,
>> but I fear, to do the actual tests as suggested by you is difficult as
>> it is not reproducible on my machine, but I can still try.
>
> Yeah, being able to reproduce the problem reliably enough to say whether
> it's fixed or not is definitely the sticking point here.

Agreed.  By the way, while browsing about this problem, I found that
one other open source (nginx) has used a solution similar to what
Andres was proposing upthread to solve this problem.  Refer:
https://github.com/nginx/nginx/commit/2ec8cfcd7d34415a99c3f3db3024ae954c00d0dd

Just to be clear, by above, I don't mean to say that if some other
open source is using some solution, we should also use it, but I think
it is worth considering (especially if it is a proven solution - just
saying based on the time (2015) it has been committed and sustained in
the code).

>  I have some
> ideas about that:
>
> 1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
> address space are so small you'd never prove anything.  (And of course
> it has to be a version that has ASLR enabled.)
>

I don't have access to Win32, so I request others who are reading this
and have access to Win32 to see if they can help in reproducing the
problem.

> 2. Revert 7f3e17b48 so that you have an ASLR-enabled build.
>
> 3. Crank shared_buffers to the maximum the machine will allow, reducing
> the amount of free address space and improving the odds of a collision.
>
> 4. Spawn lots of sessions --- pgbench with -C option might be a useful
> testing tool.
>

All are very good points and I think together they will certainly
improve the chances of reproducing this problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER SUBSCRIPTION ..SET PUBLICATION refresh is not throwing error.

2017-06-05 Thread Peter Eisentraut
On 6/2/17 22:13, Peter Eisentraut wrote:
> On 5/27/17 06:54, Petr Jelinek wrote:
>> On 27/05/17 04:00, Euler Taveira wrote:
>>> 2017-05-26 21:29 GMT-03:00 Petr Jelinek >> >:
>>>
>>>
>>> Actually another possibility would be to remove the REFRESH keyword
>>> completely and just have [ WITH (...) ] and have the refresh option
>>> there, ie simplified version of what you have suggested (without the
>>> ugliness of specifying refresh twice to disable).
>>>
>>>
>>> It will cause confusion. It seems that WITH sets ALTER SUBSCRIPTION
>>> properties. Indeed, they are REFRESH properties. I think we shouldn't
>>> exclude REFRESH keyword. Syntax leaves no doubt that WITH are REFRESH
>>> properties.
>>>
>>
>> Maybe, I don't know, it might not be that confusing when SET PUBLICATION
>> and REFRESH PUBLICATION have same set of WITH options.
> 
> I'm not sure what the conclusion from the above discussion was supposed
> to be, but here is a patch.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Thomas Munro
On Tue, Jun 6, 2017 at 12:58 PM, Craig Ringer  wrote:
> On 4 June 2017 at 06:41, Kevin Grittner  wrote:
>> On Fri, Jun 2, 2017 at 8:46 PM, Thomas Munro
>>  wrote:
>>
>>> So, afterTriggers.query_stack is used to handle the reentrancy that
>>> results from triggers running further statements that might fire
>>> triggers.  It isn't used for dealing with extra ModifyTable nodes that
>>> can appear in a plan because of wCTEs.  Could it also be used for that
>>> purpose?  I think that would only work if it is the case that each
>>> ModifyTable node begin and then runs to completion (ie no interleaving
>>> of wCTE execution) and then its AS trigger fires, which I'm not sure
>>> about.
>>
>> I don't think we want to commit to anything that depends on a CTE
>> creating an optimization fence, although *maybe* that would be OK in
>> the case of DML as a CTE.  That's a pretty special case; I'm not
>> sure whether the standard discusses it.
>
> It's definitely fine to require a fence for wCTEs. They're an
> extension to the standard, and it's pretty much necessary that we not
> pull up / push down across them since we don't want to affect the
> side-effects (row changes). If there are any cases where it's safe,
> they'll take some careful thought.
>
> It's only standard CTEs (SELECT-based) that I think matter for the
> optimisation fence behaviour.

After sleeping on it, I don't think we need to make that decision here
though.  I think it's better to just move the tuplestores into
ModifyTableState so that each embedded DML statement has its own, and
have ModifyTable pass them to the trigger code explicitly.  I think
I'd like to do that via the TransitionCaptureState object that I
proposed elsewhere, but I'll hold off on doing anything until I hear
from interested committers on which way we're going here, time being
short.

Call me an anti-globalisation (of variables) protestor.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while creating subscription when server is running in single user mode

2017-06-05 Thread Peter Eisentraut
On 6/2/17 23:06, Peter Eisentraut wrote:
> On 6/2/17 15:41, Tom Lane wrote:
>> It's certainly plausible that we could have the latch code just ignore
>> WL_POSTMASTER_DEATH if not IsUnderPostmaster.  I think that the original
>> reasoning for not doing that was that the calling code should know which
>> environment it's in, and not pass an unimplementable wait-exit reason;
>> so silently ignoring the bit could mask a bug.  Perhaps that argument is
>> no longer attractive.  Alternatively, we could fix the relevant call sites
>> to do "(IsUnderPostmaster ? WL_POSTMASTER_DEATH : 0)", and keep the strict
>> behavior for the majority of call sites.
> 
> There are a lot of those call sites.  (And a lot of duplicate code for
> what to do if postmaster death actually happens.)  I doubt we want to
> check them all.
> 
> The attached patch fixes the reported issue for me.

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Michael Paquier
On Tue, Jun 6, 2017 at 9:47 AM, Andres Freund  wrote:
> On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
>> I think that it would be interesting to be able to
>> trigger a feedback message using SIGHUP in WAL receivers, refactoring
>> at the same time SIGHUP handling for WAL receivers. It is possible for
>> example to abuse SIGHUP in autovacuum for cost parameters.
>
> Could you clarify a bit here, I can't follow?  Do you think it's
> actually a good idea to combine that with the largely mechanical patch?

Sort of. The thought here is to be able to trigger
XLogWalRcvSendReply() using a SIGHUP, even if force_reply is not
enforced. But looking again at the code, XLogWalRcvSendReply() is
processed only if data is received so sending multiple times the same
message to server would be pointless. Still, don't you think that it
would be helpful to wake up the WAL receiver at will on SIGHUP by
setting its latch? XLogWalRcvSendHSFeedback() could be triggered at
will using that.

ProcessWalRcvInterrupts() could include the checks for SIGHUP by the way...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Craig Ringer
On 4 June 2017 at 06:41, Kevin Grittner  wrote:
> On Fri, Jun 2, 2017 at 8:46 PM, Thomas Munro
>  wrote:
>
>> So, afterTriggers.query_stack is used to handle the reentrancy that
>> results from triggers running further statements that might fire
>> triggers.  It isn't used for dealing with extra ModifyTable nodes that
>> can appear in a plan because of wCTEs.  Could it also be used for that
>> purpose?  I think that would only work if it is the case that each
>> ModifyTable node begin and then runs to completion (ie no interleaving
>> of wCTE execution) and then its AS trigger fires, which I'm not sure
>> about.
>
> I don't think we want to commit to anything that depends on a CTE
> creating an optimization fence, although *maybe* that would be OK in
> the case of DML as a CTE.  That's a pretty special case; I'm not
> sure whether the standard discusses it.

It's definitely fine to require a fence for wCTEs. They're an
extension to the standard, and it's pretty much necessary that we not
pull up / push down across them since we don't want to affect the
side-effects (row changes). If there are any cases where it's safe,
they'll take some careful thought.

It's only standard CTEs (SELECT-based) that I think matter for the
optimisation fence behaviour.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Andres Freund
On 2017-06-05 15:30:38 +0900, Michael Paquier wrote:
> I have looked at all those patches. The set looks solid to me.

Thanks!


> Here are some comments about 0003.
> +   /*
> +* Have WalSndLoop() terminate the connection in an orderly
> +* manner, after writing out all the pending data.
> +*/
> +   if (got_STOPPING)
> +   got_SIGUSR2 = true;
> I think that for correctness the state of the WAL sender should be
> switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well.

No, that would be wrong.  If we switched here, checkpointer would finish
waiting, even though XLogSendLogical() might get called again.  That
e.g. could happen the TCP socket was full, and XLogSendLogical() gets
called again.


> A more appropriate name would be ConfigReloadPending perhaps?

Hm, ok.


> 0005 looks like a fine one-liner to me.
> 
> For 0006, you could include as well the removal of worker_spi_sighup()
> in the refactoring.

Ok.  I'll leave that patch for now, since I think it's probably better
to apply it only to master once v10 branched off.


> I think that it would be interesting to be able to
> trigger a feedback message using SIGHUP in WAL receivers, refactoring
> at the same time SIGHUP handling for WAL receivers. It is possible for
> example to abuse SIGHUP in autovacuum for cost parameters.

Could you clarify a bit here, I can't follow?  Do you think it's
actually a good idea to combine that with the largely mechanical patch?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 3:34 PM, Tom Lane  wrote:
> I think that a single count in a 30K-row sample is noise by definition.

Not if the table only *has* 30K rows.  Or even 100K.  And anyway we're
talking about what to do with a value we hit at least twice, which is
not quite the same thing.

> As long as they actually are MCVs, sure.  The problem I've got with the
> current behavior is that it manufactures a spiky distribution where there
> is none.  That leads directly to bad estimates, as shown in Marko's
> example.  We'd be much better off, both as to planning time and accuracy,
> if we'd concluded that the table had no MCVs.

That is so, but that example is also constructed to show the case
where the current code falls down.  The case where the distribution
actually is spiky but the frequency is near the minimum that ANALYZE
can detect isn't tested by that example.

>> Another way to state it is: is this problem one-sided?
>
> You know as well as I do that there's no free lunch in this area.
> Anything we change at all will make things worse for somebody, if only
> by accident.

Yep.

> But I do not think that choosing a bunch of values entirely
> at random and claiming (incorrectly) that they are more common than other
> values in the table can possibly lead to better results except by
> accident.

But accidents happen, and sometimes they happen frequently.  I
maintain that the root of this problem is that you want to pretend
like a 30k row sample on (in this instance) a 100m row table ought to
be expected to be enough to reliably produce good results, and I think
that's doomed.  If you tweak the algorithm to remove what is in this
case noise, you'll just break some other case instead.  Maybe it'll
take somebody a few years to find and post an example of such a case,
but surely you can see that there must be cases where most ANALYZE
runs will find 2 of some value precisely because it is a lot more
common than average.

I think what we ought to do is install some kind of auto-scaling
behavior so that when we discover that a table contains a very large
number of rows, we use a higher statistics target.  Admittedly there's
some circularity problems there -- if we haven't sampled the table
yet, we don't know how wide the rows are.  But we could estimate based
on the on-disk size for the first ANALYZE and based on the
previously-observed row width thereafter, and probably improve things
measurably.

Now if, in conjunction with that, we also get more aggressive about
filtering out low-order MCVs, fine.  I think there are plenty of small
tables where the low-order MCVs make very little difference and
chucking them will do nothing but save planning time.  But I believe
that for very large tables, shortening the MCV list will hurt -- in
some cases, probably a lot -- so I'd like to see some compensating
change to avoid that.

> I'm not by any means wedded to the proposition that we have to fix it
> simply by changing the filter rule.  One idea that seems worth considering
> is to keep a track list that's a bit longer than the maximum allowed
> number of MCVs, and then to say that we accept only MCVs whose counts are
> significantly greater than what we find at the tail of the list.  I'm not
> sure what "significantly" should be exactly, but surely a distribution
> as flat as the ones I was showing upthread should be a red flag.

I'm not sure exactly which bit upthread (or on the other thread)
you're referring to here, but I'm worried that your thinking is being
excessively shaped by the example presently in front of us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is ECPG's SET CONNECTION really not thread-aware?

2017-06-05 Thread Tsunakawa, Takayuki
From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tsunakawa,
> Takayuki
> The following page says:
> 
> https://www.postgresql.org/docs/devel/static/ecpg-connect.html#ecpg-se
> t-connection
> 
> --
> EXEC SQL AT connection-name SELECT ...;
> 
> If your application uses multiple threads of execution, they cannot share
> a connection concurrently. You must either explicitly control access to
> the connection (using mutexes) or use a connection for each thread. If each
> thread uses its own connection, you will need to use the AT clause to specify
> which connection the thread will use.
> 
> EXEC SQL SET CONNECTION connection-name;
> 
> ...It is not thread-aware.
> --
> 
> 
> What does this mean by "not thread-aware?"  Does SET CONNECTION in one
> thread change the current connection in another thread?
> It doesn't look so, because the connection management and SQL execution
> in ECPG uses thread-specific data (TSD) to get and set the current connection,
> like this:

The ECPG code sets and gets the current connection to/from the TSD, so SET 
CONNECTION is thread-aware.  I could confirm that like this:

threadA: CONNECT AS conA
threadB: CONNECT AS conB
threadA: CONNECT AS conC
threadA & threadB: SELECT pg_backend_pid() ... each thread displays different 
pids
threadA: SET CONNECTION conA
threadA & threadB: SELECT pg_backend_pid() ... each thread displays different 
pids, with thread outputting the same pid as before

So the doc seems to need fix.  The patch is attached.

Regards
Takayuki Tsunakawa



ecpg_setconn.patch
Description: ecpg_setconn.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Thomas Munro
On Tue, Jun 6, 2017 at 10:11 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> On Tue, Jun 6, 2017 at 2:15 AM, Tom Lane  wrote:
>>> FWIW, parse analysis is surely NOT the time for such a check.  Triggers
>>> might get added to a table between analysis and execution.  I think you
>>> might have to do it during executor startup.
>
>> Hmm.  My understanding:  In analyzeCTE(), where the check is done in
>> that patch, we hold a lock on the relation because we have a
>> RangeTblEntry.  That means that no triggers can be added concurrently
>> in the case where you go on to execute the plan.  If you save the plan
>> for later execution, triggers created between then and lock
>> reacquisition (because creating triggers touches pg_class) cause
>> reanalysis.
>
> No, they don't necessarily.  One case that will certainly not be
> covered by such a test is if the wCTE is an UPDATE or DELETE on
> an inheritance hierarchy: parse analysis has no idea whether the
> target table has children at all, let alone whether any of the
> children have triggers that use transition tables.  You really want
> the test to happen after the planner has expanded inheritance.

If Kevin accepts my other patch or at least the approach, we won't
allow any child table triggers with transition tables, so we wouldn't
need to consider them here (triggers on the named result table see
tuples collected from children, but ROW triggers attached to children
can't have transition tables and STATEMENT triggers attached to
children don't fire).  Though those decisions probably wouldn't be
future-proof, we'd only want this block in place until we had support
for wCTEs.

> Another problem is that the result of parse analysis is frozen for much
> longer than you're thinking about in the case of stored rules or views.
> We currently disallow views that contain writable CTEs, but I do not
> think there's such a prohibition for rules.

Ugh.  Right.  I can indeed circumvent the above check with a wCTE
hiding in a rule, so parse analysis is not the right time for the
check.  The whole rewrite area is clearly something I need to go and
learn about.  Thanks for the explanation.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Gavin Flower

On 06/06/17 10:12, Gavin Flower wrote:

On 06/06/17 09:41, Mark Kirkwood wrote:

On 05/06/17 09:30, Tom Lane wrote:


I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 

and it seems to me that there are a couple of things we ought to do 
about

it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that 
design
originated when we were envisioning stats samples of just a few 
thousand
rows --- specifically, default_statistics_target was originally just 
10,
leading to a 3000-row sample size.  So accepting two-appearance 
values as

MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 3 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the 
"average"
value's frequency is.  A rule of that general form seems like a good 
idea,
but I now think the 25% threshold is far too small to do anything 
useful.
In particular, in any case like this where there are more distinct 
values

than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this 
rule is
totally useless to filter anything that we would otherwise consider 
as an
MCV.  I wonder if we shouldn't make it be "at least double the 
estimated

average frequency".



Or possibly calculate the sample standard deviation and make use of 
that to help decide on a more flexible cutoff than twice the avg 
frequency?


Are there any research papers that might help us here (I'm drowning 
in a sea of barely relevant search results for most phrases I've 
tried so far)? I recall there were some that Tom referenced when this 
stuff was originally written.


On the other hand I do have access to some mathematicians 
specializing in statistics - so can get their thoughts on this issue 
if you feel it would be worthwhile.


Cheers

Mark


The standard deviation (sd) is proportional to the square root of the 
number in the sample in a Normal Distribution.


In a Normal Distribution, about 2/3 the values will be within plus or 
minus one sd of the mean.


There seems to be an implicit assumption that the distribution of 
values follows the Normal Distribution - has this been verified? I 
suspect that real data will have a skewed distribution of values, and 
may even be multi modal (multiple peaks)  The Normal Distribution has 
one central peak with 2 tails of the same shape & size.


So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.

So essentially, the larger the sample size the more reliable is 
knowledge of the most common values (ignoring pathologically extreme 
distributions!) - the measure of reliability depends on the distribution.


How about selecting the cut off as the mean plus one sd, or something 
of that nature?  Note that the cut off point may result in no mcv's 
being selected - especially for small samples.


If practicable, it would be good to sample real datasets. Suggest 
looking at datasets were the current mechanism looks reasonable, and 
ones were the estimates are too far off.  Also, if possible, try any 
new selection method on the datasets and see what the difference is.


The above is based on what I remember from my university statistics 
papers, I took it up to 4th year level many moons ago.



Cheers,
Gavin




Suddenly realized, that the distribution of the FREQUENCIES of values in 
a Normal Distribution are probably NOT normally distributed!


For some shapes, and fancy names for them (like 'leptokurtic'), see: 
http://onlinestatbook.com/2/introduction/distributions.html


For multi modal examples: 
http://www.statisticshowto.com/multimodal-distribution


I tried looking for useful stuff to clarify things without success - so 
I think that asking a practising statistician is probably a very good 
idea!  :-)



Cheers,
Gavin


Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Gavin Flower

On 06/06/17 09:41, Mark Kirkwood wrote:

On 05/06/17 09:30, Tom Lane wrote:


I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org 

and it seems to me that there are a couple of things we ought to do 
about

it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that design
originated when we were envisioning stats samples of just a few thousand
rows --- specifically, default_statistics_target was originally just 10,
leading to a 3000-row sample size.  So accepting two-appearance 
values as

MCVs would lead to a minimum MCV frequency estimate of 1/1500. Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 3 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the 
"average"
value's frequency is.  A rule of that general form seems like a good 
idea,
but I now think the 25% threshold is far too small to do anything 
useful.
In particular, in any case like this where there are more distinct 
values

than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this 
rule is
totally useless to filter anything that we would otherwise consider 
as an

MCV.  I wonder if we shouldn't make it be "at least double the estimated
average frequency".



Or possibly calculate the sample standard deviation and make use of 
that to help decide on a more flexible cutoff than twice the avg 
frequency?


Are there any research papers that might help us here (I'm drowning in 
a sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.


On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.


Cheers

Mark


The standard deviation (sd) is proportional to the square root of the 
number in the sample in a Normal Distribution.


In a Normal Distribution, about 2/3 the values will be within plus or 
minus one sd of the mean.


There seems to be an implicit assumption that the distribution of values 
follows the Normal Distribution - has this been verified? I suspect that 
real data will have a skewed distribution of values, and may even be 
multi modal (multiple peaks)  The Normal Distribution has one central 
peak with 2 tails of the same shape & size.


So in a sample of 100 the sd is proportional to 10%,
for 10,000 the sd is proportional to 1%.

So essentially, the larger the sample size the more reliable is 
knowledge of the most common values (ignoring pathologically extreme 
distributions!) - the measure of reliability depends on the distribution.


How about selecting the cut off as the mean plus one sd, or something of 
that nature?  Note that the cut off point may result in no mcv's being 
selected - especially for small samples.


If practicable, it would be good to sample real datasets. Suggest 
looking at datasets were the current mechanism looks reasonable, and 
ones were the estimates are too far off.  Also, if possible, try any new 
selection method on the datasets and see what the difference is.


The above is based on what I remember from my university statistics 
papers, I took it up to 4th year level many moons ago.



Cheers,
Gavin




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Jun 6, 2017 at 2:15 AM, Tom Lane  wrote:
>> FWIW, parse analysis is surely NOT the time for such a check.  Triggers
>> might get added to a table between analysis and execution.  I think you
>> might have to do it during executor startup.

> Hmm.  My understanding:  In analyzeCTE(), where the check is done in
> that patch, we hold a lock on the relation because we have a
> RangeTblEntry.  That means that no triggers can be added concurrently
> in the case where you go on to execute the plan.  If you save the plan
> for later execution, triggers created between then and lock
> reacquisition (because creating triggers touches pg_class) cause
> reanalysis.

No, they don't necessarily.  One case that will certainly not be
covered by such a test is if the wCTE is an UPDATE or DELETE on
an inheritance hierarchy: parse analysis has no idea whether the
target table has children at all, let alone whether any of the
children have triggers that use transition tables.  You really want
the test to happen after the planner has expanded inheritance.

Another problem is that the result of parse analysis is frozen for much
longer than you're thinking about in the case of stored rules or views.
We currently disallow views that contain writable CTEs, but I do not
think there's such a prohibition for rules.

More generally, the problem I've got with this proposal is that it causes
the presence of triggers to be significant much earlier in query execution
than they used to be.  I'm concerned that this may break optimizations
that are already in place somewhere, or that we might want to add later.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Mark Kirkwood

On 05/06/17 09:30, Tom Lane wrote:


I've been thinking about the behavior discussed in
https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
and it seems to me that there are a couple of things we ought to do about
it.

First, I think we need a larger hard floor on the number of occurrences
of a value that're required to make ANALYZE decide it is a "most common
value".  The existing coding is willing to believe that anything that
appears at least twice in the sample is a potential MCV, but that design
originated when we were envisioning stats samples of just a few thousand
rows --- specifically, default_statistics_target was originally just 10,
leading to a 3000-row sample size.  So accepting two-appearance values as
MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
could be a tenth or a hundredth of that.

As a round number, I'm thinking that a good floor would be a frequency
estimate of 1/1000.  With today's typical sample size of 3 rows,
a value would have to appear at least 30 times in the sample to be
believed to be an MCV.  That seems like it gives us a reasonable margin
of error against the kind of sampling noise seen in the above-cited
thread.

Second, the code also has a rule that potential MCVs need to have an
estimated frequency at least 25% larger than what it thinks the "average"
value's frequency is.  A rule of that general form seems like a good idea,
but I now think the 25% threshold is far too small to do anything useful.
In particular, in any case like this where there are more distinct values
than there are sample rows, the "average frequency" estimate will
correspond to less than one occurrence in the sample, so that this rule is
totally useless to filter anything that we would otherwise consider as an
MCV.  I wonder if we shouldn't make it be "at least double the estimated
average frequency".



Or possibly calculate the sample standard deviation and make use of that 
to help decide on a more flexible cutoff than twice the avg frequency?


Are there any research papers that might help us here (I'm drowning in a 
sea of barely relevant search results for most phrases I've tried so 
far)? I recall there were some that Tom referenced when this stuff was 
originally written.


On the other hand I do have access to some mathematicians specializing 
in statistics - so can get their thoughts on this issue if you feel it 
would be worthwhile.


Cheers

Mark


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use of non-restart-safe storage by temp_tablespaces

2017-06-05 Thread Jerry Sievers
Bruce Momjian  writes:

> On Wed, May 31, 2017 at 08:28:51AM -0700, Mark Dilger wrote:
>
>> > Uh, I thought only the sessions that created the temporary objects could
>> > see them, and since they are not in WAL and autovacuum can't see them,
>> > their non-existence in a temporary tablespace would not be a problem.
>> 
>> You are correct.  I was thinking about an extension to allow unlogged
>> tablespaces on temporary filesystems, but got the words "unlogged" and
>> "temporary" mixed up in my thinking and in what I wrote.  I should have
>> written that unlogged tablespaces would only host unlogged tables and
>> unlogged indexes, such that users are not surprised to find their data
>> missing.

Very late to the discussion here...

At my site, we have an NVME disk as temp_tablespace.

It's used not only for disk-spilling temp operations.  There are temp
tables, UNLOGGEd persistent tables.

And you guessed it.  In spite of our warnings, users have created LOGGEd
tables in there.

This has made on a few ocasions our SAN snapshots, started on a
different system fail apparently due to something in the crash recovery
data trying to update pages on one of the real tables :-)

The SAN snaps capture the entire pgdata and WAL pg_xlog area but there
is no attempt to copy the NVME device when the snaps are made.

There's an event trigger plus batch job now running tou avoid this risk.

We realize too that there are implications here if a backup is
instantiated and PITR is done.

Just FYI that there could be others running like this ignorant of the
potential gotchas.

>> 
>> On reflection, I think both features are worthwhile, and not at all exclusive
>> of each other, though unlogged tablespaces is probably considerably more
>> work to implement.
>
> TODO item added:
>
>   Allow tablespaces on RAM-based partitions for temporary objects 
>
> and I wrote a blog entry about this:
>
>   https://momjian.us/main/blogs/pgblog/2017.html#June_2_2017
>
> -- 
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
> + As you are, so once was I.  As I am, so you will be. +
> +  Ancient Roman grave inscription +

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consult...@comcast.net
p: 312.241.7800


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Thomas Munro
On Tue, Jun 6, 2017 at 2:15 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Jun 3, 2017 at 10:39 PM, Thomas Munro
>>  wrote:
>>> In the meantime, it seems like you agree that rejecting wCTEs that
>>> affect tables with triggers with transition tables is the best
>>> response to this bug report?  Do you think that parse analysis is the
>>> right time to do the check?  Here's a first attempt at that.
>
> FWIW, parse analysis is surely NOT the time for such a check.  Triggers
> might get added to a table between analysis and execution.  I think you
> might have to do it during executor startup.

Hmm.  My understanding:  In analyzeCTE(), where the check is done in
that patch, we hold a lock on the relation because we have a
RangeTblEntry.  That means that no triggers can be added concurrently
in the case where you go on to execute the plan.  If you save the plan
for later execution, triggers created between then and lock
reacquisition (because creating triggers touches pg_class) cause
reanalysis.  This is not fundamentally different from altering the
table.  For example, with that patch:

postgres=# prepare ps as with t as (insert into table1 values (1))
insert into table2 values ('hello');
PREPARE
postgres=# create trigger trig1 after insert on table1 referencing new
table as oo for each row execute procedure trigfunc();
CREATE TRIGGER
postgres=# execute ps;
ERROR:  WITH clause cannot modify a table that has triggers with
transition tables

What am I missing?

>> I'm starting to like the approach of reverting the entire transition
>> tables patch.  Failing to consider the possibility of a plan with
>> multiple ModifyTable nodes seems like a pretty fundamental design
>> mistake, and I'm not eager either to ship this with that broken or try
>> to fix it at this stage of the release cycle.

Not handling wCTEs or inheritance definitely counts as a howler, but
we seem tantalisingly close and it would be a shame to push the whole
feature back IMHO.  I really hope we can move on to talking about
incremental matviews in the next cycle, which is why I've jumped on
every problem report so far.

I'm still waiting to hear from Kevin whether he thinks what I proposed
for the inheritance bug has legs.  If so, my vote from the peanut
gallery would be to keep transition tables in the tree but block wCTEs
with transition tables, and then add support in the next release
(which I'd be happy to work on).  Support will certainly be needed
ASAP, because it'd suck if incremental matview maintenance didn't work
just because you use wCTEs, and the 'action at a distance' factor
isn't a great user experience (adding a new trigger can in general
break existing queries, but breaking them just because they use wCTEs
is arguably surprising).  On the other hand, if he thinks that the
inheritance bug is not adequately fixed by my proposal (with minor
tweaks) and needs a completely different approach, then I'm out (but
will be happy to help work on this stuff for PG11).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Jim Van Fleet
NP, Sokolov --

pgsql-hackers-ow...@postgresql.org wrote on 06/05/2017 03:26:46 PM:

> From: Sokolov Yura 
> To: Jim Van Fleet 
> Cc: pgsql-hackers@postgresql.org
> Date: 06/05/2017 03:28 PM
> Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into 
> multiple parts
> Sent by: pgsql-hackers-ow...@postgresql.org
> 
> Excuse me, Jim.
> 
> I was tired and misunderstand proposal: I thought of ProcArray 
> sharding, but proposal is about ProcArrayLock sharding.
> 
> BTW, I just posted improvement to LWLock:
> 
> https://www.postgresql.org/message-id/
> 2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
> 
> Would you mind to test against that and together with that?

I will give them a try ..

Jim



Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Sokolov Yura
Excuse me, Jim.I was tired and misunderstand proposal: I thought of ProcArray sharding, but proposal is about ProcArrayLock sharding.BTW, I just posted improvement to LWLock:https://www.postgresql.org/message-id/2968c0be065baab8865c4c95de3f435c%40postgrespro.ruWould you mind to test against that and together with that?5 июня 2017 г. 11:11 PM пользователь Sokolov Yura  написал:Hi, Jim.How do you ensure of transaction order?Example:- you lock shard A and gather info. You find transaction T1 in-progress.- Then you unlock shard A.- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.- you lock shard B, and gather info from.- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.Probably you've already solved this issue. If so it would be great to learn the solution.5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet  написал:Hi,I have been experimenting with splitting
 the ProcArrayLock into parts.  That is, to Acquire the ProcArrayLock
in shared mode, it is only necessary to acquire one of the parts in shared
mode; to acquire the lock in exclusive mode, all of the parts must be acquired
in exclusive mode. For those interested, I have attached a design description
of the change.This approach has been quite successful
on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model
8335-GCA with 2sockets, 20 core) hammerdb  improved by
16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44
core) with 9.6 base and prototype hammerdb improved by 4%. (attached is
a set of spreadsheets for power8.The down side is that on smaller
configurations (single socket) where there is less "lock thrashing"
in the storage subsystem and there are multiple Lwlocks to take for an
exclusive acquire, there is a decided downturn in performance. On  hammerdb,
the prototype was 6% worse than the base on a single socket power configuration.If there is interest in this approach,
I will submit a patch.Jim Van Fleet

Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Sokolov Yura
Hi, Jim.How do you ensure of transaction order?Example:- you lock shard A and gather info. You find transaction T1 in-progress.- Then you unlock shard A.- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.- you lock shard B, and gather info from.- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.Probably you've already solved this issue. If so it would be great to learn the solution.5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet  написал:Hi,I have been experimenting with splitting
 the ProcArrayLock into parts.  That is, to Acquire the ProcArrayLock
in shared mode, it is only necessary to acquire one of the parts in shared
mode; to acquire the lock in exclusive mode, all of the parts must be acquired
in exclusive mode. For those interested, I have attached a design description
of the change.This approach has been quite successful
on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model
8335-GCA with 2sockets, 20 core) hammerdb  improved by
16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44
core) with 9.6 base and prototype hammerdb improved by 4%. (attached is
a set of spreadsheets for power8.The down side is that on smaller
configurations (single socket) where there is less "lock thrashing"
in the storage subsystem and there are multiple Lwlocks to take for an
exclusive acquire, there is a decided downturn in performance. On  hammerdb,
the prototype was 6% worse than the base on a single socket power configuration.If there is interest in this approach,
I will submit a patch.Jim Van Fleet

Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Tom Lane
Robert Haas  writes:
> On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane  wrote:
>> First, I think we need a larger hard floor on the number of occurrences
>> of a value that're required to make ANALYZE decide it is a "most common
>> value".

> This kind of math isn't my strong point, but it seems to me that,
> while sampling noise is a problem, it's not obvious how to tell
> whether a given result is signal or noise.

I think that a single count in a 30K-row sample is noise by definition.
We really ought to be setting the threshold for "what is an MCV" high
enough that it's not drastically affected by variations that are clearly
at the sampling-error level.

> What makes me a bit cautious about this approach overall is that I've
> seen cases where the length of the MCV list turns out to be key to
> getting a good plan, and I needed to make it longer in order for
> things to work.

As long as they actually are MCVs, sure.  The problem I've got with the
current behavior is that it manufactures a spiky distribution where there
is none.  That leads directly to bad estimates, as shown in Marko's
example.  We'd be much better off, both as to planning time and accuracy,
if we'd concluded that the table had no MCVs.

> Another way to state it is: is this problem one-sided?

You know as well as I do that there's no free lunch in this area.
Anything we change at all will make things worse for somebody, if only
by accident.  But I do not think that choosing a bunch of values entirely
at random and claiming (incorrectly) that they are more common than other
values in the table can possibly lead to better results except by
accident.

> In general, I've pretty skeptical of the idea that sampling 30,000
> rows out of an arbitrarily large table will produce a
> sufficiently-accurate MCV list.

Perhaps not, but allowing two occurrences to define an MCV surely isn't
helping with that.  More to the point maybe, it's a behavior that doesn't
go away simply by making the sample larger.  Making the sample larger
just allows us to falsely invent more pseudo-MCVs.

I'm not by any means wedded to the proposition that we have to fix it
simply by changing the filter rule.  One idea that seems worth considering
is to keep a track list that's a bit longer than the maximum allowed
number of MCVs, and then to say that we accept only MCVs whose counts are
significantly greater than what we find at the tail of the list.  I'm not
sure what "significantly" should be exactly, but surely a distribution
as flat as the ones I was showing upthread should be a red flag.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why does logical replication launcher set application_name?

2017-06-05 Thread Petr Jelinek
On 03/06/17 05:18, Peter Eisentraut wrote:
> On 6/2/17 16:44, Petr Jelinek wrote:
>> However, I am not sure about the bgw_name_extra. I think I would have
>> preferred keeping full bgw_name field which would be used where full
>> name is needed and bgw_type where only the worker type is used. The
>> concatenation just doesn't sit well with me, especially if it requires
>> the bgw_name_extra to start with space.
> 
> I see your point.  There are also some i18n considerations to think through.
> 

So thinking a bit more, I wonder if we could simply do following:
- remove the application_name from logical workers
- add bgw_type and use it for worker type (if empty, use 'bgworker' like
now), would be probably nice if parallel workers added something to
indicate they are parallel workers there as well
- remove the 'bgworker:' prefix for ps display and just use the bgw_name

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts

2017-06-05 Thread Jim Van Fleet
Hi,

I have been experimenting with splitting  the ProcArrayLock into parts. 
That is, to Acquire the ProcArrayLock in shared mode, it is only necessary 
to acquire one of the parts in shared mode; to acquire the lock in 
exclusive mode, all of the parts must be acquired in exclusive mode. For 
those interested, I have attached a design description of the change.

This approach has been quite successful on large systems with the hammerdb 
benchmark. With a prototype based on 10 master source and running on 
power8 (model 8335-GCA with 2sockets, 20 core)
 hammerdb  improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 
2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved 
by 4%. (attached is a set of spreadsheets for power8.

The down side is that on smaller configurations (single socket) where 
there is less "lock thrashing" in the storage subsystem and there are 
multiple Lwlocks to take for an exclusive acquire, there is a decided 
downturn in performance. On  hammerdb, the prototype was 6% worse than the 
base on a single socket power configuration.

If there is interest in this approach, I will submit a patch.

Jim Van Fleet





compare_10base_toPrototype.ods
Description: Binary data


patchDescription.odt
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Andres Freund
On 2017-06-05 15:06:06 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-06-05 14:57:10 -0400, Tom Lane wrote:
> >> Meh.  Without volatile, I think that the compiler would be within its
> >> rights to elide the nentry local variable and re-fetch toc->toc_nentry
> >> each time through the loop.
> 
> > I don't think that's true.
> 
> Perhaps not, but I'm not quite convinced.  Anyway it doesn't matter,
> at least not here.

If not, we have a bunch of bugs in other files.  So if there are
concerns, we should better resolve them.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PROVE_FLAGS

2017-06-05 Thread Andrew Dunstan


On 05/23/2017 06:59 AM, Andrew Dunstan wrote:
> On 17 May 2017 at 14:30, Robert Haas  wrote:
>> On Tue, May 16, 2017 at 9:20 PM, Andrew Dunstan
>>  wrote:
>>> Inheriting variables from the environment is a part of make by design.
>>> We have PG_PROVE_FLAGS for our own forced settings.
>> I don't buy this argument.  We've had previous cases where we've gone
>> through and added -X to psql invocations in the regression tests
>> because, if you don't, some people's .psqlrc files may cause tests to
>> fail, which nobody wants.  The more general principle is that the
>> regression tests should ideally pass regardless of the local machine
>> configuration.  It's proven impractical to make that universally true,
>> but that doesn't make it a bad goal.  Now, for all I know it may be
>> that setting PROVE_FLAGS can never cause a regression failure, but I
>> still tend to agree with Peter that the regression tests framework
>> should insulate us from the surrounding environment rather than
>> absorbing settings from it.
>>
> In that case we're going to need to invent a way to do this similarly
> in vcregress.pl. I'm not simply going to revert to the situation where
> it and the makefiles are completely out of sync on this. The previous
> patch was made more or less by ignoring the existence of vcregress.pl.
>
> I will look at this again probably after pgcon. I don't think it's
> terribly urgent.
>


Here's a patch that should do that. If this is acceptable I'll do this
and put the makefiles back the way they were.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/tools/msvc/vcregress.pl b/src/tools/msvc/vcregress.pl
index 468a62d..e2d225e 100644
--- a/src/tools/msvc/vcregress.pl
+++ b/src/tools/msvc/vcregress.pl
@@ -178,12 +178,18 @@ sub tap_check
 	die "Tap tests not enabled in configuration"
 	  unless $config->{tap_tests};
 
+	my @flags;
+	foreach my $arg (0 .. scalar(@_))
+	{
+		next unless $_[$arg] =~ /^PROVE_FLAGS=(.*)/;
+		@flags = split(/\s+/, $1};
+		splice(@_,$arg,1);
+		last;
+	}
+
 	my $dir = shift;
 	chdir $dir;
 
-	my @flags;
-	@flags = split(/\s+/, $ENV{PROVE_FLAGS}) if exists $ENV{PROVE_FLAGS};
-
 	my @args = ("prove", @flags, "t/*.pl");
 
 	# adjust the environment for just this test

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Tom Lane
Andres Freund  writes:
> On 2017-06-05 14:57:10 -0400, Tom Lane wrote:
>> Meh.  Without volatile, I think that the compiler would be within its
>> rights to elide the nentry local variable and re-fetch toc->toc_nentry
>> each time through the loop.

> I don't think that's true.

Perhaps not, but I'm not quite convinced.  Anyway it doesn't matter,
at least not here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Andres Freund
On 2017-06-05 14:57:10 -0400, Tom Lane wrote:
> > If it doesn't prevent both the hardware and the compiler from
> > reordering, it's broken.  See the comments for pg_read_barrier() in
> > atomics.h.
> 
> Meh.  Without volatile, I think that the compiler would be within its
> rights to elide the nentry local variable and re-fetch toc->toc_nentry
> each time through the loop.

I don't think that's true. Excerption from the docs of the macros:
About pg_read_barrier()
 * A read barrier must act as a compiler barrier, and in addition must
About pg_compiler_barrier():
 * A compiler barrier need not (and preferably should not) emit any actual
 * machine code, but must act as an optimization fence: the compiler must not
 * reorder loads or stores to main memory around the barrier.  However, the
 * CPU may still reorder loads or stores at runtime, if the architecture's
 * memory model permits this.
 */

Given that I don't see how it'd be permissible to elide the local
variable.  Are you saying that's permitted, or that our implementations
don't guarantee that?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 5, 2017 at 12:19 PM, Tom Lane  wrote:
>> In practice it probably can't fail even if 64-bit reads aren't atomic,
>> simply because we'll never have enough entries in a shm_toc to make the
>> high-order half ever change.  But that just begs the question why the
>> field is declared Size rather than int.  I think we should make it the
>> latter.

> Yeah.  I think a shm_toc with more than 2^10 entries would probably
> perform badly enough that somebody would rewrite this entire module,
> so we don't really need to worry about having more than 2^31.
> Changing to int (or uint32) seems fine.

Done with uint32.

>> I am also thinking that most of the shm_toc functions need to have the
>> toc pointers declared as "volatile *", but particularly shm_toc_lookup.

(actually, they do already use volatile pointers, except for shm_toc_lookup)

>> That read_barrier call might prevent the hardware from reordering
>> accesses, but I don't think it stops the compiler from doing so.

> If it doesn't prevent both the hardware and the compiler from
> reordering, it's broken.  See the comments for pg_read_barrier() in
> atomics.h.

Meh.  Without volatile, I think that the compiler would be within its
rights to elide the nentry local variable and re-fetch toc->toc_nentry
each time through the loop.  It'd be unlikely to do so, granted, but
I'm not convinced that pg_read_barrier() would prevent that.

However, as long as the write barrier in shm_toc_insert does what it's
supposed to, I think we'd be safe even if that happened.  So probably
it's a moot point.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] postgres_fdw cost estimation defaults and documentation

2017-06-05 Thread Jeff Janes
The default value for fdw_tuple_cost is 0.01, which seems way too low.  If
I set up a loop-back foreign server with a large fetch_size, then tests
like:

select * from pgbench_accounts except select * from
loopback.pgbench_accounts

vs

select * from pgbench_accounts except select * from pgbench_accounts

indicate that 0.1 is about the lowest value for fdw_tuple_cost that could
make sense, and a reasonable default would probably be 0.25.  Yes, it is
only a default, but the default should make sense for at least some
situation, and I can't imagine any situation in which 0.01 makes sense.

In the documentation for fdw_startup_cost, it says "This represents the
additional overhead of establishing a connection, parsing and planning the
query on the remote side, etc.".  I think that "establishing a connection"
should be stricken. Either you need a connection or you don't, there is
nothing the optimizer can do about this.  And if do need one, you only
establish one once (at most), not once per query sent to the remote side.
I think the implementation correctly doesn't try to account for the
overhead of establishing a connection, so the docs should remove that claim.

In regards to use_remote_estimate, the documentation says "Running ANALYZE
on the foreign table is the way to update the local statistics; this will
perform a scan of the remote table and then calculate and store statistics
just as though the table were local. Keeping local statistics can be a
useful way to reduce per-query planning overhead for a remote table — but
if the remote table is frequently updated, the local statistics will soon
be obsolete."  This makes it send like local stats is basically equivalent
to use_remote_estimate, other than the staleness issue.  But they are far
from equivalent.  use_remote_estimate has implicit knowledge of the indexes
on the foreign server (implicit via the reduced cost estimates derived from
the foreign side for parameterized queries), whereas local stats of foreign
tables just assumes there are no indexes for planning purposes. Perhaps
adding something like "Also, local statistics do not contain information on
the available indexes on the remote side, while use_remote_estimate does
take these into account"?

Cheers,

Jeff


[HACKERS] GSoC 2017 weekly progress reports ("Explicitly support predicate locks in index access methods besides b-tree")

2017-06-05 Thread Shubham Barai
GSoC (week 1)


Hi,

Here is the list of things I have done during this week.

1. read documentation on how to set up development environment

2. installed PostgreSQL on Ubuntu from source code

3. read documentation on gist index (http://www.sai.msu.su/~
megera/postgres/gist/)

4. went through source code of gist index to understand its implementation

5. found appropriate places in gist index AM to insert calls to existing
functions (PredicateLockPage(),  CheckForSerializableConflictIn() ...etc)


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 12:19 PM, Tom Lane  wrote:
> While I'm looking at this ... seems like there's a pretty basic coding-
> rule violation here, namely that shm_toc_lookup() thinks it can read
> toc->toc_nentry without any sort of locking.  Since that field is declared
> Size, this amounts to an assumption that 64-bit reads are atomic, which
> is not per project practice.
>
> In practice it probably can't fail even if 64-bit reads aren't atomic,
> simply because we'll never have enough entries in a shm_toc to make the
> high-order half ever change.  But that just begs the question why the
> field is declared Size rather than int.  I think we should make it the
> latter.

Yeah.  I think a shm_toc with more than 2^10 entries would probably
perform badly enough that somebody would rewrite this entire module,
so we don't really need to worry about having more than 2^31.
Changing to int (or uint32) seems fine.

> I am also thinking that most of the shm_toc functions need to have the
> toc pointers declared as "volatile *", but particularly shm_toc_lookup.
> That read_barrier call might prevent the hardware from reordering
> accesses, but I don't think it stops the compiler from doing so.

If it doesn't prevent both the hardware and the compiler from
reordering, it's broken.  See the comments for pg_read_barrier() in
atomics.h.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?

2017-06-05 Thread Robert Haas
On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane  wrote:
> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 3 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.

This kind of math isn't my strong point, but it seems to me that,
while sampling noise is a problem, it's not obvious how to tell
whether a given result is signal or noise.  I mean, if we sample
30,000 rows from a table with a billion rows and we see the the same
value twice, then it could be the case that the row occurs just twice
in the whole table and we unluckily saw both occurrences.
Contrariwise, it could also be that the typical value in that table
occurs 1000 times and the value in question occurs 50,000 times, so
our sample was right on the money.  There's no way to know whether, if
we were to take a bunch more samples, we'd see that value coming up
twice in most of those samples or whether the fact that it showed up
twice this time is a statistical fluke.

What makes me a bit cautious about this approach overall is that I've
seen cases where the length of the MCV list turns out to be key to
getting a good plan, and I needed to make it longer in order for
things to work.  Any non-MCV is, IIRC, assumed to be less frequent
than the least-common MCV.  If you raise the minimum frequency that is
required for something to be considered an MCV, then the cap on the
frequency of a non-MCV also goes up, and I think it's not impossible
to imagine that being a problem for somebody.  Now maybe you're going
to say those people can fix it by twiddling n_distinct using DDL.  I'm
not sure whether that works in all cases, but even if it does, some of
those people might not need to twiddle n_distinct today, so from their
point of view it would be a regression.

Another way to state it is: is this problem one-sided?  If we *only*
have a problem with things that aren't really MCVs being considered as
MCVs, then tightening up the rules for including something in the MCV
list will fix it.  Also, the statistics will be more stable and
planning will be faster, which all sounds like a win.  However, I'm
guessing (on the basis of some personal experience) that we *also*
have a problem with things that really are MCVs *not* being considered
MCVs.  If it's true that both of those things are problems, then the
only real fix is to increase the sample size -- which I notice you
recommended on the original thread.

In general, I've pretty skeptical of the idea that sampling 30,000
rows out of an arbitrarily large table will produce a
sufficiently-accurate MCV list.  There's a comment in std_typanalyze()
citing a 1998 paper by Chaudhuri, Motwani, and Narasayya for the
proposition that we'll get a reasonably accurate histogram with that
sample size, but histogram bins are not MCVs.  Moreover, in practice,
we need *increased* accuracy, not just *equal* accuracy, as the table
size increases.  On a billion row table, a minimum MCV frequency of
1/1000 means that we can't distinguish between a value that occurs 1
time and a value that occurs 999,999 times -- they are both non-MCVs.
As you shrink the table that sort of thing becomes a smaller and
smaller problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Tom Lane
Robert Haas  writes:
> On Mon, Jun 5, 2017 at 11:19 AM, Tom Lane  wrote:
>> I think that shm_toc_lookup() ought to be made to throw elog(ERROR) on an
>> unexpected failure.  To satisfy the one caller that doesn't want an error,
>> we could either add a "bool noError" parameter to it, or split it into two
>> functions shm_toc_lookup() and shm_toc_lookup_noerror().  The latter would
>> require touching less code, but the former is probably more like what we'd
>> have had if this were designed more carefully to begin with.

> Either is OK with me.

Done with the extra parameter.  If we discover a reason to back-patch
this, we'd likely need to do it the other way in back branches, but
for now I'll refrain.

While I'm looking at this ... seems like there's a pretty basic coding-
rule violation here, namely that shm_toc_lookup() thinks it can read
toc->toc_nentry without any sort of locking.  Since that field is declared
Size, this amounts to an assumption that 64-bit reads are atomic, which
is not per project practice.

In practice it probably can't fail even if 64-bit reads aren't atomic,
simply because we'll never have enough entries in a shm_toc to make the
high-order half ever change.  But that just begs the question why the
field is declared Size rather than int.  I think we should make it the
latter.

I am also thinking that most of the shm_toc functions need to have the
toc pointers declared as "volatile *", but particularly shm_toc_lookup.
That read_barrier call might prevent the hardware from reordering
accesses, but I don't think it stops the compiler from doing so.

These problems are probably just latent, because it looks to me like
in our current usage no process would ever be doing lookups on shm_tocs
that are being changed concurrently.  But they'll bite us someday.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make check false success

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 5:23 AM, Sandro Santilli  wrote:
> Why not ? The caller is attempting to make an unsupported target,
> how's that different from calling `make unexistent` ?

That's a good point, but what Tom wrote is along the lines of my
concerns also, especially his last paragraph about REGRESS not being
defined at all.  I think we have a convention that 'make check'
succeeds if it runs all of the tests, even if the set of all tests
happens to be the empty set.

What was your motivation for wanting this changed in the first place?
It seems like either behavior could be more convenient for someone,
depending on the context.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] shm_toc_lookup API

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 11:19 AM, Tom Lane  wrote:
> shm_toc_lookup() returns NULL if it fails to find the desired key.
> Out of the 30 or so call sites, there is exactly one that has any
> use for that.  Some of the rest have Asserts that they get back
> a non-null result, but the majority just blithely dereference the
> pointer.  I do not find this cool at all; given that we're accessing
> a shared data structure, we should assign more than zero probability
> to the idea that we might not find what we expect to find when we
> expect to find it.
>
> I think that shm_toc_lookup() ought to be made to throw elog(ERROR) on an
> unexpected failure.  To satisfy the one caller that doesn't want an error,
> we could either add a "bool noError" parameter to it, or split it into two
> functions shm_toc_lookup() and shm_toc_lookup_noerror().  The latter would
> require touching less code, but the former is probably more like what we'd
> have had if this were designed more carefully to begin with.

Either is OK with me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] shm_toc_lookup API

2017-06-05 Thread Tom Lane
shm_toc_lookup() returns NULL if it fails to find the desired key.
Out of the 30 or so call sites, there is exactly one that has any
use for that.  Some of the rest have Asserts that they get back
a non-null result, but the majority just blithely dereference the
pointer.  I do not find this cool at all; given that we're accessing
a shared data structure, we should assign more than zero probability
to the idea that we might not find what we expect to find when we
expect to find it.

I think that shm_toc_lookup() ought to be made to throw elog(ERROR) on an
unexpected failure.  To satisfy the one caller that doesn't want an error,
we could either add a "bool noError" parameter to it, or split it into two
functions shm_toc_lookup() and shm_toc_lookup_noerror().  The latter would
require touching less code, but the former is probably more like what we'd
have had if this were designed more carefully to begin with.

Any preferences?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Default Partition for Range

2017-06-05 Thread Dilip Kumar
On Mon, Jun 5, 2017 at 1:43 AM, Beena Emerson  wrote:
> The new patch is rebased over default_partition_v18.patch
> [http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg315831.html]

I have done the initial review of the patch, I have few comments.


+ if ((lower->content[0] == RANGE_DATUM_DEFAULT))
+ {
+ if (partition_bound_has_default(partdesc->boundinfo))
+ {
+ overlap = true;
+ with = partdesc->boundinfo->default_index;
+ }

I think we can change if ((lower->content[0] == RANGE_DATUM_DEFAULT))
check to if (spec->is_default) that way it will be consistent with the
check in the PARTITION_STRATEGY_LIST.  Or we can move this complete
check outside of the switch.

-

- PartitionRangeDatum *datum = castNode(PartitionRangeDatum, lfirst(lc));
+ Node   *node = lfirst(lc);
+ PartitionRangeDatum *datum;
+
+ datum = castNode(PartitionRangeDatum, node);

Why do you need to change this?

--

+ if (!datums)
+ bound->content[i] = RANGE_DATUM_DEFAULT;
+

Better to check if (datums != NULL) for non boolean types.

-
+ if (content1[i] == RANGE_DATUM_DEFAULT ||
+ content2[i] == RANGE_DATUM_DEFAULT)
+ {
+ if (content1[i] == content2[i])
+ return 0;
+ else if (content1[i] == RANGE_DATUM_DEFAULT)
+ return -1;

I don't see any comments why default partition should be considered
smallest in the index comparison. For negative infinity, it's pretty
obvious by the enum name itself.

-

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Tom Lane
Andrew Dunstan  writes:
> Buildfarm member lorikeet is failing occasionally with a failed
> assertion during the select_parallel regression tests like this:

> 2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG:  statement: select 
> count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
> "/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",
>  Line: 221)

> I'll see if I can find out why, but if anyone has any ideas why this might be 
> happening (started about 3 weeks ago) that would be helpful.

Well, this seems like it has to indicate an incorrect call of
shm_mq_set_sender.  I have no great insights as to what might be causing
that, but I sure find it to be pretty unacceptable coding practice that
the call sites are not checking for failure returns from shm_toc_lookup.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] intermittent failures in Cygwin from select_parallel tests

2017-06-05 Thread Andrew Dunstan

Buildfarm member lorikeet is failing occasionally with a failed
assertion during the select_parallel regression tests like this:


2017-06-03 05:12:37.382 EDT [59327d84.1160:38] LOG:  statement: select 
count(*) from tenk1, tenk2 where tenk1.hundred > 1 and tenk2.thousand=0;
TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c",
 Line: 221)

I'll see if I can find out why, but if anyone has any ideas why this might be 
happening (started about 3 weeks ago) that would be helpful.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #14682: row level security not work with partitioned table

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 10:20 AM, Joe Conway  wrote:
> Unless Robert objects, I'll work with Mike to get a fix posted and
> committed in the next day or two.

That would be great.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 7:58 AM, Rafia Sabih
 wrote:
> I had a look at the patch from stylistic/formatting point of view,
> please find the attached patch for the suggested modifications.

Many of these seem worse, like these ones:

- * Quit if we've reached records for another database. Unless the
+ * Quit if we've reached records of another database. Unless the

- * When we reach a new relation, close the old one.  Note, however,
- * that the previous try_relation_open may have failed, in which case
- * rel will be NULL.
+ * On reaching a new relation, close the old one.  Note, that the
+ * previous try_relation_open may have failed, in which case rel will
+ * be NULL.

- * Try to open each new relation, but only once, when we first
- * encounter it.  If it's been dropped, skip the associated blocks.
+ * Each relation is open only once at it's first encounter. If it's
+ * been dropped, skip the associated blocks.

Others are better, like these:

-(errmsg("could not continue autoprewarm worker is
already running under PID %d",
+(errmsg("autoprewarm worker is already running under PID %d",

- * Start of prewarm per-database worker. This will try to load blocks of one
+ * Start prewarm per-database worker, which will load blocks of one

Others don't really seem better or worse, like:

- * Register a per-database worker to load new database's block. And
- * wait until they finish their job to launch next one.
+ * Register a per-database worker to load new database's block. Wait
+ * until they finish their job to launch next one.

IMHO, there's still a good bit of work needed here to make this sound
like American English.  For example:

- *It is a bgworker which automatically records information about blocks
- *which were present in buffer pool before server shutdown and then
- *prewarm the buffer pool upon server restart with those blocks.
+ *It is a bgworker process that automatically records information about
+ *blocks which were present in buffer pool before server
shutdown and then
+ *prewarms the buffer pool upon server restart with those blocks.

This construction "It is a..." without a clear referent seems to be
standard in Indian English, but it looks wrong to English speakers
from other parts of the world, or at least to me.

+ * Since there could be at max one worker who could do a prewarm, hence,
+ * acquiring locks is not required before setting skip_prewarm_on_restart.

To me, adding a comma before hence looks like a significant
improvement, but the word hence itself seems out-of-place.  Also, I'd
change "at max" to "at most" and maybe reword the sentence a little.
There's a lot of little things like this which I have tended be quite
strict about changing before commit; I occasionally wonder whether
it's really worth the effort.  It's not really wrong, it just sounds
weird to me as an American.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] BUG #14682: row level security not work with partitioned table

2017-06-05 Thread Joe Conway
On 06/04/2017 03:33 PM, Noah Misch wrote:
> On Fri, Jun 02, 2017 at 09:28:16AM +0900, Michael Paquier wrote:
>> On Thu, Jun 1, 2017 at 11:13 AM, Mike Palmiotto
>>  wrote:
>> > This is indeed a bug. fireRIRrules is currently skipping the RLS
>> > policy check when relkind == PARTITIONED_TABLES, so RLS policies are
>> > not applied. The attached patch fixes the behavior.
>> 
>> I would expect RLS to trigger as well in this context. Note that there
>> should be regression tests to avoid this failure again in the future.
>> I have added an open item.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Robert,
> since you committed the patch believed to have created it, you own this open
> item.

Unless Robert objects, I'll work with Mike to get a fix posted and
committed in the next day or two.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Robert Haas
On Mon, Jun 5, 2017 at 10:15 AM, Tom Lane  wrote:
>> I'm starting to like the approach of reverting the entire transition
>> tables patch.  Failing to consider the possibility of a plan with
>> multiple ModifyTable nodes seems like a pretty fundamental design
>> mistake, and I'm not eager either to ship this with that broken or try
>> to fix it at this stage of the release cycle.
>
> Postponing the feature to v11 might be a viable solution.  We don't
> have any other major work that depends on it do we?

I don't think anything that depends on it was committed to v10.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Tom Lane
Robert Haas  writes:
> On Sat, Jun 3, 2017 at 10:39 PM, Thomas Munro
>  wrote:
>> In the meantime, it seems like you agree that rejecting wCTEs that
>> affect tables with triggers with transition tables is the best
>> response to this bug report?  Do you think that parse analysis is the
>> right time to do the check?  Here's a first attempt at that.

FWIW, parse analysis is surely NOT the time for such a check.  Triggers
might get added to a table between analysis and execution.  I think you
might have to do it during executor startup.

> I'm starting to like the approach of reverting the entire transition
> tables patch.  Failing to consider the possibility of a plan with
> multiple ModifyTable nodes seems like a pretty fundamental design
> mistake, and I'm not eager either to ship this with that broken or try
> to fix it at this stage of the release cycle.

Postponing the feature to v11 might be a viable solution.  We don't
have any other major work that depends on it do we?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PG10 transition tables, wCTEs and multiple operations on the same table

2017-06-05 Thread Robert Haas
On Sat, Jun 3, 2017 at 10:39 PM, Thomas Munro
 wrote:
> In the meantime, it seems like you agree that rejecting wCTEs that
> affect tables with triggers with transition tables is the best
> response to this bug report?  Do you think that parse analysis is the
> right time to do the check?  Here's a first attempt at that.

I'm starting to like the approach of reverting the entire transition
tables patch.  Failing to consider the possibility of a plan with
multiple ModifyTable nodes seems like a pretty fundamental design
mistake, and I'm not eager either to ship this with that broken or try
to fix it at this stage of the release cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Tom Lane
Amit Kapila  writes:
> Sure.  I think it is slightly tricky because specs don't say clearly
> how ASLR can impact the behavior of any API and in my last attempt I
> could not reproduce the issue.

> I can try to do basic verification with the patch you have proposed,
> but I fear, to do the actual tests as suggested by you is difficult as
> it is not reproducible on my machine, but I can still try.

Yeah, being able to reproduce the problem reliably enough to say whether
it's fixed or not is definitely the sticking point here.  I have some
ideas about that:

1. Be sure to use Win32 not Win64 --- the odds of a failure in the larger
address space are so small you'd never prove anything.  (And of course
it has to be a version that has ASLR enabled.)

2. Revert 7f3e17b48 so that you have an ASLR-enabled build.

3. Crank shared_buffers to the maximum the machine will allow, reducing
the amount of free address space and improving the odds of a collision.

4. Spawn lots of sessions --- pgbench with -C option might be a useful
testing tool.

With luck, that will get failures often enough that you can be pretty
sure whether a patch improves the situation or not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make check false success

2017-06-05 Thread Tom Lane
Sandro Santilli  writes:
> On Fri, Jun 02, 2017 at 08:20:25AM -0400, Robert Haas wrote:
>> On Thu, Jun 1, 2017 at 10:18 AM, Sandro Santilli  wrote:
>>> I noticed that the `check` Makefile rule imported by PGXS is giving
>>> a success exit code even when it is unsupported.

>> Hmm.  I'm not 100% sure that the existing behavior is wrong.

> Why not ? The caller is attempting to make an unsupported target,
> how's that different from calling `make unexistent` ?

The key question here is whether you break recursive behavior.
We wouldn't be happy if, for instance, contrib/Makefile had to know
exactly which subdirectories it was safe to recurse into when doing
"make check".  Now, that doesn't apply directly because we're not
using the PGXS logic when we do that --- but it seems entirely
possible to me that a PGXS-using module might be a subdirectory
of some larger project, so that it would not want to fail just
because it had no tests it could run.

A related point is that if you try "make check" in a subdirectory
that doesn't have REGRESS defined at all, you don't get an error
(and this patch wouldn't change that).  It would be pretty inconsistent
to throw an error if REGRESS is defined but not if it isn't, IMO.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-06-05 Thread Sokolov Yura

Good day, everyone.

This patch improves performance of contended LWLock.
It was tested on 4 socket 72 core x86 server (144 HT) Centos 7.1
gcc 4.8.5

Results:

pgbench -i -s 300 + pgbench --skip-some-updates
Clients |  master | patched
+=+===
   30   |32k  |32k
   50   |53k  |53k
  100   |   102k  |   107k
  150   |   120k  |   131k
  200   |   129k  |   148k
  252   |   121k  |   159k
  304   |   101k  |   163k
  356   |79k  |   166k
  395   |70k  |   166k
  434   |62k  |   166k

pgbench -i -s 900 + pgbench
Clients |  master | patched
+=+===
   30   |31k  |31k
   50   |50k  |49k
  100   |75k  |78k
  150   |92k  |96k
  200   |87k  |   106k
  252   |69k  |   108k
  304   |55k  |   110k
  356   |48k  |   110k
  395   |44k  |   108k
  434   |40k  |   109k

I could not claim read-only benchmarks are affected in any way: results
are unstable between postgresql restarts and varies in a same interval
for both versions. Although some micro-optimization were made to ensure
this parity.
Since investigations exist on changing shared buffers structure (by
removing LWLock from a buffer hash partition lock [1]), I stopped
attempts to gather more reliable statistic for readonly benchmarks.

[1] Takashi Horikawa "Don't stop the world"
https://www.pgcon.org/2017/schedule/events/1078.en.html

Also, looks like patch doesn't affect single NUMA node performance
significantly.

postgresql.conf is tuned with following parameters:
shared_buffers = 32GB
max_connections = 500
fsync = on
synchronous_commit = on
full_page_writes = off
wal_buffers = 16MB
wal_writer_flush_after = 16MB
commit_delay = 2
max_wal_size = 16GB

Patch changes the way LWLock is acquired.

Before patch, lock is acquired using following procedure:
1. first LWLock->state is checked for ability to take a lock, and CAS
  performed (either lock could be acquired or not). And it is retried if
  status were changed.
2. if lock is not acquired at first 1, Proc is put into wait list, using
  LW_FLAG_LOCKED bit of LWLock->state as a spin-lock for wait list.
  In fact, profile shows that LWLockWaitListLock spends a lot of CPU on
  contendend LWLock (upto 20%).
  Additional CAS loop is inside pg_atomic_fetch_or_u32 for setting
  LW_FLAG_HAS_WAITERS. And releasing wait list lock is another CAS loop
  on the same LWLock->state.
3. after that state is checked again, because lock could be released
  before Proc were queued into wait list.
4. if lock were acquired at step 3, Proc should be dequeued from wait
  list (another spin lock and CAS loop). And this operation could be
  quite heavy, because whole list should be traversed.

Patch makes lock acquiring in single CAS loop:
1. LWLock->state is read, and ability for lock acquiring is detected.
  If there is possibility to take a lock, CAS tried.
  If CAS were successful, lock is aquired (same to original version).
2. but if lock is currently held by other backend, we check ability for
  taking WaitList lock. If wait list lock is not help by anyone, CAS
  perfomed for taking WaitList lock and set LW_FLAG_HAS_WAITERS at once.
  If CAS were successful, then LWLock were still held at the moment wait
  list lock were held - this proves correctness of new algorithm. And
  Proc is queued to wait list then.
3. Otherwise spin_delay is performed, and loop returns to step 1.

So invariant is preserved:
- either we grab the lock,
- or we set HAS_WAITERS while lock were held by someone, and queue self
  into wait list.

Algorithm for LWLockWaitForVar is also refactored. New version is:
1. If lock is not held by anyone, it immediately exit.
2. Otherwise it is checked for ability to take WaitList lock, because
  variable is protected with it. If so, CAS is performed, and if it is
  successful, loop breaked to step 4.
3. Otherwise spin_delay perfomed, and loop returns to step 1.
4. var is checked for change.
  If it were changed, we unlock wait list lock and exit.
  Note: it could not change in following steps because we are holding
  wait list lock.
5. Otherwise CAS on setting necessary flags is performed. If it succeed,
  then queue Proc to wait list and exit.
6. If Case failed, then there is possibility for LWLock to be already
  released - if so then we should unlock wait list and exit.
7. Otherwise loop returns to step 5.

So invariant is preserved:
- either we found LWLock free,
- or we found changed variable,
- or we set flags and queue self while LWLock were held.

Spin_delay is not performed at step 7, because we should release wait
list lock as soon as possible.

Additionally, taking wait list lock is skipped in both algorithms if
SpinDelayStatus.spins is less than part of spins_per_delay loop
iterations (so, it is initial iterations, and some iterations after
pg_usleep, because SpinDelayStatus.spins is reset after sleep). It
effectively turns LWLock into spinlock on low contended case. It was
made 

Re: [HACKERS] Fix a typo in README.dependencies

2017-06-05 Thread Ashutosh Bapat
On Mon, Jun 5, 2017 at 8:22 AM, atorikoshi
 wrote:
> Hi,
>
> I found below formula to compute selectivities, but
> I think the last Probability 'P(b=?)' should be 'P(c=?)'.
>
>>  P(a=?,b=?,c=?) = P(a=?,b=?) * (d + (1-d)*P(b=?))
>
>
> Attached patch fixes it, and it also adds some spaces
> following another formula which is on line 86 and
> computes P(a=?, b=?).

Agree. Also using "d" for "degree of functional dependence (b=>c) as
well is confusing. We are already using "d" for "degree of functional
dependence (a=>b). Here' patch to use "d'" instead of "d".

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


fix_typo_in_README_v2.dependencies
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-05 Thread Rafia Sabih
On Sun, Jun 4, 2017 at 12:45 PM, Mithun Cy  wrote:
> On Wed, May 31, 2017 at 10:18 PM, Robert Haas  wrote:
>> + *contrib/autoprewarm.c
>>
>> Wrong.
>
> -- Oops Sorry fixed.
>
>> +Oiddatabase;/* database */
>> +Oidspcnode;/* tablespace */
>> +Oidfilenode;/* relation's filenode. */
>> +ForkNumberforknum;/* fork number */
>> +BlockNumber blocknum;/* block number */
>>
>> Why spcnode rather than tablespace?  Also, the third line has a
>> period, but not the others.  I think you could just drop these
>> comments; they basically just duplicate the structure names, except
>> for spcnode which doesn't but probably should.
>
> -- Dropped comments and changed spcnode to tablespace.
>
>> +boolcan_do_prewarm; /* if set can't do prewarm task. */
>>
>> The comment and the name of the variable imply opposite meanings.
>
> -- Sorry a typo. Now this variable has been removed as you have
> suggested with new variables in AutoPrewarmSharedState.
>
>> +/*
>> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
>> + */
>> I assume you don't really need this.  Presumably process exit is going
>> to detach the DSM anyway.
>
> -- Yes considering process exit will detach the dsm, I have removed
> that function.
>
>> +if (seg == NULL)
>> +ereport(ERROR,
>> +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> + errmsg("unable to map dynamic shared memory segment")));
>>
>> Let's use the wording from the message in parallel.c rather than this
>> wording.  Actually, we should probably (as a separate patch) change
>> test_shm_mq's worker.c to use the parallel.c wording also.
>
> -- I have corrected the message with "could not map dynamic shared
> memory segment" as in parallel.c
>
>> +SetCurrentStatementStartTimestamp();
>>
>> Why do we need this?
>
> -- Removed Sorry forgot to remove same when I removed the SPI connection code.
>
>> +StartTransactionCommand();
>>
>> Do we need a transaction?  If so, how about starting a new transaction
>> for each relation instead of using a single one for the entire
>> prewarm?
>
> -- We do relation_open hence need a transaction. As suggested now we
> start a new transaction on every new relation.
>
>> +if (status == BGWH_STOPPED)
>> +return;
>> +
>> +if (status == BGWH_POSTMASTER_DIED)
>> +{
>> +ereport(ERROR,
>> +(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
>> +  errmsg("cannot start bgworker autoprewarm without 
>> postmaster"),
>> + errhint("Kill all remaining database processes and restart"
>> + " the database.")));
>> +}
>> +
>> +Assert(0);
>>
>> Instead of writing it this way, remove the first "if" block and change
>> the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
>> curly braces in this case.
>
> -- Fixed as suggested.
>
>>
>> +file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>>
>> Use AllocateFile() instead.  See the comments for that function and at
>> the top of fd.c.  Then you don't need to worry about leaking the file
>> handle on an error, so you can remove the fclose() before ereport()
>  now> stuff -- which is incomplete anyway, because you could fail e.g.
>> inside dsm_create().
>
> -- Using AllocateFile now.
>
>>
>> + errmsg("Error reading num of elements in \"%s\" for"
>> +" autoprewarm : %m", AUTOPREWARM_FILE)));
>>
>> As I said in a previous review, please do NOT split error messages
>> across lines like this.  Also, this error message is nothing close to
>> PostgreSQL style. Please read
>> https://www.postgresql.org/docs/devel/static/error-style-guide.html
>> and learn to follow all those guidelines written therein.  I see at
>> least 3 separate problems with this message.
>
> -- Thanks, I have tried to fix it now.
>
>> +num_elements = i;
>>
>> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
>> block dump has %d entries but expected %d", i, num_elements);  It
>> seems OK for this to be elog() rather than ereport() because the file
>> should never be corrupt unless the user has cheated by hand-editing
>> it.
>
> -- Fixed as suggested. Now eloged as an ERROR.
>
>> I think you can get rid of next_db_pos altogether, and this
>> prewarm_elem thing too.  Design sketch:
>>
>> 1. Move all the stuff that's in prewarm_elem into
>> AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
>> end_of_blockinfos to prewarm_stop_idx.
>>
>> 2. Instead of computing all of the database ranges first and then
>> launching workers, do it all in one loop.  So figure out where the
>> records for the current database end and set prewarm_start_idx and
>> prewarm_end_idx appropriately.  Launch a worker.  When the worker
>> terminates, set prewarm_start_idx = 

Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Magnus Hagander
On Mon, Jun 5, 2017 at 1:35 PM, Amit Kapila  wrote:

> On Mon, Jun 5, 2017 at 4:58 PM, Magnus Hagander 
> wrote:
> > On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila 
> wrote:
> >>
> >> On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane  wrote:
> >> > Amit Kapila  writes:
> >> >
> >> >> I think the same problem can happen during reattach as well.
> >> >> Basically, MapViewOfFileEx can fail to load image at predefined
> >> >> address (UsedShmemSegAddr).
> >> >
> >> > Once we've successfully done the VirtualAllocEx call, that should hold
> >> > until the VirtualFree call in PGSharedMemoryReAttach, allowing the
> >> > immediately-following MapViewOfFileEx call to succeed.  Were that not
> >> > the
> >> > case, we'd have had problems even without ASLR.  We did have problems
> >> > exactly like that before the pgwin32_ReserveSharedMemoryRegion code
> was
> >> > added.
> >> >
> >>
> >> I could not find anything directly in specs which could prove the
> >> theory either way.  However, in one of the StackOverflow discussions,
> >> it has been indicated that MapViewOfFile can opt to load the image at
> >> an address different than the predefined address due to ASLR.
> >>
> >> >  So my feeling about this is that retrying the process creation as
> >> > in my prototype patch ought to be sufficient; if you think it isn't,
> the
> >> > burden of proof is on you.
> >> >
> >>
> >> Sure.  I think it is slightly tricky because specs don't say clearly
> >> how ASLR can impact the behavior of any API and in my last attempt I
> >> could not reproduce the issue.
> >>
> >> I can try to do basic verification with the patch you have proposed,
> >> but I fear, to do the actual tests as suggested by you is difficult as
> >> it is not reproducible on my machine, but I can still try.
> >>
> >>
> >> [1] -
> >> https://stackoverflow.com/questions/9718616/what-does-
> mapviewoffile-return/11233456
> >> Refer below text:
> >>
> >> "Yes, MapViewOfFile returns the virtual memory base address where the
> >> image has been loaded. The value (content) of this address depends on
> >> whether the image has been successfully loaded at its predefined
> >> address (which has been setup by the linker) or whether the image has
> >> been relocated (because the desired, predefined address is already
> >> occupied or because the image has opt-in to support ASLR)."
> >>
> >
> > That statements refers to mapping executables though, like DLL and EXE.
> Not
> > mapping of data segments.
> >
> > It does randomize the entire location of the heap, in which case it might
> > also change. But not for the individual block.
> >
> > But in neither of those cases does it help to retry without restarting
> the
> > process,
> >
>
> Okay, the question here is do we need some handling during reattach
> operation where we do MapViewOfFileEx at the predefined location?
>
>
As long as we restart in a new process, I don't think so. ASLR does not
randomize away our addresses *after we've been given them*. It does it at
process (or thread start), so as long as we get it initially, I think we
are safe.

(Obviously nobody can be 100% sure without seeing the source code for it,
but all references I've seen to the implementation seem to indicate that)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Amit Kapila
On Mon, Jun 5, 2017 at 4:58 PM, Magnus Hagander  wrote:
> On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila  wrote:
>>
>> On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane  wrote:
>> > Amit Kapila  writes:
>> >
>> >> I think the same problem can happen during reattach as well.
>> >> Basically, MapViewOfFileEx can fail to load image at predefined
>> >> address (UsedShmemSegAddr).
>> >
>> > Once we've successfully done the VirtualAllocEx call, that should hold
>> > until the VirtualFree call in PGSharedMemoryReAttach, allowing the
>> > immediately-following MapViewOfFileEx call to succeed.  Were that not
>> > the
>> > case, we'd have had problems even without ASLR.  We did have problems
>> > exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
>> > added.
>> >
>>
>> I could not find anything directly in specs which could prove the
>> theory either way.  However, in one of the StackOverflow discussions,
>> it has been indicated that MapViewOfFile can opt to load the image at
>> an address different than the predefined address due to ASLR.
>>
>> >  So my feeling about this is that retrying the process creation as
>> > in my prototype patch ought to be sufficient; if you think it isn't, the
>> > burden of proof is on you.
>> >
>>
>> Sure.  I think it is slightly tricky because specs don't say clearly
>> how ASLR can impact the behavior of any API and in my last attempt I
>> could not reproduce the issue.
>>
>> I can try to do basic verification with the patch you have proposed,
>> but I fear, to do the actual tests as suggested by you is difficult as
>> it is not reproducible on my machine, but I can still try.
>>
>>
>> [1] -
>> https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
>> Refer below text:
>>
>> "Yes, MapViewOfFile returns the virtual memory base address where the
>> image has been loaded. The value (content) of this address depends on
>> whether the image has been successfully loaded at its predefined
>> address (which has been setup by the linker) or whether the image has
>> been relocated (because the desired, predefined address is already
>> occupied or because the image has opt-in to support ASLR)."
>>
>
> That statements refers to mapping executables though, like DLL and EXE. Not
> mapping of data segments.
>
> It does randomize the entire location of the heap, in which case it might
> also change. But not for the individual block.
>
> But in neither of those cases does it help to retry without restarting the
> process,
>

Okay, the question here is do we need some handling during reattach
operation where we do MapViewOfFileEx at the predefined location?




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Magnus Hagander
On Mon, Jun 5, 2017 at 1:16 PM, Amit Kapila  wrote:

> On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane  wrote:
> > Amit Kapila  writes:
> >
> >> I think the same problem can happen during reattach as well.
> >> Basically, MapViewOfFileEx can fail to load image at predefined
> >> address (UsedShmemSegAddr).
> >
> > Once we've successfully done the VirtualAllocEx call, that should hold
> > until the VirtualFree call in PGSharedMemoryReAttach, allowing the
> > immediately-following MapViewOfFileEx call to succeed.  Were that not the
> > case, we'd have had problems even without ASLR.  We did have problems
> > exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
> > added.
> >
>
> I could not find anything directly in specs which could prove the
> theory either way.  However, in one of the StackOverflow discussions,
> it has been indicated that MapViewOfFile can opt to load the image at
> an address different than the predefined address due to ASLR.
>
> >  So my feeling about this is that retrying the process creation as
> > in my prototype patch ought to be sufficient; if you think it isn't, the
> > burden of proof is on you.
> >
>
> Sure.  I think it is slightly tricky because specs don't say clearly
> how ASLR can impact the behavior of any API and in my last attempt I
> could not reproduce the issue.
>
> I can try to do basic verification with the patch you have proposed,
> but I fear, to do the actual tests as suggested by you is difficult as
> it is not reproducible on my machine, but I can still try.
>
>
> [1] - https://stackoverflow.com/questions/9718616/what-does-
> mapviewoffile-return/11233456
> Refer below text:
>
> "Yes, MapViewOfFile returns the virtual memory base address where the
> image has been loaded. The value (content) of this address depends on
> whether the image has been successfully loaded at its predefined
> address (which has been setup by the linker) or whether the image has
> been relocated (because the desired, predefined address is already
> occupied or because the image has opt-in to support ASLR)."
>
>
That statements refers to mapping executables though, like DLL and EXE. Not
mapping of data segments.

It does randomize the entire location of the heap, in which case it might
also change. But not for the individual block.

But in neither of those cases does it help to retry without restarting the
process, because the heap will be mapped into the same place, and any DLLs
loading prior to our code will already have been loaded.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: retry shm attach for windows (WAS: Re: [HACKERS] OK, so culicidae is *still* broken)

2017-06-05 Thread Amit Kapila
On Mon, Jun 5, 2017 at 9:15 AM, Tom Lane  wrote:
> Amit Kapila  writes:
>
>> I think the same problem can happen during reattach as well.
>> Basically, MapViewOfFileEx can fail to load image at predefined
>> address (UsedShmemSegAddr).
>
> Once we've successfully done the VirtualAllocEx call, that should hold
> until the VirtualFree call in PGSharedMemoryReAttach, allowing the
> immediately-following MapViewOfFileEx call to succeed.  Were that not the
> case, we'd have had problems even without ASLR.  We did have problems
> exactly like that before the pgwin32_ReserveSharedMemoryRegion code was
> added.
>

I could not find anything directly in specs which could prove the
theory either way.  However, in one of the StackOverflow discussions,
it has been indicated that MapViewOfFile can opt to load the image at
an address different than the predefined address due to ASLR.

>  So my feeling about this is that retrying the process creation as
> in my prototype patch ought to be sufficient; if you think it isn't, the
> burden of proof is on you.
>

Sure.  I think it is slightly tricky because specs don't say clearly
how ASLR can impact the behavior of any API and in my last attempt I
could not reproduce the issue.

I can try to do basic verification with the patch you have proposed,
but I fear, to do the actual tests as suggested by you is difficult as
it is not reproducible on my machine, but I can still try.


[1] - 
https://stackoverflow.com/questions/9718616/what-does-mapviewoffile-return/11233456
Refer below text:

"Yes, MapViewOfFile returns the virtual memory base address where the
image has been loaded. The value (content) of this address depends on
whether the image has been successfully loaded at its predefined
address (which has been setup by the linker) or whether the image has
been relocated (because the desired, predefined address is already
occupied or because the image has opt-in to support ASLR)."



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-06-05 Thread Etsuro Fujita

On 2017/06/05 17:39, Ashutosh Bapat wrote:

On Fri, Jun 2, 2017 at 2:40 PM, Etsuro Fujita



While updating the patch, I noticed the patch rewrites the UPDATE targetList
incorrectly in some cases; rewrite_inherited_tlist I added to
adjust_appendrel_attrs (1) removes all junk items from the targetList and
(2) adds junk items for the child table using rewriteTargetListUD, but it's
wrong to drop all junk items in cases where there are junk items for some
other reasons than rewriteTargetListUD.  Consider junk items containing
MULTIEXPR SubLink.  One way I came up with to fix this is to change (1) to
only remove junk items with resname; since junk items added by
rewriteTargetListUD should have resname (note: we would need resname to call
ExecFindJunkAttributeInTlist at execution time!) while other junk items
wouldn't have resname (see transformUpdateTargetList), we could correctly
replace junk items added by rewriteTargetListUD for the parent with ones for
the child, by that change.  I might be missing something, though.  Comments
welcome.


I haven't looked at the patch, but that doesn't look right. In future
some code path other than rewriteTargetListUD() may add junk items
with resname and this fix will remove those junk items as well.


Yeah, I thought that too.  I think we could replace junk tlist entries 
added by rewriteTargetListUD() more safely, by adding a lot more code, 
but I'm not sure it's worth complicating the code at the current stage.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make check false success

2017-06-05 Thread Sandro Santilli
On Fri, Jun 02, 2017 at 08:20:25AM -0400, Robert Haas wrote:
> On Thu, Jun 1, 2017 at 10:18 AM, Sandro Santilli  wrote:
> > I noticed that the `check` Makefile rule imported by PGXS is giving
> > a success exit code even when it is unsupported.
> >
> > The attached patch fixes that.
> 
> Hmm.  I'm not 100% sure that the existing behavior is wrong.

Why not ? The caller is attempting to make an unsupported target,
how's that different from calling `make unexistent` ?

--strk;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Heikki Linnakangas

On 06/05/2017 11:49 AM, Tianzhou Chen wrote:

Hi Pg Hackers,

XID wraparound seems to be quite a big concern and we introduce
changes like “adding another frozen bit to each page”
[http://rhaas.blogspot.com/2016/03/no-more-full-table-vacuums.html

to tackle this. I am just wondering what’s the challenges preventing
us from moving to 64 bit xid?  This is the previous thread I find
https://www.postgresql.org/message-id/CAEYLb_UfC%2BHZ4RAP7XuoFZr%2B2_ktQmS9xqcQgE-rNf5UCqEt5A%40mail.gmail.com
,
the only answer there is:

“ The most obvious reason for not using 64-bit xid values is that
they require more storage than 32-bit values. There is a patch
floating around that makes it safe to not forcibly safety shutdown
the server where currently it is necessary, but it doesn't work by
making xids 64-bit.
"

I am personally not quite convinced that is the main reason, since I
feel for database hitting this issue, the schema is mostly
non-trivial and doesn’t matter so much with 8 more bytes. Could some
postgres experts share more insights about the challenges?


That quote is accurate. We don't want to just expand XIDs to 64 bits, 
because it would significantly bloat the tuple header. PostgreSQL's 
per-tuple overhead is already quite large, compared to many other systems.


The most promising approach to tackle this is to switch to 64-bit XIDs 
in in-memory structures, and add some kind of an extra epoch field to 
the page header. That would effectively give you 64-bit XIDs, but would 
only add one a field to each page, not every tuple.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Update comments in nodeModifyTable.c

2017-06-05 Thread Etsuro Fujita

While working on [1], I noticed that the comment in ExecModifyTable:

 * Foreign table updates have a wholerow attribute when the
 * relation has an AFTER ROW trigger.

is not 100% correct because a foreign table has a wholerow attrubute 
when the relation has an AFTER ROW or BEFORE ROW trigger (see 
rewriteTargetListUD).  So I'd propose s/an AFTER ROW trigger/a row-level 
trigger/.  Attached is a patch for that.


Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/a31f779e-9cb8-1ea5-71a6-bca6adbfa6a4%40lab.ntt.co.jp
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index cf555fe..5dde93c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1554,7 +1554,7 @@ ExecModifyTable(ModifyTableState *node)
 * the old relation tuple.
 *
 * Foreign table updates have a wholerow 
attribute when the
-* relation has an AFTER ROW trigger.  Note 
that the wholerow
+* relation has a row-level trigger.  Note that 
the wholerow
 * attribute does not carry system columns.  
Foreign table
 * triggers miss seeing those, except that we 
know enough here
 * to set t_tableOid.  Quite separately from 
this, the FDW may
@@ -2093,7 +2093,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, 
int eflags)
else if (relkind == 
RELKIND_FOREIGN_TABLE)
{
/*
-* When there is an AFTER 
trigger, there should be a
+* When there is a row-level 
trigger, there should be a
 * wholerow attribute.
 */
j->jf_junkAttNo = 
ExecFindJunkAttribute(j, "wholerow");

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-06-05 Thread Tianzhou Chen
Hi Pg Hackers,

XID wraparound seems to be quite a big concern and we introduce changes 
like “adding another frozen bit to each page” 
[http://rhaas.blogspot.com/2016/03/no-more-full-table-vacuums.html 
 to tackle 
this. I am just wondering what’s the challenges preventing us from moving to 64 
bit xid?  This is the previous thread I find 
https://www.postgresql.org/message-id/CAEYLb_UfC%2BHZ4RAP7XuoFZr%2B2_ktQmS9xqcQgE-rNf5UCqEt5A%40mail.gmail.com
 
,
 the only answer there is:

“
The most obvious reason for not using 64-bit xid values is that they
require more storage than 32-bit values. There is a patch floating
around that makes it safe to not forcibly safety shutdown the server
where currently it is necessary, but it doesn't work by making xids
64-bit.

"
   
I am personally not quite convinced that is the main reason, since I feel 
for database hitting this issue, the schema is mostly non-trivial and doesn’t 
matter so much with 8 more bytes. Could some postgres experts share more 
insights about the challenges?


Thanks
Tianzhou



Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-06-05 Thread Ashutosh Bapat
On Fri, Jun 2, 2017 at 2:40 PM, Etsuro Fujita
 wrote:
> On 2017/05/16 21:36, Etsuro Fujita wrote:
>>
>> One approach I came up with to fix this issue is to rewrite the targetList
>> entries of an inherited UPDATE/DELETE properly using rewriteTargetListUD,
>> when generating a plan for each child table in inheritance_planner.
>> Attached is a WIP patch for that.  Maybe I am missing something, though.
>
>
> While updating the patch, I noticed the patch rewrites the UPDATE targetList
> incorrectly in some cases; rewrite_inherited_tlist I added to
> adjust_appendrel_attrs (1) removes all junk items from the targetList and
> (2) adds junk items for the child table using rewriteTargetListUD, but it's
> wrong to drop all junk items in cases where there are junk items for some
> other reasons than rewriteTargetListUD.  Consider junk items containing
> MULTIEXPR SubLink.  One way I came up with to fix this is to change (1) to
> only remove junk items with resname; since junk items added by
> rewriteTargetListUD should have resname (note: we would need resname to call
> ExecFindJunkAttributeInTlist at execution time!) while other junk items
> wouldn't have resname (see transformUpdateTargetList), we could correctly
> replace junk items added by rewriteTargetListUD for the parent with ones for
> the child, by that change.  I might be missing something, though.  Comments
> welcome.

I haven't looked at the patch, but that doesn't look right. In future
some code path other than rewriteTargetListUD() may add junk items
with resname and this fix will remove those junk items as well.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-06-05 Thread Etsuro Fujita

On 2017/06/02 18:10, Etsuro Fujita wrote:

On 2017/05/16 21:36, Etsuro Fujita wrote:
One approach I came up with to fix this issue is to rewrite the 
targetList entries of an inherited UPDATE/DELETE properly using 
rewriteTargetListUD, when generating a plan for each child table in 
inheritance_planner.  Attached is a WIP patch for that.  Maybe I am 
missing something, though.


While updating the patch, I noticed the patch rewrites the UPDATE 
targetList incorrectly in some cases; rewrite_inherited_tlist I added to 
adjust_appendrel_attrs (1) removes all junk items from the targetList 
and (2) adds junk items for the child table using rewriteTargetListUD, 
but it's wrong to drop all junk items in cases where there are junk 
items for some other reasons than rewriteTargetListUD.  Consider junk 
items containing MULTIEXPR SubLink.  One way I came up with to fix this 
is to change (1) to only remove junk items with resname; since junk 
items added by rewriteTargetListUD should have resname (note: we would 
need resname to call ExecFindJunkAttributeInTlist at execution time!) 
while other junk items wouldn't have resname (see 
transformUpdateTargetList), we could correctly replace junk items added 
by rewriteTargetListUD for the parent with ones for the child, by that 
change.  I might be missing something, though.  Comments welcome.


I updated the patch that way.  Please find attached an updated version.

Other changes:
* Moved the initialization for "tupleid" I added in ExecModifyTable as 
discussed before, which I think is essentially the same as proposed by 
Ildus in [1], since I think that would be more consistent with "oldtuple".

* Added regression tests.

Anyway I'll add this to the next commitfest.

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/20170514150525.0346ba72%40postgrespro.ru
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 6924,6929  update bar set f2 = f2 + 100 returning *;
--- 6924,7038 
7 | 277
  (6 rows)
  
+ -- Test that UPDATE/DELETE with inherited target works with row-level triggers
+ CREATE TRIGGER trig_row_before
+ BEFORE UPDATE OR DELETE ON bar2
+ FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
+ CREATE TRIGGER trig_row_after
+ AFTER UPDATE OR DELETE ON bar2
+ FOR EACH ROW EXECUTE PROCEDURE trigger_data(23,'skidoo');
+ explain (verbose, costs off)
+ update bar set f2 = f2 + 100;
+   QUERY PLAN  

+ 
--
+  Update on public.bar
+Update on public.bar
+Foreign Update on public.bar2
+  Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1 RETURNING 
f1, f2, f3
+->  Seq Scan on public.bar
+  Output: bar.f1, (bar.f2 + 100), bar.ctid
+->  Foreign Scan on public.bar2
+  Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, bar2.*
+  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+ (9 rows)
+ 
+ update bar set f2 = f2 + 100;
+ NOTICE:  trig_row_before(23, skidoo) BEFORE ROW UPDATE ON bar2
+ NOTICE:  OLD: (3,333,33),NEW: (3,433,33)
+ NOTICE:  trig_row_before(23, skidoo) BEFORE ROW UPDATE ON bar2
+ NOTICE:  OLD: (4,344,44),NEW: (4,444,44)
+ NOTICE:  trig_row_before(23, skidoo) BEFORE ROW UPDATE ON bar2
+ NOTICE:  OLD: (7,277,77),NEW: (7,377,77)
+ NOTICE:  trig_row_after(23, skidoo) AFTER ROW UPDATE ON bar2
+ NOTICE:  OLD: (3,333,33),NEW: (3,433,33)
+ NOTICE:  trig_row_after(23, skidoo) AFTER ROW UPDATE ON bar2
+ NOTICE:  OLD: (4,344,44),NEW: (4,444,44)
+ NOTICE:  trig_row_after(23, skidoo) AFTER ROW UPDATE ON bar2
+ NOTICE:  OLD: (7,277,77),NEW: (7,377,77)
+ explain (verbose, costs off)
+ update bar set (f2,f1) = (select f1, f2 from bar2 where f3 = 77) where f1 = 7;
+   QUERY PLAN  
 
+ 
---
+  Update on public.bar
+Update on public.bar
+Foreign Update on public.bar2
+  Remote SQL: UPDATE public.loct2 SET f1 = $2, f2 = $3 WHERE ctid = $1 
RETURNING f1, f2, f3
+InitPlan 1 (returns $0,$1)
+  ->  Foreign Scan on public.bar2 bar2_1
+Output: bar2_1.f1, bar2_1.f2
+Remote SQL: SELECT f1, f2 FROM public.loct2 WHERE ((f3 = 77))
+->  Seq Scan on public.bar
+  Output: $1, $0, NULL::record, bar.ctid
+  Filter: (bar.f1 = 7)
+->  Foreign Scan on public.bar2
+  Output: $1, $0, bar2.f3, NULL::record, bar2.ctid, bar2.*
+  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 WHERE ((f1 = 
7)) FOR UPDATE
+ (14 rows)
+ 
+ update bar set (f2,f1) = (select f1, f2 from bar2 where f3 = 77) where f1 = 7;
+ NOTICE:  trig_row_before(23, skidoo) BEFORE ROW UPDATE ON bar2
+ NOTICE:  OLD: (7,377,77),NEW: (377,7,77)
+ 

Re: [HACKERS] simplehash.h typo

2017-06-05 Thread Heikki Linnakangas

On 06/05/2017 11:26 AM, Andres Freund wrote:

On 2017-06-05 11:10:12 +0300, Heikki Linnakangas wrote:

Fixed, thanks. I also fixed and clarified some other comments in the file
that seemed wrong or confusing to me.


Thanks for looking - I don't see any commit though?


Pushed now. Note to self: remove --dry-run when you intend to commit for 
real :-).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] simplehash.h typo

2017-06-05 Thread Andres Freund
On 2017-06-05 11:10:12 +0300, Heikki Linnakangas wrote:
> On 05/28/2017 04:50 AM, Jeff Janes wrote:
> > /* round up size to the next power of 2, that's the bucketing works  */
> > 
> > 
> > That should probably be "that's the **way** bucketing works".  Or maybe it
> > is an idiom I don't grok.

No, you're perfectly right!


> Fixed, thanks. I also fixed and clarified some other comments in the file
> that seemed wrong or confusing to me.

Thanks for looking - I don't see any commit though?

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server ignores contents of SASLInitialResponse

2017-06-05 Thread Heikki Linnakangas

On 06/05/2017 09:34 AM, Michael Paquier wrote:

On Mon, Jun 5, 2017 at 7:49 AM, Noah Misch  wrote:

IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
for your status update.  Please reacquaint yourself with the policy on open
item ownership[1] and then reply immediately.  If I do not hear from you by
2017-06-05 23:00 UTC, I will transfer this item to release management team
ownership without further notice.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


I am planning to look and comment about Heikki's patch tomorrow my
time. The GSS issue requires a back-patch and is actually linked to
what is discussed for SASL. The testing behind it also needs some care
and attention.


Thanks. I'll wait for Michael's update tomorrow, and will have a look at 
this after that. Expect a new status update by Wednesday.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] simplehash.h typo

2017-06-05 Thread Heikki Linnakangas

On 05/28/2017 04:50 AM, Jeff Janes wrote:

/* round up size to the next power of 2, that's the bucketing works  */


That should probably be "that's the **way** bucketing works".  Or maybe it
is an idiom I don't grok.


Fixed, thanks. I also fixed and clarified some other comments in the 
file that seemed wrong or confusing to me.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal psql \gdesc

2017-06-05 Thread Fabien COELHO



new update - rebase, changed message


Thanks. New patch applies cleanly, make check still ok.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-05 Thread Amit Khandekar
On 5 June 2017 at 11:27, Amit Kapila  wrote:
> On Fri, Jun 2, 2017 at 4:37 PM, Amit Khandekar  wrote:
>> On 2 June 2017 at 01:17, Robert Haas  wrote:
>>> On Thu, Jun 1, 2017 at 7:41 AM, Amit Khandekar  
>>> wrote:
> Regarding the trigger issue, I can't claim to have a terribly strong
> opinion on this.  I think that practically anything we do here might
> upset somebody, but probably any halfway-reasonable thing we choose to
> do will be OK for most people.  However, there seems to be a
> discrepancy between the approach that got the most votes and the one
> that is implemented by the v8 patch, so that seems like something to
> fix.

 Yes, I have started working on updating the patch to use that approach
 (BR and AR update triggers on source and destination partition
 respectively, instead of delete+insert) The approach taken by the
 patch (BR update + delete+insert triggers) didn't require any changes
 in the way ExecDelete() and ExecInsert() were called. Now we would
 require to skip the delete/insert triggers, so some flags need to be
 passed to these functions,

>
> I thought you already need to pass an additional flag for special
> handling of ctid in Delete case.

Yeah that was unavoidable.

> For Insert, a new flag needs to be
> passed and need to have a check for that in few places.

For skipping delete and insert trigger, we need to include still
another flag, and checks in both ExecDelete() and ExecInsert() for
skipping both BR and AR trigger, and then in ExecUpdate(), again a
call to ExecARUpdateTriggers() before quitting.

>
>> or else have stripped down versions of
 ExecDelete() and ExecInsert() which don't do other things like
 RETURNING handling and firing triggers.
>>>
>>> See, that strikes me as a pretty good argument for firing the
>>> DELETE+INSERT triggers...
>>>
>>> I'm not wedded to that approach, but "what makes the code simplest?"
>>> is not a bad tiebreak, other things being equal.
>>
>> Yes, that sounds good to me.
>>
>
> I am okay if we want to go ahead with firing BR UPDATE + DELETE +
> INSERT triggers for an Update statement (when row movement happens) on
> the argument of code simplicity, but it sounds slightly odd behavior.

Ok. Will keep this behaviour that is already present in the patch. I
myself also feel that code simplicity can be used as a tie-breaker if
a single behaviour  cannot be agreed upon that completely satisfies
all aspects.

>
>> But I think we want to wait for other's
>> opinion because it is quite understandable that two triggers firing on
>> the same partition sounds odd.
>>
>
> Yeah, but I think we have to rely on docs in this case as behavior is
> not intuitive.

Agreed. The doc changes in the patch already has explained in detail
this behaviour.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-06-05 Thread Amit Kapila
On Thu, Jun 1, 2017 at 3:25 AM, Robert Haas  wrote:
> On Mon, May 29, 2017 at 5:26 AM, Amit Kapila  wrote:
>>> But I think, we can also take step-by-step approach even for v11. If
>>> we agree that it is ok to silently do the updates as long as we
>>> document the behaviour, we can go ahead and do this, and then as a
>>> second step, implement error handling as a separate patch. If that
>>> patch does not materialize, we at least have the current behaviour
>>> documented.
>>
>> I think that is sensible approach if we find the second step involves
>> big or complicated changes.
>
> I think it is definitely a good idea to separate the two patches.
> UPDATE tuple routing without any special handling for the EPQ issue is
> just a partitioning feature.  The proposed handling for the EPQ issue
> is an *on-disk format change*.  That turns a patch which is subject
> only to routine bugs into one which can eat your data permanently --
> so having the "can eat your data permanently" separated out for both
> review and commit seems only prudent.  For me, it's not a matter of
> which patch is big or complicated, but rather a matter of one of them
> being a whole lot riskier than the other.  Even UPDATE tuple routing
> could mess things up pretty seriously if we end up with tuples in the
> wrong partition, of course, but the other thing is still worse.
>
> In terms of a development plan, I think we would need to have both
> patches before either could be committed.  I believe that everyone
> other than me who has expressed an opinion on this issue has said that
> it's unacceptable to just ignore the issue, so it doesn't sound like
> there will be much appetite for having #1 go into the tree without #2.
> I'm still really concerned about that approach because we do not have
> very much bit space left and WARM wants to use quite a bit of it.  I
> think it's quite possible that we'll be sad in the future if we find
> that we can't implement feature XYZ because of the bit-space consumed
> by this feature.  However, I don't have the only vote here and I'm not
> going to try to shove this into the tree over multiple objections
> (unless there are a lot more votes the other way, but so far there's
> no sign of that).
>
> Greg/Amit's idea of using the CTID field rather than an infomask bit
> seems like a possibly promising approach.  Not everything that needs
> bit-space can use the CTID field, so using it is a little less likely
> to conflict with something else we want to do in the future than using
> a precious infomask bit.  However, I'm worried about this:
>
> /* Make sure there is no forward chain link in t_ctid */
> tp.t_data->t_ctid = tp.t_self;
>
> The comment does not say *why* we need to make sure that there is no
> forward chain link, but it implies that some code somewhere in the
> system does or at one time did depend on no forward link existing.
>

I think it is to ensure that EvalPlanQual mechanism gets invoked in
the right case.   The visibility routine will return HeapTupleUpdated
both when the tuple is deleted or updated (updated - has a newer
version of the tuple), so we use ctid to decide if we need to follow
the tuple chain for a newer version of the tuple.

> Any such code that still exists will need to be updated.
>

Yeah.

> The other potential issue I see here is that I know the WARM code also
> tries to use the bit-space in the CTID field; in particular, it uses
> the CTID field of the last tuple in a HOT chain to point back to the
> root of the chain.  That seems like it could conflict with the usage
> proposed here, but I'm not totally sure.
>

The proposed change in WARM tuple patch uses ip_posid field of CTID
and we are planning to use ip_blkid field.  Here is the relevant text
and code from WARM tuple patch:

"Store the root line pointer of the WARM chain in the t_ctid.ip_posid
field of the last tuple in the chain and mark the tuple header with
HEAP_TUPLE_LATEST flag to record that fact."

+#define HeapTupleHeaderSetHeapLatest(tup, offnum) \
+do { \
+ AssertMacro(OffsetNumberIsValid(offnum)); \
+ (tup)->t_infomask2 |= HEAP_LATEST_TUPLE; \
+ ItemPointerSetOffsetNumber(&(tup)->t_ctid, (offnum)); \
+} while (0)

For further details, refer patch 0001-Track-root-line-pointer-v23_v26
in the below e-mail:
https://www.postgresql.org/message-id/CABOikdOTstHK2y0rDk%2BY3Wx9HRe%2BbZtj3zuYGU%3DVngneiHo5KQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Server ignores contents of SASLInitialResponse

2017-06-05 Thread Michael Paquier
On Mon, Jun 5, 2017 at 7:49 AM, Noah Misch  wrote:
> IMMEDIATE ATTENTION REQUIRED.  This PostgreSQL 10 open item is long past due
> for your status update.  Please reacquaint yourself with the policy on open
> item ownership[1] and then reply immediately.  If I do not hear from you by
> 2017-06-05 23:00 UTC, I will transfer this item to release management team
> ownership without further notice.
>
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I am planning to look and comment about Heikki's patch tomorrow my
time. The GSS issue requires a back-patch and is actually linked to
what is discussed for SASL. The testing behind it also needs some care
and attention.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication and PANIC during shutdown checkpoint in publisher

2017-06-05 Thread Michael Paquier
On Mon, Jun 5, 2017 at 10:29 AM, Andres Freund  wrote:
> On 2017-06-02 17:20:23 -0700, Andres Freund wrote:
>> Attached is a *preliminary* patch series implementing this.  I've first
>> reverted the previous patch, as otherwise backpatchable versions of the
>> necessary patches would get too complicated, due to the signals used and
>> such.

That makes sense.

> I went again through this, and the only real thing I found that there
> was a leftover prototype in walsender.h.  I've in interim worked on
> backpatch versions of that series, annoying conflicts, but nothing
> really problematic.  The only real difference is adding SetLatch() calls
> to HandleWalSndInitStopping() < 9.6, and guarding SetLatch with an if <
> 9.5.
>
> As an additional patch (based on one by Petr), even though it more
> belongs to
> http://archives.postgresql.org/message-id/20170421014030.fdzvvvbrz4nckrow%40alap3.anarazel.de
> attached is a patch unifying SIGHUP between normal and walsender
> backends.  This needs to be backpatched all the way.  I've also attached
> a second patch, again based on Petr's, that unifies SIGHUP handling
> across all the remaining backends, but that's something that probably
> more appropriate for v11, although I'm still tempted to commit it
> earlier.

I have looked at all those patches. The set looks solid to me.

0001 and 0002 are straight-forward things. It makes sense to unify the
SIGUSR1 handling.

Here are some comments about 0003.

+ * This will trigger walsenders to send the remaining WAL, prevent them from
+ * accepting further commands. After that they'll wait till the last WAL is
+ * written.
s/prevent/preventing/?
I would rephrase the last sentence a bit:
"After that each WAL sender will wait until the end-of-checkpoint
record has been flushed on the receiver side."

+   /*
+* Have WalSndLoop() terminate the connection in an orderly
+* manner, after writing out all the pending data.
+*/
+   if (got_STOPPING)
+   got_SIGUSR2 = true;
I think that for correctness the state of the WAL sender should be
switched to WALSNDSTATE_STOPPING in XLogSendLogical() as well.

About 0004... This definitely meritates a backpatch, PostgresMain() is
taken by WAL senders as well when executing queries.

-   if (got_SIGHUP)
+   if (ConfigRereadPending)
{
-   got_SIGHUP = false;
+   ConfigRereadPending = false;
A more appropriate name would be ConfigReloadPending perhaps?

0005 looks like a fine one-liner to me.

For 0006, you could include as well the removal of worker_spi_sighup()
in the refactoring. I think that it would be interesting to be able to
trigger a feedback message using SIGHUP in WAL receivers, refactoring
at the same time SIGHUP handling for WAL receivers. It is possible for
example to abuse SIGHUP in autovacuum for cost parameters.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers