date:20151209

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund

On December 10, 2015 5:02:27 AM GMT+01:00, Michael Paquier 
 wrote:
>On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund 
>wrote:
>> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>>> Oh, OK. I didn't read though your lines correctly. So you basically
>>> mean that we would look at the init files that are on disk, and
>check
>>> if they are empty. If they are, we simply use XLogReadBufferExtended
>>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>>> MAIN_FORKNUM. More or less right?
>>
>> We would not just do so if they're empty, we would just generally
>copy the file
>> via shared buffers, instead of copy_file(). But we'd get the file
>size
>> from the filesystem (which is fine, we make sure it is correct during
>> replay).
>
>So, this suggestion is basically implementing the reverse operation of
>GetRelationPath() to be able to rebuild a RelFileNode from scratch and
>then look at the shared buffers needed. Isn't it a bit brittle for
>back-branches? RelFileNode stuff is available easily through records
>when replaying individually FPIs, but not really in this code path.

Oh, sorry. In definitely not thinking about doing this for the back branches. 
That was more musing about a way to optimize this.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-09 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 10 Dec 2015 15:28:14 +0900, Michael Paquier  
wrote in 
> On Thu, Dec 10, 2015 at 4:40 AM, Robert Haas  wrote:
> > On Mon, Dec 7, 2015 at 2:41 AM, Amit Langote 
> >  wrote:
> >> Hm, I guess progress messages would not change that much over the course
> >> of a long-running command. How about we pass only the array(s) that we
> >> change and NULL for others:
> >>
> >> With the following prototype for pgstat_report_progress:
> >>
> >> void pgstat_report_progress(uint32 *uint32_param, int num_uint32_param,
> >>bool *message_param[], int num_message_param);
> >>
> >> If we just wanted to change, say scanned_heap_pages, then we report that
> >> using:
> >>
> >> uint32_param[1] = scanned_heap_pages;
> >> pgstat_report_progress(uint32_param, 3, NULL, 0);
> >>
> >> Also, pgstat_report_progress() should check each of its parameters for
> >> NULL before iterating over to copy. So in most reports over the course of
> >> a command, the message_param would be NULL and hence not copied.
> >
> > It's going to be *really* important that this facility provides a
> > lightweight way of updating progress, so I think this whole API is
> > badly designed.  VACUUM, for example, is going to want a way to update
> > the individual counter for the number of pages it's scanned after
> > every page.  It should not have to pass all of the other information
> > that is part of a complete report.  It should just be able to say
> > pgstat_report_progress_update_counter(1, pages_scanned) or something
> > of this sort.  Don't marshal all of the data and then make
> > pgstat_report_progress figure out what's changed.  Provide a series of
> > narrow APIs where the caller tells you exactly what they want to
> > update, and you update only that.  It's fine to have a
> > pgstat_report_progress() function to update everything at once, for
> > the use at the beginning of a command, but the incremental updates
> > within the command should do something lighter-weight.
> 
> [first time looking really at the patch and catching up with this thread]
> 
> Agreed. As far as I can guess from the code, those should be as
> followed to bloat pgstat message queue a minimum:
> 
> +   pgstat_report_activity_flag(ACTIVITY_IS_VACUUM);
> /*
>  * Loop to process each selected relation.
>  */
> Defining a new routine for this purpose is a bit surprising. Cannot we
> just use pgstat_report_activity with a new BackendState STATE_INVACUUM
> or similar if we pursue the progress tracking approach?

The author might want to know vacuum status *after* it has been
finished. But it is reset just after the end of a vacuum. One
concern is that BackendState adds new value for
pg_stat_activiry.state and it might confuse someone using it but
I don't see other issue on it.

> A couple of comments:
> - The relation OID should be reported and not its name. In case of a
> relation rename that would not be cool for tracking, and most users
> are surely going to join with other system tables using it.

+1

> - The progress tracking facility adds a whole level of complexity for
> very little gain, and IMO this should *not* be part of PgBackendStatus
> because in most cases its data finishes wasted. We don't expect
> backends to run frequently such progress reports, do we?

I strongly thought the same thing but I haven't came up with
better place for it to be stored. but,

>  My opinion on
> the matter if that we should define a different collector data for
> vacuum, with something like PgStat_StatVacuumEntry, then have on top
> of it a couple of routines dedicated at feeding up data with it when
> some work is done on a vacuum job.

+many. But I can't guess the appropriate number of the entry of
it, or suitable replacing policy on excesive number of
vacuums. Although sane users won't run vacuum on so many
backends.

> In short, it seems to me that this patch needs a rework, and should be
> returned with feedback. Other opinions?

This is important feature for DBAs so I agree with you.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Jeff Janes

On Wed, Dec 9, 2015 at 10:44 AM, Stas Kelvich  wrote:
> Hello.
>
> While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc 
> transactions is approximately two times slower than an ordinary commit on 
> workload with fast transactions — few single-row updates and COMMIT or 
> PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on 
> fopen/fclose, so it worth a try to reduce file operations with 2pc tx.
>

I've tested this through my testing harness which forces the database
to go through endless runs of crash recovery and checks for
consistency, and so far it has survived perfectly.

...

>
> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

Can you give the full command line?  -j, -c, etc.

>
> Benchmark done with following script:
>
> \set naccounts 10 * :scale
> \setrandom from_aid 1 :naccounts
> \setrandom to_aid 1 :naccounts
> \setrandom delta 1 100
> \set scale :scale+1

Why are you incrementing :scale ?

I very rapidly reach a point where most of the updates are against
tuples that don't exist, and then get integer overflow problems.

> BEGIN;
> UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = 
> :from_aid;
> UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
> PREPARE TRANSACTION ':client_id.:scale';
> COMMIT PREPARED ':client_id.:scale';
>

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-09 Thread Michael Paquier

On Thu, Dec 10, 2015 at 4:40 AM, Robert Haas  wrote:
> On Mon, Dec 7, 2015 at 2:41 AM, Amit Langote  
> wrote:
>> Hm, I guess progress messages would not change that much over the course
>> of a long-running command. How about we pass only the array(s) that we
>> change and NULL for others:
>>
>> With the following prototype for pgstat_report_progress:
>>
>> void pgstat_report_progress(uint32 *uint32_param, int num_uint32_param,
>>bool *message_param[], int num_message_param);
>>
>> If we just wanted to change, say scanned_heap_pages, then we report that
>> using:
>>
>> uint32_param[1] = scanned_heap_pages;
>> pgstat_report_progress(uint32_param, 3, NULL, 0);
>>
>> Also, pgstat_report_progress() should check each of its parameters for
>> NULL before iterating over to copy. So in most reports over the course of
>> a command, the message_param would be NULL and hence not copied.
>
> It's going to be *really* important that this facility provides a
> lightweight way of updating progress, so I think this whole API is
> badly designed.  VACUUM, for example, is going to want a way to update
> the individual counter for the number of pages it's scanned after
> every page.  It should not have to pass all of the other information
> that is part of a complete report.  It should just be able to say
> pgstat_report_progress_update_counter(1, pages_scanned) or something
> of this sort.  Don't marshal all of the data and then make
> pgstat_report_progress figure out what's changed.  Provide a series of
> narrow APIs where the caller tells you exactly what they want to
> update, and you update only that.  It's fine to have a
> pgstat_report_progress() function to update everything at once, for
> the use at the beginning of a command, but the incremental updates
> within the command should do something lighter-weight.

[first time looking really at the patch and catching up with this thread]

Agreed. As far as I can guess from the code, those should be as
followed to bloat pgstat message queue a minimum:

+   pgstat_report_activity_flag(ACTIVITY_IS_VACUUM);
/*
 * Loop to process each selected relation.
 */
Defining a new routine for this purpose is a bit surprising. Cannot we
just use pgstat_report_activity with a new BackendState STATE_INVACUUM
or similar if we pursue the progress tracking approach?

A couple of comments:
- The relation OID should be reported and not its name. In case of a
relation rename that would not be cool for tracking, and most users
are surely going to join with other system tables using it.
- The progress tracking facility adds a whole level of complexity for
very little gain, and IMO this should *not* be part of PgBackendStatus
because in most cases its data finishes wasted. We don't expect
backends to run frequently such progress reports, do we? My opinion on
the matter if that we should define a different collector data for
vacuum, with something like PgStat_StatVacuumEntry, then have on top
of it a couple of routines dedicated at feeding up data with it when
some work is done on a vacuum job.

In short, it seems to me that this patch needs a rework, and should be
returned with feedback. Other opinions?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-09 Thread Amit Kapila

On Wed, Dec 9, 2015 at 2:17 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Dec 8, 2015 at 6:00 PM, Amit Kapila 
> wrote:
>
>> On Tue, Dec 8, 2015 at 3:56 PM, Alexander Korotkov <
>> a.korot...@postgrespro.ru> wrote:
>>>
>>> Agree. This patch need to be carefully verified. Current experiments
>>> just show that it is promising direction for improvement. I'll come with
>>> better version of this patch.
>>>
>>> Also, after testing on large machines I have another observation to
>>> share. For now, LWLock doesn't guarantee that exclusive lock would be ever
>>> acquired (assuming each shared lock duration is finite). It because when
>>> there is no exclusive lock, new shared locks aren't queued and LWLock state
>>> is changed directly. Thus, process which tries to acquire exclusive lock
>>> have to wait for gap in shared locks.
>>>
>>
>> I think this has the potential to starve exclusive lockers in worst case.
>>
>>
>>> But with high concurrency for shared lock that could happen very rare,
>>> say never.
>>>
>>> We did see this on big Intel machine in practice. pgbench -S gets shared
>>> ProcArrayLock very frequently. Since some number of connections is
>>> achieved, new connections hangs on getting exclusive ProcArrayLock. I think
>>> we could do some workaround for this problem. For instance, when exclusive
>>> lock waiter have some timeout it could set some special bit which prevents
>>> others to get new shared locks.
>>>
>>>
>> I think timeout based solution would lead to giving priority to
>> exclusive lock waiters (assume a case where each of exclusive
>> lock waiter timesout one after another) and make shared lockers
>> wait and a timer based solution might turn out to be costly for
>> general cases where wait is not so long.
>>
>
> Since all lwlock waiters are ordered in the queue, we can let only first
> waiter to set this bit.
>

Thats okay, but still every time an Exclusive locker woke up, the
threshold time for its wait might be already over and it will set the
bit.  In theory, that looks okay, but as compare to current algorithm
it will make more shared lockers to be added into wait queue.


> Anyway, once bit is set, shared lockers would be added to the queue. They
> would get the lock in queue order.
>
>

Ye thats right, but I think in general the solution to this problem
should be don't let any Exclusive locker to starve and still allow
as many shared lockers as possible.  I think here it is important
how we define starving, should it be based on time or something
else?  I find timer based solution somewhat less suitable, but may
be it is okay, if there is no other better way.


> Another way could be to
>> check if the Exclusive locker needs to go for repeated wait for a
>> couple of times, then we can set such a bit.
>>
>
> I'm not sure what do you mean by repeated wait. Do you mean exclusive
> locker was waked twice up by timeout?
>

I mean to say once the Exclusive locker is woken up, it again
re-tries to acquire the lock as it does today, but if it finds that the
number of retries is greater than certain threshold (let us say 10),
then we sit the bit.


> Because now, without timeout, exclusive locker wouldn't be waked up until
> all shared locks are released.
>
>
Does LWLockWakeup() work that way?  I thought it works such
that once an Exclusive locker is encountered in the wait queue, it
just wakes that and won't try to wake any further waiters.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Haribabu Kommi

On Thu, Dec 10, 2015 at 4:33 PM, Amit Kapila  wrote:
> On Thu, Dec 10, 2015 at 9:51 AM, Haribabu Kommi 
> wrote:
>>
>> On Thu, Dec 10, 2015 at 2:29 PM, Amit Kapila 
>> wrote:
>
>> > How about creating "hba parser context" and "ident parser context"
>> > at the beginning of their respective functions and don't change
>> > anything in tokenize_file()?
>>
>> The tokenize file cxt is deleted after a successful load of pg_hba.conf or
>> pg_ident.conf files. we don't need this memory once the pg_hba.conf
>> or pg_ident file is loaded, because of this reason, it is created as a
>> separate context and deleted later.
>>
>
> What about the error case?

Yes, One error case is possible when the length of the string crosses
the MAX_LINE size.
If we allocate the tokenize file cxt inside CurrentMemoryContext (i.e
MessageContext)
instead of TopMemoryContext, it will automatically freed later in case
if exists.


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-09 Thread Amit Langote

On 2015/12/10 4:40, Robert Haas wrote:
> It's going to be *really* important that this facility provides a
> lightweight way of updating progress, so I think this whole API is
> badly designed.  VACUUM, for example, is going to want a way to update
> the individual counter for the number of pages it's scanned after
> every page.  It should not have to pass all of the other information
> that is part of a complete report.  It should just be able to say
> pgstat_report_progress_update_counter(1, pages_scanned) or something
> of this sort.  Don't marshal all of the data and then make
> pgstat_report_progress figure out what's changed.  Provide a series of
> narrow APIs where the caller tells you exactly what they want to
> update, and you update only that.  It's fine to have a
> pgstat_report_progress() function to update everything at once, for
> the use at the beginning of a command, but the incremental updates
> within the command should do something lighter-weight.

How about something like the following:

/*
 * index: in the array of uint32 counters in the beentry
 * counter: new value of the (index+1)th counter
 */
void pgstat_report_progress_update_counter(int index, uint32 counter);

/*
 * msg: new value of (index+1)the message (with trailing null byte)
 */
void pgstat_report_progress_update_message(int index, const char *msg);

Actually updating a counter or message would look like:

pgstat_increment_changecount_before(beentry);
// update the counter or message at index in beentry->st_progress_*
pgstat_increment_changecount_after(beentry);

Other interface functions which are called at the beginning:

void pgstat_report_progress_set_command(int commandId);
void pgstat_report_progress_set_command_target(const char *target_name);
or
void pgstat_report_progress_set_command_target(Oid target_oid);

And then a SQL-level,
void pgstat_reset_local_progress();

Which simply sets beentry->st_command to some invalid value which signals
a progress view function to ignore this backend.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-12-09 Thread Michael Paquier

On Thu, Dec 10, 2015 at 6:46 AM, Alvaro Herrera
 wrote:
> I've been giving RecoveryTest.pm a look. I wonder if we really need that
> as a separate package.  My first thought was that we could have another
> class that inherits from PostgresNode (say RecoveryNode).  But later it
> occured to me that we could have the new functions just be part of
> PostgresNode itself directly; so we would have some new PostgresNode
> methods:
> $node->enable_streaming
> $node->enable_restoring
> $node->enable_archiving

Sure.

> $node->wait (your RecoveryTest::wait_for_node; better name for this?)

wait_for_access?

> and some additional constructors:
> make_master
> make_stream_standby
> make_archive_standby

I have done that a little bit differently. Those are completely
remove, then init() and init_from_backup() are extended with a new set
of parameters to enable archive, streaming or restore on a node.

Which gives the patch attached.
-- 
Michael


20151210_recovery_suite.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Amit Kapila

On Thu, Dec 10, 2015 at 9:51 AM, Haribabu Kommi 
wrote:
>
> On Thu, Dec 10, 2015 at 2:29 PM, Amit Kapila 
wrote:

> > How about creating "hba parser context" and "ident parser context"
> > at the beginning of their respective functions and don't change
> > anything in tokenize_file()?
>
> The tokenize file cxt is deleted after a successful load of pg_hba.conf or
> pg_ident.conf files. we don't need this memory once the pg_hba.conf
> or pg_ident file is loaded, because of this reason, it is created as a
> separate context and deleted later.
>

What about the error case?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] parallel joins, and better parallel explain

2015-12-09 Thread Amit Kapila

On Wed, Dec 9, 2015 at 11:51 PM, Robert Haas  wrote:
>
> On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila 
wrote:
>
> > I think the problem is at Gather node, the number of buffers (read +
hit)
> > are greater than the number of pages in relation.  The reason why it
> > is doing so is that in Workers (ParallelQueryMain()), it starts the
buffer
> > usage accumulation before ExecutorStart() whereas in master backend
> > it always calculate it for ExecutorRun() phase, on changing it to
accumulate
> > for ExecutorRun() phase above problem is fixed.  Attached patch fixes
the
> > problem.
>
> Why is it a bad thing to capture the cost of doing ExecutorStart() in
> the worker?  I can see there's an argument that changing this would be
> more consistent, but I'm not totally convinced.  The overhead of
> ExecutorStart() in the leader isn't attributable to any specific
> worker, but the overhead of ExecutorStart() in the worker can fairly
> be blamed on Gather, I think.
>

This boils down to the question why currently buffer usage or other
similar stats (time for node execution) for a node doesn't include
the cost for ExecutorStart().  I think the reason is that ExecutorStart()
does some other miscellaneous works like accessing system tables
or initialization of nodes which we might not even execute, so
accumulating the cost for such work doesn't seems to be meaningful.
Looking at references of InstrStartNode() which actually marks the
beginning of buffer usage stats, it is clear that buffer usage stats are
not counted for ExecutorStart() phase, so I think following the same
for worker stats seems to be more accurate.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread amul sul

>On Thursday, 10 December 2015 10:13 AM, Amit Langote 
> wrote:

>On 2015/12/10 13:38, Amit Langote wrote:
>> 
>> Take a look at a similar code block in transformFKConstraints
>> (parse_utilcmd.c: line 1935), or transformIndexConstraint
>> (parse_utilcmd.c: line 1761).

>Oops, forget the second one.

No issue, first one make sense. 
Updated patch is attached.

Thanks & Regards,
Amul Sul

transformCheckConstraints-function-to-overrid-not-valid_V2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Implement failover on libpq connect level.

2015-12-09 Thread Victor Wagner

On Mon, 07 Dec 2015 15:26:33 -0500
Korry Douglas  wrote:


> The problem seems to be in PQconnectPoll() in the case for 
> CONNECTION_AUTH_OK, specifically this code:
> 
>/* We can release the address list now. */
>pg_freeaddrinfo_all(conn->addrlist_family, conn->addrlist);
>conn->addrlist = NULL;
>conn->addr_cur = NULL;
> That frees up the list of alternative host addresses.  The state
> machine then progresses to CONNECTION_CHECK_RO (which invokes 
> pg_is_in_recovery()), then CONNECTION_CHECK_RW (waiting for the

Thank you for pointing to this problem. I've overlooked it. Probably
I should improve my testing scenario.

I', attaching new version of the patch, which, hopefully, handles
address list freeing correctly.




-- 
   Victor Wagner 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 9c0e4c8..162bd4f 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1&...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=random&readonly=1
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhost&port=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhost&port=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  readonly
+  
+  
+  If this parameter is 0 (the default), upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  non-zero, connection to warm standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ falover_timeout
+ 
+ 
+ Maximum time to cycilically retry all the hosts in commect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the master fails, can
+ already give up attempt to reestablish

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread Amit Langote

On 2015/12/10 13:38, Amit Langote wrote:
> 
> Take a look at a similar code block in transformFKConstraints
> (parse_utilcmd.c: line 1935), or transformIndexConstraint
> (parse_utilcmd.c: line 1761).

Oops, forget the second one.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread Amit Langote

On 2015/12/10 13:12, amul sul wrote:
>> On Thursday, 10 December 2015 8:22 AM, Amit Langote 
>>  wrote:
> 
> 
>> You forgot to put braces around the if block.
> 
> 
> Does this really required?

If nothing else, for consistency with surrounding code.

Take a look at a similar code block in transformFKConstraints
(parse_utilcmd.c: line 1935), or transformIndexConstraint
(parse_utilcmd.c: line 1761).

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Haribabu Kommi

On Thu, Dec 10, 2015 at 2:29 PM, Amit Kapila  wrote:
> On Thu, Dec 10, 2015 at 6:46 AM, Haribabu Kommi 
> wrote:
>>
>> On Thu, Dec 10, 2015 at 4:22 AM, Alvaro Herrera
>>  wrote:
>> > Haribabu Kommi wrote:
>> >
>> >> Reverting the context release patch is already provided in the first
>> >> mail of this
>> >> thread [1]. Forgot to mention about the same in further mails.
>> >>
>> >> Here I attached the same patch. This patch needs to be applied first
>> >> before
>> >> pg_hba_lookup patch. I tested it in windows version also.
>> >
>> > So if you change the file and reload repeatedly, we leak all the memory
>> > allocated for HBA lines in TopMemoryContext?  This doesn't sound great.
>> > Perhaps we need a dedicated context which can be reset at will so that
>> > it can be refilled with the right info when we reload the file.
>>
>> No. There is no leaks associated with pg_hba.conf parsing. we already have
>> a memory context called "hba parser context" allocated from Postmaster
>> context. The "revert_hba_context_release_in_backend" patch changes it to
>> TopMemoryContext. The memory required for parsing and storing parsed
>> hba lines is obtained from this context.
>>
>
> tokenize_file() is called before creation of hba parser context, so below
> change would be problem.
>
> *** 386,392  tokenize_file(const char *filename, FILE *file,
>
>   MemoryContext linecxt;
>
>   MemoryContext oldcxt;
>
>
>
> ! linecxt = AllocSetContextCreate(CurrentMemoryContext,
>
>   "tokenize file cxt",
>
>   ALLOCSET_DEFAULT_MINSIZE,
>
>   ALLOCSET_DEFAULT_INITSIZE,
>
> --- 386,392 
>
>   MemoryContext linecxt;
>
>   MemoryContext oldcxt;
>
>
>
> ! linecxt = AllocSetContextCreate(TopMemoryContext,
>
>   "tokenize file cxt",
>
>   ALLOCSET_DEFAULT_MINSIZE,
>
>   ALLOCSET_DEFAULT_INITSIZE,
>
>
> How about creating "hba parser context" and "ident parser context"
> at the beginning of their respective functions and don't change
> anything in tokenize_file()?

The tokenize file cxt is deleted after a successful load of pg_hba.conf or
pg_ident.conf files. we don't need this memory once the pg_hba.conf
or pg_ident file is loaded, because of this reason, it is created as a
separate context and deleted later.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread amul sul

>On Thursday, 10 December 2015 8:22 AM, Amit Langote 
> wrote:


>You forgot to put braces around the if block.


Does this really required?


Regards,
Amul Sul


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier

On Wed, Dec 9, 2015 at 9:07 PM, Andres Freund  wrote:
> On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
>> Oh, OK. I didn't read though your lines correctly. So you basically
>> mean that we would look at the init files that are on disk, and check
>> if they are empty. If they are, we simply use XLogReadBufferExtended
>> to fetch the INIT_FORKNUM content and fill in another buffer for the
>> MAIN_FORKNUM. More or less right?
>
> We would not just do so if they're empty, we would just generally copy the 
> file
> via shared buffers, instead of copy_file(). But we'd get the file size
> from the filesystem (which is fine, we make sure it is correct during
> replay).

So, this suggestion is basically implementing the reverse operation of
GetRelationPath() to be able to rebuild a RelFileNode from scratch and
then look at the shared buffers needed. Isn't it a bit brittle for
back-branches? RelFileNode stuff is available easily through records
when replaying individually FPIs, but not really in this code path.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Amit Kapila

On Wed, Dec 9, 2015 at 2:35 PM, Haribabu Kommi 
wrote:
>
>
> Reverting the context release patch is already provided in the first
> mail of this
> thread [1]. Forgot to mention about the same in further mails.
>

Thanks, that is helpful.  However I think it is better if you can
always keep the link of related patches at end of mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Amit Kapila

On Thu, Dec 10, 2015 at 6:46 AM, Haribabu Kommi 
wrote:
>
> On Thu, Dec 10, 2015 at 4:22 AM, Alvaro Herrera
>  wrote:
> > Haribabu Kommi wrote:
> >
> >> Reverting the context release patch is already provided in the first
> >> mail of this
> >> thread [1]. Forgot to mention about the same in further mails.
> >>
> >> Here I attached the same patch. This patch needs to be applied first
before
> >> pg_hba_lookup patch. I tested it in windows version also.
> >
> > So if you change the file and reload repeatedly, we leak all the memory
> > allocated for HBA lines in TopMemoryContext?  This doesn't sound great.
> > Perhaps we need a dedicated context which can be reset at will so that
> > it can be refilled with the right info when we reload the file.
>
> No. There is no leaks associated with pg_hba.conf parsing. we already have
> a memory context called "hba parser context" allocated from Postmaster
> context. The "revert_hba_context_release_in_backend" patch changes it to
> TopMemoryContext. The memory required for parsing and storing parsed
> hba lines is obtained from this context.
>

tokenize_file() is called before creation of hba parser context, so below
change would be problem.

*** 386,392  tokenize_file(const char *filename, FILE *file,

  MemoryContext linecxt;

  MemoryContext oldcxt;



! linecxt = AllocSetContextCreate(CurrentMemoryContext,

  "tokenize file cxt",

  ALLOCSET_DEFAULT_MINSIZE,

  ALLOCSET_DEFAULT_INITSIZE,

--- 386,392 

  MemoryContext linecxt;

  MemoryContext oldcxt;



! linecxt = AllocSetContextCreate(TopMemoryContext,

  "tokenize file cxt",

  ALLOCSET_DEFAULT_MINSIZE,

  ALLOCSET_DEFAULT_INITSIZE,

How about creating "hba parser context" and "ident parser context"
at the beginning of their respective functions and don't change
anything in tokenize_file()?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] psql: add \pset true/false

2015-12-09 Thread Michael Paquier

On Wed, Dec 9, 2015 at 8:50 PM, Michael Paquier
 wrote:
> On Tue, Dec 8, 2015 at 8:51 PM, Michael Paquier
>  wrote:
>> On Tue, Dec 8, 2015 at 7:18 PM, Kyotaro HORIGUCHI
>>  wrote:
>>> At Sat, 5 Dec 2015 21:05:29 +0900, Michael Paquier 
>>>  wrote in 
>>> 
>>> > Regarding the patch, I
>>> > would tend to think that we should just reject it and try to cruft
>>> > something that could be more pluggable if there is really a need.
>>> > Thoughts?
>>>
>>> Honestly saying, I feel similarly with you:p I personally will do
>>> something like the following for the original objective.
>>
>> Are there other opinions? The -1 team is in majority at the end of this 
>> thread..
>
> So, marking the patch as rejected? Any objections?

Done so. Alea jacta est, as one guy 2000 years ago would have said.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-12-09 Thread Noah Misch

On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:
> On Mon, Dec 7, 2015 at 12:06 PM, Noah Misch  wrote:
> > On Wed, Dec 02, 2015 at 06:59:09PM -0300, Alvaro Herrera wrote:
> >> +sub DESTROY
> >> +{
> >> + my $self = shift;
> >> + return if not defined $self->{_pid};
> >> + print "# signalling QUIT to $self->{_pid}\n";
> >> + kill 'QUIT', $self->{_pid};
> >
> > On Windows, that kill() is the wrong thing.  I suspect "pg_ctl kill" will be
> > the right thing, but that warrants specific testing.
> 
> I don't directly see any limitation with the use of kill on Windows..
> http://perldoc.perl.org/functions/kill.html
> But indeed using directly pg_ctl kill seems like a better fit for the
> PG infrastructure.

>From http://perldoc.perl.org/perlwin32.html, "Using signals under this port
should currently be considered unsupported."  Windows applications cannot
handle SIGQUIT: https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx.  The
PostgreSQL backend does not generate or expect Windows signals; see its
signal.c emulation facility.

> > Postmaster log file names became less informative.  Before the commit:
> > Should nodes get a name, so we instead see master_57834.log?
> 
> I am not sure that this is necessary.

In general, you'd need to cross-reference the main log file to determine which
postmaster log corresponds to which action within the test.  I did plenty of
"grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
that test.  I'd like to be able to use /*master*.log, not rely on timestamps
or on scraping regress_log_002_databases to determine which logs are master
logs.  Feel free to skip this point if I'm the only one minding, though.

> There is definitely a limitation
> regarding the fact that log files of nodes get overwritten after each
> test, hence I would tend with just appending the test name in front of
> the node_* files similarly to the other files.

They got appended, not overwritten.  I like how you changed that to not
happen, but it doesn't address what I was reporting.

nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread Amit Langote

On 2015/12/09 18:07, amul sul wrote:
>> On Wednesday, 9 December 2015 12:55 PM, Amit Langote 
>>  wrote:
> 
>> Thoughts?
> 
> Wondering, have you notice failed regression tests?

Here is the updated patch.

> I have tried with new transformCheckConstraints() function & there will be 
> small fix in gram.y.
> 
> 
> Have look into attached patch & please share your thoughts and/or suggestions.

The transformCheckConstraints approach may be better after all.

By the way,

> @@ -1915,6 +1922,32 @@ transformIndexConstraint(Constraint *constraint, 
> CreateStmtContext *cxt)
...
> + if (skipValidation)
> + foreach(ckclist, cxt->ckconstraints)
> + {
> + Constraint *constraint = (Constraint *) lfirst(ckclist);
> +
> + constraint->skip_validation = true;
> + constraint->initially_valid = true;
> + }

You forgot to put braces around the if block.

Thanks,
Amit
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7d7d062..04c4f8f 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -2349,7 +2349,7 @@ AddRelationNewConstraints(Relation rel,
 		 * OK, store it.
 		 */
 		constrOid =
-			StoreRelCheck(rel, ccname, expr, !cdef->skip_validation, is_local,
+			StoreRelCheck(rel, ccname, expr, cdef->initially_valid, is_local,
 		  is_local ? 0 : 1, cdef->is_no_inherit, is_internal);
 
 		numchecks++;
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 344a40c..f0c3e76 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -562,6 +562,11 @@ transformColumnDefinition(CreateStmtContext *cxt, ColumnDef *column)
 break;
 
 			case CONSTR_CHECK:
+/*
+ * When being added as part of a column definition, the
+ * following always holds.
+ */
+constraint->initially_valid = true;
 cxt->ckconstraints = lappend(cxt->ckconstraints, constraint);
 break;
 
@@ -687,6 +692,10 @@ transformTableConstraint(CreateStmtContext *cxt, Constraint *constraint)
 			break;
 
 		case CONSTR_CHECK:
+			/* Is this better done in a transformCheckConstraint? */
+			if (!cxt->isalter)
+constraint->initially_valid = true;
+
 			cxt->ckconstraints = lappend(cxt->ckconstraints, constraint);
 			break;
 
@@ -935,6 +944,7 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 			n->conname = pstrdup(ccname);
 			n->raw_expr = NULL;
 			n->cooked_expr = nodeToString(ccbin_node);
+			n->initially_valid = true;
 			cxt->ckconstraints = lappend(cxt->ckconstraints, n);
 
 			/* Copy comment on constraint */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Noah Misch

On Wed, Dec 09, 2015 at 11:08:32AM -0500, Robert Haas wrote:
> On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch  wrote:
> > I hope those who have not already read commit 4f627f8 will not waste time
> > reading it.  They should instead ignore multixact changes from commit 
> > 4f627f8
> > through its revert.  The 2015-09-26 commits have not appeared in a supported
> > release, and no other work has built upon them.  They have no tenure.  (I am
> > glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
> > would have introduced a data loss bug.)  Nobody reported a single defect
> > before my review overturned half the patch.  A revert will indeed impose on
> > those who invested time to understand commit 4f627f8, but the silence about
> > its defects suggests the people understanding it number either zero or one.
> > Even as its author and reviewers, you would do better to set aside what you
> > thought you knew about this code.
> 
> I just don't find this a realistic model of how people use the git
> log.  Maybe you use it this way; I don't.  I don't *want* git blame to
> make it seem as if 4f627f8 is not part of the history.  For better or
> worse, it is.

I would like to understand how you use git, then.  What's one of your models
of using "git log" and/or "git blame" in which you foresee the revert making
history less clear, not more clear?

By the way, it occurs to me that I should also make pg_upgrade blacklist the
range of catversions that might have data loss.  No sense in putting ourselves
in the position of asking whether data files of a 9.9.3 cluster spent time in
a 9.5beta2 cluster.

> Ripping it out and replacing it monolithically will not
> change that; it will only make the detailed history harder to
> reconstruct, and I *will* want to reconstruct it.

What's something that might happen six months from now and lead you to inspect
master or 9.5 multixact.c between 4f627f8 and its revert?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Haribabu Kommi

On Thu, Dec 10, 2015 at 4:22 AM, Alvaro Herrera
 wrote:
> Haribabu Kommi wrote:
>
>> Reverting the context release patch is already provided in the first
>> mail of this
>> thread [1]. Forgot to mention about the same in further mails.
>>
>> Here I attached the same patch. This patch needs to be applied first before
>> pg_hba_lookup patch. I tested it in windows version also.
>
> So if you change the file and reload repeatedly, we leak all the memory
> allocated for HBA lines in TopMemoryContext?  This doesn't sound great.
> Perhaps we need a dedicated context which can be reset at will so that
> it can be refilled with the right info when we reload the file.

No. There is no leaks associated with pg_hba.conf parsing. we already have
a memory context called "hba parser context" allocated from Postmaster
context. The "revert_hba_context_release_in_backend" patch changes it to
TopMemoryContext. The memory required for parsing and storing parsed
hba lines is obtained from this context.


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should TIDs be typbyval = FLOAT8PASSBYVAL to speed up CREATE INDEX CONCURRENTLY?

2015-12-09 Thread Peter Geoghegan

On Tue, Nov 17, 2015 at 7:33 PM, Corey Huinker  wrote:
> I'm willing, but I'm too new to the codebase to be an effective reviewer
> (without guidance). The one thing I can offer in the mean time is this: my
> company/client nearly always has a few spare AWS machines on the largish
> side where I can compile uncommitted patches and benchmark stuff for y'all.

I think that this particular patch is close to being a slam-dunk, so I
don't think it's particularly needed here. But thanks.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Given a view relation OID, how to construct a Query?

2015-12-09 Thread Eric Ridge

On Wed, Dec 9, 2015 at 5:07 PM Tom Lane  wrote:

>
> FWIW, it's exposed in 9.4 and up.  But in older branches you could
> probably just copy it, it's not that big.
>

That's good to know, thanks.  I did copy it and it's almost 3x faster than
going through SPI.  Thanks again for the pointer.

eric

Re: [HACKERS] Given a view relation OID, how to construct a Query?

2015-12-09 Thread Tom Lane

Eric Ridge  writes:
> On Wed, Dec 9, 2015 at 4:04 PM Tom Lane  wrote:
>> Open the relation and use get_view_query(), perhaps.

> I figured there was something simple, but I couldn't find it.  Thanks!
> Sadly, it's static.

FWIW, it's exposed in 9.4 and up.  But in older branches you could
probably just copy it, it's not that big.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-09 Thread Tom Lane

Merlin Moncure  writes:
> Aside from the functional issues, could your changes result in
> performance regressions? (if so, I'd argue not to backpatch unless
> there were cases that returned bad data as opposed to spurious
> errors).

I can't say that I think planner failures on valid queries is something
that's optional to fix.  However, I believe that this will typically
not change the selected plan in cases where the planner didn't fail
before.  I did notice one change in an existing regression test, where
the planner pushed a qual clause further down in the plan than it did
before; but that seems like a better plan anyway.  (The reason that
happened is that the changes to enlarge the minimum parameterization
of some base rels result in choosing to push qualifiers further down,
since a qual clause will be evaluated at the lowest plan level that
the selected parameterization allows.)

It's a little bit harder to gauge the impact on planner speed.  The
transitive closure calculation could be expensive in a query with many
lateral references, but that doesn't seem likely to be common; and anyway
we'll buy back some of that cost due to simpler tests later.  I'm
optimistic that we'll come out ahead in HEAD/9.5 after the removal
of LateralJoinInfo setup.  It might be roughly a wash in the back
branches.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Given a view relation OID, how to construct a Query?

2015-12-09 Thread Eric Ridge

On Wed, Dec 9, 2015 at 4:04 PM Tom Lane  wrote:

> Eric Ridge  writes:
> > I'm doing some extension development (in C) and have a situation where I
> > need to examine the target list of a view, but all I have is the view's
> oid.
>
> Open the relation and use get_view_query(), perhaps.

I figured there was something simple, but I couldn't find it.  Thanks!
Sadly, it's static.

eric

Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Michael Paquier

On Thu, Dec 10, 2015 at 3:44 AM, Stas Kelvich  wrote:
> Most of that ideas was already mentioned in 2009 thread by Michael Paquier 
> http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com
>  where he suggested to store 2pc data in shared memory.
> At that time patch was declined because no significant speedup were observed. 
> Now I see performance improvements by my patch at about 60%. Probably old 
> benchmark overall tps was lower and it was harder to hit filesystem 
> fopen/fclose limits.

Glad to see this patch is given a second life 6 years later.

> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

That's nice.

+XLogRecPtrprepare_xlogptr;/* XLOG offset of prepare record start
+ * or NULL if twophase data moved to file
+ * after checkpoint.
+ */
This has better be InvalidXLogRecPtr if unused.

+if (gxact->prepare_lsn)
+{
+XlogReadTwoPhaseData(gxact->prepare_xlogptr, &buf, NULL);
+}
Perhaps you mean prepare_xlogptr here?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-09 Thread Merlin Moncure

On Wed, Dec 9, 2015 at 4:18 PM, Tom Lane  wrote:
> David Fetter  writes:
>> On Tue, Dec 08, 2015 at 12:13:41PM -0500, Tom Lane wrote:
>>> Hm.  At least in the first of these cases, the problem is that the
>>> code I committed yesterday doesn't account for indirect lateral
>>> dependencies.  That is, if S1 depends on S2 which depends on the
>>> inner side of an outer join, it now knows not to join S2 directly to
>>> the outer side of the outer join, but it doesn't realize that the
>>> same must apply to S1.
>>>
>>> Maybe we should redefine lateral_relids as the transitive closure of
>>> a rel's lateral dependencies?  Not sure.
>
>> That seems like it would fix the problem once and for good.  Is there
>> any reason to believe that the lateral dependencies could go in a
>> loop, i.e. is there a reason to put in checks for same in the
>> transitive closure construction?
>
> It should not be possible to have a loop, since rels can only have lateral
> references to rels that appeared syntactically before them.  But an Assert
> about it doesn't seem a bad thing.
>
> After working on this for awhile, I've found that indeed converting
> lateral_relids to a transitive closure makes things better.  But there was
> another bug (or two other bugs, depending on how you count) exposed by
> Andreas' examples.  I was right after all to suspect that the "hazardous
> PHV" condition needs to be accounted for by join_is_legal; as things
> stood, join_is_legal might allow a join for which every possible path
> would be rejected by check_hazardous_phv.  More, if we were insisting that
> a join PHV be calculated before we could pass it to some other relation,
> we didn't actually do anything to ensure that that join would get built.
> Given an appropriate set of conditions, the clauseless-join heuristics
> could decide that that join wasn't interesting and never build it, again
> possibly leading to planner failure.  So join_is_legal's cousins
> have_join_order_restriction and has_join_restriction also need to know
> about this issue.
>
> (This is a bit of a mess; I'm beginning to wonder if we shouldn't bite
> the bullet and teach the executor how to handle non-Var NestLoopParams,
> so that the planner doesn't have to work around that.  But I'd rather
> not back-patch such a change.)
>
> I also noticed that lateral_referencers should be a transitive closure
> as well.  I think that oversight doesn't lead to any observable bug,
> but it would lead to constructing index paths that cannot be used, so
> fixing it should save some planner cycles.
>
> So that leads to the attached patch, which I think is good as-is for
> the back branches.  In this state of the code, the LateralJoinInfo
> list is darn near useless: the only thing we use it for is detecting
> whether two relations have a direct (not transitive closure) lateral
> reference.  I'm strongly tempted to replace that logic by keeping a
> copy of the pre-closure lateral_relids sets, and then we could drop
> LateralJoinInfo altogether, which would make create_lateral_join_info
> quite a bit shorter and faster.  That could go into HEAD/9.5, but
> I'm afraid to back-patch it further than 9.5 for fear that third-party
> code somewhere might be relying on the LateralJoinInfo data structure.

Aside from the functional issues, could your changes result in
performance regressions? (if so, I'd argue not to backpatch unless
there were cases that returned bad data as opposed to spurious
errors).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Given a view relation OID, how to construct a Query?

2015-12-09 Thread Tom Lane

Eric Ridge  writes:
> I'm doing some extension development (in C) and have a situation where I
> need to examine the target list of a view, but all I have is the view's oid.

Open the relation and use get_view_query(), perhaps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread Michael Paquier

On Thu, Dec 10, 2015 at 12:49 AM, Greg Stark  wrote:
> On Wed, Dec 9, 2015 at 2:27 PM, David Fetter  wrote:
>> Agreed that the "whole new language" aspect seems like way too big a
>> hammer, given what it actually does.
>
> Which would be easier to update when things change?

Regarding that both patches are equal compared to the current methods
with strcmp.

> Which would be possible to automatically generate from gram.y?

None of those patches take this approach.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Given a view relation OID, how to construct a Query?

2015-12-09 Thread Eric Ridge

I'm doing some extension development (in C) and have a situation where I
need to examine the target list of a view, but all I have is the view's oid.

An approach that works is (pseudocode):
   SPI_connect();
   "SELECT ev_action FROM pg_catalog.pg_rewrite WHERE rulename = '_RETURN'
and ev_class=?oid";
   Query *query = linitial(stringToNode(ev_action));
   ...
   SPI_finish();

I backed into this by tracing through pg_getviewdef().  Is there a more
direct way to do this without going through SPI?

I also looked at using postgres.c#pg_analyze_and_rewrite() against a query
like "SELECT * FROM viewname" but the target list of the actual query
wasn't what I was expecting (individual entry tags don't match those of the
SPI approach above).

Thanks for your time!

eric

Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-09 Thread Peter Geoghegan

On Wed, Dec 9, 2015 at 2:15 PM, Peter Geoghegan  wrote:
> I think that you're missing that patch 0001 formally forbids
> abbreviated keys that are pass-by-value

Sorry. I do of course mean it forbids abbreviated keys that are *not*
pass-by-value (that are pass-by-reference).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-09 Thread Tom Lane

David Fetter  writes:
> On Tue, Dec 08, 2015 at 12:13:41PM -0500, Tom Lane wrote:
>> Hm.  At least in the first of these cases, the problem is that the
>> code I committed yesterday doesn't account for indirect lateral
>> dependencies.  That is, if S1 depends on S2 which depends on the
>> inner side of an outer join, it now knows not to join S2 directly to
>> the outer side of the outer join, but it doesn't realize that the
>> same must apply to S1.
>> 
>> Maybe we should redefine lateral_relids as the transitive closure of
>> a rel's lateral dependencies?  Not sure.

> That seems like it would fix the problem once and for good.  Is there
> any reason to believe that the lateral dependencies could go in a
> loop, i.e. is there a reason to put in checks for same in the
> transitive closure construction?

It should not be possible to have a loop, since rels can only have lateral
references to rels that appeared syntactically before them.  But an Assert
about it doesn't seem a bad thing.

After working on this for awhile, I've found that indeed converting
lateral_relids to a transitive closure makes things better.  But there was
another bug (or two other bugs, depending on how you count) exposed by
Andreas' examples.  I was right after all to suspect that the "hazardous
PHV" condition needs to be accounted for by join_is_legal; as things
stood, join_is_legal might allow a join for which every possible path
would be rejected by check_hazardous_phv.  More, if we were insisting that
a join PHV be calculated before we could pass it to some other relation,
we didn't actually do anything to ensure that that join would get built.
Given an appropriate set of conditions, the clauseless-join heuristics
could decide that that join wasn't interesting and never build it, again
possibly leading to planner failure.  So join_is_legal's cousins
have_join_order_restriction and has_join_restriction also need to know
about this issue.

(This is a bit of a mess; I'm beginning to wonder if we shouldn't bite
the bullet and teach the executor how to handle non-Var NestLoopParams,
so that the planner doesn't have to work around that.  But I'd rather
not back-patch such a change.)

I also noticed that lateral_referencers should be a transitive closure
as well.  I think that oversight doesn't lead to any observable bug,
but it would lead to constructing index paths that cannot be used, so
fixing it should save some planner cycles.

So that leads to the attached patch, which I think is good as-is for
the back branches.  In this state of the code, the LateralJoinInfo
list is darn near useless: the only thing we use it for is detecting
whether two relations have a direct (not transitive closure) lateral
reference.  I'm strongly tempted to replace that logic by keeping a
copy of the pre-closure lateral_relids sets, and then we could drop
LateralJoinInfo altogether, which would make create_lateral_join_info
quite a bit shorter and faster.  That could go into HEAD/9.5, but
I'm afraid to back-patch it further than 9.5 for fear that third-party
code somewhere might be relying on the LateralJoinInfo data structure.

Comments?

regards, tom lane

diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 0f04033..53d8fdd 100644
*** a/src/backend/optimizer/path/joinpath.c
--- b/src/backend/optimizer/path/joinpath.c
*** allow_star_schema_join(PlannerInfo *root
*** 255,308 
  }

  /*
-  * There's a pitfall for creating parameterized nestloops: suppose the inner
-  * rel (call it A) has a parameter that is a PlaceHolderVar, and that PHV's
-  * minimum eval_at set includes the outer rel (B) and some third rel (C).
-  * We might think we could create a B/A nestloop join that's parameterized by
-  * C.  But we would end up with a plan in which the PHV's expression has to be
-  * evaluated as a nestloop parameter at the B/A join; and the executor is only
-  * set up to handle simple Vars as NestLoopParams.  Rather than add complexity
-  * and overhead to the executor for such corner cases, it seems better to
-  * forbid the join.  (Note that existence of such a PHV probably means there
-  * is a join order constraint that will cause us to consider joining B and C
-  * directly; so we can still make use of A's parameterized path with B+C.)
-  * So we check whether any PHVs used in the query could pose such a hazard.
-  * We don't have any simple way of checking whether a risky PHV would actually
-  * be used in the inner plan, and the case is so unusual that it doesn't seem
-  * worth working very hard on it.
-  *
-  * This case can occur whether or not the join's remaining parameterization
-  * overlaps param_source_rels, so we have to check for it separately from
-  * allow_star_schema_join, even though it looks much like a star-schema case.
-  */
- static inline bool
- check_hazardous_phv(PlannerInfo *root,
- 	Path *outer_path,
- 	Path *inner_path)
- {
- 	Relids		i

Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-09 Thread Peter Geoghegan

On Wed, Dec 9, 2015 at 11:31 AM, Robert Haas  wrote:
> I find the references to a "void" representation in this patch to be
> completely opaque.  I see that there are some such references in
> tuplesort.c already, and most likely they were put there by commits
> that I did, so I guess I have nobody but myself to blame, but I don't
> know what this means, and I don't think we should let this terminology
> proliferate.
>
> My understanding is that the "void" representation is intended to
> whatever Datum we originally got, which might be a pointer.  Why not
> just say that instead of referring to it this way?

That isn't what is intended. "void" is the state that macros like
index_getattr() leave NULL leading attributes (that go in the
SortTuple.datum1 field) in. However, the function tuplesort_putdatum()
requires callers to initialize their Datum to 0 now, which is new. A
"void" representation is a would-be NULL pointer in the case of
pass-by-value types, and a NULL pointer for pass-by-reference types.

> My understanding is also that it's OK if the abbreviated key stays the
> same even though the value has changed, but that the reverse could
> cause queries to return wrong answers.  The first part of that
> justifies why this is safe when no abbreviation is available: we'll
> return an abbreviated value of 0 for everything, but that's fine.
> However, using the original Datum (which might be a pointer) seems
> unsafe, because two binary-identical values could be stored at
> different addresses and thus have different pointer representations.
>
> I'm probably missing something here, so clue me in...

I think that you're missing that patch 0001 formally forbids
abbreviated keys that are pass-by-value, by revising the contract
(this is proposed for backpatch to 9.5 -- only comments are changed).
This is already something that is all but forbidden, although the
datum case does tacitly acknowledge the possibility by not allowing
abbreviation to work with the pass-by-value-and-yet-abbreviated case.

I think that this revision is also useful for putting abbreviated keys
in indexes, something that may happen yet.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich

Thanks, Kevin.

> I assume that last one should have been *Patched master with 2PC”?

Yes, this list should look like this:

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Patched master with 2PC: ~36 ktps

And created CommitFest entry for this patch.

--
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


> On 10 Dec 2015, at 00:37, Kevin Grittner  wrote:
> 
> On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich  
> wrote:
> 
>> Now 2PC in postgres does following:
>> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
>> xlog and to file, but file not is not fsynced
>> * on commit backend reads data from file
>> * if checkpoint occurs before commit, then files are fsynced during 
>> checkpoint
>> * if case of crash replay will move data from xlog to files
>> 
>> In this patch I’ve changed this procedures to following:
>> * on prepare backend writes data only to xlog and store pointer to the start 
>> of the xlog record
>> * if commit occurs before checkpoint then backend reads data from xlog by 
>> this pointer
>> * on checkpoint 2pc data copied to files and fsynced
>> * if commit happens after checkpoint then backend reads files
>> * in case of crash replay will move data from xlog to files (as it was 
>> before patch)
> 
> That sounds like a very good plan to me.
> 
>> Now results of benchmark are following (dual 6-core xeon server):
>> 
>> Current master without 2PC: ~42 ktps
>> Current master with 2PC: ~22 ktps
>> Current master with 2PC: ~36 ktps
> 
> I assume that last one should have been *Patched master with 2PC"?
> 
> Please add this to the January CommitFest.
> 
> --
> Kevin Grittner
> EDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-09 Thread Tom Lane

Robert Haas  writes:
> The comment in mdnblocks.c says this:

>  * Because we pass O_CREAT, we will create the
> next segment (with
>  * zero length) immediately, if the last
> segment is of length
>  * RELSEG_SIZE.  While perhaps not strictly
> necessary, this keeps
>  * the logic simple.

> I don't really see how this "keeps the logic simple".

My (vague) recollection is that this is defending some assumptions
elsewhere in md.c.  You sure you haven't broken anything with this?
Relation extension across a segment boundary, perhaps?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redundant sentence within INSERT documentation page (exclusion constraints)

2015-12-09 Thread Peter Geoghegan

On Wed, Dec 9, 2015 at 11:13 AM, Robert Haas  wrote:
> I don't know nearly as much about ON CONFLICT DO UPDATE as Andres, but
> even I can spot a redundancy when somebody shoves the evidence right
> under my nose.  So, committed and back-patched to 9.5.

Thanks.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] mdnblocks() sabotages error checking in _mdfd_getseg()

2015-12-09 Thread Robert Haas

This is per a report by an EnterpriseDB customer and a bunch of
off-list analysis by Kevin Grittner and Rahila Syed.

Suppose you have a large relation with OID 123456.  There are segment
files 123456, 123456.1, and 123456.2.  Due to some kind of operating
system malfeasance, 123456.1 disappears; you are officially in
trouble.

Now, a funny thing happens.  The next time you call mdnblocks() on
this relation, which will probably happen pretty quickly since every
sequential scan does one, it will see that 123456 is a complete
segment and it will create an empty 123456.1.  It and any future
mdnblocks() calls will report that the length of the relation is equal
to the length of one full segment, because they don't check for the
next segment unless the current segment is completely full.

Now, if subsequent to this an index scan happens to sweep through and
try to fetch a block in 123456.2, it will work!  This happens because
_mdfd_getseg() doesn't care about the length of the segments; it only
cares whether or not they exist.  If 123456.1 were actually missing,
then we'd never test whether 123456.2 exists and we'd get an error.
But because mdnblocks() created 123456.1, _mdfd_getseg() is now quite
happy to fetch blocks in 123456.2; it considers the empty 123456.1
file to be a sufficient reason to look for 123456.2, and seeing that
file, and finding the block it wants therein, it happily returns that
block, blithely ignoring the fact that it passed over a completely .1
segment before returning a block from .2.  This is maybe not the
smartest thing ever.

The comment in mdnblocks.c says this:

 * Because we pass O_CREAT, we will create the
next segment (with
 * zero length) immediately, if the last
segment is of length
 * RELSEG_SIZE.  While perhaps not strictly
necessary, this keeps
 * the logic simple.

I don't really see how this "keeps the logic simple".  What it does is
allow sequential scans and index scans to have two completely
different notions of how many accessible blocks there are in the
relation.  Granted, this kind of thing really shouldn't happen, but
sometimes bad things do happen.  Therefore, I propose the attached
patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RM36310.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-12-09 Thread Alvaro Herrera

I've been giving RecoveryTest.pm a look. I wonder if we really need that
as a separate package.  My first thought was that we could have another
class that inherits from PostgresNode (say RecoveryNode).  But later it
occured to me that we could have the new functions just be part of
PostgresNode itself directly; so we would have some new PostgresNode
methods:
$node->enable_streaming
$node->enable_restoring
$node->enable_archiving
$node->wait (your RecoveryTest::wait_for_node; better name for this?)

and some additional constructors:
make_master
make_stream_standby
make_archive_standby

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speedup twophase transactions

2015-12-09 Thread Kevin Grittner

On Wed, Dec 9, 2015 at 12:44 PM, Stas Kelvich  wrote:

> Now 2PC in postgres does following:
> * on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
> xlog and to file, but file not is not fsynced
> * on commit backend reads data from file
> * if checkpoint occurs before commit, then files are fsynced during checkpoint
> * if case of crash replay will move data from xlog to files
>
> In this patch I’ve changed this procedures to following:
> * on prepare backend writes data only to xlog and store pointer to the start 
> of the xlog record
> * if commit occurs before checkpoint then backend reads data from xlog by 
> this pointer
> * on checkpoint 2pc data copied to files and fsynced
> * if commit happens after checkpoint then backend reads files
> * in case of crash replay will move data from xlog to files (as it was before 
> patch)

That sounds like a very good plan to me.

> Now results of benchmark are following (dual 6-core xeon server):
>
> Current master without 2PC: ~42 ktps
> Current master with 2PC: ~22 ktps
> Current master with 2PC: ~36 ktps

I assume that last one should have been *Patched master with 2PC"?

Please add this to the January CommitFest.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DOCS] max_worker_processes on the standby

2015-12-09 Thread Alvaro Herrera

Robert Haas wrote:
> On Mon, Dec 7, 2015 at 8:33 AM, Fujii Masao  wrote:

> > So firstly you will push those "latest" changes soon?
> 
> It seems like these changes haven't been pushed yet, and unfortunately
> that's probably a beta blocker.

I'm on this.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some questions about the array.

2015-12-09 Thread Robert Haas

On Wed, Dec 2, 2015 at 3:01 PM, Merlin Moncure  wrote:
> On Tue, Dec 1, 2015 at 8:46 AM, YUriy Zhuravlev
>  wrote:
>> On Tuesday 01 December 2015 08:38:21 you wrote:
>>> it (zero
>>> based indexing support) doesn't meet the standard of necessity for
>>> adding to the core API and as stated it's much to magical.
>>
>> We do not touch the arrays, we simply create a function to access them with a
>> comfortable behavior. Creating a separate array types in the form extension 
>> is
>> very difficult IMHO.
>
> Correct; what I'm saying is that we don't need core API support for
> zero based array indexing.

Yes.  I think adding new functions that use an indexing convention
inconsistent with the one we're using for everything else ought to be
completely out of the question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [DOCS] max_worker_processes on the standby

2015-12-09 Thread Robert Haas

On Mon, Dec 7, 2015 at 8:33 AM, Fujii Masao  wrote:
> On Sat, Dec 5, 2015 at 12:56 AM, Alvaro Herrera
>  wrote:
>> Fujii Masao wrote:
>>
>>> Sorry for not reviewing the patch before you push it...
>>>
>>> In HEAD, I ran very simple test case:
>>>
>>> 1. enable track_commit_timestamp
>>> 2. start the server
>>> 3. run some transactions
>>> 4. execute pg_last_committed_xact() -- returns non-null values
>>> 5. shutdown the server with immdiate mode
>>> 6. restart the server -- crash recovery happens
>>> 7. execute pg_last_committed_xact()
>>>
>>> The last call of pg_last_committed_xact() returns NULL values, which means
>>> that the xid and timestamp information of the last committed transaction
>>> disappeared by crash recovery. Isn't this a bug?
>>
>> Hm, not really, because the status of the "last" transaction is kept in
>> shared memory as a cache and not expected to live across a restart.
>> However, I tested the equivalent scenario:
>>
>> alvherre=# create table fg();
>> CREATE TABLE
>>
>> alvherre=# select ts.* from pg_class,pg_xact_commit_timestamp(xmin) ts where 
>> relname = 'fg';
>>   ts
>> ---
>>  2015-12-04 12:41:48.017976-03
>> (1 fila)
>>
>> then crash the server, and after recovery the data is gone:
>>
>> alvherre=# select ts.*, xmin, c.relname from pg_class 
>> c,pg_xact_commit_timestamp(xmin) ts where relname = 'fg';
>>  ts | xmin | relname
>> +--+-
>> |  630 | fg
>> (1 fila)
>>
>> Not sure what is going on; my reading of the code certainly says that
>> the data should be there.  I'm looking into it.
>>
>> I also noticed that I didn't actually push the whole of the patch
>> yesterday -- I neglected to "git add" the latest changes, the ones that
>> fix the promotion scenario :-( so the commit messages is misleading
>> because it describes something that's not there.
>
> So firstly you will push those "latest" changes soon?

It seems like these changes haven't been pushed yet, and unfortunately
that's probably a beta blocker.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-12-09 Thread Robert Haas

On Mon, Dec 7, 2015 at 2:41 AM, Amit Langote
 wrote:
> On 2015/12/03 19:05, Kyotaro HORIGUCHI wrote:
>> At Thu, 3 Dec 2015 16:24:32 +0900, Amit Langote 
>>  wrote
>>> By the way, there are some non-st_progress_* fields that I was talking
>>> about in my previous message. For example, st_activity_flag (which I have
>>> suggested to rename to st_command instead). It needs to be set once at the
>>> beginning of the command processing and need not be touched again. I think
>>> it may be a better idea to do the same for table name or OID. It also
>>> won't change over the duration of the command execution. So, we could set
>>> it once at the beginning where that becomes known. Currently in the patch,
>>> it's reported in st_progress_message[0] by every pgstat_report_progress()
>>> invocation. So, the table name will be strcpy()'d to shared memory for
>>> every scanned block that's reported.
>>
>> If we don't have dedicate reporting functions for each phase
>> (that is, static data phase and progress phase), the one function
>> pgstat_report_progress does that by having some instruction from
>> the caller and it would be a bitfield.
>>
>> Reading the flags for making decision whether every field to copy
>> or not and branching by that seems too much for int32 (or maybe
>> 64?) fields. Alhtough it would be in effective when we have
>> progress_message fields, I don't think it is a good idea without
>> having progress_message.
>>
>>> pgstat_report_progress(uint32 *param1,  uint16 param1_bitmap,
>>>char param2[][..], uint16 param2_bitmap)
>>> {
>>> ...
>>>   for(i = 0; i < 16 ; i++)
>>>   {
>>>   if (param1_bitmap & (1 << i))
>>>beentry->st_progress_param[i] = param1[i];
>>>   if (param2_bitmap & (1 << i))
>>>strcpy(beentry->..., param2[i]);
>>>   }
>>
>> Thoughts?
>
> Hm, I guess progress messages would not change that much over the course
> of a long-running command. How about we pass only the array(s) that we
> change and NULL for others:
>
> With the following prototype for pgstat_report_progress:
>
> void pgstat_report_progress(uint32 *uint32_param, int num_uint32_param,
>bool *message_param[], int num_message_param);
>
> If we just wanted to change, say scanned_heap_pages, then we report that
> using:
>
> uint32_param[1] = scanned_heap_pages;
> pgstat_report_progress(uint32_param, 3, NULL, 0);
>
> Also, pgstat_report_progress() should check each of its parameters for
> NULL before iterating over to copy. So in most reports over the course of
> a command, the message_param would be NULL and hence not copied.

It's going to be *really* important that this facility provides a
lightweight way of updating progress, so I think this whole API is
badly designed.  VACUUM, for example, is going to want a way to update
the individual counter for the number of pages it's scanned after
every page.  It should not have to pass all of the other information
that is part of a complete report.  It should just be able to say
pgstat_report_progress_update_counter(1, pages_scanned) or something
of this sort.  Don't marshal all of the data and then make
pgstat_report_progress figure out what's changed.  Provide a series of
narrow APIs where the caller tells you exactly what they want to
update, and you update only that.  It's fine to have a
pgstat_report_progress() function to update everything at once, for
the use at the beginning of a command, but the incremental updates
within the command should do something lighter-weight.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Reusing abbreviated keys during second pass of ordered [set] aggregates

2015-12-09 Thread Robert Haas

On Sat, Oct 10, 2015 at 9:03 PM, Peter Geoghegan  wrote:
> On Fri, Sep 25, 2015 at 2:39 PM, Jeff Janes  wrote:
>> This needs a rebase, there are several conflicts in 
>> src/backend/executor/nodeAgg.c
>
> I attached a revised version of the second patch in the series, fixing
> this bitrot.
>
> I also noticed a bug in tuplesort's TSS_SORTEDONTAPE case with the
> previous patch, where no final on-the-fly merge step is required (no
> merge step is required whatsoever, because replacement selection
> managed to produce only one run). The function mergeruns() previously
> only "abandoned" abbreviated ahead of any merge step iff there was
> one. When there was only one run (not requiring a merge) it happened
> to continue to keep its state consistent with abbreviated keys still
> being in use. It didn't matter before, because abbreviated keys were
> only for tuplesort to compare, but that's different now.
>
> That bug is fixed in this revision by reordering things within
> mergeruns(). The previous order of the two things that were switched
> is not at all significant (I should know, I wrote that code).

I find the references to a "void" representation in this patch to be
completely opaque.  I see that there are some such references in
tuplesort.c already, and most likely they were put there by commits
that I did, so I guess I have nobody but myself to blame, but I don't
know what this means, and I don't think we should let this terminology
proliferate.

My understanding is that the "void" representation is intended to
whatever Datum we originally got, which might be a pointer.  Why not
just say that instead of referring to it this way?

My understanding is also that it's OK if the abbreviated key stays the
same even though the value has changed, but that the reverse could
cause queries to return wrong answers.  The first part of that
justifies why this is safe when no abbreviation is available: we'll
return an abbreviated value of 0 for everything, but that's fine.
However, using the original Datum (which might be a pointer) seems
unsafe, because two binary-identical values could be stored at
different addresses and thus have different pointer representations.

I'm probably missing something here, so clue me in...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-09 Thread Andres Freund

On 2015-12-03 09:54:23 +0300, konstantin knizhnik wrote:
> But right now performance of Multimaster is not limited by logical 
> replication protocol - if I remove DTM and use asynchronous replication 
> (lightweight version of BDR:)
> then I get 38k TPS instead of 12k.

My guess is that that's to a large degree because BDR 'batches' WAL
flushes / fsyncs over several connections. As the data is only applied
in one connection, whereas the primary can use multiple backends, it is
important no to constantly flush, as that's synchronous.

What I did for bdr was to register the 'desired' flush position whenever
replaying a commit (c.f. dlist_push_tail(&bdr_lsn_association,
&flushpos->node); in process_remote_commit()) and whenever feedback is
sent figure out how far the WAL actually has been flushed
(c.f. bdr_get_flush_position()).

Now that cannot trivially be done with 2PC, but it doesn't look all that
hard to change the 2PC API to allow at least some batching of the
fsyncs.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical replication and multimaster

2015-12-09 Thread Robert Haas

On Sun, Dec 6, 2015 at 10:24 PM, Craig Ringer  wrote:
> * An API to enumerate currently registered bgworkers
> * A way to securely make a libpq connection from a bgworker without messing
> with passwords etc. Generate one-time cookies, sometihng like that.
> * (unimportant but nice API fix): BGW_NO_RESTART_ON_CRASH

Why would you have the bgworker connect to the database via TCP
instead of just doing whatever it wants to do directly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Redundant sentence within INSERT documentation page (exclusion constraints)

2015-12-09 Thread Robert Haas

On Sun, Dec 6, 2015 at 9:36 PM, Peter Geoghegan  wrote:
> There is a redundant second appearance of more or less the same
> sentence at one point in the new INSERT documentation -- I missed this
> during the recent overhaul of the 9.5+ INSERT documentation.
>
> Attached is a trivial patch that fixes this issue.

I don't know nearly as much about ON CONFLICT DO UPDATE as Andres, but
even I can spot a redundancy when somebody shoves the evidence right
under my nose.  So, committed and back-patched to 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Alvaro Herrera

Haribabu Kommi wrote:

> Reverting the context release patch is already provided in the first
> mail of this
> thread [1]. Forgot to mention about the same in further mails.
> 
> Here I attached the same patch. This patch needs to be applied first before
> pg_hba_lookup patch. I tested it in windows version also.

So if you change the file and reload repeatedly, we leak all the memory
allocated for HBA lines in TopMemoryContext?  This doesn't sound great.
Perhaps we need a dedicated context which can be reset at will so that
it can be refilled with the right info when we reload the file.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Speedup twophase transactions

2015-12-09 Thread Stas Kelvich

Hello.

While working with cluster stuff (DTM, tsDTM) we noted that postgres 2pc 
transactions is approximately two times slower than an ordinary commit on 
workload with fast transactions — few single-row updates and COMMIT or 
PREPARE/COMMIT. Perf top showed that a lot of time is spent in kernel on 
fopen/fclose, so it worth a try to reduce file operations with 2pc tx.

Now 2PC in postgres does following:
* on prepare 2pc data (subxacts, commitrels, abortrels, invalmsgs) saved to 
xlog and to file, but file not is not fsynced
* on commit backend reads data from file
* if checkpoint occurs before commit, then files are fsynced during checkpoint
* if case of crash replay will move data from xlog to files

In this patch I’ve changed this procedures to following:
* on prepare backend writes data only to xlog and store pointer to the start of 
the xlog record
* if commit occurs before checkpoint then backend reads data from xlog by this 
pointer
* on checkpoint 2pc data copied to files and fsynced
* if commit happens after checkpoint then backend reads files
* in case of crash replay will move data from xlog to files (as it was before 
patch)

Most of that ideas was already mentioned in 2009 thread by Michael Paquier 
http://www.postgresql.org/message-id/c64c5f8b0908062031k3ff48428j824a9a46f2818...@mail.gmail.com
 where he suggested to store 2pc data in shared memory. 
At that time patch was declined because no significant speedup were observed. 
Now I see performance improvements by my patch at about 60%. Probably old 
benchmark overall tps was lower and it was harder to hit filesystem 
fopen/fclose limits.

Now results of benchmark are following (dual 6-core xeon server):

Current master without 2PC: ~42 ktps
Current master with 2PC: ~22 ktps
Current master with 2PC: ~36 ktps

Benchmark done with following script:

\set naccounts 10 * :scale
\setrandom from_aid 1 :naccounts
\setrandom to_aid 1 :naccounts
\setrandom delta 1 100
\set scale :scale+1
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance - :delta WHERE aid = :from_aid;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :to_aid;
PREPARE TRANSACTION ':client_id.:scale';
COMMIT PREPARED ':client_id.:scale';



2pc_xlog.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unicode collations in FreeBSD 11, DragonFly BSD 4.4 without ICU

2015-12-09 Thread Robert Haas

On Sun, Dec 6, 2015 at 4:45 PM, Thomas Munro
 wrote:
> Since the ICU patch from the BSD ports trees has been discussed on
> this list a few times I thought people might be interested in this
> news:
>
> https://lists.freebsd.org/pipermail/freebsd-current/2015-October/057781.html
> https://www.dragonflybsd.org/release44/
> https://forums.freebsd.org/threads/this-is-how-i-like-open-source.53943/
>
> (Thanks to pstef on #postgresql for pointing me at this).

Nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-12-09 Thread Robert Haas

On Mon, Nov 30, 2015 at 12:58 PM, Bruce Momjian  wrote:
> On Mon, Nov 30, 2015 at 10:48:04PM +0530, Masahiko Sawada wrote:
>> On Sun, Nov 29, 2015 at 2:21 AM, Jeff Janes  wrote:
>> > On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  
>> > wrote:
>> >>
>> >> Yeah, we need to consider to compute checksum if enabled.
>> >> I've changed the patch, and attached.
>> >> Please review it.
>> >
>> > Thanks for the update.  This now conflicts with the updates doesn to
>> > fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
>> > conflict in order to do some testing, but I'd like to get an updated
>> > patch from the author in case I did it wrong.  I don't want to find
>> > bugs that I just introduced myself.
>> >
>>
>> Thank you for having a look.
>
> I would not bother mentioning this detail in the pg_upgrade manual page:
>
> +   Since the format of visibility map has been changed in version 9.6,
> +   pg_upgrade creates and rewrite new 
> '_vm'
> +   file even if upgrading from 9.5 or before to 9.6 or later with link mode 
> (-k).

Really?  I know we don't always document things like this, but it
seems like a good idea to me that we do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread David Fetter

On Wed, Dec 09, 2015 at 03:49:20PM +, Greg Stark wrote:
> On Wed, Dec 9, 2015 at 2:27 PM, David Fetter  wrote:
> > Agreed that the "whole new language" aspect seems like way too big a
> > hammer, given what it actually does.
> 
> Which would be easier to update when things change?

This question seems closer to being on point with the patch sets
proposed here.

> Which would be possible to automatically generate from gram.y?

This seems like it goes to a wholesale context-aware reworking of tab
completion rather than the myopic ("What has happened within the past N
tokens?", for slowly increasing N) versions of tab completions in both
the current code and in the two proposals here.

A context-aware tab completion wouldn't care how many columns you were
into a target list, or a FROM list, or whatever, as it would complete
based on the (possibly nested) context ("in a target list", e.g.)
rather than on inferences made from some slightly variable number of
previous tokens.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallel joins, and better parallel explain

2015-12-09 Thread Robert Haas

On Fri, Dec 4, 2015 at 3:07 AM, Amit Kapila  wrote:
> Do you think it will be useful to display in a similar way if worker
> is not able to execute plan (like before it starts execution, the other
> workers have already finished the work)?

Maybe, but it would clutter the output a good deal.  I think we should
instead have the Gather node itself display the number of workers that
it actually managed to launch, and then the user can infer that any
execution nodes that don't mention those workers were not executed by
them.

> Other than that parallel-explain-v2.patch looks good.

OK, I'm going to go ahead and commit that part.

> I think the problem is at Gather node, the number of buffers (read + hit)
> are greater than the number of pages in relation.  The reason why it
> is doing so is that in Workers (ParallelQueryMain()), it starts the buffer
> usage accumulation before ExecutorStart() whereas in master backend
> it always calculate it for ExecutorRun() phase, on changing it to accumulate
> for ExecutorRun() phase above problem is fixed.  Attached patch fixes the
> problem.

Why is it a bad thing to capture the cost of doing ExecutorStart() in
the worker?  I can see there's an argument that changing this would be
more consistent, but I'm not totally convinced.  The overhead of
ExecutorStart() in the leader isn't attributable to any specific
worker, but the overhead of ExecutorStart() in the worker can fairly
be blamed on Gather, I think.  I'm not dead set against this change
but I'm not totally convinced, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

2015-12-09 Thread Jeremy Harris

On 09/12/15 00:02, Jeff Janes wrote:
> The second one consumes that giant tape run along with 232 small tape
> runs.

In terms of number of comparisons, binary merge works best when the
inputs are of similar length.  I'd assume the same goes for n-ary
merge, but I don't know if comparison count is an issue here.
-- 
Cheers,
  Jeremy



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Confusing results with lateral references

2015-12-09 Thread Robert Haas

On Fri, Dec 4, 2015 at 10:23 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> I am seeing different results with two queries which AFAIU have same
>> semantics and hence are expected to give same results.
>
>> postgres=# select * from t1, (select distinct val, val2 from t2) t2ss where 
>> t1.val = t2ss.val for update of t1;
>
>> postgres=# select * from t1, lateral (select distinct val, val2 from t2 
>> where t2.val = t1.val) t2ss for update of t1;
>
> (I renamed your inline sub-selects to avoid confusion between them and the
> table t2.)
>
> I'm skeptical that those should be claimed to have identical semantics.
>
> In the first example, after we've found the join row (1,1,1,1), we block
> to see if the pending update on t1 will commit.  After it does, we recheck
> the join condition using the updated row from t1 (and the original row
> from t2ss).  The condition fails, so the updated row is not output.

Check.

> The same thing happens in the second example, ie, we consider the updated
> row from t1 and the non-updated row from t2ss (NOT t2).  There are no join
> conditions to recheck (in the outer query level), so the row passes, and
> we output it.

What's surprising is that t2.val = t1.val isn't rechecked here.  I
think that's not really possible, because of the DISTINCT operation,
which prevents us from identifying a single row from t2 that accounts
for the subquery's output row.  Not sure whether it would work without
the DISTINCT.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-12-09 Thread Robert Haas

On Wed, Dec 9, 2015 at 3:22 AM, Etsuro Fujita
 wrote:
> Sorry, my explanation might be not enough, but I'm not saying to hide the
> subplan.  I think it would be better to show the subplan somewhere in the
> EXPLAIN outout, but I'm not sure that it's a good idea to show that in the
> current form.  We have two plan trees; one for normal query execution and
> another for EvalPlanQual testing.  I think it'd be better to show the
> EXPLAIN output the way that allows users to easily identify each of the plan
> trees.

It's hard to do that because we don't identify that internally
anywhere.  Like I said before, the possibility of a ForeignScan having
an outer subplan is formally independent of the new EPQ stuff, and I'd
prefer to maintain that separation and just address this with
documentation.

Getting this bug fixed has been one of the more exhausting experiences
of my involvement with PostgreSQL, and to be honest, I think I'd like
to stop spending too much time on this now and work on getting the
feature that this is intended to support working.  Right now, the only
people who can have an opinion on this topic are those who are
following this thread in detail, and there really aren't that many of
those.  If we get the feature - join pushdown for postgres_fdw -
working, then we might get some feedback from users about what they
like about it or don't, and certainly if this is a frequent complaint
then that bolsters the case for doing something about it, and possibly
also helps us figure out what that thing should be.  On the other
hand, if we don't get the feature because we're busy debating
interface details related to this patch, then none of these details
matter anyway because nobody except developer is actually running the
code in question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Andres Freund

On 2015-12-09 11:18:39 -0500, Robert Haas wrote:
> If I correctly understand the scenario that you are describing, that
> does happen - not for the same MXID, but for different ones.  At least
> the last time I checked, and I'm not sure if we've fixed this, it
> could happen because the SLRU page that contains the multixact wasn't
> flushed out of the SLRU buffers yet.

That should be fixed, with the brute force solution of flushing buffers
before searching for files on disk.

> But apart from that, it could
> happen any time there's a gap in the sequence of files, and that sure
> doesn't seem like a can't-happen situation.  We know that, on 9.3,
> there's definitely a sequence of events that leads to a  file
> followed by a gap followed by the series of files that are still live.
> Given the number of other bugs we've fixed in this area, I would not
> like to bet on that being the only scenario where this crops up.  It
> *shouldn't* happen, and as far as we know, if you start and end on a
> version newer than 4f627f8 and aa29c1c, it won't.  Older branches,
> though, I wouldn't like to bet on.

Ok, fair enough.

andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Robert Haas

On Wed, Dec 9, 2015 at 10:41 AM, Andres Freund  wrote:
>> (I am glad you talked the author out of back-patching; otherwise,
>> 9.4.5 and 9.3.10 would have introduced a data loss bug.)
>
> Isn't that a bug in a, as far as we know, impossible scenario? Unless I
> miss something there's no known case where it's "expected" that
> find_multixact_start() fails after initially succeeding? Sure, it sucks
> that the bug survived review and that it was written in the first
> place. But it not showing up during testing isn't meaningful, given it's
> a should-never-happen scenario.

If I correctly understand the scenario that you are describing, that
does happen - not for the same MXID, but for different ones.  At least
the last time I checked, and I'm not sure if we've fixed this, it
could happen because the SLRU page that contains the multixact wasn't
flushed out of the SLRU buffers yet.  But apart from that, it could
happen any time there's a gap in the sequence of files, and that sure
doesn't seem like a can't-happen situation.  We know that, on 9.3,
there's definitely a sequence of events that leads to a  file
followed by a gap followed by the series of files that are still live.
Given the number of other bugs we've fixed in this area, I would not
like to bet on that being the only scenario where this crops up.  It
*shouldn't* happen, and as far as we know, if you start and end on a
version newer than 4f627f8 and aa29c1c, it won't.  Older branches,
though, I wouldn't like to bet on.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Robert Haas

On Wed, Dec 9, 2015 at 9:43 AM, Noah Misch  wrote:
> Andres writing the patch that became commit 4f627f8 and you reviewing it were
> gifts to Alvaro and to the community.  Aware of that, I have avoided[1] saying
> that I was shocked to see that commit's defects.  Despite a committer-author
> and _two_ committer reviewers, the patch was rife with wrong new comments,
> omitted updates to comments it caused to become wrong, and unsolicited
> whitespace churn.  (Anyone could have missed the data loss bug, but these
> collectively leap off the page.)  This in beleaguered code critical to data
> integrity.  You call this thread's latest code a patch submission, but I call
> it bandaging the tree after a recent commit that never should have reached the
> tree.  Hey, if you'd like me to post the traditional patch files, that's easy.
> It would have been easier for me.  I posted branches because it gives more
> metadata to guide review.  As for the choice to revert and redo ...

Yes, I'd like patch files, one per topic.

I wasn't very happy with the way that patch it was written; it seemed
to me that it touched too much code and move a lot of things around
unnecessarily, and I said so at the time.  I would have preferred
something more incremental, and I asked for it and didn't get it.
Well, I'm not giving up: I'm asking for the same thing here.  I didn't
think it was a good idea for Andres to rearrange that much code in a
single commit, because it was hard to review, and I don't think it's a
good idea for you to do it, either.  To the extent that you found
bugs, I think that proves the point that large commits are hard to
review and small commits that change things just a little bit at a
time are the way to go.

> I hope those who have not already read commit 4f627f8 will not waste time
> reading it.  They should instead ignore multixact changes from commit 4f627f8
> through its revert.  The 2015-09-26 commits have not appeared in a supported
> release, and no other work has built upon them.  They have no tenure.  (I am
> glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
> would have introduced a data loss bug.)  Nobody reported a single defect
> before my review overturned half the patch.  A revert will indeed impose on
> those who invested time to understand commit 4f627f8, but the silence about
> its defects suggests the people understanding it number either zero or one.
> Even as its author and reviewers, you would do better to set aside what you
> thought you knew about this code.

I just don't find this a realistic model of how people use the git
log.  Maybe you use it this way; I don't.  I don't *want* git blame to
make it seem as if 4f627f8 is not part of the history.  For better or
worse, it is.  Ripping it out and replacing it monolithically will not
change that; it will only make the detailed history harder to
reconstruct, and I *will* want to reconstruct it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread Greg Stark

On Wed, Dec 9, 2015 at 2:27 PM, David Fetter  wrote:
> Agreed that the "whole new language" aspect seems like way too big a
> hammer, given what it actually does.

Which would be easier to update when things change?
Which would be possible to automatically generate from gram.y?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Andres Freund

On 2015-12-09 09:43:19 -0500, Noah Misch wrote:
> Aware of that, I have avoided[1] saying that I was shocked to see that
> commit's defects.  Despite a committer-author and _two_ committer
> reviewers, the patch was rife with wrong new comments, omitted updates
> to comments it caused to become wrong,

It's not like that patch wasn't posted for review for months...

> and unsolicited whitespace churn.

Whitespace churn? The commit includes a pgindent run, because Alvaro
asked me to do that, but that just affected a handful of lines. If you
mean the variable ordering: given several variables were renamed anyway,
additionally putting them in a easier to understand order, seems rather
painless. If you mean 'pgindent immune' long lines - multixact.c is far
from the only one with those, and they're prett harmless.

> You call this thread's latest code a patch
> submission, but I call it bandaging the tree after a recent commit
> that never should have reached the tree.

Oh, for christs sake.

> Hey, if you'd like me to
> post the traditional patch files, that's easy.  It would have been
> easier for me.

You've been asked that, repeatedly. At least if you take 'traditional
patch files' to include traditional, iterative, patches ontop of the
current tree.

> I hope those who have not already read commit 4f627f8 will not waste time
> reading it.

We have to, who knows what's hiding in there. Your git log even shows
that you had conflicts in your approach (83cb04 Conflicts:
src/backend/access/transam/multixact.c).

> They should instead ignore multixact changes from commit 4f627f8
> through its revert.  The 2015-09-26 commits have not appeared in a supported
> release, and no other work has built upon them.

> They have no tenure.

Man.

> (I am glad you talked the author out of back-patching; otherwise,
> 9.4.5 and 9.3.10 would have introduced a data loss bug.)

Isn't that a bug in a, as far as we know, impossible scenario? Unless I
miss something there's no known case where it's "expected" that
find_multixact_start() fails after initially succeeding? Sure, it sucks
that the bug survived review and that it was written in the first
place. But it not showing up during testing isn't meaningful, given it's
a should-never-happen scenario.

I'm actually kinda inclined to rip out the whole "previous pass" logic
out alltogether, and replace it with a PANIC. It's a hard to test,
should never happen, scenario. If it happens, things have already
seriously gone sour.

> > That's not a one-time problem for Andres during the course of review;
> > that is a problem for every single person who looks at the commit
> > history from now until the end of time.
> 
> It's a service to future readers that no line of "git blame master <...>" will
> refer to a 2015-09-26 multixact commit.

And a disservice for everyone doing git log, or git blame for
intermediate states of the tree. The benefit for git blame, are almost
nonexistant, not seing a couple newlines changed, or not seing some
intermediate commits isn't really important.

> Blame reports will instead refer to
> replacement commits designed to be meaningful for study in isolation.  If I
> instead structured the repairs as you ask, the blame would find one of 4f627f8
> or its various repair commits, none of which would be a self-contained unit of
> development.

So what? That's how development in general works. And how it actually
happened in this specific case.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Include ppc64le build type for back branches

2015-12-09 Thread Robert Haas

On Wed, Dec 9, 2015 at 2:54 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> With respect to this particular thing, it's hard for me to imagine
>> that anything will go wrong on ppcle that we wouldn't consider a
>> back-patchable fix, so there might be no harm in adjusting
>> config.guess and config.sub.
>
> FWIW, I also suspect that supporting ppc64le would not really take
> much more than updating config.guess/config.sub; there's no evidence
> in the git logs that we had to fix anything else in the newer branches.
>
> My concern here is about establishing project policy about whether
> we will or won't consider back-patching support for newer platforms.
> I think that the default answer should be "no", and I'd like to
> see us set down some rules about what it takes to override that.
>
> Obviously, setting up a buildfarm member helps a good deal.  But
> is that sufficient, or necessary?

I would say it's necessary but not sufficient.  The other criterion
I'd lay down is that you shouldn't need to really change anything in
the code except maybe to fix up s_lock.h breakage or some equally
minor wart.  For example, suppose that somebody wanted a new platform
which works just like UNIX except that open() takes 4 arguments
instead of 3, and we've always got to pass 0 for the last one.  Well,
even though that's minor and easily done, I'm not going to argue for
back-patching that to all supported branches.  I don't think that
would be good material for a back-patch; the necessary changes could
flummox third-party code or just turn out to be buggy, and apparently
this new platform wasn't that important up until now, so whatever.

But I feel differently about this case because we basically already do
support the platform, or so it seems.  We just didn't know we
supported it.  Really, we expect our stuff to work anywhere that has a
reasonably UNIX-like environment and, hey, it'll even use atomics
automatically if it has a reasonably modern gcc.  So, win.  I don't
think that we really gain anything by refusing to admit that the
product works on a platform where it does work.  We've put a lot of
effort into being portable and I don't think we should feel bad about
that.  If we test PostgreSQL on a new platform and it works with no
problems (or only trivial adjustments that seem like good back-patch
candidates anyway), then I think it's fine to just say, yep, it works.
I don't think that sets a precedent that we'll be willing to accept
arbitrary changes in back-branches to make it work when it doesn't
already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-12-09 Thread Masahiko Sawada

On Wed, Nov 18, 2015 at 2:06 PM, Masahiko Sawada  wrote:
> On Tue, Nov 17, 2015 at 7:52 PM, Kyotaro HORIGUCHI
>  wrote:
>> Oops.
>>
>> At Tue, 17 Nov 2015 19:40:10 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>>  wrote in 
>> <20151117.194010.17198448.horiguchi.kyot...@lab.ntt.co.jp>
>>> Hello,
>>>
>>> At Tue, 17 Nov 2015 18:13:11 +0900, Masahiko Sawada  
>>> wrote in 
>>> 
>>> > >> One question is that what is different between the leading "n" in
>>> > >> s_s_names and the leading "n" of "n-priority"?
>>> > >
>>> > > Ah. Sorry for the ambiguous description. 'n' in s_s_names
>>> > > representing an arbitrary integer number and that in "n-priority"
>>> > > is literally an "n", meaning "a format with any number of
>>> > > priority hosts" as a whole. As an instance,
>>> > >
>>> > > synchronous_replication_method = "n-priority"
>>> > > synchronous_standby_names = "2, mercury, venus, earth, mars, jupiter"
>>> > >
>>> > > I added "n-" of "n-priority" to distinguish with "1-priority" so
>>> > > if we won't provide "1-priority" for backward compatibility,
>>> > > "priority" would be enough to represent the type.
>>> > >
>>> > > By the way, s_r_method is not essentially necessary but it would
>>> > > be important to avoid complexity of autodetection of formats
>>> > > including currently undefined ones.
>>> >
>>> > Than you for your explanation, I understood that.
>>> >
>>> > It means that the format of s_s_names will be changed, which would be not 
>>> > good.
>>>
>>> I believe that the format of definition of "replication set"(?)
>>> is not fixed and it would be more complex format to support
>>> nested definition. This should be in very different format from
>>> the current simple list of names. This is a selection among three
>>> or possiblly more disigns in order to be tolerable for future
>>> changes, I suppose.
>>>
>>> 1. Additional formats of definition in future will be stored in
>>>elsewhere of s_s_names.
>>>
>>> 2. Additional format will be stored in s_s_names, the format will
>>>be automatically detected.
>>>
>>> 3. (ditto), the format is designated by s_r_method.
>>>
>>> 4. Any other way?
>>>
>>> I choosed the third way. What do you think about future expansion
>>> of the format?
>>>
>
> I agree with #3 way and the s_s_name format you suggested.
> I think that It's extensible and is tolerable for future changes.
> I'm going to implement the patch based on this idea if other hackers
> agree with this design.
>

Please find the attached draft patch which supports multi sync replication.
This patch adds a GUC parameter synchronous_replication_method, which
represent the method of synchronous replication.

[Design of replication method]
synchronous_replication_method has two values; 'priority' and
'1-priority' for now.
We can expand the kind of its value (e.g, 'quorum', 'json' etc) in the future.

* s_r_method = '1-priority'
This method is for backward compatibility, so the syntax of s_s_names
is same as today.
The behavior is same as well.

* s_r_method = 'priority'
This method is for multiple synchronous replication using priority method.
The syntax of s_s_names is,
   ,  [, ...]

For example, s_r_method = 'priority' and s_s_names = '2, node1, node2,
node3' means that the master waits for  acknowledge from at least 2
lowest priority servers.
If 4 standbys(node1 - node4) are available, the master server waits
acknowledge from 'node1' and 'node2.
The each status of wal senders are;

=# select application_name, sync_state from pg_stat_replication order
by application_name;
application_name | sync_state
--+
node1| sync
node2| sync
node3| potential
node4| async
(4 rows)

After 'node2' crashed, the master will wait for acknowledge from
'node1' and 'node3'.
The each status of wal senders are;

=# select application_name, sync_state from pg_stat_replication order
by application_name;
application_name | sync_state
--+
node1| sync
node3| sync
node4| async
(3 rows)

[Changing replication method]
When we want to change the replication method, we have to change the
s_r_method  at first, and then do pg_reload_conf().
After changing replication method, we can change the s_s_names.

[Expanding replication method]
If we want to expand new replication method additionally, we need to
implement two functions for each replication method:
* int SyncRepGetSynchronousStandbysXXX(int *sync_standbys)
  This function obtains the list of standbys considered as synchronous
at that time, and return its length.
* bool SyncRepGetSyncLsnXXX(XLogRecPtr *write_pos, XLogRecPtr *flush_pos)
  This function obtains LSNs(write, flush) considered as synced.

Also, this patch debug code is remain yet, you can debug this behavior
using by enable DEBUG_REPLICATION macro.

Please give me feedbacks.

Regards,

--
Masahiko Sawada

000_multi_sync_replication_v1.patch
Description: Binary data

-- 
Sent

Re: [HACKERS] [sqlsmith] Failed to generate plan on lateral subqueries

2015-12-09 Thread David Fetter

On Tue, Dec 08, 2015 at 12:13:41PM -0500, Tom Lane wrote:
> Andreas Seltenreich  writes:
> >> I no longer see "failed to build any n-way joins" after pulling,
> >> but there are still instances of "could not devise a query plan".
> >> Samples below.
> 
> > sorry, I spoke too soon: nine of the former have been logged
> > through the night.  I'm attaching a larger set of sample queries
> > this time in case that there are still multiple causes for the
> > observed errors.
> 
> Hm.  At least in the first of these cases, the problem is that the
> code I committed yesterday doesn't account for indirect lateral
> dependencies.  That is, if S1 depends on S2 which depends on the
> inner side of an outer join, it now knows not to join S2 directly to
> the outer side of the outer join, but it doesn't realize that the
> same must apply to S1.
> 
> Maybe we should redefine lateral_relids as the transitive closure of
> a rel's lateral dependencies?  Not sure.

That seems like it would fix the problem once and for good.  Is there
any reason to believe that the lateral dependencies could go in a
loop, i.e. is there a reason to put in checks for same in the
transitive closure construction?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Rework the way multixact truncations work

2015-12-09 Thread Noah Misch

On Tue, Dec 08, 2015 at 01:05:03PM -0500, Robert Haas wrote:
> On Tue, Dec 8, 2015 at 6:43 AM, Andres Freund  wrote:
> > On 2015-12-04 21:55:29 -0500, Noah Misch wrote:
> >> On Thu, Dec 03, 2015 at 07:03:21PM +0100, Andres Freund wrote:
> >> > Sorry, but I really just want to see these changes as iterative patches
> >> > ontop of 9.5/HEAD instead of this process. I won't revert the reversion
> >> > if you push it anyway, but I think it's a rather bad idea.
> >>
> >> I hear you.
> >
> > Not just me.
> >
> >> I evaluated your request and judged that the benefits you cited
> >> did not make up for the losses I cited.  Should you wish to change my mind,
> >> your best bet is to find defects in the commits I proposed.  If I 
> >> introduced
> >> juicy defects, that discovery would lend much weight to your conjectures.
> >
> > I've absolutely no interest in "proving you wrong". And my desire to
> > review patches that are in a, in my opinion, barely reviewable format is
> > pretty low as well.
> 
> I agree.  Noah, it seems to me that you are offering a novel theory of
> how patches should be submitted, reviewed, and committed, but you've
> got three people, two of them committers, telling you that we don't
> like that approach.  I seriously doubt you're going to find anyone who
> does.

Andres writing the patch that became commit 4f627f8 and you reviewing it were
gifts to Alvaro and to the community.  Aware of that, I have avoided[1] saying
that I was shocked to see that commit's defects.  Despite a committer-author
and _two_ committer reviewers, the patch was rife with wrong new comments,
omitted updates to comments it caused to become wrong, and unsolicited
whitespace churn.  (Anyone could have missed the data loss bug, but these
collectively leap off the page.)  This in beleaguered code critical to data
integrity.  You call this thread's latest code a patch submission, but I call
it bandaging the tree after a recent commit that never should have reached the
tree.  Hey, if you'd like me to post the traditional patch files, that's easy.
It would have been easier for me.  I posted branches because it gives more
metadata to guide review.  As for the choice to revert and redo ...

> When stuff gets committed to the tree, people want to to be
> able to answer the question "what has just now changed?" and it is
> indisputable that what you want to do here will make that harder.

I hope those who have not already read commit 4f627f8 will not waste time
reading it.  They should instead ignore multixact changes from commit 4f627f8
through its revert.  The 2015-09-26 commits have not appeared in a supported
release, and no other work has built upon them.  They have no tenure.  (I am
glad you talked the author out of back-patching; otherwise, 9.4.5 and 9.3.10
would have introduced a data loss bug.)  Nobody reported a single defect
before my review overturned half the patch.  A revert will indeed impose on
those who invested time to understand commit 4f627f8, but the silence about
its defects suggests the people understanding it number either zero or one.
Even as its author and reviewers, you would do better to set aside what you
thought you knew about this code.

> That's not a one-time problem for Andres during the course of review;
> that is a problem for every single person who looks at the commit
> history from now until the end of time.

It's a service to future readers that no line of "git blame master <...>" will
refer to a 2015-09-26 multixact commit.  Blame reports will instead refer to
replacement commits designed to be meaningful for study in isolation.  If I
instead structured the repairs as you ask, the blame would find one of 4f627f8
or its various repair commits, none of which would be a self-contained unit of
development.  What's to enjoy about discovering that history?

> I don't think you have the
> right to force your proposed approach through in the face of concerted
> opposition.

That's correct; I do not have that right.  Your objection still worries me.

nm

[1] 
http://www.postgresql.org/message-id/20151029065903.gc770...@tornado.leadboat.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread David Fetter

On Wed, Dec 09, 2015 at 08:49:22PM +0900, Michael Paquier wrote:
> On Wed, Dec 9, 2015 at 8:17 PM, Thomas Munro
>  wrote:
> > On Mon, Dec 7, 2015 at 8:41 PM, Michael Paquier
> >  wrote:
> >> On Tue, Nov 17, 2015 at 12:19 AM, Alvaro Herrera 
> >> wrote:
> >>> Thomas Munro wrote:
>  New version attached, merging recent changes.
> >>>
> >>> I wonder about the TailMatches and Matches macros --- wouldn't it be
> >>> better to have a single one, renaming TailMatches to Matches and
> >>> replacing the current Matches() with an initial token that corresponds
> >>> to anchoring to start of command?  Just wondering, not terribly attached
> >>> to the idea.
> >>
> >> +   /* TODO:TM -- begin temporary, not part of the patch! */
> >> +   Assert(!word_matches(NULL, ""));
> >> + [...]
> >> +   Assert(!word_matches("foo", ""));
> >> +   /* TODO:TM -- end temporary */
> >>
> >> Be sure to not forget to remove that later.
> >
> > Thanks for looking at this Michael.  It's probably not much fun to
> > review!  Here is a new version with that bit removed.  More responses
> > inline below.
> 
> I had a hard time not sleeping when reading it... That's very mechanical.
> 
> > I agree that would probably be better but Alvaro suggested following
> > the existing logic in the first pass, which was mostly based on tails,
> > and then considering simpler head-based patterns in a future pass.
> 
> Fine with me.
> 
> So what do we do now? There is your patch, which is already quite big,
> but as well a second patch based on regexps, which is far bigger. And
> at the end they provide a similar result:
> 
> Here is for example what the regexp patch does for some complex
> checks, like ALTER TABLE RENAME:
>  /* ALTER TABLE xxx RENAME yyy */
> -else if (pg_strcasecmp(prev4_wd, "TABLE") == 0 &&
> - pg_strcasecmp(prev2_wd, "RENAME") == 0 &&
> - pg_strcasecmp(prev_wd, "CONSTRAINT") != 0 &&
> - pg_strcasecmp(prev_wd, "TO") != 0)
> +else if (MATCH("TABLE #id RENAME !CONSTRAINT|TO"))
> 
> And what Thomas's patch does:
> +else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME", MatchAny) &&
> + !TailMatches1("CONSTRAINT|TO"))
> 
> The regexp patch makes the negative checks somewhat easier to read
> (there are 19 positions in tab-complete.c doing that), still inventing
> a new langage and having a heavy refactoring just tab completion of
> psql seems a bit too much IMO, so my heart balances in favor of
> Thomas' stuff. Thoughts from others?

Agreed that the "whole new language" aspect seems like way too big a
hammer, given what it actually does.

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: ResourceOwner optimization for tables with many partitions

2015-12-09 Thread Aleksander Alekseev

Hello, Robert

Thanks for your review. I believe I fixed items 1, 2 and 3 (see
attachment). Also I would like to clarify item 4.

> 4. It mixes together multiple ideas in a single patch, not only
> introducing a hashing concept but also striping a brand-new layer of
> abstraction across the resource-owner mechanism.  I am not sure that
> layer of abstraction is a very good idea, but if it needs to be done,
> I think it should be a separate patch.

Do I right understand that you suggest following?

Current patch should be split in two parts. In first patch we create
and use ResourceArray with array-based implementation (abstraction
layer). Then we apply second patch which change ResourceArray
implementation to hashing based (optimization).

Best regards,
Aleksander

On Tue, 8 Dec 2015 17:30:33 -0500
Robert Haas  wrote:

> On Fri, Dec 4, 2015 at 7:15 AM, Aleksander Alekseev
>  wrote:
> > Current implementation of ResourceOwner uses arrays to store
> > resources like TupleDescs, Snapshots, etc. When we want to release
> > one of these resources ResourceOwner finds it with linear scan.
> > Granted, resource array are usually small so its not a problem most
> > of the time. But it appears to be a bottleneck when we are working
> > with tables which have a lot of partitions.
> >
> > To reproduce this issue:
> > 1. run `./gen.pl 1 | psql my_database postgres`
> > 2. run `pgbench -j 8 -c 8 -f q.sql -T 100 my_database`
> > 3. in second terminal run `sudo perf top -u postgres`
> >
> > Both gen.pl and q.sql are attached to this message.
> >
> > You will see that postgres spends a lot of time in
> > ResourceOwnerForget* procedures:
> >
> >  32.80%  postgres   [.] list_nth
> >  20.29%  postgres   [.] ResourceOwnerForgetRelationRef
> >  12.87%  postgres   [.] find_all_inheritors
> >   7.90%  postgres   [.] get_tabstat_entry
> >   6.68%  postgres   [.] ResourceOwnerForgetTupleDesc
> >   1.17%  postgres   [.] hash_search_with_hash_value
> >  ... < 1% ...
> >
> > I would like to suggest a patch (see attachment) witch fixes this
> > bottleneck. Also I discovered that there is a lot of code
> > duplication in ResourceOwner. Patch fixes this too. The idea is
> > quite simple. I just replaced arrays with something that could be
> > considered hash tables, but with some specific optimizations.
> >
> > After applying this patch we can see that bottleneck is gone:
> >
> >  42.89%  postgres   [.] list_nth
> >  18.30%  postgres   [.] find_all_inheritors
> >  10.97%  postgres   [.] get_tabstat_entry
> >   1.82%  postgres   [.] hash_search_with_hash_value
> >   1.21%  postgres   [.] SearchCatCache
> >  ... < 1% ...
> >
> > For tables with thousands partitions we gain in average 32.5% more
> > TPS.
> >
> > As far as I can see in the same time patch doesn't make anything
> > worse. `make check` passes with asserts enabled and disabled. There
> > is no performance degradation according to both standard pgbench
> > benchmark and benchmark described above for tables with 10 and 100
> > partitions.
> 
> I do think it's interesting to try to speed up the ResourceOwner
> mechanism when there are many resources owned.  We've had some forays
> in that direction in the past, and I think it's useful to continue
> that work.  But I also think that this patch, while interesting, is
> not something we can seriously consider committing in its current
> form, because:
> 
> 1. It lacks adequate comments and submission notes to explain the
> general theory of operation of the patch.
> 
> 2. It seems to use uint8 * rather freely as a substitute for char * or
> void *, which doesn't seem like a great idea.
> 
> 3. It doesn't follow PostgreSQL formatting conventions (uncuddled
> curly braces, space after if and before open paren, etc.).
> 
> 4. It mixes together multiple ideas in a single patch, not only
> introducing a hashing concept but also striping a brand-new layer of
> abstraction across the resource-owner mechanism.  I am not sure that
> layer of abstraction is a very good idea, but if it needs to be done,
> I think it should be a separate patch.
> 

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0e7acbf..d2b37d5 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -29,6 +29,323 @@
 #include "utils/snapmgr.h"
 
 /*
+ * ResourceArray is a common structure for storing different types of resources.
+ *
+ * ResourceOwner can own `HeapTuple`s, `Relation`s, `Snapshot`s, etc. For
+ * each resource type there are procedures ResourceOwnerRemember* and
+ * ResourceOwnerForget*. There are also ResourceOwnerEnlarge* procedures
+ * which should be called before corresponding ResourceOwnerRemember* calls
+ * (see below). Internally each type of resource is stored in separate
+ * ResourceArray.
+ *
+ * There are two major reasons for u

Re: [HACKERS] W-TinyLfu for cache eviction

2015-12-09 Thread Ants Aasma

On Wed, Dec 9, 2015 at 11:31 AM, Konstantin Knizhnik
 wrote:
> I expect synchronization issues with implementation of this algorithm. It
> seems to be hard to avoid some global critical section which can cause
> significant performance degradation at MPP systems (see topic "Move
> PinBuffer and UnpinBuffer to atomics").

The sketch updates don't really need to be globally consistent,
anything that is faster than cycling of buffers through the LRU tier
would be enough. In principle atomic increments could be used, but
funneling all buffer hits onto a small amount of cachelines and then
doing 4x the amount of writes would result in extremely nasty cache
line ping-ponging. Caffeine implementation appears to solve this by
batching the frequency sketch updates using a striped set of circular
buffers. Re-implementing this is not exactly trivial and some prior
experience with lock free programming seems well advised.

Based on the paper it looks like this algorithm can work with a low
quality primary eviction algorithm. Swapping our GCLOCK out for
something simpler like plain CLOCK could simplify an atomics based
PinBuffer approach. Another interesting avenue for research would be
to use ideas in TinyLFU to implement a tier of "nailed" buffers that
have backend local or striped pin-unpin accounting.

But for checking if the replacement policy implemented by W-TinyLFU is
good it isn't necessary to have a performant locking solution. I think
a global lock would be good enough for a proof of concept that only
evaluates cache hit ratios.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2015-12-09 Thread Michael Paquier

On Thu, Sep 17, 2015 at 7:12 AM, Andres Freund  wrote:
> On 2015-08-20 09:59:25 -0400, Andrew Dunstan wrote:
>> Is there any significant interest in either of these?
>>
>> Josh Berkus tells me that he would like pg_controldata information, and I
>> was a bit interested in pg_config information, for this reason: I had a
>> report of someone who had configured using --with-libxml but the xml tests
>> actually returned the results that are expected without xml being
>> configured. The regression tests thus passed, but should not have. It
>> occurred to me that if we had a test like
>>
>> select pg_config('configure') ~ '--with-libxml' as has_xml;
>>
>> in the xml tests then this failure mode would be detected.
>
> On my reading of the thread there seems to be a tentative agreement that
> pg_controldata is useful and still controversy around pg_config. Can we
> split committing this?

Yeah, the last version of the patch dates of August, and there is
visibly agreement that the information of pg_controldata provided at
SQL level is useful while the data of pg_config is proving to be less
interesting for remote users. Could the patch be rebased and split as
suggested above?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund

On 2015-12-09 21:03:47 +0900, Michael Paquier wrote:
> Oh, OK. I didn't read though your lines correctly. So you basically
> mean that we would look at the init files that are on disk, and check
> if they are empty. If they are, we simply use XLogReadBufferExtended
> to fetch the INIT_FORKNUM content and fill in another buffer for the
> MAIN_FORKNUM. More or less right?

We'd not just do so if they're empty, we'd just generally copy the file
via shared buffers, instead of copy_file(). But we'd get the file size
from the filesystem (which is fine, we make sure it is correct during
replay).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier

On Wed, Dec 9, 2015 at 8:04 PM, Andres Freund  wrote:
> On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
>> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
>> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> >> > I'm kinda wondering if it wouldn't have been better to go through shared
>> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> >> > copy_file().
>> >>
>> >> For deployment with large shared_buffers settings, wouldn't that be
>> >> actually more costly than the current way of doing? We would need to
>> >> go through all the buffers at least once and look for the INIT_FORKNUM
>> >> present to flush them.
>> >
>> > We could just check the file sizes on disk, and the check for the
>> > contents of all the pages for each file.
>>
>> By doing it at replay, the flushes are spread across time. And by
>> doing it at the end of recovery, all the flushes would be grouped. Do
>> you think that's fine?
>
> The point is that we'd no flushes, because the data would come directly
> from shared buffers...

Oh, OK. I didn't read though your lines correctly. So you basically
mean that we would look at the init files that are on disk, and check
if they are empty. If they are, we simply use XLogReadBufferExtended
to fetch the INIT_FORKNUM content and fill in another buffer for the
MAIN_FORKNUM. More or less right?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-12-09 Thread Michael Paquier

On Tue, Dec 8, 2015 at 8:51 PM, Michael Paquier
 wrote:
> On Tue, Dec 8, 2015 at 7:18 PM, Kyotaro HORIGUCHI
>  wrote:
>> At Sat, 5 Dec 2015 21:05:29 +0900, Michael Paquier 
>>  wrote in 
>> 
>> > Regarding the patch, I
>> > would tend to think that we should just reject it and try to cruft
>> > something that could be more pluggable if there is really a need.
>> > Thoughts?
>>
>> Honestly saying, I feel similarly with you:p I personally will do
>> something like the following for the original objective.
>
> Are there other opinions? The -1 team is in majority at the end of this 
> thread..

So, marking the patch as rejected? Any objections?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread Michael Paquier

On Wed, Dec 9, 2015 at 8:17 PM, Thomas Munro
 wrote:
> On Mon, Dec 7, 2015 at 8:41 PM, Michael Paquier
>  wrote:
>> On Tue, Nov 17, 2015 at 12:19 AM, Alvaro Herrera 
>> wrote:
>>> Thomas Munro wrote:
 New version attached, merging recent changes.
>>>
>>> I wonder about the TailMatches and Matches macros --- wouldn't it be
>>> better to have a single one, renaming TailMatches to Matches and
>>> replacing the current Matches() with an initial token that corresponds
>>> to anchoring to start of command?  Just wondering, not terribly attached
>>> to the idea.
>>
>> +   /* TODO:TM -- begin temporary, not part of the patch! */
>> +   Assert(!word_matches(NULL, ""));
>> + [...]
>> +   Assert(!word_matches("foo", ""));
>> +   /* TODO:TM -- end temporary */
>>
>> Be sure to not forget to remove that later.
>
> Thanks for looking at this Michael.  It's probably not much fun to
> review!  Here is a new version with that bit removed.  More responses
> inline below.

I had a hard time not sleeping when reading it... That's very mechanical.

> I agree that would probably be better but Alvaro suggested following
> the existing logic in the first pass, which was mostly based on tails,
> and then considering simpler head-based patterns in a future pass.

Fine with me.

So what do we do now? There is your patch, which is already quite big,
but as well a second patch based on regexps, which is far bigger. And
at the end they provide a similar result:

Here is for example what the regexp patch does for some complex
checks, like ALTER TABLE RENAME:
 /* ALTER TABLE xxx RENAME yyy */
-else if (pg_strcasecmp(prev4_wd, "TABLE") == 0 &&
- pg_strcasecmp(prev2_wd, "RENAME") == 0 &&
- pg_strcasecmp(prev_wd, "CONSTRAINT") != 0 &&
- pg_strcasecmp(prev_wd, "TO") != 0)
+else if (MATCH("TABLE #id RENAME !CONSTRAINT|TO"))

And what Thomas's patch does:
+else if (TailMatches5("ALTER", "TABLE", MatchAny, "RENAME", MatchAny) &&
+ !TailMatches1("CONSTRAINT|TO"))

The regexp patch makes the negative checks somewhat easier to read
(there are 19 positions in tab-complete.c doing that), still inventing
a new langage and having a heavy refactoring just tab completion of
psql seems a bit too much IMO, so my heart balances in favor of
Thomas' stuff. Thoughts from others?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread Amit Langote

On Wednesday, 9 December 2015, amul sul  wrote:

> >On Wednesday, 9 December 2015 12:55 PM, Amit Langote <
> langote_amit...@lab.ntt.co.jp > wrote:
>
> >Thoughts?
>
> Wondering, have you notice failed regression tests?


I did, I couldn't send an updated patch today before leaving for the day.


> I have tried with new transformCheckConstraints() function & there will be
> small fix in gram.y.
>
> Have look into attached patch & please share your thoughts and/or
> suggestions.
>

Will take a look.

Thanks,
Amit

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-12-09 Thread Thomas Munro

On Mon, Dec 7, 2015 at 8:41 PM, Michael Paquier
 wrote:
> On Tue, Nov 17, 2015 at 12:19 AM, Alvaro Herrera 
> wrote:
>> Thomas Munro wrote:
>>> New version attached, merging recent changes.
>>
>> I wonder about the TailMatches and Matches macros --- wouldn't it be
>> better to have a single one, renaming TailMatches to Matches and
>> replacing the current Matches() with an initial token that corresponds
>> to anchoring to start of command?  Just wondering, not terribly attached
>> to the idea.
>
> +   /* TODO:TM -- begin temporary, not part of the patch! */
> +   Assert(!word_matches(NULL, ""));
> + [...]
> +   Assert(!word_matches("foo", ""));
> +   /* TODO:TM -- end temporary */
>
> Be sure to not forget to remove that later.

Thanks for looking at this Michael.  It's probably not much fun to
review!  Here is a new version with that bit removed.  More responses
inline below.

> -   else if (pg_strcasecmp(prev5_wd, "DEFAULT") == 0 &&
> -pg_strcasecmp(prev4_wd, "PRIVILEGES") == 0 &&
> -(pg_strcasecmp(prev3_wd, "FOR") == 0 ||
> - pg_strcasecmp(prev3_wd, "IN") == 0))
> -   {
> -   static const char *const
> list_ALTER_DEFAULT_PRIVILEGES_REST[] =
> -   {"GRANT", "REVOKE", NULL};
> -
> -   COMPLETE_WITH_LIST(list_ALTER_DEFAULT_PRIVILEGES_REST);
> -   }
> +   else if (TailMatches5("DEFAULT", "PRIVILEGES", "FOR", "ROLE",
> MatchAny) ||
> +TailMatches5("DEFAULT", "PRIVILEGES", "IN",
> "SCHEMA", MatchAny))
> +   CompleteWithList2("GRANT", "REVOKE");
> For this chunk I think that you need to check for ROLE|USER and not only
> ROLE.

Right, done.

> +   else if (TailMatches4("ALTER", "DOMAIN", MatchAny, "RENAME"))
> {
> static const char *const list_ALTERDOMAIN[] =
> {"CONSTRAINT", "TO", NULL};
> I think you should remove COMPLETE_WITH_LIST here for consistency with the
> rest.

Right, done.

> -   else if (pg_strcasecmp(prev5_wd, "DOMAIN") == 0 &&
> -pg_strcasecmp(prev3_wd, "RENAME") == 0 &&
> -pg_strcasecmp(prev2_wd, "CONSTRAINT") == 0)
> +   else if (TailMatches3("RENAME", "CONSTRAINT", MatchAny))
> COMPLETE_WITH_CONST("TO");
> Perhaps you are missing DOMAIN here?

Right, done.

> -   else if (pg_strcasecmp(prev4_wd, "ALTER") == 0 &&
> -pg_strcasecmp(prev3_wd, "SEQUENCE") == 0 &&
> -pg_strcasecmp(prev_wd, "NO") == 0)
> -   {
> -   static const char *const list_ALTERSEQUENCE2[] =
> -   {"MINVALUE", "MAXVALUE", "CYCLE", NULL};
> -
> -   COMPLETE_WITH_LIST(list_ALTERSEQUENCE2);
> -   }
> +   else if (TailMatches4("ALTER", "SEQUEMCE", MatchAny, "NO"))
> +   CompleteWithList3("MINVALUE", "MAXVALUE", "CYCLE");
> Typo here: s/SEQUEMCE/SEQUENCE.

Oops, fixed.

> -   else if (pg_strcasecmp(prev5_wd, "TABLE") == 0 &&
> -pg_strcasecmp(prev3_wd, "RENAME") == 0 &&
> -(pg_strcasecmp(prev2_wd, "COLUMN") == 0 ||
> - pg_strcasecmp(prev2_wd, "CONSTRAINT") == 0) &&
> -pg_strcasecmp(prev_wd, "TO") != 0)
> +   else if (TailMatches6("ALTER", "TABLE", MatchAny, "RENAME",
> "COLUMN|CONSTRAINT", MatchAny) &&
> +!TailMatches1("TO"))
> This should use TailMatches5 without ALTER for consistency with the existing
> code?

Ok, done.

> -   else if (pg_strcasecmp(prev_wd, "CLUSTER") == 0 &&
> -pg_strcasecmp(prev2_wd, "WITHOUT") != 0)
> +   else if (TailMatches1("CLUSTER") && !TailMatches2("WITHOUT",
> "CLUSTER"))
> Here removing CLUSTER should be fine.

Ok.

> -   else if (pg_strcasecmp(prev2_wd, "CLUSTER") == 0 &&
> -pg_strcasecmp(prev_wd, "ON") != 0 &&
> -pg_strcasecmp(prev_wd, "VERBOSE") != 0)
> -   {
> +   else if (TailMatches2("CLUSTER", MatchAny) &&
> !TailMatches1("VERBOSE"))
> Handling of ON has been forgotten.

Right, fixed.

> -   else if (pg_strcasecmp(prev3_wd, "CREATE") == 0 &&
> -!(pg_strcasecmp(prev2_wd, "USER") == 0 &&
> pg_strcasecmp(prev_wd, "MAPPING") == 0) &&
> -(pg_strcasecmp(prev2_wd, "ROLE") == 0 ||
> - pg_strcasecmp(prev2_wd, "GROUP") == 0 ||
> pg_strcasecmp(prev2_wd, "USER") == 0))
> +   else if (TailMatches3("CREATE", "ROLE|GROUP|USER", MatchAny) &&
> +!TailMatches3("CREATE", "USER", "MAPPING"))
> !TailMatches2("USER", "MAPPING")?

Ok.

> -   else if (pg_strcasecmp(prev4_wd, "CREATE") == 0 &&
> -pg_strcasecmp(prev3_wd, "MATERIALIZED") == 0 &&
> -pg_strcasecmp(prev2_wd, "VIEW") == 0)
> +   else if (TailMatches3("CREATE", "MATERIALIZED", "VIEW"))
> Fo

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Andres Freund

On 2015-12-09 19:36:11 +0900, Michael Paquier wrote:
> On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
> > On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
> >> > I'm kinda wondering if it wouldn't have been better to go through shared
> >> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
> >> > copy_file().
> >>
> >> For deployment with large shared_buffers settings, wouldn't that be
> >> actually more costly than the current way of doing? We would need to
> >> go through all the buffers at least once and look for the INIT_FORKNUM
> >> present to flush them.
> >
> > We could just check the file sizes on disk, and the check for the
> > contents of all the pages for each file.
> 
> By doing it at replay, the flushes are spread across time. And by
> doing it at the end of recovery, all the flushes would be grouped. Do
> you think that's fine?

The point is that we'd no flushes, because the data would come directly
from shared buffers...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Error with index on unlogged table

2015-12-09 Thread Michael Paquier

On Wed, Dec 9, 2015 at 4:41 PM, Andres Freund  wrote:
> On 2015-12-09 16:30:47 +0900, Michael Paquier wrote:
>> > I'm kinda wondering if it wouldn't have been better to go through shared
>> > buffers in ResetUnloggedRelationsInDbspaceDir() instead of using
>> > copy_file().
>>
>> For deployment with large shared_buffers settings, wouldn't that be
>> actually more costly than the current way of doing? We would need to
>> go through all the buffers at least once and look for the INIT_FORKNUM
>> present to flush them.
>
> We could just check the file sizes on disk, and the check for the
> contents of all the pages for each file.

By doing it at replay, the flushes are spread across time. And by
doing it at the end of recovery, all the flushes would be grouped. Do
you think that's fine?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] W-TinyLfu for cache eviction

2015-12-09 Thread Konstantin Knizhnik




On 03.12.2015 10:27, Vladimir Sitnikov wrote:

I've recently noticed W-TinyLfu cache admission policy (see [1]) being
used for caffeine "high performance caching library for Java 8".
It demonstrates high cache hit ratios (see [2]) and enables to build
high-throughput caches (see caffeine in [3])
Authors explicitly allow implementations of the algorithm (see [4]).

Does it make sense to evaluate the algorithm for buffer replacement?


I expect synchronization issues with implementation of this algorithm. 
It seems to be hard to avoid some global critical section which can 
cause significant performance degradation at MPP systems (see topic 
"Move PinBuffer and UnpinBuffer to atomics").




[1]: http://arxiv.org/pdf/1512.00727v1.pdf
[2]: https://github.com/ben-manes/caffeine/wiki/Efficiency
[3]: https://github.com/ben-manes/caffeine/wiki/Benchmarks
[4]: https://github.com/ben-manes/caffeine/issues/23#issuecomment-161536706

Vladimir Sitnikov






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Passing initially_valid values instead of !skip_validation to StoreRelCheck() in AddRelationNewConstraints()

2015-12-09 Thread amul sul

>On Wednesday, 9 December 2015 12:55 PM, Amit Langote 
> wrote:

>Thoughts?

Wondering, have you notice failed regression tests?

I have tried with new transformCheckConstraints() function & there will be 
small fix in gram.y.

Have look into attached patch & please share your thoughts and/or suggestions.

Thanks,
Amul Sul

transformCheckConstraints-function-to-overrid-not-valid.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-12-09 Thread Haribabu Kommi

On Wed, Dec 9, 2015 at 5:31 PM, Amit Kapila  wrote:
>
> Another bigger issue I see in the above part of code is that it doesn't
> seem to be safe to call load_hba() at that place in PostgresMain() as
> currently load_hba() is using a context created from PostmasterContext
> to perform the parsing and some other stuff, the PostmasterContext
> won't be available at that time.  It is deleted immediately after
> InitPostgres
> is completed.  So either we need to make PostmasterContext don't go
> away after InitPostgres() or load_hba shouldn't use it and rather use
> CurrentMemroyContext similar to ProcessConfigFile or may be use
> TopMemoryContext instead of PostmasterContext if possible.  I think
> this needs some more thoughts.
>
> Apart from above, this patch doesn't seem to work on Windows or
> for EXEC_BACKEND builds as we are loading the hba file in a
> temporary context (PostmasterContext, refer PerformAuthentication)
> which won't be alive for the backends life.  This works on linux because
> of fork/exec mechanism which allows to inherit the parsed file
> (parsed_hba_lines). This is slightly tricky problem to solve and we
> have couple of options (a) use TopMemoryContext instead of Postmaster
> Context to load hba; (b) Use CurrentMemoryContext (c) pass the parsed
> hba file for Windows/Exec_Backend using save_backend_variables/
> restore_backend_variables mechanism or if you have any other idea.
> If you don't have any better idea, then you can evaluate above ideas
> and see which one makes more sense.

Reverting the context release patch is already provided in the first
mail of this
thread [1]. Forgot to mention about the same in further mails.

Here I attached the same patch. This patch needs to be applied first before
pg_hba_lookup patch. I tested it in windows version also.

[1] - 
http://www.postgresql.org/message-id/cajrrpgffyf45mfk7ub+qhwhxn_ttmknrvhtudefqzuzzrwe...@mail.gmail.com

Regards,
Hari Babu
Fujitsu Australia


revert_hba_context_release_in_backend.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2015-12-09 Thread Alexander Korotkov

On Tue, Dec 8, 2015 at 6:00 PM, Amit Kapila  wrote:

> On Tue, Dec 8, 2015 at 3:56 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>>
>> Agree. This patch need to be carefully verified. Current experiments
>> just show that it is promising direction for improvement. I'll come with
>> better version of this patch.
>>
>> Also, after testing on large machines I have another observation to
>> share. For now, LWLock doesn't guarantee that exclusive lock would be ever
>> acquired (assuming each shared lock duration is finite). It because when
>> there is no exclusive lock, new shared locks aren't queued and LWLock state
>> is changed directly. Thus, process which tries to acquire exclusive lock
>> have to wait for gap in shared locks.
>>
>
> I think this has the potential to starve exclusive lockers in worst case.
>
>
>> But with high concurrency for shared lock that could happen very rare,
>> say never.
>>
>> We did see this on big Intel machine in practice. pgbench -S gets shared
>> ProcArrayLock very frequently. Since some number of connections is
>> achieved, new connections hangs on getting exclusive ProcArrayLock. I think
>> we could do some workaround for this problem. For instance, when exclusive
>> lock waiter have some timeout it could set some special bit which prevents
>> others to get new shared locks.
>>
>>
> I think timeout based solution would lead to giving priority to
> exclusive lock waiters (assume a case where each of exclusive
> lock waiter timesout one after another) and make shared lockers
> wait and a timer based solution might turn out to be costly for
> general cases where wait is not so long.
>

Since all lwlock waiters are ordered in the queue, we can let only first
waiter to set this bit.
Anyway, once bit is set, shared lockers would be added to the queue. They
would get the lock in queue order.


> Another way could be to
> check if the Exclusive locker needs to go for repeated wait for a
> couple of times, then we can set such a bit.
>

I'm not sure what do you mean by repeated wait. Do you mean exclusive
locker was waked twice up by timeout? Because now, without timeout,
exclusive locker wouldn't be waked up until all shared locks are released.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-12-09 Thread Etsuro Fujita


On 2015/12/09 13:26, Kouhei Kaigai wrote:

On Tue, Dec 8, 2015 at 10:00 PM, Etsuro Fujita
 wrote:

I think the actual regression test outputs are fine, and that your
desire to suppress part of the plan tree from showing up in the
EXPLAIN output is misguided.  I like it just the way it is.  To
prevent user confusion, I think that when we add support to
postgres_fdw for this we might also want to add some documentation
explaining how to interpret this EXPLAIN output, but I don't think
there's any problem with the output itself.



I'm not sure that that's a good idea.  one reason for that is I think that
that would be more confusing to users when more than two foreign tables are
involved in a foreign join as shown in the following example.  Note that the
outer plans will be shown recursively.  Another reason is there is no
consistency between the costs of the outer plans and that of the main plan.



I still don't really see a problem here, but, regardless, the solution
can't be to hide nodes that are in fact present from the user.  We can
talk about making further changes here, but hiding the nodes
altogether is categorically out in my mind.



If you really want to hide the alternative sub-plan, you can move the
outer planstate onto somewhere private field on BeginForeignScan,
then kick ExecProcNode() at the ForeignRecheck callback by itself.
Explain walks down the sub-plan if outerPlanState(planstate) is
valid. So, as long as your extension keeps the planstate privately,
it is not visible from the EXPLAIN.

Of course, I don't recommend it.


Sorry, my explanation might be not enough, but I'm not saying to hide 
the subplan.  I think it would be better to show the subplan somewhere 
in the EXPLAIN outout, but I'm not sure that it's a good idea to show 
that in the current form.  We have two plan trees; one for normal query 
execution and another for EvalPlanQual testing.  I think it'd be better 
to show the EXPLAIN output the way that allows users to easily identify 
each of the plan trees.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

86 matches

Mail list logo