Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Amit Kapila
On Thu, Apr 7, 2016 at 10:02 AM, Fujii Masao  wrote:
>
> On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila 
wrote:
> >
> > But for that, I think we don't need to do anything extra.  I mean
> > write_nondefault_variables() will automatically write the non-default
value
> > of variable and then during backend initialization, it will call
> > read_nondefault_variables which will call set_config_option for
non-default
> > parameters and that should set the required value if we have assign_*
> > function defined for the variable.
>
> Yes if the variable that we'd like to pass to a backend is BOOL, INT,
> REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
> complicated.
>

SyncRepConfig is a parsed result of SyncRepStandbyNames, why you want to
pass that?  I assume that current non-default value of SyncRepStandbyNames
will be passed via write_nondefault_variables(), so we can use that to
regenerate SyncRepConfig.

>
> So ISTM that write_one_nondefault_variable() needs to
> be updated so that SyncRepConfig is written to a file.
>


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] pg_ctl promote wait

2016-04-06 Thread Michael Paquier
On Wed, Mar 23, 2016 at 1:47 AM, David Steele  wrote:
> On 3/16/16 12:19 PM, David Steele wrote:
>> Hi Peter,
>>
>> On 3/9/16 3:08 PM, Michael Paquier wrote:
>>
>>> Here are some comments about 0002
>> <...>
>>> I think that we had better do something like the attached first.
>>> Thoughts?
>>
>> It's been a week since Michael reviewed this patch.  Could you please
>> comment on his proposed changes as soon as possible?
>
> Bump.

Feature freeze is getting close by. Once the deadline is reached, I
guess that we had better return the patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-04-06 Thread Andres Freund
Hi,

On 2016-04-07 09:14:00 +0530, Amit Kapila wrote:
> On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila  wrote:
> I have ran exactly same test on intel x86 m/c and the results are as below:

Thanks for running these tests!

> Client Count/Patch_ver (tps) 2 128 256
> HEAD – Commit 2143f5e1 2832 35001 26756
> clog_buf_128 2909 50685 40998
> clog_buf_128 +group_update_clog_v8 2981 53043 50779
> clog_buf_128 +content_lock 2843 56261 54059
> clog_buf_128 +nocontent_lock 2630 56554 54429

Interesting.

could you perhaps also run a test with -btpcb-like@1 -bselect-only@3?
That much represents real world loads, and it's where I saw simon's
approach outshining yours considerably...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-04-06 Thread Karl O. Pinc
On Wed, 6 Apr 2016 22:26:13 -0500
"Karl O. Pinc"  wrote:
> On Wed, 23 Mar 2016 23:22:26 +0100
> Gilles Darold  wrote:
> 
> > Thanks for the reminder, here is the v3 of the patch after a deeper
> > review and testing. It is now registered to the next commit fest
> > under the System Administration topic. 

> I've not yet even tried building it, 

Ok.  I've built it (but not tested it).  Some comments.

The docs don't build.  You are missing a " at line
15197 of func.sgml.  (Patched against current HEAD.)

Is there any rationale for the placement of your function
in that documentation table?  I don't see any organizing principle
to the table so am wondering where it might best fit.
(http://www.postgresql.org/docs/devel/static/functions-info.html)

Perhaps you want to write?:

   
pg_current_logfile returns the name of the
   current log file used by the logging collector, as
   text. Log collection must be active or the
   return value is undefined.
   

(Removed "a" before "text", and added "or..." to the end.)

Unless you want to define the return value to be NULL when
log collection is not active.  This might be cleaner.
I say this because I'm tempted to add "This can be tested
for with current_setting('logging_collector')='on'." to
the end of the paragraph.  But adding such text means
that the "logging_collector" setting is documented in multiple
places, in some sense, and such redundancy is best
avoided when possible.

I don't see a problem with defining the return value to be
NULL -- so long as it's defined to be NULL when there is
no current log file.  This would be different from defining
it to be NULL when log collection is off.  Although not
very different it should be clear that using pg_currrent_logfile()
to test whether log collection is on is not a good practice.
Perhaps some text like?:

   
pg_current_logfile returns the name of the
   current log file used by the logging collector, as
   text. NULL is returned
   and a NOTICE raised when no log file exists.
   

(I'm going to have to think some more about the raising of the
notice, and of the other error handling in your code.
I've not paid it any attention and error handling is always
problematic.)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Thu, Apr 7, 2016 at 1:22 PM, Amit Kapila  wrote:
> On Wed, Apr 6, 2016 at 8:11 PM, Fujii Masao  wrote:
>>
>> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila 
>> wrote:
>> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao 
>> > wrote:
>> >>
>> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila 
>> >> wrote:
>> >> >
>> >> >> BTW, we can move SyncRepUpdateConfig() just after
>> >> >> ProcessConfigFile()
>> >> >> from pg_stat_get_wal_senders() and every backends always parse the
>> >> >> value
>> >> >> of s_s_names when the setting is changed.
>> >> >>
>> >> >
>> >> > That sounds appropriate, but not sure what is exact place to call it.
>> >>
>> >> Maybe just after the following ProcessConfigFile().
>> >>
>> >> -
>> >> /*
>> >> * (6) check for any other interesting events that happened while we
>> >> * slept.
>> >> */
>> >> if (got_SIGHUP)
>> >> {
>> >> got_SIGHUP = false;
>> >> ProcessConfigFile(PGC_SIGHUP);
>> >> }
>> >> -
>> >>
>> >> If we do the move, we also need to either (1) make postmaster call
>> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends
>> >> via a file like write_nondefault_variables() does for EXEC_BACKEND
>> >> environment, or (2) make a backend call SyncRepUpdateConfig() during
>> >> its initialization phase so that the first call of pg_stat_replication
>> >> can use the parsed result. (1) seems complicated and overkill.
>> >> (2) may add very small overhead into the fork of a backend. It would
>> >> be almost negligible, though. So which logic should we adopt?
>> >>
>> >
>> > Won't it be possible to have assign_* function for
>> > synchronous_standby_names
>> > as we have for some of the other settings like assign_XactIsoLevel and
>> > then
>> > call SyncRepUpdateConfig() in that function?
>>
>> It's possible, but still seems to need (1), i.e., the variable that
>> assign_XXX
>> function assigned needs to be passed to a backend via file for
>> EXEC_BACKEND
>> environment.
>>
>
> But for that, I think we don't need to do anything extra.  I mean
> write_nondefault_variables() will automatically write the non-default value
> of variable and then during backend initialization, it will call
> read_nondefault_variables which will call set_config_option for non-default
> parameters and that should set the required value if we have assign_*
> function defined for the variable.

Yes if the variable that we'd like to pass to a backend is BOOL, INT,
REAL, STRING or ENUM. But SyncRepConfig variable is a bit more
complicated. So ISTM that write_one_nondefault_variable() needs to
be updated so that SyncRepConfig is written to a file.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Amit Kapila
On Wed, Apr 6, 2016 at 8:11 PM, Fujii Masao  wrote:
>
> On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila 
wrote:
> > On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao 
wrote:
> >>
> >> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila 
> >> wrote:
> >> >
> >> >> BTW, we can move SyncRepUpdateConfig() just after
ProcessConfigFile()
> >> >> from pg_stat_get_wal_senders() and every backends always parse the
> >> >> value
> >> >> of s_s_names when the setting is changed.
> >> >>
> >> >
> >> > That sounds appropriate, but not sure what is exact place to call it.
> >>
> >> Maybe just after the following ProcessConfigFile().
> >>
> >> -
> >> /*
> >> * (6) check for any other interesting events that happened while we
> >> * slept.
> >> */
> >> if (got_SIGHUP)
> >> {
> >> got_SIGHUP = false;
> >> ProcessConfigFile(PGC_SIGHUP);
> >> }
> >> -
> >>
> >> If we do the move, we also need to either (1) make postmaster call
> >> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> >> via a file like write_nondefault_variables() does for EXEC_BACKEND
> >> environment, or (2) make a backend call SyncRepUpdateConfig() during
> >> its initialization phase so that the first call of pg_stat_replication
> >> can use the parsed result. (1) seems complicated and overkill.
> >> (2) may add very small overhead into the fork of a backend. It would
> >> be almost negligible, though. So which logic should we adopt?
> >>
> >
> > Won't it be possible to have assign_* function for
synchronous_standby_names
> > as we have for some of the other settings like assign_XactIsoLevel and
then
> > call SyncRepUpdateConfig() in that function?
>
> It's possible, but still seems to need (1), i.e., the variable that
assign_XXX
> function assigned needs to be passed to a backend via file for
EXEC_BACKEND
> environment.
>

But for that, I think we don't need to do anything extra.  I
mean write_nondefault_variables() will automatically write the non-default
value of variable and then during backend initialization, it will
call read_nondefault_variables which will call set_config_option for
non-default parameters and that should set the required value if we have
assign_* function defined for the variable.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Andres Freund


On April 7, 2016 2:26:41 AM GMT+02:00, Michael Paquier 
 wrote:
>On Thu, Apr 7, 2016 at 12:55 AM, Andres Freund 
>wrote:
>> On 2016-04-06 16:49:17 +0100, Simon Riggs wrote:
>>> Perhaps easy to solve, but how do we test it is solved?
>>
>> Maybe something like
>>
>> -- drain
>> pg_logical_slot_get_changes(...);
>> -- generate message in different database, to ensure it's not
>processed
>> -- in this database
>> \c template1
>> SELECT pg_logical_emit_message(...);
>> \c postgres
>> -- check
>> pg_logical_slot_get_changes(..);
>>
>> It's a bit ugly to hardcode database names :/
>
>When running installcheck, there is no way to be sure that databases
>template1 and/or postgres exist on a server, so this test would fail
>because of that.

No need to hardcode postgres, see Petr's reply. I'm not concerned about 
template 1 not being there -if you tinkered with things in that level it's 
unlikely that tests will succeed.  Also, remember, this is in a test cluster 
created by the regression script, and there's no installcheck support anyway 
(because of the required settings for logical decoding) anyway
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2016-04-06 Thread Amit Kapila
On Sat, Apr 2, 2016 at 5:25 PM, Amit Kapila  wrote:

> On Thu, Mar 31, 2016 at 3:48 PM, Andres Freund  wrote:
>
> Here is the performance data (configuration of machine used to perform
> this test is mentioned at end of mail):
>
> Non-default parameters
> 
> max_connections = 300
> shared_buffers=8GB
> min_wal_size=10GB
> max_wal_size=15GB
> checkpoint_timeout=35min
> maintenance_work_mem = 1GB
> checkpoint_completion_target = 0.9
> wal_buffers = 256MB
>
> median of 3, 20-min pgbench tpc-b results for --unlogged-tables
>

I have ran exactly same test on intel x86 m/c and the results are as below:

Client Count/Patch_ver (tps) 2 128 256
HEAD – Commit 2143f5e1 2832 35001 26756
clog_buf_128 2909 50685 40998
clog_buf_128 +group_update_clog_v8 2981 53043 50779
clog_buf_128 +content_lock 2843 56261 54059
clog_buf_128 +nocontent_lock 2630 56554 54429


In this m/c, I don't see any run-to-run variation, however the trend of
results seems somewhat similar to power m/c.  Clearly the first patch
increasing clog bufs to 128 shows upto 50% performance improvement on 256
client-count.  We can also observe that group clog patch gives ~24% gain on
top of increase clog bufs patch at 256 client count.  Both content lock and
no content lock patches show similar performance gains and the performance
is 6~7% better than group clog patch.  Also as on power m/c, no content
lock patch seems to show some regression at lower client count (2 clients
in this case).

Based on above results, increase_clog_bufs to 128 is a clear winner and I
think we might not want to proceed with no content lock approach patch as
that shows some regression and also it is no better than using content lock
approach patch.   Now, I think we need to decide between group clog mode
approach patch and use content lock approach patch, it seems to me that the
difference between both of these is not high (6~7%) and I think that when
there are sub-transactions involved (sub-transactions on same page as main
transaction) group clog mode patch should give better performance as then
content lock in itself will start becoming bottleneck.  Now, I think we can
address that case for content lock approach by using grouping technique on
content lock or something similar, but not sure if that is worth the
effort.   Also, I see some variation in performance data with content lock
patch on power m/c, but again that might be attributed to m/c
characteristics.  So, I think we can proceed with either group clog patch
or content lock patch and if we want to proceed with content lock approach,
then we need to do some more work on it.


Note - For both content and no content lock, I have
applied 0001-Improve-64bit-atomics-support patch.


m/c config (lscpu)
---
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1064.000
BogoMIPS:  4266.62
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Re: [COMMITTERS] pgsql: Use GRANT system to manage access to sensitive functions

2016-04-06 Thread Stephen Frost
* Fujii Masao (masao.fu...@gmail.com) wrote:
> On Thu, Apr 7, 2016 at 10:45 AM, Stephen Frost  wrote:
> > Use GRANT system to manage access to sensitive functions
> >
> > Now that pg_dump will properly dump out any ACL changes made to
> > functions which exist in pg_catalog, switch to using the GRANT system
> > to manage access to those functions.
> >
> > This means removing 'if (!superuser()) ereport()' checks from the
> > functions themselves and then REVOKEing EXECUTE right from 'public' for
> > these functions in system_views.sql.
> 
> This commit revokes the execution privilege on pg_start_backup() from
> a replication role. Doesn't this affect many systems that a replication
> role is used to take a backup? This commit forces administrators of
> those systems to manually grant the privilege to a replication role
> when upgrading the system to 9.6.

That's possible, but I actually consider it pretty unlikely.  barman,
pgBackRest, and our own documentation (at least under the backup
documentation chapter; the function documentation itself does say that
replication roles could also call these functions, though provides no
guidance as to how one would go about creating a replication role) state
that the user used must be a superuser account.  There may be
hand-hacked scripts where the user has gone to the effort of setting up
an additional non-superuser account which has the replication attribute
but, frankly, those scripts are almost certainly broken in far worse
ways when you consider the complications around setting up a proper
backup solution for PG.

Further, currently, you have to have access to read the files on the
filesystem as the postgres user as we don't allow group privileges on
the data directory, so those same hand-hacked scripts are almost
certainly logging in from the postgres user account.

Additionally, just to be clear, pg_basebackup uses the replication
protocol which doesn't go through the GRANT system, and that is
therefore not impacted by this change.

There had been discussion about the replication role during this patch's
lifetime and I had figured that at least pg_start/stop_backup were
reasonably uncontroversial changes.  I plan to propose patches tomorrow
for making the same changes to the various pg_logical_* functions, which
I expect to spark additional discussion.

I do agree that this needs to be mentioned in the release notes, along
with the other changes to pg_start/stop_backup.

We could certainly hack into the pg_dump code to explicitly GRANT access
to these functions to all replication roles, but I'd really rather not
as there's no reason for a role which is issuing pg_start/stop_backup
commands to also be a replication role, with these changes.  Ultimately,
I'd like to make replication roles only able to connect using the
replication protocol.  Using it to provide access to the
pg_start/stop_backup and pg_logical_* functions really isn't a good
approach as the target user ends up with much more access than
necessary.  Propagating that into future versions doesn't strike me as a
good idea.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-04-06 Thread Karl O. Pinc
Hi Gilles,

On Wed, 23 Mar 2016 23:22:26 +0100
Gilles Darold  wrote:

> Thanks for the reminder, here is the v3 of the patch after a deeper
> review and testing. It is now registered to the next commit fest under
> the System Administration topic.

I am going to try reviewing your patch.  I don't feel entirely
confident, but should be able to provide at least some help.

I've not yet even tried building it, but the the first thing I notice
is that you're going to need to use pgrename() of src/port/dirmod.c
in order to get an atomic update of the pg_log_file file.

I believe this is the right approach here.  Any other program
will always see a fully-formed file content.

The typical way this works is: you make a new file with new
content, then rename the new file to the old file name.
This makes the new file name go away and the old file
content go away and, effectively, replaces
the content of your file with new content.

You'll want to look at other places where pg uses pgrename()
to see what sort of name you should give to the new file.
If it was me, I'd just stick a dot in front, calling it
".pg_log_file" but we want to be consistent with existing
practice.

I'd imagine that you'd create the new file in the same directly
as pg_log_file, that being the usual way to ensure that both
files are in the same file system (a requirement).  But when
you're looking at other uses of pgrename() in the existing code
it wouldn't hurt to see check what that code is doing regards
placement of the new file in the file system.

If you have any guidance/scripts/whatever which support
testing your patch please send it my way.  Thanks.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updated backup APIs for non-exclusive backups

2016-04-06 Thread Noah Misch
On Wed, Apr 06, 2016 at 09:17:22AM +0200, Magnus Hagander wrote:
> On Wed, Apr 6, 2016 at 6:42 AM, Noah Misch  wrote:
> > On Tue, Apr 05, 2016 at 08:15:16PM +0200, Magnus Hagander wrote:
> > > I've pushed this version, and also added the item from the Brussels
> > > developer meeting to actually rewrite the main backup docs to the open
> > > items so they are definitely not forgotten for 9.6.
> >
> > Here's that PostgreSQL 9.6 open item:
> >
> > * Update of backup documentation (assigned to Bruce at [
> > https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2016_Developer_Meeting
> > Brussels Developer Meeting], but others are surely allowed to work on it as
> > well)
> > ** Made required by 7117685461af50f50c03f43e6a622284c8d54694 since the
> > current documentation is now incorrect
> >
> > Unless Bruce endorsed this development strategy, I think it unfair for
> > commit
> > 7117685 to impose a deadline on his backup.sgml project.  Did commit
> > 7117685
> >
> 
> I agree that's a horrible wording. What it meant to imply was a deadline
> for *me* to help take that off Bruce's shoulders before then while also
> opening the doors for others to work on it should they want. Re-reading it
> now I can see that's not at all what it says. I'll reword (or rather,
> remove most of that, since it's been discussed separately and doesn't
> actually need to go on the open items list which is a list and not a
> discussion)

Ahh.  Your original wording did admit your interpretation; I just didn't think
of that interpretation.

> > The chapter already does describe pg_basebackup before describing
> > pg_start_backup; what else did the plan entail?

> Recommending people to primarily use tried and tested ways of doing backups
> (including a reference to a list, probably on the wiki, of known external
> tools that "do things the right way", such as backrest or barman) rather
> than focusing on examples that aren't necessarily even correct, and
> certainly not complete.
> 
> Mentioning the actual problems of exclusive base backups. (And obviously
> talk about how using pg_start_backup/pg_stop_backup now makes it possible
> to do "low level" backups without the risks of exclusive backups -- which
> is the "incomplete" part from my note.
> 
> More clearly outlining backup vs dump, and possibly (I can't actually
> remember if we decided the second step) put base backups before pg_dump
> since the topic is backups.

Unless you especially want to self-impose the same tight resolution schedule
that 9.6 regressions experience, let's move this to the "Non-bugs" section.
Which do you prefer?  I don't think the opportunity for more documentation in
light of 7117685 constitutes a regression, and I don't want "Open Issues" to
double as a parking lot for slow-moving non-regressions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Robert Haas
On Wed, Apr 6, 2016 at 10:04 AM, Dilip Kumar  wrote:
> On Wed, Apr 6, 2016 at 3:22 PM, Andres Freund  wrote:
>> Which scale did you initialize with? I'm trying to reproduce the
>> workload on hydra as precisely as possible...
>
> I tested with scale factor 300, shared buffer 8GB.
>
> My test script is attached with the mail (perf_pgbench_ro.sh).
>
> I have done some more test on power (same machine)

I spent a lot of time testing things on power2 today and also
discussed with Andres via IM.  Andres came up with a few ideas to
reduce the variability, which I tried.  One of those was to run the
server under numactl --interleave=all (that is, numactl
--interleave=all pg_ctl start etc.) and another was to set
kernel.numa_balancing = 0 (it was set to 1).  Neither of those things
seemed to prevent the problem of run-to-run variability.  Andres also
suggested running pgbench with "-P 1", which revealed that it was
generally possible to tell what the overall performance of a run was
going to be like within 10-20 seconds.  Runs that started out fast
stayed fast, and those that started out slower remained slower.
Therefore, long runs didn't seem to be necessary for testing, so I
switched to using 2-minute test runs launched via pgbench -T 120 -c 64
-j 64 -n -M prepared -S -P1.

After quite a bit of experimentation, Andres hit on an idea that did
succeed in drastically reducing the run-to-run variation: prewarming
all of the relations in a deterministic order before starting the
test.  I used this query:

psql -c "select sum(x.x) from (select pg_prewarm(oid) as x from
pg_class where relkind in ('i', 'r') order by oid) x;"

With that change to my test script, the results became much more
stable.  I tested four different builds of the server: commit 3fed4174
(that is, the one just before where you have been reporting the
variability to have begun), commit 6150a1b0 (the one you reported that
the variability actually did begin), master as of this morning
(actually commit cac0e366), and master + pinunpin-cas-9.patch +
0001-WIP-Avoid-the-use-of-a-separate-spinlock-to-protect-.patch +
backoff.patch (herein called "andres").  The first and last of these
have 64-byte BufferDescs, and the others have 80-byte BufferDescs.
Without prewarming, I see high and low results on *all* of these
builds, even 3fed4174.  I did nine test runs with each configuration
with and without prewarming, and here are the results.  With each
result I have reported the raw numbers, plus the median and the range
(highest result - lowest result).

-- without prewarming --
3fed4174
tps by run: 249165.928992 300958.039880 501281.083247
488073.289603 251033.038275 272086.063197
522287.023374 528855.922328 266057.502255
median: 300958.039880, range: 206687.132100

6150a1b0
tps by run: 455853.061092 438628.888369 353643.017507
419850.232715 424353.870515 440219.581180
431805.001465 237528.175877 431789.666417
median: 431789.666417, range: 218324.885215

master
tps by run: 427808.559919 366625.640433 376165.188508
441349.141152 363381.095198 352925.252345
348975.712841 446626.284308 444633.921009
median: 376165.188508, range: 97650.571467

andres
tps by run: 391123.866928 423525.415037 496103.017599
346707.246825 531194.321999 466100.337795
517708.470146 355392.837942 510817.921728
median: 466100.337795, range: 184487.075174

-- with prewarming --
3fed4174
tps by run: 413239.471751 428996.202541 488206.103941
493788.533491 497953.728875 498074.092686
501719.406720 508880.505416 509868.266778
median: 497953.728875, range: 96628.795027

6150a1b0
tps by run: 421589.299889 438765.851782 440601.270742
440649.818900 443033.460295 447317.269583
452831.480337 456387.316178 460140.159903
median: 443033.460295, range: 38550.860014

master
tps by run: 427211.917303 427796.174209 435601.396857
436581.431219 442329.269335 446438.814270
449970.003595 450085.123059 456231.229966
median: 442329.269335, range: 29019.312663

andres
tps by run: 425513.771259 429219.907952 433838.084721
451354.257738 494735.045808 495301.319716
517166.054466 531655.887669 546984.476602
median: 494735.045808, range: 121470.705343

My belief is that the first set of numbers has so much jigger that you
can't really draw any meaningful conclusions.  For two of the four
branches, the range is more than 50% of the median, which is enormous.
You could probably draw some conclusions if you took enough
measurements, but it's pretty hard. Notice that that set of numbers
makes 6150a1b0 look like a performance improvement, whereas the second
set makes it pretty clear that 6150a1b0 was a regression.

Also, the patch set is clearly winning here.  It picks up 90k TPS
median on the first set of numbers and 50k TPS median on the second
set.

It's fairly mysterious to me why there is so much jitter in the
results on this machine. By doing prewarming in a consistent fashion,
we make sure that every disk run puts the same disk blocks in the same
buffers.  Andres guessed 

Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley
On 7 April 2016 at 08:01, David Rowley  wrote:
> On 7 April 2016 at 04:05, Tom Lane  wrote:
>> Starting to look at this again.  I wonder, now that you have the generic
>> caching mechanism for remembering whether join inner sides have been
>> proven unique, is it still worth having the is_unique_join field in
>> SpecialJoinInfo?  It seems like that's creating a separate code path
>> for special joins vs. inner joins that may not be buying us much.
>> It does potentially save lookups in the unique_rels cache, if you already
>> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.
>
> I quite like that field where it is, as it should make
> remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
> removed, the previous code would go off and try to make sure all the
> joins are unique again, but now we cache that, and save it from having
> to bother doing that again, on joins already marked as unique.
>
> Certainly changing that would mean one less special case in
> joinpath.c, as the JOIN_LEFT case can be handle the same as the other
> cases, although it looks like probably, if I do change that, then I'd
> probably move is_innerrel_unique_for() into analyzejoins.c, and put
> the special case for JOIN_LEFT in that function, so that it calls
> specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
> the unique_rels cache if the result comes back positive, and in the
> non_unique_rels cache if negative... But it seems a bit crazy to go to
> the trouble or all that caching, when we can just throw the result in
> a struct field in the case of Special Joins.  Maybe we could just hide
> both the new joinpath.c functions in analyzejoins.c and call it quits.
> It's not as if there's no special cases for JOIN_LEFT in that file.
>
>> Also, as I'm looking around at the planner some more, I'm beginning to get
>> uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
>> as the executor is concerned, no doubt, but there's a lot of planner
>> expectations about the use of JOIN_SEMI that we'd be violating.  One is
>> that there should be a SpecialJoinInfo for any SEMI join.  Another is that
>> JOIN_SEMI can be implemented by unique-ifying the inner side and then
>> doing a regular inner join; that's not a code path we wish to trigger in
>> these situations.  The patch might avoid tripping over these hazards as it
>> stands, but it seems fragile, and third-party FDWs could easily contain
>> code that'll be broken.  So I'm starting to feel that we'd better invent
>> two new JoinTypes after all, to make sure we can distinguish plain-join-
>> with-inner-side-known-unique from a real SEMI join when we need to.
>>
>> What's your thoughts on these matters?
>
> I was a bit uncomfortable with JOIN_INNER becoming JOIN_SEMI before,
> but that was for EXPLAIN reasons. So I do think it's better to have
> JOIN_INNER_UNIQUE and JOIN_LEFT_UNIQUE instead. I can go make that
> change, but unsure on how EXPLAIN will display these now. I'd need to
> pull out my tests if we don't have anything to show in EXPLAIN, and
> I'd really rather have tests for this.

I've attached an updated patch which introduces JOIN_INNER_UNIQUE and
JOIN_LEFT_UNIQUE. So unique inner joins no longer borrow JOIN_SEMI.

I also made some changes around the is_unique_join flag in
SpecialJoinInfo. I've changed this to become optimal_jointype, which
is initially set to jointype and updated by what used to be called
mark_unique_joins(), now called optimize_outerjoin_types(). The LEFT
JOIN removal code now only bothers with special joins with
optimal_jointype as JOIN_LEFT_UNIQUE. This still saves any double work
analyzing the unique properties of LEFT JOINs. I've moved the two new
functions I put in joinpath.c into analyzejoins.c

In EXPLAIN, I named these new join types "Unique Inner" and "Unique
Left". Going by a comment in explain.c, there's some historic reason
that we display "Hash Join" rather than "Hash Inner Join", and "Nested
Loop", rather than "Nested Loop Join" or "Nested Loop Inner Join". I
wasn't quite sure if Unique Inner Joins should use a similar shorthand
form, so I didn't touch that code. We'll get "Nested Loop Unique Inner
Join" now instead of "Nested Loop". I wasn't able to think of some
equivalent shorthand form that made sense.

I know we're close to the feature freeze, so I just want to say that
I'll be AFK starting 5:30am Friday New Zealand time (13:30 on the 8th
New York time), until Sunday ~4pm. . I really hope that if this needs
any tweaking it will be minor. I'll check my email before I leave on
Friday in the hope that if there's anything, then I can quickly fix it
up before I go, but time will be short.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


unique_joins_2016-04-07.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] postgres_fdw : altering foreign table not invalidating prepare statement execution plan.

2016-04-06 Thread Kyotaro HORIGUCHI
Hi,

At Tue, 5 Apr 2016 19:46:04 +0900, Amit Langote  
wrote in <5703976c.30...@lab.ntt.co.jp>
> On 2016/04/05 18:44, Kyotaro HORIGUCHI wrote:
> > At Tue, 5 Apr 2016 14:24:52 +0900, Amit Langote wrote:
> > With this patch, making any change on foreign servers or user
> > mappings corresponding to exiting connection causes
> > disconnection. This could be moved in to core, with the following
> > API like this.
> > 
> > reoutine->NotifyObjectModification(ObjectAccessType access,
> >  Oid classId, Oid objId);
> > 
> > - using object access hook (which is put by the core itself) is not bad?
> > 
> > - If ok, the API above looks bad. Any better API?
> 
> Although helps achieve our goal here, object access hook stuff seems to be
> targeted at different users:
> 
> /*
>  * Hook on object accesses.  This is intended as infrastructure for security
>  * and logging plugins.
>  */
> object_access_hook_type object_access_hook = NULL;

Yeah, furthermore, it doesn't notify to other backends, that is
what I forgotten at the time:(

So, it seems reasonable that all stuffs ride on the cache
invalidation mechanism.

Object class id and object id should be necessary to be
adequitely notificated to fdws but the current invalidation
mechanism donesn't convey the latter. It is hidden in hash
code. This is resolved just by additional member in PlanInvalItem
or resolving oid from hash code(this would need catalog scan..).

PlanCache*Callback() just do invalidtion and throw the causeal
PlanInvalItem away. If plancache had oid lists of foreign
servers, foreign tables, user mappings or FDWS, we can notify
FDWs of invalidation with the causal object.


- Adding CachedPlanSource with some additional oid lists, which
  will be built by extract_query_dependencies.

- In PlanCache*Callback(), class id and object id of the causal
  object is notified to FDW.


> I just noticed the following comment above GetConnection():
> 
>  *
>  * XXX Note that caching connections theoretically requires a mechanism to
>  * detect change of FDW objects to invalidate already established connections.
>  * We could manage that by watching for invalidation events on the relevant
>  * syscaches.  For the moment, though, it's not clear that this would really
>  * be useful and not mere pedantry.  We could not flush any active connections
>  * mid-transaction anyway.
> 
> 
> So, the hazard of flushing the connection mid-transaction sounds like it
> may be a show-stopper here, even if we research some approach based on
> syscache invalidation.  Although I see that your patch takes care of it.

Yeah. Altough cache invalidation would accur amid transaction..

> > By the way, AlterUserMapping seems missing calling
> > InvokeObjectPostAlterHook. Is this a bug?
> 
> Probably, yes.

But we dont' see a proper way to "fix" this since we here don't
use this.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Michael Paquier
On Thu, Apr 7, 2016 at 12:55 AM, Andres Freund  wrote:
> On 2016-04-06 16:49:17 +0100, Simon Riggs wrote:
>> Perhaps easy to solve, but how do we test it is solved?
>
> Maybe something like
>
> -- drain
> pg_logical_slot_get_changes(...);
> -- generate message in different database, to ensure it's not processed
> -- in this database
> \c template1
> SELECT pg_logical_emit_message(...);
> \c postgres
> -- check
> pg_logical_slot_get_changes(..);
>
> It's a bit ugly to hardcode database names :/

When running installcheck, there is no way to be sure that databases
template1 and/or postgres exist on a server, so this test would fail
because of that.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v12] GSSAPI encryption support

2016-04-06 Thread Tom Lane
Robbie Harwood  writes:
> Tom Lane  writes:
>> Wait a second.  So the initial connection-request packet is necessarily
>> unencrypted under this scheme?

> Yes, by necessity.  The username must be sent in the clear, even if only
> as part of the GSSAPI handshake (i.e., the GSSAPI username will appear
> in plantext in the GSSAPI blobs which are otherwise encrypted).  GSSAPI
> performs authentication before it can start encryption.

Ugh.  I had thought we were putting work into this because it represented
something we could recommend as best practice, but now you're telling me
that it's always going to be inferior to what we have already.

> In this design, the contents of the Startup Message are the only
> non-authentication related information sent in the clear.  This
> contains: username (which we need anyway), database, application_name,
> and I add gss_encrypt.

And any other GUC value that the user has decided to send via PGOPTIONS.
Whatever the merits of assuming that the username is okay to expose to
eavesdroppers, I dislike having to assume that there are not and never
will be any GUCs whose settings are potentially security-sensitive.

I really think you need to fix this so that the true startup packet
is not sent until the connection has been encrypted.  People are used
to assuming that's true with SSL encryption, and if GSSAPI is less
secure that's likely to lead to unexpected security weaknesses somewhere
down the line.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2016-04-06 Thread Tomas Vondra

Hi,

attached is the patch split into two parts, as proposed by Simon. 0001 
just adds the stuff to relcache, 0002 actually uses it for estimation.


On 04/04/2016 12:03 PM, Amit Langote wrote:

On 2016/04/04 17:25, Simon Riggs wrote:

The rel cache code you're adding uses a flag called "rd_fkeyvalid"
which indicates that the relcache is correctly filled. That is
confusing, since it has nothing to do with the concept of
constraint validity. We should rename that to rd_fkeycachefilled or
similar.


Maybe I'm missing something, but is a separate bool required at all
in this case? Wouldn't simply doing the following suffice?

/* Quick exit if we already computed the list. */
if (relation->rd_fkeylist)
return list_copy(relation->rd_fkeylist);

ISTM, rd_fkeyvalid is modeled on rd_indexvalid, where the latter
serves to convey more info than simply whether the index list is
valid or not, so the extra field is justified.


I think you're right. I've copied the logic for indexes, but clearly for 
foreign keys we don't need this flag. I've removed it.




Also, it seems the patch forgot to update RelationDestroyRelation().


Right. Fixed.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-add-foreign-key-info-into-relcache.patch
Description: binary/octet-stream


0002-use-fkeys-in-join-estimation.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Michael Paquier
On Thu, Apr 7, 2016 at 7:44 AM, Michael Paquier
 wrote:
> On Thu, Apr 7, 2016 at 6:11 AM, Petr Jelinek  wrote:
>> On 06/04/16 22:50, Andrew Dunstan wrote:
>>> I have spent way too much time on this and don't have it working yet.
>>> I'm setting up a sacrificial VM from scratch in a last ditch attempt to
>>> get it working.
>>>
>>> Things to note so far:
>>>
>>>   * VS2015 appears to create version 12 solution files, not version 14,
>>> and the tools complained about version 14.
>>>   * Windows git (the successor to msysGit) apparently no longer ships
>>> with bison and flex. So if you need those (i.e. to built from git,
>>> not tarball) you're probably going to need to install the MsysDTK
>>> even if you're not using its compiler.
>>
>> It's fun to set it up yes. I do have the machine with buildfarm client ready
>> still (although now also traveling so slightly complicated to get to it) but
>> I didn't activate it yet as I don't want it to just report failures forever.
>
> Petr, did you actually try the patches I sent? I did my tests using
> the community version of Visual Studio 2015 and a full install of it.
> If I am the only able to make those working, well we surely have a
> problem captain.

By the way, if I look at the vcxproj files generated in my case, those
are correctly with Tools=14.0 or PlatformToolSet=v140.. Perhaps not
using Win10 differs in the way things are generated.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Petr Jelinek

On 07/04/16 00:50, Michael Paquier wrote:

On Thu, Apr 7, 2016 at 7:44 AM, Michael Paquier
 wrote:

On Thu, Apr 7, 2016 at 6:11 AM, Petr Jelinek  wrote:

On 06/04/16 22:50, Andrew Dunstan wrote:

I have spent way too much time on this and don't have it working yet.
I'm setting up a sacrificial VM from scratch in a last ditch attempt to
get it working.

Things to note so far:

   * VS2015 appears to create version 12 solution files, not version 14,
 and the tools complained about version 14.
   * Windows git (the successor to msysGit) apparently no longer ships
 with bison and flex. So if you need those (i.e. to built from git,
 not tarball) you're probably going to need to install the MsysDTK
 even if you're not using its compiler.


It's fun to set it up yes. I do have the machine with buildfarm client ready
still (although now also traveling so slightly complicated to get to it) but
I didn't activate it yet as I don't want it to just report failures forever.


Petr, did you actually try the patches I sent? I did my tests using
the community version of Visual Studio 2015 and a full install of it.
If I am the only able to make those working, well we surely have a
problem captain.


By the way, if I look at the vcxproj files generated in my case, those
are correctly with Tools=14.0 or PlatformToolSet=v140.. Perhaps not
using Win10 differs in the way things are generated.



I have community edition on win10 as well, worked fine there yes, I used 
just the command-line tools from it.


VS2015 creates version 12 solution only before the patches are applied 
for me, once they are applied it works fine.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Michael Paquier
On Thu, Apr 7, 2016 at 6:11 AM, Petr Jelinek  wrote:
> On 06/04/16 22:50, Andrew Dunstan wrote:
>> I have spent way too much time on this and don't have it working yet.
>> I'm setting up a sacrificial VM from scratch in a last ditch attempt to
>> get it working.
>>
>> Things to note so far:
>>
>>   * VS2015 appears to create version 12 solution files, not version 14,
>> and the tools complained about version 14.
>>   * Windows git (the successor to msysGit) apparently no longer ships
>> with bison and flex. So if you need those (i.e. to built from git,
>> not tarball) you're probably going to need to install the MsysDTK
>> even if you're not using its compiler.
>
> It's fun to set it up yes. I do have the machine with buildfarm client ready
> still (although now also traveling so slightly complicated to get to it) but
> I didn't activate it yet as I don't want it to just report failures forever.

Petr, did you actually try the patches I sent? I did my tests using
the community version of Visual Studio 2015 and a full install of it.
If I am the only able to make those working, well we surely have a
problem captain.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-04-06 Thread Tom Lane
Pavel Stehule  writes:
> 1. We want this patch - it increase a functionality of autocomplete

TBH, I do not think that is an agreed-to statement.  I concur with
Peter's comments upthread questioning how much use-case there is for
interactive completion of IF (NOT) EXISTS.  If it were a small and
uncomplicated patch, I wouldn't object ... but this is neither.

It increases the size of tab-complete.c by nearly 10%, and would
increase it more than that if it were adequately documented as to
what all these new macros and variables do.  (To take just the first
example, APPEND_COMP_CHARP and SET_COMP_CHARP not only lack any
documentation, but have been inserted between a comment documenting
some macros and those macros.  Another complaint in the same vein is
that MatchesN() no longer means even approximately what it did before,
but the comment for those macros hasn't been changed.)  On top of that,
it seems like there's an awful lot of klugery going on here.  An example
is the use of the COLLAPSE macro (hidden inside MidMatchAndRemoveN),
which seems like a seriously bad idea because it destroys data even if
the match the macro is embedded in ultimately fails.  That will create
order dependencies between match rules, which is not something we want
IMO, most especially when it's not clearly marked in the match rules
what's dependent on what.

Seeing things like "if (something-with-side-effects && false)" doesn't
fill me with any great admiration for the cleanliness of the code, either.

In short, I'm not sold that we need autocomplete for IF EXISTS,
and if the price we have to pay for it is klugery on this scale,
it's no sale.  I think this needs to be sent back for a considerable
amount of rethinking.

One thing that might be worth considering to get rid of this
side-effects-in-IF-conditions mess is to move the match rules into
a separate function so that after doing a successful match, we just
"return".  This could be implemented in most places by adding
return statements into the COMPLETE_WITH_FOO macros.  Then we would
not need the enormous else-if chain, but just simple independent
if statements, where we know a successful match will end with a
"return" instead of falling through to the next statement.  The
big advantage of that is that then you can do operations with
side-effects explicitly as separate statements, instead of having
to make them look like phony else-if cases.  So for example the
CREATE SCHEMA case might be handled like

if (Matches2("CREATE", "SCHEMA"))
{
... handle possible autocompletions of CREATE SCHEMA itself here ...

/* Else, move head match point past CREATE SCHEMA */
if ((n = find_last_index_of("CREATE")) > 0)
HEAD_SHIFT(n);
}

/*
 * Statements that can appear in CREATE SCHEMA should be considered here!
 */

if (Matches2("CREATE", "TABLE"))
... handle CREATE TABLE here ...

... handle other statements that can appear in CREATE SCHEMA here ...

After exhausting the possibilities for sub-statements of CREATE SCHEMA,
we could either return failure if we'd seen CREATE SCHEMA:

/*
 * Fail if we saw CREATE SCHEMA; no rules below here should be considered.
 */
if (head_shift > 0)
return false;

or reset the head match point before continuing with unrelated rules:

/*
 * Done considering CREATE SCHEMA sub-rules, so forget about
 * whether we saw CREATE SCHEMA.
 */
HEAD_SHIFT(0);

Immediately failing seems like the right thing for CREATE SCHEMA, but
maybe in other cases we just undo the head_shift change and keep trying.
This is still order dependent, but at least the places where we change
the match basis can be made to be fairly obvious actions instead of
being disguised as just-another-candidate-match.

I don't immediately have a better idea to replace COLLAPSE, but I really
don't like that as it stands.  I wonder whether we could dodge the need
for it by just increasing head_shift when deciding to skip over an
IF (NOT) EXISTS clause.  Otherwise, I think what I'd want to see is
some way to explicitly undo the effects of COLLAPSE when done with
rules that could benefit from it.  Or at least a coding convention
that makes it clear that you return failure once you've considered
all subsidiary rules, rather than continuing on with unrelated rules
that would have a risk of false match thanks to the data removal.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] Proposal for \crosstabview in psql

2016-04-06 Thread Alvaro Herrera
I've been looking at this patch.  First thing was to rebase on top of
recent psql code restructuring; second, pgindent; third, reordered the
code in crosstabview.c more sensibly (had to add prototypes).  New
version attached.

Then I looked at the docs to try to figure out exactly how it works.
I'm surprised that there's not a single example added to the psql
manpage.  Please add one.

I then tested it a bit, "kick the tires" so to speak.  I noticed that
error handling is broken.  For instance, observe the query prompt after
the error:

regression=# select * from pg_class \crosstabview relnatts 
\crosstabview: missing second argument
regression-# 

At this point the query buffer contains the query (you can see it with
\e), which seems bogus to me.  The query buffer needs to be reset.
Compare \gexec:
alvherre=# select 1 \gexec
ERROR:  error de sintaxis en o cerca de «1»
LÍNEA 1: 1
 ^
alvherre=# 


Also, using bogus column names as arguments cause state to get all
bogus:

alvherre=# select * from pg_class \crosstabview relnatts relkinda
Invalid column name: relkinda
alvherre=# select 1;
The query must return at least two columns to be shown in crosstab

Note that the second query is not crosstab at all, yet the error message
is entirely bogus.  This one is probably the same bug:

alvherre=# select 'one', 'two';
Invalid column name: relnatts

Apparently, once in that state, not even a successful query crosstab
display resets the state correctly:

alvherre=# select * from pg_class \crosstabview relnatts relkinda
Invalid column name: relkinda
alvherre=# select 'one' as relnatts, 'two' as relkinda \crosstabview
 relnatts | two 
--+-
 one  | X
(1 fila)

alvherre=# select 1;
The query must return at least two columns to be shown in crosstab

Please fix this.


Some additional items:

* A few examples in docs.  The psql manpage should have at least two new
examples showing the crosstab features, one with the simplest case you
can think of, and another one showing all the features.

* Add regression test cases somewhere for the regression database.
Probably use "FROM tenk1 WHERE hundred < 5", which provides you with 500
rows, enough for many interesting games.  Make sure to test all the
provided features.  I would use a new psql.sql file for this.

* How did you come up with the 1600 value?  Whatever it is, please use a
#define instead of hardcoding it.

* In the "if (cont.cells[idx] != NULL && cont.cells[idx][0] != '\0')"
block (line 497 in the attached), can't we do the same thing by using
psprintf?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index d8b9a03..536141c 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -990,6 +990,113 @@ testdb=
   
 
   
+\crosstabview [
+colV
+colH
+[:scolH]
+[colG1[,colG2...]]
+] 
+
+
+Execute the current query buffer (like \g) and shows
+the results inside a crosstab grid.
+The output column colV
+becomes a vertical header
+and the output column colH
+becomes a horizontal header, optionally sorted by ranking data obtained
+from scolH.
+
+colG1[,colG2...]
+is the list of output columns to project into the grid.
+By default, all output columns of the query except 
+colV and
+colH
+are included in this list.
+
+
+
+All columns can be refered to by their position (starting at 1), or by
+their name. Normal case folding and quoting rules apply on column
+names. By default,
+colV corresponds to column 1
+and colH to column 2.
+A query having only one output column cannot be viewed in crosstab, and
+colH must differ from
+colV.
+
+
+
+The vertical header, displayed as the leftmost column,
+contains the deduplicated values found in
+column colV, in the same
+order as in the query results.
+
+
+The horizontal header, displayed as the first row,
+contains the deduplicated values found in
+column colH, in
+the order of appearance in the query results.
+If specified, the optional scolH
+argument refers to a column whose values should be integer numbers
+by which colH will be sorted
+to be positioned in the horizontal header.
+
+
+
+Inside the crosstab grid,
+given a query output with N columns
+(including colV and
+colH),
+for each distinct value x of
+colH
+and each distinct value y of
+colV,
+the contents of a cell located at the intersection
+(x,y) is determined by these rules:
+  

Re: [HACKERS] [PATCH v12] GSSAPI encryption support

2016-04-06 Thread Robbie Harwood
Tom Lane  writes:

> Robbie Harwood  writes:
>> I need to flush this any time we might be doing encryption because it
>> needs to be in a separate request to _secure_write() from what follows
>> it.  We don't know whether we should be doing encryption until
>> connection parameters are parsed; to put it another way,
>> port->gss->encrypt will never be true here because it hasn't been parsed
>> out of port->gss->gss_encrypt yet.
>
> Wait a second.  So the initial connection-request packet is necessarily
> unencrypted under this scheme?  That seems like a pretty substantial
> step backwards from what happens with SSL.  Even granting that stuff
> like passwords won't be sent till later, the combination of user name
> and database name might already be useful info to an eavesdropper.
>
> I would think a design similar to the SSL one (special protocol version
> to cause encryption negotiation before the actual connection request
> is sent) would be better.

(Apologies for the wall of text that follows.  My GSSAPI encryption
support has gone through three major redesigns, so I've got a fair bit
to say about it at this point, and it's probably better that I say too
much than too little.  The short version is that GSSAPI works
differently than SSL and username is sent in the clear no matter what,
but this isn't a problem.)

Yes, by necessity.  The username must be sent in the clear, even if only
as part of the GSSAPI handshake (i.e., the GSSAPI username will appear
in plantext in the GSSAPI blobs which are otherwise encrypted).  GSSAPI
performs authentication before it can start encryption.

In this design, the contents of the Startup Message are the only
non-authentication related information sent in the clear.  This
contains: username (which we need anyway), database, application_name,
and I add gss_encrypt.

Why does it look this way?  Fallback support.  We already have GSSAPI
authentication code in the project, and unfortunately we can't fix the
multitude of older clients that won't have encryption support, whatever
form it takes.

What if we didn't need fallback support, though?  Doing it with a
special protocol version, as in the SSL/TLS case, would cause a separate
path for authentication to occur, during which everything would look
pretty much the same, except we wouldn't send database.  We would then
complete the auth handshake, and in a separate exchange, pass in the
database information.  Only then could we perform authorization
checking.  Authorization checking is currently coupled with the
authentication as well; we would need a way to bypass the normal auth
sequence and enter encryption.

Bottom line is that designing similarly to SSL/TLS doesn't really make
sense because the two schemes work differently.  Typically, usernames
are not considered sensitive information unless one is worried about the
security of the authentication information that goes with them.  For
instance, MIT Kerberos will reveal the difference between "username not
found" and the equivalent of "bad password".  I don't know how much
admins care about the database names being in the clear; I suspect it
doesn't matter much because just knowing their names isn't enough to
connect.  Even if it's something people care about, this is still far
better than no encryption at all (which is the current GSSAPI behavior),
and would be better addressed by supporting connecting without
specifying a database immediately anyway.

>>> I'm aware that enlargeStringInfo() does check and handle the case where
>>> the length ends up >1G, but that feels a bit grotty to me- are you sure
>>> you want the generic enlargeStringInfo() to handle that case?
>
>> This is a good point.  We definitely have to draw the line somewhere; 1G
>> is a high upper bound.  Digging around the server code I don't see a
>> precedent for what's a good size to stop at.  There's
>> PG_MAX_AUTH_TOKEN_LENGTH, which is 65k, and the password length in
>> auth.c is 1k.  Beyond that, SocketBackend() calls pq_getmessage() with
>> no maximum length, which causes the enlargeStringInfo() size
>> restrictions to be the only control (wrapped in a PG_TRY()).
>
> Note that SocketBackend() only runs *after* we've accepted the user as
> authorized.  We should be a lot charier of what we're willing to accept
> before authorization, IMO.  Note MAX_STARTUP_PACKET_LENGTH, which is
> only 10K.

Correct, we're only talking about packets that are sent after
authentication+authorization are complete.  Before authentication is
complete, the GSSAPI encryption codepaths are not taken.


signature.asc
Description: PGP signature


Re: [HACKERS] Combining Aggregates

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 22:28, David Rowley  wrote:

> On 7 April 2016 at 09:25, Simon Riggs  wrote:
> > On 5 April 2016 at 19:33, Robert Haas  wrote:
> >
> >>
> >> Committed 0002+0003 with those changes, some minor cosmetic stuff, and
> >> of course the obligatory catversion bump.  Oh, and fixed an OID
> >> conflict with the patch Magnus just committed.
> >
> >
> > Is that everything now? I don't see anything about combining aggs in the
> git
> > log and this is still showing as UnCommitted in the CF app.
>
> There's just a documentation patch and two combine functions for
> floating point aggs left now (Haribabu's patch)
>

Ah, I see, it was the commit about "multi-stage aggregation".

So SELECT sum(x), avg(x) is now optimized?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Alvaro Herrera
Petr Jelinek wrote:

> It's fun to set it up yes. I do have the machine with buildfarm client ready
> still (although now also traveling so slightly complicated to get to it) but
> I didn't activate it yet as I don't want it to just report failures forever.

Maybe you should just activate it regardless, and we'll see it go from
red to green once things are good.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Combining Aggregates

2016-04-06 Thread David Rowley
On 7 April 2016 at 09:25, Simon Riggs  wrote:
> On 5 April 2016 at 19:33, Robert Haas  wrote:
>
>>
>> Committed 0002+0003 with those changes, some minor cosmetic stuff, and
>> of course the obligatory catversion bump.  Oh, and fixed an OID
>> conflict with the patch Magnus just committed.
>
>
> Is that everything now? I don't see anything about combining aggs in the git
> log and this is still showing as UnCommitted in the CF app.

There's just a documentation patch and two combine functions for
floating point aggs left now (Haribabu's patch)


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Combining Aggregates

2016-04-06 Thread Simon Riggs
On 5 April 2016 at 19:33, Robert Haas  wrote:


> Committed 0002+0003 with those changes, some minor cosmetic stuff, and
> of course the obligatory catversion bump.  Oh, and fixed an OID
> conflict with the patch Magnus just committed.


Is that everything now? I don't see anything about combining aggs in the
git log and this is still showing as UnCommitted in the CF app.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Truncating/vacuuming relations on full tablespaces

2016-04-06 Thread Jim Nasby

On 4/6/16 11:06 AM, Robert Haas wrote:

This is too late for 9.6 at this point and certainly requires
discussion anyway, so please add it to the next CommitFest.


If the goal here is to free up space via truncation when there's a real 
emergency, perhaps there's some other steps that should be taken as 
well. What immediately comes to mind is scanning the heap backwards and 
stopping on the first page we can't truncate.


There might be some other non-critical things we could skip in emergency 
mode, with an eye towards making it as fast as possible.


BTW, if someone really wanted to put some effort into this, it would be 
possible to better handle filling up a single filesystem that has both 
data and WAL by slowly shrinking the VM/FSM to make room in the WAL for 
vacuum records. ISTM that's a much more common problem people run into 
than filling up a separate tablespace.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Petr Jelinek

On 06/04/16 22:50, Andrew Dunstan wrote:



On 03/29/2016 09:38 PM, Robert Haas wrote:

On Tue, Mar 29, 2016 at 9:29 PM, Andrew Dunstan 
wrote:

I am currently travelling, but my intention is to deal with the
remaining
patches when I'm back home this weekend, unless someone beats me to it.

Cool.



Progress report:


I have spent way too much time on this and don't have it working yet.
I'm setting up a sacrificial VM from scratch in a last ditch attempt to
get it working.

Things to note so far:

  * VS2015 appears to create version 12 solution files, not version 14,
and the tools complained about version 14.
  * Windows git (the successor to msysGit) apparently no longer ships
with bison and flex. So if you need those (i.e. to built from git,
not tarball) you're probably going to need to install the MsysDTK
even if you're not using its compiler.



It's fun to set it up yes. I do have the machine with buildfarm client 
ready still (although now also traveling so slightly complicated to get 
to it) but I didn't activate it yet as I don't want it to just report 
failures forever.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v12] GSSAPI encryption support

2016-04-06 Thread Tom Lane
Robbie Harwood  writes:
> I need to flush this any time we might be doing encryption because it
> needs to be in a separate request to _secure_write() from what follows
> it.  We don't know whether we should be doing encryption until
> connection parameters are parsed; to put it another way,
> port->gss->encrypt will never be true here because it hasn't been parsed
> out of port->gss->gss_encrypt yet.

Wait a second.  So the initial connection-request packet is necessarily
unencrypted under this scheme?  That seems like a pretty substantial
step backwards from what happens with SSL.  Even granting that stuff
like passwords won't be sent till later, the combination of user name
and database name might already be useful info to an eavesdropper.

I would think a design similar to the SSL one (special protocol version
to cause encryption negotiation before the actual connection request
is sent) would be better.

(If I'm totally misunderstanding the context here, my apologies.  I've
not been following this thread.)

>> I'm aware that enlargeStringInfo() does check and handle the case where
>> the length ends up >1G, but that feels a bit grotty to me- are you sure
>> you want the generic enlargeStringInfo() to handle that case?

> This is a good point.  We definitely have to draw the line somewhere; 1G
> is a high upper bound.  Digging around the server code I don't see a
> precedent for what's a good size to stop at.  There's
> PG_MAX_AUTH_TOKEN_LENGTH, which is 65k, and the password length in
> auth.c is 1k.  Beyond that, SocketBackend() calls pq_getmessage() with
> no maximum length, which causes the enlargeStringInfo() size
> restrictions to be the only control (wrapped in a PG_TRY()).

Note that SocketBackend() only runs *after* we've accepted the user as
authorized.  We should be a lot charier of what we're willing to accept
before authorization, IMO.  Note MAX_STARTUP_PACKET_LENGTH, which is
only 10K.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley
On 7 April 2016 at 08:01, David Rowley  wrote:
> On 7 April 2016 at 04:05, Tom Lane  wrote:
>> Starting to look at this again.  I wonder, now that you have the generic
>> caching mechanism for remembering whether join inner sides have been
>> proven unique, is it still worth having the is_unique_join field in
>> SpecialJoinInfo?  It seems like that's creating a separate code path
>> for special joins vs. inner joins that may not be buying us much.
>> It does potentially save lookups in the unique_rels cache, if you already
>> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.
>
> I quite like that field where it is, as it should make
> remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
> removed, the previous code would go off and try to make sure all the
> joins are unique again, but now we cache that, and save it from having
> to bother doing that again, on joins already marked as unique.
>
> Certainly changing that would mean one less special case in
> joinpath.c, as the JOIN_LEFT case can be handle the same as the other
> cases, although it looks like probably, if I do change that, then I'd
> probably move is_innerrel_unique_for() into analyzejoins.c, and put
> the special case for JOIN_LEFT in that function, so that it calls
> specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
> the unique_rels cache if the result comes back positive, and in the
> non_unique_rels cache if negative... But it seems a bit crazy to go to
> the trouble or all that caching, when we can just throw the result in
> a struct field in the case of Special Joins.  Maybe we could just hide
> both the new joinpath.c functions in analyzejoins.c and call it quits.
> It's not as if there's no special cases for JOIN_LEFT in that file.

We could also get rid of the SpecialJoinInfo.is_unique_join and just
store this as optimal_jointype, where this would be initialised to
jointype in make_outerjoininfo(), and then set in mark_unique_joins().
This would simplify the test in get_optimal_jointype(), perhaps if
(IS_OUTER_JOIN(jointype)) return sjinfo->optimal_jointype;

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-04-06 Thread Peter Geoghegan
On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan  wrote:
> Personally, I like documenting assertions, and will sometimes write
> assertions that the compiler could easily optimize away. Maybe going
> *that* far is more a matter of personal style, but I think an
> assertion about the new index tuple size being <= the old one is just
> a good idea. It's not about a problem in your code at all.

You should make index_truncate_tuple()/index_reform_tuple() promise to
always do this in its comments/contract with caller as part of this,
IMV.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VS 2015 support in src/tools/msvc

2016-04-06 Thread Andrew Dunstan



On 03/29/2016 09:38 PM, Robert Haas wrote:

On Tue, Mar 29, 2016 at 9:29 PM, Andrew Dunstan  wrote:

I am currently travelling, but my intention is to deal with the remaining
patches when I'm back home this weekend, unless someone beats me to it.

Cool.



Progress report:


I have spent way too much time on this and don't have it working yet. 
I'm setting up a sacrificial VM from scratch in a last ditch attempt to 
get it working.


Things to note so far:

 * VS2015 appears to create version 12 solution files, not version 14,
   and the tools complained about version 14.
 * Windows git (the successor to msysGit) apparently no longer ships
   with bison and flex. So if you need those (i.e. to built from git,
   not tarball) you're probably going to need to install the MsysDTK
   even if you're not using its compiler.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-04-06 Thread Peter Geoghegan
On Wed, Apr 6, 2016 at 6:15 AM, Anastasia Lubennikova
 wrote:
>> * I would like to see index_reform_tuple() assert that the new,
>> truncated index tuple is definitely <= the original (I worry about the
>> 1/3 page restriction issue). Maybe you should also change the name of
>> index_reform_tuple(), per David.
>
> Is it possible that the new tuple, containing less attributes than the old
> one, will have a greater size?
> Maybe you can give an example?
> I think that  Assert(indnkeyatts <= indnatts); covers this kind of errors.

I don't think it is possible, because you aren't e.g. making an
attribute's value NULL where it wasn't NULL before (making the
IndexTuple contain a NULL bitmap where it didn't before). But that's
kind of subtle, and it certainly seems worth an assertion. It could
change tomorrow, when someone optimizes heap_deform_tuple(), which has
been proposed more than once.

Personally, I like documenting assertions, and will sometimes write
assertions that the compiler could easily optimize away. Maybe going
*that* far is more a matter of personal style, but I think an
assertion about the new index tuple size being <= the old one is just
a good idea. It's not about a problem in your code at all.

> I do not mind to rename this function, but what name would be better?
> index_truncate_tuple()?

That seems better, yes.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IF (NOT) EXISTS in psql-completion

2016-04-06 Thread Pavel Stehule
Hi

2016-04-04 7:58 GMT+02:00 Kyotaro HORIGUCHI :

> Thank you for testing. That is a silly mistake, sorry.
>
> The attached is the fixed version.
>
> # Can I add a suffix to format-patche's output files?
>
> At Sat, 2 Apr 2016 07:18:32 +0200, Pavel Stehule 
> wrote in <
> cafj8pradf3rmq3y33aer1c7woi2qss65c8bbtirnqu0zwvw...@mail.gmail.com>
> > >> Finally I settled it by replacing comma expression with logical
> > >> OR or AND expresssion. gcc 4.9 compains for some unused variables
> > >> in flex output but it is the another issue.
> > >>
> > >> I forgot to address COMPLETE_WITH_ATTTR but it needed an overhaul
> > >> of some macros and changing the type of completion_charp. The
> > >> third patch does it but it might be unacceptable..
> > >>
> > >>
> > > something is wrong, autocomplete for CREATE TABLE IF NOT EXISTS doesn't
> > > work
> > >
> >
> > CREATE UNLOGGED/TEMP table is working.
>
> Mmm. I mitakenly refactored multi-step matching.
>
> >   else if (HeadMatches3("CREATE", MatchAny, "TABLE") &&
> >MidMatchAndRemove1(1, "TEMP|TEMPORARY|UNLOGGED") &&
> >Matches2("CREATE", "TABLE"))
> > COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables,
> >ADDLIST1("IF NOT EXISTS"));
>
> The completion runs only for CREATE AnyKeyword TABLE when
> AnyKeyword is removable. It is wrong to do so when any of
> prev_words[] that matches the last Matches() can be fileterd out
> by the first Headmatches(). The same kind of mistake was found in
> the following syntaxes. CREATE SEQENCE had one more mistake.
>
>
> "CREATE [UNIQUE] INDEX"
> "CREATE [TEMP] SEQUENCE"
> "CREATE [TEMP..] TABLE"
>
> It is arguable that it is proper to suggest existing object for
> CREATE statement, but most of the statement is suggested. It is
> semantically wrong but practically useful to know what kind of
> word should be there.
>

I tested this patch and I didn't find any problem.

1. We want this patch - it increase a functionality of autocomplete - IF
(NOT) EXISTS is relative long phrase and autocomplete is nice. - next
implementation can be CREATE "OR REPLACE" FUNCTION

2. The patch is possible to apply - no problems, no problems with compiling

3. All regress tests passed without problems

4. Patch respects PostgreSQL's codding style and it is enough commented

5. The regress tests are not possible - interactive process

6. The documentation is not necessary

7. It should not to have any impacts on SQL or performance


I'll mark this patch as ready for commiter

Thank you for the patch

Regards

Pavel



> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread David Rowley
On 7 April 2016 at 04:05, Tom Lane  wrote:
> David Rowley  writes:
>> In the last patch I failed to notice that there's an alternative
>> expected results file for one of the regression tests.
>> The attached patch includes the fix to update that file to match the
>> new expected EXPLAIN output.
>
> Starting to look at this again.  I wonder, now that you have the generic
> caching mechanism for remembering whether join inner sides have been
> proven unique, is it still worth having the is_unique_join field in
> SpecialJoinInfo?  It seems like that's creating a separate code path
> for special joins vs. inner joins that may not be buying us much.
> It does potentially save lookups in the unique_rels cache, if you already
> have the SpecialJoinInfo at hand, but I'm not sure what that's worth.

I quite like that field where it is, as it should make
remove_useless_joins() a bit more efficient, as after a LEFT JOIN is
removed, the previous code would go off and try to make sure all the
joins are unique again, but now we cache that, and save it from having
to bother doing that again, on joins already marked as unique.

Certainly changing that would mean one less special case in
joinpath.c, as the JOIN_LEFT case can be handle the same as the other
cases, although it looks like probably, if I do change that, then I'd
probably move is_innerrel_unique_for() into analyzejoins.c, and put
the special case for JOIN_LEFT in that function, so that it calls
specialjoin_is_unique_join(), then cache the sjinfo->min_righthand in
the unique_rels cache if the result comes back positive, and in the
non_unique_rels cache if negative... But it seems a bit crazy to go to
the trouble or all that caching, when we can just throw the result in
a struct field in the case of Special Joins.  Maybe we could just hide
both the new joinpath.c functions in analyzejoins.c and call it quits.
It's not as if there's no special cases for JOIN_LEFT in that file.

> Also, as I'm looking around at the planner some more, I'm beginning to get
> uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
> as the executor is concerned, no doubt, but there's a lot of planner
> expectations about the use of JOIN_SEMI that we'd be violating.  One is
> that there should be a SpecialJoinInfo for any SEMI join.  Another is that
> JOIN_SEMI can be implemented by unique-ifying the inner side and then
> doing a regular inner join; that's not a code path we wish to trigger in
> these situations.  The patch might avoid tripping over these hazards as it
> stands, but it seems fragile, and third-party FDWs could easily contain
> code that'll be broken.  So I'm starting to feel that we'd better invent
> two new JoinTypes after all, to make sure we can distinguish plain-join-
> with-inner-side-known-unique from a real SEMI join when we need to.
>
> What's your thoughts on these matters?

I was a bit uncomfortable with JOIN_INNER becoming JOIN_SEMI before,
but that was for EXPLAIN reasons. So I do think it's better to have
JOIN_INNER_UNIQUE and JOIN_LEFT_UNIQUE instead. I can go make that
change, but unsure on how EXPLAIN will display these now. I'd need to
pull out my tests if we don't have anything to show in EXPLAIN, and
I'd really rather have tests for this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v12] GSSAPI encryption support

2016-04-06 Thread Robbie Harwood
Stephen Frost  writes:

> Just an initial pass over the patch.

Thanks!  In the interest of brevity, if I haven't replied to something,
I plan to fix it.

>>  /*
>> - * Flush message so client will see it, except for AUTH_REQ_OK, which 
>> need
>> - * not be sent until we are ready for queries.
>> + * In most cases, we do not need to send AUTH_REQ_OK until we are ready
>> + * for queries.  However, if we are doing GSSAPI encryption, that 
>> request
>> + * must go out immediately to ensure that all messages which follow the
>> + * AUTH_REQ_OK are not grouped with it and can therefore be encrypted.
>>   */
>> -if (areq != AUTH_REQ_OK)
>> +if (areq != AUTH_REQ_OK || port->gss != NULL)
>>  pq_flush();
>>  
>>  CHECK_FOR_INTERRUPTS();
>
> Do we actually need to send pq_flush *whenever* port->gss is not null?
> Shouldn't this actually be port->gss->encrypt?

I need to flush this any time we might be doing encryption because it
needs to be in a separate request to _secure_write() from what follows
it.  We don't know whether we should be doing encryption until
connection parameters are parsed; to put it another way,
port->gss->encrypt will never be true here because it hasn't been parsed
out of port->gss->gss_encrypt yet.

I could parse it earlier, but then I need another variable in the struct
(i.e., to hold whether AUTH_REQ_OK has been sent yet) to check as well.
Additionally, it seemed to me that there might be some value
security-wise in delaying parsing of connection parameters until after
auth is complete, though of course for just a bool this may not be as
important.

>> +/* recur to send any buffered data */
>> +gss_release_buffer(, );
>> +return be_gssapi_write(port, ptr, len);
>
> This feels a bit odd to be doing, honestly.  We try to take a lot of
> care to consider low-memory situation and to be careufl when it comes to
> potential for infinite recursion and there's already a way to ask for
> this function to be called again, isn't there?

This call should be okay because (1) it's a tail call and (2) we know
the buffer isn't empty.

That said, the other recursion is excessively complicated and I think
it's simpler to eliminate it entirely in favor of being called again.
I'll see what I can do.

>> +/* we know the length of the packet at this point */
>> +memcpy((char *), port->gss->buf.data, 4);
>> +input.length = ntohl(input.length);
>> +enlargeStringInfo(>gss->buf, input.length - port->gss->buf.len + 
>> 4);
>
> I'm aware that enlargeStringInfo() does check and handle the case where
> the length ends up >1G, but that feels a bit grotty to me- are you sure
> you want the generic enlargeStringInfo() to handle that case?

This is a good point.  We definitely have to draw the line somewhere; 1G
is a high upper bound.  Digging around the server code I don't see a
precedent for what's a good size to stop at.  There's
PG_MAX_AUTH_TOKEN_LENGTH, which is 65k, and the password length in
auth.c is 1k.  Beyond that, SocketBackend() calls pq_getmessage() with
no maximum length, which causes the enlargeStringInfo() size
restrictions to be the only control (wrapped in a PG_TRY()).

I think what I'm trying to get at here is that I'm open to suggestion,
but don't see a clearly better way to do this.  We could use the
PG_TRY() approach here to preserve sync, though I'm not sure it's worth
it considering PQ_RECV_BUFFER_SIZE and PQ_SEND_BUFFER_SIZE are both 8k.

>> Subject: [PATCH 3/3] GSSAPI authentication cleanup
>
> Why wouldn't this be part of patch #2?

It's a cleanup of existing code, not my new code, so I thought the
separation would make it easier to understand.  I'm fine with it as part
of #2 so long as it happens, but the impression I had gotten from
earlier reviews was that it should have even more division (i.e., move
the gss_display_status() wrappers into their own patch), not less.

Thanks,
--Robbie


signature.asc
Description: PGP signature


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Petr Jelinek

On 06/04/16 17:55, Andres Freund wrote:

On 2016-04-06 16:49:17 +0100, Simon Riggs wrote:

Perhaps easy to solve, but how do we test it is solved?


Maybe something like

-- drain
pg_logical_slot_get_changes(...);
-- generate message in different database, to ensure it's not processed
-- in this database
\c template1
SELECT pg_logical_emit_message(...);
\c postgres
-- check
pg_logical_slot_get_changes(..);

It's a bit ugly to hardcode database names :/



Attached patch adds filtering of both database and origin. Added tests 
with slightly less hardcoding than what you proposed above.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/contrib/test_decoding/expected/messages.out b/contrib/test_decoding/expected/messages.out
index 70130e9..a5b13c5 100644
--- a/contrib/test_decoding/expected/messages.out
+++ b/contrib/test_decoding/expected/messages.out
@@ -1,6 +1,5 @@
 -- predictability
 SET synchronous_commit = on;
-SET client_encoding = 'utf8';
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
  ?column? 
 --
@@ -71,9 +70,32 @@ SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'for
  message: transactional: 1 prefix: test, sz: 11 content:czechtastic
 (7 rows)
 
-SELECT 'init' FROM pg_drop_replication_slot('regression_slot');
+-- test db filtering
+CREATE DATABASE test_logical_messages;
+\set prevdb :DBNAME
+\c test_logical_messages
+SELECT 'otherdb1' FROM pg_logical_emit_message(false, 'test', 'otherdb1');
  ?column? 
 --
- init
+ otherdb1
+(1 row)
+
+SELECT 'otherdb2' FROM pg_logical_emit_message(true, 'test', 'otherdb2');
+ ?column? 
+--
+ otherdb2
+(1 row)
+
+\c :prevdb
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
+ data 
+--
+(0 rows)
+
+DROP DATABASE test_logical_messages;
+SELECT 'cleanup' FROM pg_drop_replication_slot('regression_slot');
+ ?column? 
+--
+ cleanup
 (1 row)
 
diff --git a/contrib/test_decoding/expected/replorigin.out b/contrib/test_decoding/expected/replorigin.out
index c0f5125..8e8889d 100644
--- a/contrib/test_decoding/expected/replorigin.out
+++ b/contrib/test_decoding/expected/replorigin.out
@@ -59,6 +59,12 @@ SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
 -- ensure we prevent duplicate setup
 SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
 ERROR:  cannot setup replication origin when one is already setup
+SELECT 'this message will not be decoded' FROM pg_logical_emit_message(false, 'test', 'this message will not be decoded');
+ ?column? 
+--
+ this message will not be decoded
+(1 row)
+
 BEGIN;
 -- setup transaction origin
 SELECT pg_replication_origin_xact_setup('0/aabbccdd', '2013-01-01 00:00');
diff --git a/contrib/test_decoding/sql/messages.sql b/contrib/test_decoding/sql/messages.sql
index 4aedb04..eba7d7a 100644
--- a/contrib/test_decoding/sql/messages.sql
+++ b/contrib/test_decoding/sql/messages.sql
@@ -1,6 +1,5 @@
 -- predictability
 SET synchronous_commit = on;
-SET client_encoding = 'utf8';
 
 SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
 
@@ -22,4 +21,17 @@ SELECT 'ignorethis' FROM pg_logical_emit_message(true, 'test', 'czechtastic');
 
 SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
 
-SELECT 'init' FROM pg_drop_replication_slot('regression_slot');
+-- test db filtering
+CREATE DATABASE test_logical_messages;
+\set prevdb :DBNAME
+\c test_logical_messages
+
+SELECT 'otherdb1' FROM pg_logical_emit_message(false, 'test', 'otherdb1');
+SELECT 'otherdb2' FROM pg_logical_emit_message(true, 'test', 'otherdb2');
+
+\c :prevdb
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'force-binary', '0', 'skip-empty-xacts', '1');
+
+DROP DATABASE test_logical_messages;
+
+SELECT 'cleanup' FROM pg_drop_replication_slot('regression_slot');
diff --git a/contrib/test_decoding/sql/replorigin.sql b/contrib/test_decoding/sql/replorigin.sql
index e12404e..a33e460bb 100644
--- a/contrib/test_decoding/sql/replorigin.sql
+++ b/contrib/test_decoding/sql/replorigin.sql
@@ -31,6 +31,8 @@ SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
 -- ensure we prevent duplicate setup
 SELECT pg_replication_origin_session_setup('test_decoding: regression_slot');
 
+SELECT 'this message will not be decoded' FROM pg_logical_emit_message(false, 'test', 'this message will not be decoded');
+
 BEGIN;
 -- setup transaction origin
 SELECT pg_replication_origin_xact_setup('0/aabbccdd', '2013-01-01 00:00');
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 3e80c4a..0cdb0b8 100644
--- 

Re: [HACKERS] Choosing parallel_degree

2016-04-06 Thread Julien Rouhaud
On 06/04/2016 07:38, Amit Kapila wrote:
> On Tue, Apr 5, 2016 at 11:55 PM, Julien Rouhaud
>>
>> In alter_table.sgml, I didn't comment the lock level needed to modify
>> parallel_degree since it requires an access exclusive lock for now.
>> While thinking about it, I think it's safe to use a share update
>> exclusive lock but I may be wrong.  What do you think?
>>
> 
> We require to keep AccessExclusiveLock for operations which can impact
> Select operation which I think this operation does, so lets
> retain AccessExclusiveLock for now.  If somebody else thinks, we should
> not bother about Selects, then we can change it.
> 

Ok. Isn't there also some considerations about forcing replanning of
prepared statements using the table for instance?

>>
>> I find your version better once again, but should we keep some
>> consistency between them or it's not important?
>>
> 
> I think consistency is good, but this is different from
> max_parallel_degree, so I would prefer to use something on lines of what
> I have mentioned.
> 

Agreed, changed in attached v8 (including fix for previous mail).

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index cd234db..0eab2be 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -909,6 +909,17 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | 
UNLOGGED ] TABLE [ IF NOT EXI

 

+parallel_degree (integer)
+
+ 
+ Sets the degree of parallelism for an individual relation.  The requested
+ number of workers will be limited by .
+ 
+
+   
+
+   
 autovacuum_enabled, 
toast.autovacuum_enabled (boolean)
 
  
diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index ea0755a..758457c 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -26,6 +26,7 @@
 #include "commands/tablespace.h"
 #include "commands/view.h"
 #include "nodes/makefuncs.h"
+#include "postmaster/postmaster.h"
 #include "utils/array.h"
 #include "utils/attoptcache.h"
 #include "utils/builtins.h"
@@ -267,6 +268,15 @@ static relopt_int intRelOpts[] =
0, 0, 0
 #endif
},
+   {
+   {
+   "parallel_degree",
+   "Number of parallel processes that can be used per 
executor node for this relation.",
+   RELOPT_KIND_HEAP,
+   AccessExclusiveLock
+   },
+   -1, 0, MAX_BACKENDS
+   },
 
/* list terminator */
{{NULL}}
@@ -1251,8 +1261,8 @@ fillRelOptions(void *rdopts, Size basesize,
 
 
 /*
- * Option parser for anything that uses StdRdOptions (i.e. fillfactor and
- * autovacuum)
+ * Option parser for anything that uses StdRdOptions (i.e. fillfactor,
+ * autovacuum and parallel_degree)
  */
 bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
@@ -1291,7 +1301,9 @@ default_reloptions(Datum reloptions, bool validate, 
relopt_kind kind)
{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,
offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, 
analyze_scale_factor)},
{"user_catalog_table", RELOPT_TYPE_BOOL,
-   offsetof(StdRdOptions, user_catalog_table)}
+   offsetof(StdRdOptions, user_catalog_table)},
+   {"parallel_degree", RELOPT_TYPE_INT,
+   offsetof(StdRdOptions, parallel_degree)}
};
 
options = parseRelOptions(reloptions, validate, kind, );
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index cc77ff9..38233bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -659,31 +659,42 @@ set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo 
*rel, RangeTblEntry *rte)
 static void
 create_parallel_paths(PlannerInfo *root, RelOptInfo *rel)
 {
-   int parallel_threshold = 1000;
-   int parallel_degree = 1;
+   int parallel_threshold = 1000;
+   int parallel_degree = 1;
 
/*
 * If this relation is too small to be worth a parallel scan, just 
return
-* without doing anything ... unless it's an inheritance child.  In 
that case,
-* we want to generate a parallel path here anyway.  It might not be 
worthwhile
-* just for this relation, but when combined with all of its 
inheritance siblings
-* it may well pay off.
+* without doing anything ... unless parallel_degree has been set for 
this
+* relation, or if it's an inheritance child.  In the latter case, we 
want
+* to generate a parallel path here anyway.  It might not be worthwhile
+* just for this relation, but when combined with all of its inheritance
+* siblings it 

Re: [HACKERS] insufficient qualification of some objects in dump files

2016-04-06 Thread Peter Eisentraut

On 04/05/2016 09:46 PM, Robert Haas wrote:

On Fri, Mar 18, 2016 at 11:22 AM, Robert Haas  wrote:

On Fri, Mar 18, 2016 at 2:13 AM, Michael Paquier
 wrote:

On Fri, Mar 18, 2016 at 5:38 AM, Tom Lane  wrote:

Given the lack of any other complaints about this, I'm okay with the
approach as presented.  (I haven't read the patch in detail, though.)


FWIW, I am still of the opinion that the last patch sent by Peter is
in a pretty good shape.


Great.  I've marked this patch as "Ready for Committer" in the
CommitFest application.


Peter, are you going to commit this?  Feature freeze is in just a few days.


done



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN data corruption bug(s) in 9.6devel

2016-04-06 Thread Teodor Sigaev

I've tested the v2, v3 and v3.1 of the patch, to see if there are any
differences. The v2 no longer applies, so I tested it on ee943004. The following
table shows the total duration of the data load, and also sizes of the two GIN
indexes.

duration (sec)  subject  body
   ---
   v2  1290 23 MB   684 MB
   v3  1360 24 MB   487 MB
   v3.11360 24 MB   488 MB

Thank you very much.

Hmm. v3 is a just rebased version of v2, v3.1 hasn't unlock/lock cycle during 
cleanup, just to become similar to Jeff's pending lock patch. In theory, v2 and 
v3 should be very close, v3.1 should be close to pending_lock.


I'm inclining to push v3.1 as one of two winners by size/performance and, unlike 
to pending lock patch, it doesn't change an internal logic of lock machinery.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATH] Jsonb, insert a new value into an array at arbitrary position

2016-04-06 Thread Teodor Sigaev

Thank you, pushed with Petr's notice

Dmitry Dolgov wrote:


On 6 April 2016 at 03:29, Andrew Dunstan > wrote:


Yeah, keeping it but rejecting update of an existing key is my preference 
too.

cheers

andrew


Yes, it sounds quite reasonable. Here is a new version of patch (it will throw
an error for an existing key). Is it better now?





--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 15:17, Andres Freund  wrote:

> On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> > On 6 April 2016 at 14:15, Craig Ringer  wrote:
> >  ...
> >
> > Nice summary
> >
> > Failover slots are optional. And they work on master.
> >
> > While the other approach could also work, it will work later and still
> > require a slot on the master.
> >
> >
> > => I don't see why having Failover Slots in 9.6 would prevent us from
> > having something else later, if someone else writes it.
> >
> >
> > We don't need to add this to core. Each plugin can independently write is
> > own failover code. Works, but doesn't seem like the right approach for
> open
> > source.
> >
> >
> > => I think we should add Failover Slots to 9.6.
>
> Simon, please don't take this personal; because of the other ongoing
> thread.
>

Thanks for the review. Rational technical comments are exactly why we are
here and they are always welcome.


> For one I think this is architecturally the wrong choice.

But even leaving that fact aside, and
> considering this a temporary solution (we can't easily remove),


As I observed above, the alternate solution doesn't sound particularly good
either but the main point is that we wouldn't need to remove it, it can
coexist happily. I would add that I did think of the alternate solution
previously as well, this one seemed simpler, which is always key for me in
code aimed at robustness.


> there appears to have been very little code level review


That is potentially fixable. At this point I don't claim it is committable,
I only say it is important and the alternate solution is not significantly
better, therefore if the patch can be beaten into shape we should commit it.

I will spend some time on this and see if we have something viable. Which
will be posted here for discussion, as would have happened even before our
other discussions.

Thanks for the points below

(one early from Petr
> in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
> whole patch was submitted late to the 9.6 cycle.
>
> Quickly skimming 0001 in [4] there appear to be a number of issues:
> * LWLockHeldByMe() is only for debugging, not functional differences
> * ReplicationSlotPersistentData is now in an xlog related header
> * The code and behaviour around name conflicts of slots seems pretty
>   raw, and not discussed
> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>   idea
> * I doubt that the archive based switches around StartupReplicationSlots
>   do what they intend. Afaics that'll not work correctly for basebackups
>   taken with -X, without recovery.conf
>
> That's from a ~5 minute skim, of one patch in the series.
>
>
> [1]
> http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
> [2]
> http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
> [3]
> http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
> [4]
> http://archives.postgresql.org/message-id/camsr+ye6lny2e0tbuaqb+ntvb6w-dhjaflq0-zbal7g7hjh...@mail.gmail.com
>

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Truncating/vacuuming relations on full tablespaces

2016-04-06 Thread Robert Haas
On Wed, Apr 6, 2016 at 3:32 AM, Asif Naeem  wrote:
>> Oh, I see.  I think it's probably not a good idea to skip truncating
>> those maps, but perhaps the option could be defined as making no new
>> entries, rather than ignoring them altogether, so that they never
>> grow.  It seems that both of those are defined in such a way that if
>> page Y follows page X in the heap, the entries for page Y in the
>> relation fork will follow the one for page X.  So truncating them
>> should be OK; it's just growing them that we need to avoid.
>
> Thank you Robert. PFA basic patch, it introduces EMERGENCY option to VACUUM
> that forces to avoid extend any entries in the VM or FSM. It seems working
> fine in simple test scenarios e.g.
>
>> postgres=# create table test1 as (select generate_series(1,10));
>> SELECT 10
>> postgres=# vacuum  EMERGENCY test1;
>> VACUUM
>> postgres=# select pg_relation_filepath('test1');
>>  pg_relation_filepath
>> --
>>  base/13250/16384
>> (1 row)
>> [asif@centos66 inst_96]$ find . | grep base/13250/16384
>> ./data/base/13250/16384
>> postgres=# vacuum test1;
>> VACUUM
>> [asif@centos66 inst_96]$ find . | grep base/13250/16384
>> ./data/base/13250/16384
>> ./data/base/13250/16384_fsm
>> ./data/base/13250/16384_vm
>
>
> Please do let me know if I missed something or more information is required.
> Thanks.

This is too late for 9.6 at this point and certainly requires
discussion anyway, so please add it to the next CommitFest.  But in
the meantime, here are a few quick comments:

You should only support EMERGENCY using the new-style, parenthesized
options syntax.  That way, you won't need to make EMERGENCY a keyword.
We don't want to do that, and we especially don't want it to be
partially reserved, as you have done here.

Passing the EMERGENCY flag around in a global variable is probably not
a good idea; use parameter lists.  That's what they are for.  Also,
calling the variable that decides whether EMERGENCY was set
Extend_VM_FSM is confusing.

I see why you changed the calling convention for visibilitymap_pin()
and RecordPageWithFreeSpace(), but that's awfully invasive.  I wonder
if there's a better way to do that, like maybe having vacuumlazy.c ask
the VM and FSM for their length in pages and then not trying to use
those functions for block numbers that are too large.

Don't forget to update the documentation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance improvement for joins where outer side is unique

2016-04-06 Thread Tom Lane
David Rowley  writes:
> In the last patch I failed to notice that there's an alternative
> expected results file for one of the regression tests.
> The attached patch includes the fix to update that file to match the
> new expected EXPLAIN output.

Starting to look at this again.  I wonder, now that you have the generic
caching mechanism for remembering whether join inner sides have been
proven unique, is it still worth having the is_unique_join field in
SpecialJoinInfo?  It seems like that's creating a separate code path
for special joins vs. inner joins that may not be buying us much.
It does potentially save lookups in the unique_rels cache, if you already
have the SpecialJoinInfo at hand, but I'm not sure what that's worth.

Also, as I'm looking around at the planner some more, I'm beginning to get
uncomfortable with the idea of using JOIN_SEMI this way.  It's fine so far
as the executor is concerned, no doubt, but there's a lot of planner
expectations about the use of JOIN_SEMI that we'd be violating.  One is
that there should be a SpecialJoinInfo for any SEMI join.  Another is that
JOIN_SEMI can be implemented by unique-ifying the inner side and then
doing a regular inner join; that's not a code path we wish to trigger in
these situations.  The patch might avoid tripping over these hazards as it
stands, but it seems fragile, and third-party FDWs could easily contain
code that'll be broken.  So I'm starting to feel that we'd better invent
two new JoinTypes after all, to make sure we can distinguish plain-join-
with-inner-side-known-unique from a real SEMI join when we need to.

What's your thoughts on these matters?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Andres Freund
On 2016-04-06 16:49:17 +0100, Simon Riggs wrote:
> Perhaps easy to solve, but how do we test it is solved?

Maybe something like

-- drain
pg_logical_slot_get_changes(...);
-- generate message in different database, to ensure it's not processed
-- in this database
\c template1
SELECT pg_logical_emit_message(...);
\c postgres
-- check
pg_logical_slot_get_changes(..);

It's a bit ugly to hardcode database names :/

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 15:29, Andres Freund  wrote:

> On 2016-04-06 10:24:52 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2016-04-06 10:15:59 -0400, Tom Lane wrote:
> > >> Well, that's something worth thinking about.  I assume that
> > >> pg_logical_slot_get_changes could be executed in a database different
> from
> > >> the one where a change was originated?
> >
> > > You can execute it, but you'll get an error:
> >
> > Oh good.  I was afraid we had an unrecognized can o' worms here.
>
> As posted nearby, there's a hole in that defense; for the messages
> only. Pretty easy to solve though.
>

My instinct was to put in a test for non-ascii text; even if we can't keep
that test, it has highlighted a hole we wouldn't have spotted for a while,
so I'll call that "good catch" then.

Perhaps easy to solve, but how do we test it is solved?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [CommitFest App] Feature request -- review e-mail additions

2016-04-06 Thread Magnus Hagander
On Wed, Apr 6, 2016 at 2:11 PM, Magnus Hagander  wrote:

> On Wed, Mar 30, 2016 at 7:47 PM, Alvaro Herrera 
> wrote:
>
>> José Luis Tallón wrote:
>>
> > * Auto-CC the patch author on this e-mail
>> > I guess this should speed up reactions / make communication a bit
>> more
>> > fluid.
>>
>> Yes, strong +1 on this.
>>
>
> Ok, that should be easy enough to add. I've stuck it on my TODO list and
> will try to get it ASAP.
>
>
That part was indeed easy, so it's now done.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] WIP: Covering + unique indexes.

2016-04-06 Thread Anastasia Lubennikova

06.04.2016 16:15, Anastasia Lubennikova :

06.04.2016 03:05, Peter Geoghegan:

* There is some stray whitespace within RelationGetIndexAttrBitmap().
I think you should have updated it with code, though. I don't think
it's necessary for HOT updates to work, but I think it could be
necessary so that we don't need to get a row lock that blocks
non-conflict foreign key locking (see heap_update() callers). I think
you need to be careful for non-key columns within the loop in
RelationGetIndexAttrBitmap(), basically, because it seems to still go
through all columns. UPSERT also must call this code, FWIW.

* I think that a similar omission is also made for the replica
identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed
on how this patch interacts with logical decoding, I guess.


Good point. Indexes are everywhere in the code.
I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX.
I'll discuss it with Theodor and send an updated patch tomorrow.


As promised, updated patch is in attachments.
But, I'm not an expert in this area, so it needs a 'critical look'.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 9c8e308..891325d 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name);
 static HTAB *createConnHash(void);
 static void createNewConnection(const char *name, remoteConn *rconn);
 static void deleteConnection(const char *name);
-static char **get_pkey_attnames(Relation rel, int16 *numatts);
+static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts);
 static char **get_text_array_contents(ArrayType *array, int *numitems);
 static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals);
 static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals);
@@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey);
 Datum
 dblink_get_pkey(PG_FUNCTION_ARGS)
 {
-	int16		numatts;
+	int16		indnkeyatts;
 	char	  **results;
 	FuncCallContext *funcctx;
 	int32		call_cntr;
@@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT);
 
 		/* get the array of attnums */
-		results = get_pkey_attnames(rel, );
+		results = get_pkey_attnames(rel, );
 
 		relation_close(rel, AccessShareLock);
 
@@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS)
 		attinmeta = TupleDescGetAttInMetadata(tupdesc);
 		funcctx->attinmeta = attinmeta;
 
-		if ((results != NULL) && (numatts > 0))
+		if ((results != NULL) && (indnkeyatts > 0))
 		{
-			funcctx->max_calls = numatts;
+			funcctx->max_calls = indnkeyatts;
 
 			/* got results, keep track of them */
 			funcctx->user_fctx = results;
@@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS)
  * get_pkey_attnames
  *
  * Get the primary key attnames for the given relation.
- * Return NULL, and set numatts = 0, if no primary key exists.
+ * Return NULL, and set indnkeyatts = 0, if no primary key exists.
  */
 static char **
-get_pkey_attnames(Relation rel, int16 *numatts)
+get_pkey_attnames(Relation rel, int16 *indnkeyatts)
 {
 	Relation	indexRelation;
 	ScanKeyData skey;
@@ -2036,8 +2036,8 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 	char	  **result = NULL;
 	TupleDesc	tupdesc;
 
-	/* initialize numatts to 0 in case no primary key exists */
-	*numatts = 0;
+	/* initialize indnkeyatts to 0 in case no primary key exists */
+	*indnkeyatts = 0;
 
 	tupdesc = rel->rd_att;
 
@@ -2058,12 +2058,12 @@ get_pkey_attnames(Relation rel, int16 *numatts)
 		/* we're only interested if it is the primary key */
 		if (index->indisprimary)
 		{
-			*numatts = index->indnatts;
-			if (*numatts > 0)
+			*indnkeyatts = index->indnkeyatts;
+			if (*indnkeyatts > 0)
 			{
-result = (char **) palloc(*numatts * sizeof(char *));
+result = (char **) palloc(*indnkeyatts * sizeof(char *));
 
-for (i = 0; i < *numatts; i++)
+for (i = 0; i < *indnkeyatts; i++)
 	result[i] = SPI_fname(tupdesc, index->indkey.values[i]);
 			}
 			break;
diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c
index 7352b29..142730a 100644
--- a/contrib/tcn/tcn.c
+++ b/contrib/tcn/tcn.c
@@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 		/* we're only interested if it is the primary key and valid */
 		if (index->indisprimary && IndexIsValid(index))
 		{
-			int			numatts = index->indnatts;
+			int			indnkeyatts = index->indnkeyatts;
 
-			if (numatts > 0)
+			if (indnkeyatts > 0)
 			{
 int			i;
 
@@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS)
 appendStringInfoCharMacro(payload, ',');
 appendStringInfoCharMacro(payload, operation);
 
-for (i = 0; i < numatts; i++)
+for (i = 0; i < indnkeyatts; i++)
 {
 	int			colno = 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Wed, Apr 6, 2016 at 11:14 PM, Amit Kapila  wrote:
> On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao  wrote:
>>
>> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila 
>> wrote:
>> >
>> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
>> >> from pg_stat_get_wal_senders() and every backends always parse the
>> >> value
>> >> of s_s_names when the setting is changed.
>> >>
>> >
>> > That sounds appropriate, but not sure what is exact place to call it.
>>
>> Maybe just after the following ProcessConfigFile().
>>
>> -
>> /*
>> * (6) check for any other interesting events that happened while we
>> * slept.
>> */
>> if (got_SIGHUP)
>> {
>> got_SIGHUP = false;
>> ProcessConfigFile(PGC_SIGHUP);
>> }
>> -
>>
>> If we do the move, we also need to either (1) make postmaster call
>> SyncRepUpdateConfig() and pass the parsed result to any forked backends
>> via a file like write_nondefault_variables() does for EXEC_BACKEND
>> environment, or (2) make a backend call SyncRepUpdateConfig() during
>> its initialization phase so that the first call of pg_stat_replication
>> can use the parsed result. (1) seems complicated and overkill.
>> (2) may add very small overhead into the fork of a backend. It would
>> be almost negligible, though. So which logic should we adopt?
>>
>
> Won't it be possible to have assign_* function for synchronous_standby_names
> as we have for some of the other settings like assign_XactIsoLevel and then
> call SyncRepUpdateConfig() in that function?

It's possible, but still seems to need (1), i.e., the variable that assign_XXX
function assigned needs to be passed to a backend via file for EXEC_BACKEND
environment.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Andres Freund
On 2016-04-06 10:24:52 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-04-06 10:15:59 -0400, Tom Lane wrote:
> >> Well, that's something worth thinking about.  I assume that
> >> pg_logical_slot_get_changes could be executed in a database different from
> >> the one where a change was originated?
> 
> > You can execute it, but you'll get an error:
> 
> Oh good.  I was afraid we had an unrecognized can o' worms here.

As posted nearby, there's a hole in that defense; for the messages
only. Pretty easy to solve though.

Allowing logical decoding from a difference would have a lot of
problems; primarily we couldn't actually look up any catalog state. But
even leaving that aside, it'd end up being pretty hard to interpret
database contents without knowledge about encoding.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Andres Freund
On 2016-04-06 16:20:29 +0200, Andres Freund wrote:
> On 2016-04-06 10:15:59 -0400, Tom Lane wrote:
> > > In some ways it seems like the argument to pg_logical_emit_message(...) 
> > > should
> > > be 'bytea'. That'd be much more convenient for application use. But then
> > > it's a pain when using it via the text-format SQL interface calls, where
> > > we've got no sensible way to output it.
> 
> There's a bytea version.
> 
> > Well, that's something worth thinking about.  I assume that
> > pg_logical_slot_get_changes could be executed in a database different from
> > the one where a change was originated?
> 
> You can execute it, but you'll get an error:
>   if (slot->data.database != MyDatabaseId)
>   ereport(ERROR,
>   
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> (errmsg("replication slot \"%s\" was not created in this 
> database",
> NameStr(slot->data.name);

Or so I thought. A look at the code shows a lack of database filtering
in DecodeLogicalMsgOp().  I think it also misses a FilterByOrigin()
check.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Tom Lane
Andres Freund  writes:
> On 2016-04-06 10:15:59 -0400, Tom Lane wrote:
>> Well, that's something worth thinking about.  I assume that
>> pg_logical_slot_get_changes could be executed in a database different from
>> the one where a change was originated?

> You can execute it, but you'll get an error:

Oh good.  I was afraid we had an unrecognized can o' worms here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Phrase search ported to 9.6

2016-04-06 Thread Dmitry Ivanov
> > It seems that everything is settled now, so here's the patch introducing
> > the '<->' and '' operators. I've made the necessary changes to docs &
> > regression tests.
> 
> I noticed that I had accidently trimmed whitespaces in docs, this is a
> better one.

After a brief but reasonable discussion with Teodor I've come up with a more 
proper piece of code for phrase operator parsing. The patch is included below.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Andres Freund
On 2016-04-06 10:15:59 -0400, Tom Lane wrote:
> > In some ways it seems like the argument to pg_logical_emit_message(...) 
> > should
> > be 'bytea'. That'd be much more convenient for application use. But then
> > it's a pain when using it via the text-format SQL interface calls, where
> > we've got no sensible way to output it.

There's a bytea version.

> Well, that's something worth thinking about.  I assume that
> pg_logical_slot_get_changes could be executed in a database different from
> the one where a change was originated?

You can execute it, but you'll get an error:
if (slot->data.database != MyDatabaseId)
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  (errmsg("replication slot \"%s\" was not created in this 
database",
  NameStr(slot->data.name);


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Andres Freund
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer  wrote:
>  ...
> 
> Nice summary
> 
> Failover slots are optional. And they work on master.
> 
> While the other approach could also work, it will work later and still
> require a slot on the master.
> 
> 
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
> 
> 
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
> 
> 
> => I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

I don't think this is commit-ready. For one I think this is
architecturally the wrong choice. But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove), there
appears to have been very little code level review (one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
  raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
  idea
* I doubt that the archive based switches around StartupReplicationSlots
  do what they intend. Afaics that'll not work correctly for basebackups
  taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.


[1] 
http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] 
http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] 
http://archives.postgresql.org/message-id/camsr+ye6lny2e0tbuaqb+ntvb6w-dhjaflq0-zbal7g7hjh...@mail.gmail.com

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Phrase search ported to 9.6

2016-04-06 Thread Dmitry Ivanov
> > > It seems that everything is settled now, so here's the patch introducing
> > > the '<->' and '' operators. I've made the necessary changes to docs &
> > > regression tests.
> > 
> > I noticed that I had accidently trimmed whitespaces in docs, this is a
> > better one.
> 
> After a brief but reasonable discussion with Teodor I've come up with a more
> proper piece of code for phrase operator parsing. The patch is included
> below.

Attached the patch. Sorry for the inconvenience.

-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Companydiff --git a/contrib/tsearch2/expected/tsearch2.out b/contrib/tsearch2/expected/tsearch2.out
index 972f764..97379e7 100644
--- a/contrib/tsearch2/expected/tsearch2.out
+++ b/contrib/tsearch2/expected/tsearch2.out
@@ -278,15 +278,15 @@ SELECT '(!1|2)&3'::tsquery;
 (1 row)
 
 SELECT '1|(2|(4|(5|6)))'::tsquery;
- tsquery 
--
- '1' | ( '2' | ( '4' | ( '5' | '6' ) ) )
+   tsquery   
+-
+ '1' | '2' | '4' | '5' | '6'
 (1 row)
 
 SELECT '1|2|4|5|6'::tsquery;
- tsquery 
--
- ( ( ( '1' | '2' ) | '4' ) | '5' ) | '6'
+   tsquery   
+-
+ '1' | '2' | '4' | '5' | '6'
 (1 row)
 
 SELECT '1&(2&(4&(5&6)))'::tsquery;
@@ -340,7 +340,7 @@ select 'a' > 'b & c'::tsquery;
 select 'a | f' < 'b & c'::tsquery;
  ?column? 
 --
- t
+ f
 (1 row)
 
 select 'a | ff' < 'b & c'::tsquery;
@@ -443,9 +443,9 @@ select count(*) from test_tsquery where keyword >  'new & york';
 
 set enable_seqscan=on;
 select rewrite('foo & bar & qq & new & york',  'new & york'::tsquery, 'big & apple | nyc | new & york & city');
- rewrite  
---
- 'foo' & 'bar' & 'qq' & ( 'city' & 'new' & 'york' | ( 'nyc' | 'big' & 'apple' ) )
+   rewrite
+--
+ 'foo' & 'bar' & 'qq' & ( 'nyc' | 'big' & 'apple' | 'city' & 'new' & 'york' )
 (1 row)
 
 select rewrite('moscow', 'select keyword, sample from test_tsquery'::text );
@@ -461,9 +461,9 @@ select rewrite('moscow & hotel', 'select keyword, sample from test_tsquery'::tex
 (1 row)
 
 select rewrite('bar &  new & qq & foo & york', 'select keyword, sample from test_tsquery'::text );
-   rewrite   
--
- 'citi' & 'foo' & ( 'bar' | 'qq' ) & ( 'nyc' | ( 'big' & 'appl' | 'new' & 'york' ) )
+ rewrite 
+-
+ ( 'nyc' | 'big' & 'appl' | 'new' & 'york' ) & 'citi' & 'foo' & ( 'bar' | 'qq' )
 (1 row)
 
 select rewrite( ARRAY['moscow', keyword, sample] ) from test_tsquery;
@@ -479,9 +479,9 @@ select rewrite( ARRAY['moscow & hotel', keyword, sample] ) from test_tsquery;
 (1 row)
 
 select rewrite( ARRAY['bar &  new & qq & foo & york', keyword, sample] ) from test_tsquery;
-   rewrite   
--
- 'citi' & 'foo' & ( 'bar' | 'qq' ) & ( 'nyc' | ( 'big' & 'appl' | 'new' & 'york' ) )
+ rewrite 
+-
+ ( 'nyc' | 'big' & 'appl' | 'new' & 'york' ) & 'citi' & 'foo' & ( 'bar' | 'qq' )
 (1 row)
 
 select keyword from test_tsquery where keyword @> 'new';
@@ -520,9 +520,9 @@ select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('e
 (1 row)
 
 select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('english', 'bar &  new & qq & foo & york') as query where keyword <@ query;
-   rewrite   
--
- 'citi' & 'foo' & ( 'bar' | 'qq' ) & ( 'nyc' | ( 'big' & 'appl' | 'new' & 'york' ) )
+ rewrite 
+-
+ ( 'nyc' | 'big' & 'appl' | 'new' & 'york' ) & 'citi' & 'foo' & ( 'bar' | 'qq' )
 (1 row)
 
 select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('english', 'moscow') as query where query @> keyword;
@@ -538,9 +538,9 @@ select rewrite( ARRAY[query, keyword, sample] ) from test_tsquery, to_tsquery('e
 (1 

Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Tom Lane
Craig Ringer  writes:
> Interesting issue. Mainly because the "Å¥" char it complains about
> (utf-8 0xc5 0xa5) is accepted in the SELECT that generates the record.

Uh, no, actually it's the SELECT that's failing.

> The regress script in question sets:
> SET client_encoding = 'utf8';
> but we're apparently round-tripping the data through the database encoding
> at some point, then converting back to client_encoding for output.

The conversion to DB encoding will happen the instant the query string
reaches the database.  You can set client_encoding to whatever you want,
but the only characters that can appear in queries are those that exist
in both the client encoding and the database encoding.

> In some ways it seems like the argument to pg_logical_emit_message(...) should
> be 'bytea'. That'd be much more convenient for application use. But then
> it's a pain when using it via the text-format SQL interface calls, where
> we've got no sensible way to output it.

Well, that's something worth thinking about.  I assume that
pg_logical_slot_get_changes could be executed in a database different from
the one where a change was originated?  What's going to happen if a string
in WAL contains characters unrepresentable in that database?  Do we even
have logic in there that will attempt to perform the necessary conversion?
And it is *necessary*, not optional, if you are going to claim that the
output of pg_logical_slot_get_changes is of type text.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Amit Kapila
On Wed, Apr 6, 2016 at 7:03 PM, Fujii Masao  wrote:
>
> On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila 
wrote:
> >
> >> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> >> from pg_stat_get_wal_senders() and every backends always parse the
value
> >> of s_s_names when the setting is changed.
> >>
> >
> > That sounds appropriate, but not sure what is exact place to call it.
>
> Maybe just after the following ProcessConfigFile().
>
> -
> /*
> * (6) check for any other interesting events that happened while we
> * slept.
> */
> if (got_SIGHUP)
> {
> got_SIGHUP = false;
> ProcessConfigFile(PGC_SIGHUP);
> }
> -
>
> If we do the move, we also need to either (1) make postmaster call
> SyncRepUpdateConfig() and pass the parsed result to any forked backends
> via a file like write_nondefault_variables() does for EXEC_BACKEND
> environment, or (2) make a backend call SyncRepUpdateConfig() during
> its initialization phase so that the first call of pg_stat_replication
> can use the parsed result. (1) seems complicated and overkill.
> (2) may add very small overhead into the fork of a backend. It would
> be almost negligible, though. So which logic should we adopt?
>

Won't it be possible to have assign_* function
for synchronous_standby_names as we have for some of the other settings
like assign_XactIsoLevel and then call SyncRepUpdateConfig() in that
function?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Dilip Kumar
On Wed, Apr 6, 2016 at 3:22 PM, Andres Freund  wrote:

> Which scale did you initialize with? I'm trying to reproduce the
> workload on hydra as precisely as possible...
>

I tested with scale factor 300, shared buffer 8GB.

My test script is attached with the mail (perf_pgbench_ro.sh).

I have done some more test on power (same machine)

Test1:

head + pinunpin-cas-9.patch + BufferDesc content lock to pointer (patch
attached: buffer_content_lock_ptr**.patch)

Ashutosh helped me in generating this patch (this is just temp patch to see
the pin/unpin behaviour if content lock is pointer)

64 client
run1 497684
run2 543366
run3 476988
  128 Client
run1740301
run2 482676
run3 474530
run4 480971
run5 757779

*Summary:*
1. With 64 client I think whether we apply only
pinunpin-cas-9.patch or we apply pinunpin-cas-9.patch +
buffer_content_lock_ptr_rebased_head_temp.patch
max TPS is ~550,000 and some fluctuations.

2. With 128, we saw in earlier post that with pinunpin we were getting max
TPS was 650,000 (even after converting BufferDesc to 64 bytes it was
650,000.  Now after converting content lock to pointer on top of pinunpin I
get max as ~750,000.

   - One more point to be noted, earlier it was varying from 250,000 to
650,000 but after converting content lock to pointer its varying from
450,000 to 75.

*Test2:*

Head + buffer_content_lock_ptr_rebased_head_temp.patch

1. With this test reading is same as head, and can see same variance in
performance.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


perf_pgbench_ro.sh
Description: Bourne shell script


buffer_content_lock_ptr_rebased_head_temp.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-04-06 Thread Craig Ringer
Committed. https://commitfest.postgresql.org/9/468/

Buildfarm error:
http://www.postgresql.org/message-id/cab7npqrod2mxqy_5+czjvhw0whrrz6p8jv_rsblcrxrtwlh...@mail.gmail.com

Interesting issue. Mainly because the "ť" char it complains about
(utf-8 0xc5 0xa5) is accepted in the SELECT that generates the record. If
it's valid input it should be valid output, right? We didn't change the
client_encoding in the mean time.  It makes sense though:

initdb on that animal says:

The database cluster will be initialized with locale "English_United
States.1252".
The default database encoding has accordingly been set to "WIN1252".


The regress script in question sets:

SET client_encoding = 'utf8';

but we're apparently round-tripping the data through the database encoding
at some point, then converting back to client_encoding for output.

Presumably that's when we're forming the text 'data' column in the
tuplestore produced by the get changes function, which will be in the
database encoding.

So setting client_encoding is not sufficient to make this work and the
non-7-bit-ascii part should be removed from the test, since it's not
allowed on all machines.


In some ways it seems like the argument to pg_logical_emit_message(...) should
be 'bytea'. That'd be much more convenient for application use. But then
it's a pain when using it via the text-format SQL interface calls, where
we've got no sensible way to output it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Wed, Apr 6, 2016 at 8:59 PM, Amit Kapila  wrote:
> On Wed, Apr 6, 2016 at 11:17 AM, Fujii Masao  wrote:
>>
>> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila 
>> wrote:
>> >>
>> >> > 2.
>> >> > pg_stat_get_wal_senders()
>> >> > {
>> >> > ..
>> >> > /*
>> >> > ! * Allocate and update the config data of synchronous replication,
>> >> > ! * and then get the currently active synchronous standbys.
>> >> >   */
>> >> > + SyncRepUpdateConfig();
>> >> >   LWLockAcquire(SyncRepLock, LW_SHARED);
>> >> > ! sync_standbys = SyncRepGetSyncStandbys();
>> >> >   LWLockRelease(SyncRepLock);
>> >> > ..
>> >> > }
>> >> >
>> >> > Why is it important to update the config with patch?  Earlier also
>> >> > any
>> >> > update to config between calls wouldn't have been visible.
>> >>
>> >> Because a backend has no chance to call SyncRepUpdateConfig() and
>> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
>> >> called here. This means that pg_stat_replication may return the
>> >> information
>> >> based on the old value of s_s_names.
>> >>
>> >
>> > Thats right, but without this patch also won't pg_stat_replication can
>> > show
>> > old information? If no, why so?
>>
>> Without the patch, when s_s_names is changed and SIGHUP is sent,
>> a backend calls ProcessConfigFile(), parse the configuration file and
>> set the global variable SyncRepStandbyNames to the latest value of
>> s_s_names. When pg_stat_replication is accessed, a backend calculates
>> which standby is synchronous based on that latest value in
>> SyncRepStandbyNames,
>> and then displays the information of sync replication.
>>
>> With the patch, basically the same steps are executed when s_s_names is
>> changed. But the difference is that, with the patch, SyncRepUpdateConfig()
>> must be called after ProcessConfigFile() is called before the calculation
>> of
>> sync standbys. So I just added the call of SyncRepUpdateConfig() to
>> pg_stat_get_wal_senders().
>>
>
> Then why to call it just in pg_stat_get_wal_senders(), isn't it better if we
> call it always after ProcessConfigFile() (after setting SyncRepStandbyNames)
>
>> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
>> from pg_stat_get_wal_senders() and every backends always parse the value
>> of s_s_names when the setting is changed.
>>
>
> That sounds appropriate, but not sure what is exact place to call it.

Maybe just after the following ProcessConfigFile().

-
/*
* (6) check for any other interesting events that happened while we
* slept.
*/
if (got_SIGHUP)
{
got_SIGHUP = false;
ProcessConfigFile(PGC_SIGHUP);
}
-

If we do the move, we also need to either (1) make postmaster call
SyncRepUpdateConfig() and pass the parsed result to any forked backends
via a file like write_nondefault_variables() does for EXEC_BACKEND
environment, or (2) make a backend call SyncRepUpdateConfig() during
its initialization phase so that the first call of pg_stat_replication
can use the parsed result. (1) seems complicated and overkill.
(2) may add very small overhead into the fork of a backend. It would
be almost negligible, though. So which logic should we adopt?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 14:15, Craig Ringer  wrote:
 ...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.


=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.


We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for open
source.


=> I think we should add Failover Slots to 9.6.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Andres Freund
On 2016-04-06 14:00:20 +0100, Simon Riggs wrote:
> On 6 April 2016 at 13:48, Michael Paquier  wrote:
> 
> >
> > Yeah... We have reached a clear consensus about the way things should
> > be done after quite a lot of discussions that has gone for a couple of
> > months. And Andres' design on the matter is what is getting approval
> > from everybody who has worked toward addressing this issue while
> > really taking care of the problem at its root.

> Clearly I am not included in your use of the words "we" and "everybody" !

Uh. You're not serious, right? Obviously that refers to the months of
discussions that already had happened. Where you didn't participate,
hence obviously weren't included in "we".

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Andres Freund
On 2016-04-06 13:50:24 +0100, Simon Riggs wrote:
> On 6 April 2016 at 13:27, Andres Freund  wrote:
> 
> > On 2016-04-06 13:11:40 +0100, Simon Riggs wrote:
> > > On 6 April 2016 at 10:09, Andres Freund  wrote:
> > > > On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
> > > > The issue there is that we continue to issue checkpoints if the only
> > > > activity since the last checkpoint was emitting a standby
> > > > snapshot. That's because:
> > > >
> > >
> > > I agree this is the current situation in 9.4 and 9.5, hence the bug
> > report.
> > >
> > > This no longer occurs with the patches I have proposed. The snapshot is
> > > skipped, so a further checkpoint never triggers.
> >
> > Not if there's a longrunning/idle transaction.
> >
> > Note that skipping the snapshot is actually a *problem* in some
> > cases. As I've brought up upthread, to which you never replied. A
> > xl_running_xacts->xcnt == 0/!overflowed snapshot can be very important
> > for hot standby, because it allows ProcArrayApplyRecoveryInfo() to
> > switch to INITIALIZED state:
> >
> 
> I replied by posting a patch to address your concern, how is that non-reply?

It doesn't address the problem? It's irrelevant that the last snapshot
had 0 xacts, if you start recovery from a later check/restartpoint;
recovery won't process earlier running_xacts records.


> > This obviously doesn't apply to WAL_LEVEL_LOGICAL as is (the if). And it
> > also obviously repeats to log the same snapshot in a system where the
> > state hasn't changed, but where running->xcnt != 0 or nlocks != 0.

> My understanding from your previous comments was that it would be incorrect
> to do that.

I said:
> For one it breaks cleanup with logical decoding which does *NEED* to
> know that nothing is happening. Although only once, not repeatedly.

The salient point is "Although only once, not repeatedly.". Which is
pretty much same thing as for HS; to become consistent after a
checkpoint.


> Not true. I have listened to everything you've said and been patient with
> the high number of mistakes in your replies.

Simon, this is utterly ridiculous. Missing an if in a post-commit
review, of a hastily committed patch, which hasn't previously been
posted for review, is entirely normal.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Covering + unique indexes.

2016-04-06 Thread Anastasia Lubennikova

06.04.2016 03:05, Peter Geoghegan:

On Tue, Apr 5, 2016 at 1:31 PM, Peter Geoghegan  wrote:

My new understanding: The extra "included" columns are stored in the
index, but do not affect its sort order at all. They are no more part
of the key than, say, the heap TID that the key points to. They are
just "payload".


It was really long and complicated discussion. I'm glad that finally we 
are in agreement about the patch.
Anyway, I think all mentioned questions will be very helpful for the 
future work on b-tree.



Noticed a few issues following another pass:

* tuplesort.c should handle the CLUSTER case in the same way as the
btree case. No?

Yes, I just missed that cluster uses index sort.  Fixed.


* Why have a RelationGetNumberOfAttributes(indexRel) call in
tuplesort_begin_index_btree() at all now?

Fixed.

* This critical section is unnecessary, because this happens during
index builds:

+   if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+   {
+   /*
+* It's essential to truncate High key here.
+* The purpose is not just to save more space
on this particular page,
+* but to keep whole b-tree structure
consistent. Subsequent insertions
+* assume that hikey is already truncated, and
so they should not
+* worry about it, when copying the high key
into the parent page
+* as a downlink.
+* NOTE It is not crutial for reliability in present,
+* but maybe it will be that in the future.
+* NOTE this code will be changed by the
"btree compression" patch,
+* which is in progress now.
+*/
+   keytup = index_reform_tuple(wstate->index, oitup,
+
  indnatts, indnkeyatts);
+
+   /*  delete "wrong" high key, insert keytup as
P_HIKEY. */
+   START_CRIT_SECTION();
+   PageIndexTupleDelete(opage, P_HIKEY);
+
+   if (!_bt_pgaddtup(opage,
IndexTupleSize(keytup), keytup, P_HIKEY))
+   elog(ERROR, "failed to rewrite
compressed item in index \"%s\"",
+   RelationGetRelationName(wstate->index));
+   END_CRIT_SECTION();
+   }

Note that START_CRIT_SECTION() promotes any ERROR to PANIC, which
isn't useful here, because we have no buffer lock held, and nothing
must be WAL-logged.

* Think you forgot to update spghandler(). (You did not add a test for
just that one AM, either)

Fixed.

* I wonder why this restriction needs to exist:

+   else
+   elog(ERROR, "Expressions are not supported in
included columns.");

What does not supporting it buy us? Was it just that the pg_index
representation is more complicated, and you wanted to put it off?

An error like this should use ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED ..., btw.
Yes, you get it right. It was a bit complicated to implement and I 
decided to delay it to the next patch.

errmsg is fixed.


* I would like to see index_reform_tuple() assert that the new,
truncated index tuple is definitely <= the original (I worry about the
1/3 page restriction issue). Maybe you should also change the name of
index_reform_tuple(), per David.
Is it possible that the new tuple, containing less attributes than the 
old one, will have a greater size?

Maybe you can give an example?
I think that  Assert(indnkeyatts <= indnatts); covers this kind of errors.
I do not mind to rename this function, but what name would be better?
index_truncate_tuple()?


* There is some stray whitespace within RelationGetIndexAttrBitmap().
I think you should have updated it with code, though. I don't think
it's necessary for HOT updates to work, but I think it could be
necessary so that we don't need to get a row lock that blocks
non-conflict foreign key locking (see heap_update() callers). I think
you need to be careful for non-key columns within the loop in
RelationGetIndexAttrBitmap(), basically, because it seems to still go
through all columns. UPSERT also must call this code, FWIW.

* I think that a similar omission is also made for the replica
identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed
on how this patch interacts with logical decoding, I guess.


Good point. Indexes are everywhere in the code.
I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX.
I'll discuss it with Theodor and send an updated patch tomorrow.


* Valgrind shows an error with an aggregate statement I tried:

2016-04-05 17:01:31.129 PDT 12310 LOG:  statement: explain analyze
select count(*) from ab  where b > 5 group by a, b;
==12310== Invalid read of size 4
==12310==at 0x656615: match_clause_to_indexcol 

Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Craig Ringer
A few thoughts on failover slots vs the alternative of pushing catalog_xmin
up to the master via a replica's slot and creating independent slots on
replicas.


Failover slots:
---

+ Failover slots are very easy for applications. They "just work" and are
transparent for failover. This is great especially for things that aren't
complex replication schemes, that just want to use logical decoding.

+ Applications don't have to know what replicas exist or be able to reach
them; transparent failover is easier.

- Failover slots can't be used from a cascading standby (where we can fail
down to the standby's own replicas) because they have to write WAL to
advance the slot position. They'd have to send the slot position update
"up" to the master then wait to replay it. Not a disaster, though they'd do
extra work on reconnect until a restart_lsn update replayed. Would require
a whole new feedback-like message on the rep protocol, and couldn't work at
all with archive replication. Ugly as hell.

+ Failover slots exist now, and could be added to 9.6.

- The UI for failover slots can't be re-used for the catalog_xmin push-up
approach to allow replay from failover slots on cascading standbys in 9.7+.
There'd be no way to propagate the creation of failover slots "down" the
replication heirarchy that way, especially to archive standbys like
failover slots will do. So it'd be semantically different and couldn't
re-use the FS UI. We'd be stuck with failover slots even if we also did the
other way later.

+ Will work for recovery of a master PITR-restored up to the latest
recovery point



Independent slots on replicas + catalog_xmin push-up
---

With this approach we allow creation of replication slots on a replica
independently of the master. The replica is required to connect to the
master via a slot. We send feedback to the master to advance the replica's
slot on the master to the confirmed_lsn of the most-behind slot on the
replica, therefore pinning master's catalog_xmin where needed. Or we just
send a new feedback message type that directly sets a catalog_xmin on the
replica's physical slot in the master. Slots are _not_ cloned from master
to replica automatically.


- More complicated for applications to use. They have to create a slot on
each replica that might be failed over to as well as the master and have to
advance all those slots to stop the master from suffering severe catalog
bloat.  (But see note below).

- Applications must be able to connect to failover-candidate standbys and
know where they are, it's not automagically handled via WAL.  (But see note
below).

- Applications need reconfiguration whenever a standby is rebuilt, moved,
etc. (But see note below).

- Cannot work at all for archive-based replication, requires a slot from
replica to master.

+ Works with replay from cascading standbys

+ Actually solves one of the problems making logical slots on standbys
unsupported at the moment by giving us a way to pin the master's
catalog_xmin to that needed by a replica.

- Won't work for a standby PITR-restored up to latest.

- Vapourware with zero hope for 9.6


Note: I think the application complexity issues can be solved - to a degree
- by having the replicas run a bgworker based helper that connects to the
master and clones the master's slots then advances them automatically.


Do nothing
---

Drop the idea of being able to follow physical failover on logical slots.

I've already expressed why I think this is a terrible idea. It's hostile to
application developers who'd like to use logical decoding. It makes
integration of logical replication with existing HA systems much harder. It
means we need really solid, performant, well-tested and mature logical rep
based HA before we can take logical rep seriously, which is a long way out
given that we can't do decoding of in-progress xacts, ddl, sequences, 
etc etc.

Some kind of physical HA for logical slots is needed and will be needed for
some time. Logical rep will be great for selective replication, replication
over WAN, filtered/transformed replication etc. Physical rep is great for
knowing you'll get exactly the same thing on the replica that you have on
the master and it'll Just Work.

In any case, "Do nothing" is the same for 9.6 as pursusing the catalog_xmin
push-up idea; in both cases we don't commit anything in 9.6.


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 13:48, Michael Paquier  wrote:

>
> Yeah... We have reached a clear consensus about the way things should
> be done after quite a lot of discussions that has gone for a couple of
> months. And Andres' design on the matter is what is getting approval
> from everybody who has worked toward addressing this issue while
> really taking care of the problem at its root.


Clearly I am not included in your use of the words "we" and "everybody" !

What you mean is that you and Andres agree, so using the word consensus is
not appropriate and certainly not clear. We did all agree on the point that
the earlier fix cannot be backpatched, so if it is as grevious a problem as
described many users will not now benefit.

I will revert my earlier patch now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 13:27, Andres Freund  wrote:

> On 2016-04-06 13:11:40 +0100, Simon Riggs wrote:
> > On 6 April 2016 at 10:09, Andres Freund  wrote:
> > > On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
> > > The issue there is that we continue to issue checkpoints if the only
> > > activity since the last checkpoint was emitting a standby
> > > snapshot. That's because:
> > >
> >
> > I agree this is the current situation in 9.4 and 9.5, hence the bug
> report.
> >
> > This no longer occurs with the patches I have proposed. The snapshot is
> > skipped, so a further checkpoint never triggers.
>
> Not if there's a longrunning/idle transaction.
>
> Note that skipping the snapshot is actually a *problem* in some
> cases. As I've brought up upthread, to which you never replied. A
> xl_running_xacts->xcnt == 0/!overflowed snapshot can be very important
> for hot standby, because it allows ProcArrayApplyRecoveryInfo() to
> switch to INITIALIZED state:
>

I replied by posting a patch to address your concern, how is that non-reply?


> > > > What issue is that? Previously you said it must not skip it at all
> for
> > > > logical.
> > >
> > > It's fine to skip the records iff nothing important has happened since
> > > the last time a snapshot has been logged. Again, the proposed approach
> > > allowed to detect that.
>
> > Agreed, both proposals do that.
>
> No, the currently committed patch, even including your proposed followup
> patch, doesn't do that. As you just commented about as quoted above. The
> code currently reads like:
>
> if (wal_level < WAL_LEVEL_LOGICAL)
> {
> LWLockRelease(ProcArrayLock);
>
> /*
>  * Don't bother to log anything if nothing is happening,
> if we are
>  * using archive_timeout > 0 and we didn't overflow
> snapshot last time.
>  *
>  * This ensures that we don't issue an empty WAL record,
> which can
>  * be annoying when used in conjunction with archive
> timeout.
>  */
> if (running->xcnt == 0 &&
> nlocks == 0 &&
> XLogArchiveTimeout > 0 &&
> !last_snapshot_overflowed)
> {
> LWLockRelease(XidGenLock);
> return InvalidXLogRecPtr;
> }
>
> last_snapshot_overflowed = running->subxid_overflow;
> }
>
> This obviously doesn't apply to WAL_LEVEL_LOGICAL as is (the if). And it
> also obviously repeats to log the same snapshot in a system where the
> state hasn't changed, but where running->xcnt != 0 or nlocks != 0.


My understanding from your previous comments was that it would be incorrect
to do that.


> > > > > We now log more WAL with
> > > > > XLogArchiveTimeout > 0 than without.
> > >
> > > > And the problem with that is what?
> > >
> > > That an idle system unnecessarily produces WAL? Waking up disks and
> > > everything?
> > >
> >
> > The OP discussed a problem with archive_timeout > 0. Are you saying we
> > should put in a fix that applies whatever the setting of archive_timeout?
>
> Yes. We should strive to fix the full scope of an issue; not just the
> part complained about.
>

I believe that's what I did. You didn't mention that point earlier, nor do
I now think it important.


> You seem to ignore valid points made against your approach, apparently
> because you dismiss them as "emotive language". Stop.
>

Not true. I have listened to everything you've said and been patient with
the high number of mistakes in your replies.

Calling something a "bandaid" is not a "valid point" against it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Michael Paquier
On Wed, Apr 6, 2016 at 9:27 PM, Andres Freund  wrote:
> On 2016-04-06 13:11:40 +0100, Simon Riggs wrote:
>> On 6 April 2016 at 10:09, Andres Freund  wrote:
>> > On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
>> > The issue there is that we continue to issue checkpoints if the only
>> > activity since the last checkpoint was emitting a standby
>> > snapshot. That's because:
>> >
>>
>> I agree this is the current situation in 9.4 and 9.5, hence the bug report.
>>
>> This no longer occurs with the patches I have proposed. The snapshot is
>> skipped, so a further checkpoint never triggers.
>
> Not if there's a longrunning/idle transaction.
>
> Note that skipping the snapshot is actually a *problem* in some
> cases. As I've brought up upthread, to which you never replied. A
> xl_running_xacts->xcnt == 0/!overflowed snapshot can be very important
> for hot standby, because it allows ProcArrayApplyRecoveryInfo() to
> switch to INITIALIZED state:
> if (standbyState == STANDBY_SNAPSHOT_PENDING)
> {
> /*
>  * If the snapshot isn't overflowed or if its empty we can 
> reset our
>  * pending state and use this snapshot instead.
>  */
> if (!running->subxid_overflow || running->xcnt == 0)
> {
> /*
>  * If we have already collected known assigned xids, 
> we need to
>  * throw them away before we apply the recovery 
> snapshot.
>  */
> KnownAssignedXidsReset();
> standbyState = STANDBY_INITIALIZED;
> }

Yes. Such snapshot allows to initialize more quickly a hot standby...
And that's useful in some cases if the recovery is on a pending
snapshot (bgwriter standby snapshots help a lot with that btw).

>> > > > We now log more WAL with
>> > > > XLogArchiveTimeout > 0 than without.
>> >
>> > > And the problem with that is what?
>> >
>> > That an idle system unnecessarily produces WAL? Waking up disks and
>> > everything?
>> >
>>
>> The OP discussed a problem with archive_timeout > 0. Are you saying we
>> should put in a fix that applies whatever the setting of archive_timeout?
>
> Yes. We should strive to fix the full scope of an issue; not just the
> part complained about.
>
> You seem to ignore valid points made against your approach, apparently
> because you dismiss them as "emotive language". Stop.

Yeah... We have reached a clear consensus about the way things should
be done after quite a lot of discussions that has gone for a couple of
months. And Andres' design on the matter is what is getting approval
from everybody who has worked toward addressing this issue while
really taking care of the problem at its root. The patch, as currently
proposed, is a bandaid, and has been committed at the surprise of
everybody without any discussion.

Please let's revert this patch and really move toward fixing this
problem... Which is a way in short a way to fix the decision-making
for checkpoint skipping.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Andres Freund
On 2016-04-06 13:11:40 +0100, Simon Riggs wrote:
> On 6 April 2016 at 10:09, Andres Freund  wrote:
> > On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
> > The issue there is that we continue to issue checkpoints if the only
> > activity since the last checkpoint was emitting a standby
> > snapshot. That's because:
> >
> 
> I agree this is the current situation in 9.4 and 9.5, hence the bug report.
> 
> This no longer occurs with the patches I have proposed. The snapshot is
> skipped, so a further checkpoint never triggers.

Not if there's a longrunning/idle transaction.

Note that skipping the snapshot is actually a *problem* in some
cases. As I've brought up upthread, to which you never replied. A
xl_running_xacts->xcnt == 0/!overflowed snapshot can be very important
for hot standby, because it allows ProcArrayApplyRecoveryInfo() to
switch to INITIALIZED state:
if (standbyState == STANDBY_SNAPSHOT_PENDING)
{
/*
 * If the snapshot isn't overflowed or if its empty we can 
reset our
 * pending state and use this snapshot instead.
 */
if (!running->subxid_overflow || running->xcnt == 0)
{
/*
 * If we have already collected known assigned xids, we 
need to
 * throw them away before we apply the recovery 
snapshot.
 */
KnownAssignedXidsReset();
standbyState = STANDBY_INITIALIZED;
}



> > > What issue is that? Previously you said it must not skip it at all for
> > > logical.
> >
> > It's fine to skip the records iff nothing important has happened since
> > the last time a snapshot has been logged. Again, the proposed approach
> > allowed to detect that.

> Agreed, both proposals do that.

No, the currently committed patch, even including your proposed followup
patch, doesn't do that. As you just commented about as quoted above. The
code currently reads like:

if (wal_level < WAL_LEVEL_LOGICAL)
{
LWLockRelease(ProcArrayLock);

/*
 * Don't bother to log anything if nothing is happening, if we 
are
 * using archive_timeout > 0 and we didn't overflow snapshot 
last time.
 *
 * This ensures that we don't issue an empty WAL record, which 
can
 * be annoying when used in conjunction with archive timeout.
 */
if (running->xcnt == 0 &&
nlocks == 0 &&
XLogArchiveTimeout > 0 &&
!last_snapshot_overflowed)
{
LWLockRelease(XidGenLock);
return InvalidXLogRecPtr;
}

last_snapshot_overflowed = running->subxid_overflow;
}

This obviously doesn't apply to WAL_LEVEL_LOGICAL as is (the if). And it
also obviously repeats to log the same snapshot in a system where the
state hasn't changed, but where running->xcnt != 0 or nlocks != 0.

> > > > We now log more WAL with
> > > > XLogArchiveTimeout > 0 than without.
> >
> > > And the problem with that is what?
> >
> > That an idle system unnecessarily produces WAL? Waking up disks and
> > everything?
> >
> 
> The OP discussed a problem with archive_timeout > 0. Are you saying we
> should put in a fix that applies whatever the setting of archive_timeout?

Yes. We should strive to fix the full scope of an issue; not just the
part complained about.


You seem to ignore valid points made against your approach, apparently
because you dismiss them as "emotive language". Stop.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Robert Haas
On Wed, Apr 6, 2016 at 8:02 AM, Simon Riggs  wrote:
>> > We can, if you wish, revert this patch. If we do, we will have nothing,
>> > since I object to the other patch(es).
>>
>> I don't think you have an absolute veto over other patches
>
> Huh? My understanding is I have the same powers as other committers, no more
> but also, no less. If you've seen me claim otherwise, please point where
> that happened.

Uh, that would be in the portion that is still quoted.  "If we do, we
will have nothing, since I object to the other patches."

> Me saying "I object" seems to attract more attention than others for some
> reason. Why is it a discussion point that I object to a patch, whereas if
> you do it, thats fine?

You have every right to object to the patch.  You don't have a right,
nor do I, to say that it won't be committed without your agreement.

> All very strange. People commit changes they didn't post all the time,
> especially on minor bugs such as this.

No, they really don't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 10:09, Andres Freund  wrote:

> On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
> > On 6 April 2016 at 09:45, Andres Freund  wrote:
> >
> > > On 2016-04-06 09:18:54 +0100, Simon Riggs wrote:
> > > > Rather than take that option, I went to the trouble of writing a
> patch
> > > that
> > > > does the same thing but simpler, less invasive and more maintainable.
> > > > Primarily, I did that for you, to avoid you having wasted your time
> and
> > > to
> > > > allow you to backpatch a solution.
> > >
> > > But it doesn't. It doesn't solve the longstanding problem of
> checkpoints
> > > needlessly being repeated due to standby snapshots.
>
> >  I can't see why you say this. I am willing to listen, but this
> > appears to be wrong.
>
> The issue there is that we continue to issue checkpoints if the only
> activity since the last checkpoint was emitting a standby
> snapshot. That's because:
>

I agree this is the current situation in 9.4 and 9.5, hence the bug report.

This no longer occurs with the patches I have proposed. The snapshot is
skipped, so a further checkpoint never triggers.


> The proposed patch allows to fix that in a more principled manner,
> because we can simply check that no "important" records have been
> emitted since the last checkpoint, and skip if that's the case.


I understand the proposal you have made. The patch to implement it is what
I object to; my comments made last Sunday.


> > What issue is that? Previously you said it must not skip it at all for
> > logical.
>
> It's fine to skip the records iff nothing important has happened since
> the last time a snapshot has been logged. Again, the proposed approach
> allowed to detect that.


Agreed, both proposals do that.


> > > We now log more WAL with
> > > XLogArchiveTimeout > 0 than without.
>
> > And the problem with that is what?
>
> That an idle system unnecessarily produces WAL? Waking up disks and
> everything?
>

The OP discussed a problem with archive_timeout > 0. Are you saying we
should put in a fix that applies whatever the setting of archive_timeout?

Are we concerned that a master sends a small amount of data every few
minutes to a standby it is connected to? I hadn't read that in the thread.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [CommitFest App] Feature request -- review e-mail additions

2016-04-06 Thread Magnus Hagander
On Wed, Mar 30, 2016 at 7:47 PM, Alvaro Herrera 
wrote:

> José Luis Tallón wrote:
>
> > Just wanted to suggest two minor mods to the review e-mails
> > auto-generated by the app:
> >
> > * Prepend a [review] tag to the e-mail subject
> > ... so that e-mails sent to -hackers will read  " [HACKERS] [review]
> > "
>
> Changing the subject of an email causes Gmail to break the threads, so
> anything in that line should be discouraged.  -1 from me.  I would be
> happier if the subject of the submission email is kept intact, i.e. not
> use the patch title that was used in commitfest app but use the one in
> the email.  These often differ.
>

That's what the code tries to do. It will add a Re: if there isn't one
already, but otherwise should reuse it.  At least that's how it's trying to
work - do you have a pointer to an example where it doesn't, so I can
investigate if something is off?



> > * Auto-CC the patch author on this e-mail
> > I guess this should speed up reactions / make communication a bit
> more
> > fluid.
>
> Yes, strong +1 on this.
>

Ok, that should be easy enough to add. I've stuck it on my TODO list and
will try to get it ASAP.



> 3) Have the auto-generated emails insert In-Reply-To headers (and
> perhaps References).
>

It already does. Both. Again, do you have a pointer to where it doesn't?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 12:24, Robert Haas  wrote:

> On Wed, Apr 6, 2016 at 4:18 AM, Simon Riggs  wrote:
> >> FWIW, I vote also for reverting this patch. This has been committed
> >> without any real discussions..
> >
> > Michael, its a shame to hear you say that, so let me give full context.
> >
> > The patches under review in the CF are too invasive and not worth the
> > trouble for such a minor problem. After full review, I would simply
> reject
> > those patches (already discussed on list).
> >
> > Rather than take that option, I went to the trouble of writing a patch
> that
> > does the same thing but simpler, less invasive and more maintainable.
> > Primarily, I did that for you, to avoid you having wasted your time and
> to
> > allow you to backpatch a solution.
> >
> > We can, if you wish, revert this patch. If we do, we will have nothing,
> > since I object to the other patch(es).
>
> I don't think you have an absolute veto over other patches


Huh? My understanding is I have the same powers as other committers, no
more but also, no less. If you've seen me claim otherwise, please point
where that happened.

Me saying "I object" seems to attract more attention than others for some
reason. Why is it a discussion point that I object to a patch, whereas if
you do it, thats fine?

, though you
> certainly have the right to object, and you certainly don't have to
> commit them yourself.  But even more than that, the fact that you
> don't like those other patches does not mean that you can commit
> something without discussion.  Even if every argument you are making
> here is correct, which I'm not sure about, other people obviously
> don't think so.  That stuff should be worked out, as far as possible,
> in pre-commit review, which is only possible when you post the patch
> before committing it.  I think it is fine to commit things
> occasionally without posting them ahead of time if they are obviously
> uncontroversial, but that isn't the case here.
>

All very strange. People commit changes they didn't post all the time,
especially on minor bugs such as this.

Obviously if I knew that it would be controversial I would not have done
it. We are discussing it now, so I don't see any problem.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Amit Kapila
On Wed, Apr 6, 2016 at 11:17 AM, Fujii Masao  wrote:
>
> On Tue, Apr 5, 2016 at 11:40 PM, Amit Kapila 
wrote:
> >>
> >> > 2.
> >> > pg_stat_get_wal_senders()
> >> > {
> >> > ..
> >> > /*
> >> > ! * Allocate and update the config data of synchronous replication,
> >> > ! * and then get the currently active synchronous standbys.
> >> >   */
> >> > + SyncRepUpdateConfig();
> >> >   LWLockAcquire(SyncRepLock, LW_SHARED);
> >> > ! sync_standbys = SyncRepGetSyncStandbys();
> >> >   LWLockRelease(SyncRepLock);
> >> > ..
> >> > }
> >> >
> >> > Why is it important to update the config with patch?  Earlier also
any
> >> > update to config between calls wouldn't have been visible.
> >>
> >> Because a backend has no chance to call SyncRepUpdateConfig() and
> >> parse the latest value of s_s_names if SyncRepUpdateConfig() is not
> >> called here. This means that pg_stat_replication may return the
> >> information
> >> based on the old value of s_s_names.
> >>
> >
> > Thats right, but without this patch also won't pg_stat_replication can
show
> > old information? If no, why so?
>
> Without the patch, when s_s_names is changed and SIGHUP is sent,
> a backend calls ProcessConfigFile(), parse the configuration file and
> set the global variable SyncRepStandbyNames to the latest value of
> s_s_names. When pg_stat_replication is accessed, a backend calculates
> which standby is synchronous based on that latest value in
SyncRepStandbyNames,
> and then displays the information of sync replication.
>
> With the patch, basically the same steps are executed when s_s_names is
> changed. But the difference is that, with the patch, SyncRepUpdateConfig()
> must be called after ProcessConfigFile() is called before the calculation
of
> sync standbys. So I just added the call of SyncRepUpdateConfig() to
> pg_stat_get_wal_senders().
>

Then why to call it just in pg_stat_get_wal_senders(), isn't it better if
we call it always after ProcessConfigFile() (after
setting SyncRepStandbyNames)

> BTW, we can move SyncRepUpdateConfig() just after ProcessConfigFile()
> from pg_stat_get_wal_senders() and every backends always parse the value
> of s_s_names when the setting is changed.
>

That sounds appropriate, but not sure what is exact place to call it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Craig Ringer
On 6 April 2016 at 17:43, Simon Riggs  wrote:

> On 25 January 2016 at 14:25, Craig Ringer  wrote:
>
>
>> I'd like to get failover slots in place for 9.6 since the're fairly
>> self-contained and meet an immediate need: allowing replication using slots
>> (physical or logical) to follow a failover event.
>>
>
> I'm a bit confused about this now.
>
> We seem to have timeline following, yet no failover slot. How do we now
> follow a failover event?
>

>

> There are many and varied users of logical decoding now and a fix is
> critically important for 9.6.
>

I agree with you, but I haven't been able to convince enough people of that.


>  Do all decoding plugins need to write their own support code??
>

We'll be able to write a bgworker based extension that handles it by
running in the standby. So no, I don't think so.


> Please explain how we cope without this, so if a problem remains we can
> fix by the freeze.
>

The TL;DR: Create a slot on the master to hold catalog_xmin where the
replica needs it. Advance it using client or bgworker on replica based on
the catalog_xmin of the oldest slot on the replica. Copy slot state from
the master using an extension that keeps the slots on the replica
reasonably up to date.

All of this is ugly workaround for not having true slot failover support.
I'm not going to pretend it's nice, or anything that should go anywhere
near core. Petr outlined the approach we want to take for core in 9.7 on
the logical timeline following thread.


Details:

Logical decoding on a slot can follow timeline switches now - or rather,
the xlogreader knows how to follow timeline switches, and the read page
callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so
they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to
be ready for failover:

- it has no way to create a new slot at a consistent point on the replica
since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since
decoding isn't permitted on a replica, so it can't just decode from the
replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still
needed by the slot's catalog_xmin since catalog_xmin isn't propagated from
standby to master.

So we have to help the client out. To do so, we have a
function/worker/whatever on the replica that grabs the slot state from the
master and copies it to the replica, and we have to hold the master's
catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical
slot on the master, maintained by a function/bgworker/whatever on the
replica. It gets advanced so that its restart_lsn and catalog_xmin are
those of the oldest slot on the replica. We can do that by requesting
replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the
replica. Ugly, but workable. Or we can abuse the infrastructure more deeply
by simply setting the catalog_xmin and restart_lsn on the slot directly,
but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can
create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which
will appear to work fine but will be subtly broken. Since the replica is an
identical copy of the master we know that a slot state that was valid on
the master at a given xlog insert lsn is also valid on the replica at the
same replay lsn, but we've got no reliable way to ensure that when the
master updates a slot at LSN A/B the replica also updates the slot at
replay of LSN A/B. That's what failover slots did. Without that we need to
use some external channel - but there's no way to capture knowledge of "at
exactly LSN A/B, master saved a new copy of slot X" since we can't hook
ReplicationSlotSave(). At least we *can* now inject slot state updates as
generic WAL messages though, so we can ensure they happen at exactly the
desired point in replay.

As Andres explained on the timeline following thread it's not safe for the
slot on the replica to be behind the state the slot on the master was at
the same LSN. At least unless we can protect catalog_xmin via some other
mechanism so we can make sure no catalogs still needed by the slots on the
replica are vacuumed away. It's vital that the catalog_xmin of any slots on
the replica be >= the catalog_xmin the master had for the lowest
catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice
that a slot has changed we'll dump it into xlog via the generic xlog
mechanism to be applied on the replica, much like failover slots. The slot
update might arrive a bit late on the replica, but that's OK because we're

Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Robert Haas
On Wed, Apr 6, 2016 at 4:18 AM, Simon Riggs  wrote:
>> FWIW, I vote also for reverting this patch. This has been committed
>> without any real discussions..
>
> Michael, its a shame to hear you say that, so let me give full context.
>
> The patches under review in the CF are too invasive and not worth the
> trouble for such a minor problem. After full review, I would simply reject
> those patches (already discussed on list).
>
> Rather than take that option, I went to the trouble of writing a patch that
> does the same thing but simpler, less invasive and more maintainable.
> Primarily, I did that for you, to avoid you having wasted your time and to
> allow you to backpatch a solution.
>
> We can, if you wish, revert this patch. If we do, we will have nothing,
> since I object to the other patch(es).

I don't think you have an absolute veto over other patches, though you
certainly have the right to object, and you certainly don't have to
commit them yourself.  But even more than that, the fact that you
don't like those other patches does not mean that you can commit
something without discussion.  Even if every argument you are making
here is correct, which I'm not sure about, other people obviously
don't think so.  That stuff should be worked out, as far as possible,
in pre-commit review, which is only possible when you post the patch
before committing it.  I think it is fine to commit things
occasionally without posting them ahead of time if they are obviously
uncontroversial, but that isn't the case here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH v12] GSSAPI encryption support

2016-04-06 Thread Stephen Frost
Robbie,

Just an initial pass over the patch.

* Robbie Harwood (rharw...@redhat.com) wrote:
> Here's v12, both here and on my github:
> https://github.com/frozencemetery/postgres/tree/feature/gssencrypt12

I've started taking a look at this as it's a capability I've wanted us
to support for a *long* time.

> Subject: [PATCH 1/3] Move common GSSAPI code into its own files

Didn't look too closely at this as it's mostly just moving stuff around.
I'll review it more closely once the other items are addressed though.

> Subject: [PATCH 2/3] Connection encryption support for GSSAPI

> diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
> index 73d493e..94d95bd 100644
> --- a/src/backend/libpq/auth.c
> +++ b/src/backend/libpq/auth.c
> @@ -596,10 +596,12 @@ sendAuthRequest(Port *port, AuthRequest areq)
>   pq_endmessage();
>  
>   /*
> -  * Flush message so client will see it, except for AUTH_REQ_OK, which 
> need
> -  * not be sent until we are ready for queries.
> +  * In most cases, we do not need to send AUTH_REQ_OK until we are ready
> +  * for queries.  However, if we are doing GSSAPI encryption, that 
> request
> +  * must go out immediately to ensure that all messages which follow the
> +  * AUTH_REQ_OK are not grouped with it and can therefore be encrypted.
>*/
> - if (areq != AUTH_REQ_OK)
> + if (areq != AUTH_REQ_OK || port->gss != NULL)
>   pq_flush();
>  
>   CHECK_FOR_INTERRUPTS();

Do we actually need to send pq_flush *whenever* port->gss is not null?
Shouldn't this actually be port->gss->encrypt?

> diff --git a/src/backend/libpq/be-secure-gssapi.c 
> b/src/backend/libpq/be-secure-gssapi.c
[...]
> +/*
> + * Wrapper function indicating whether we are currently performing GSSAPI
> + * connection encryption.
> + *
> + * gss->encrypt is set when connection parameters are processed, which 
> happens
> + * immediately after AUTH_REQ_OK is sent.
> + */
> +static bool
> +be_gssapi_should_encrypt(Port *port)
> +{
> + if (port->gss->ctx == GSS_C_NO_CONTEXT)
> + return false;
> + return port->gss->encrypt;
> +}

be_gssapi_should_encrypt returns bool, which seems entirely reasonable,
but...

> +be_gssapi_write(Port *port, void *ptr, size_t len)
> +{
> + OM_uint32 major, minor;
> + gss_buffer_desc input, output;
> + ssize_t ret;
> + int conf;
> + uint32 netlen;
> + char lenbuf[4];
> +
> + ret = be_gssapi_should_encrypt(port);

Why are we storing the result into an ssize_t?

> + if (ret == -1)
> + return -1;
> + else if (ret == 0)
> + return secure_raw_write(port, ptr, len);

And then testing the result against -1...?  Or a bare 0 for that matter?

> + /* encrypt the message */
> + output.value = NULL;
> + output.length = 0;
> +
> + input.value = ptr;
> + input.length = len;
> +
> + conf = 0;
> + major = gss_wrap(, port->gss->ctx, 1, GSS_C_QOP_DEFAULT,
> +  , , );
> + if (GSS_ERROR(major))
> + {
> + pg_GSS_error(ERROR,
> +  gettext_noop("GSSAPI wrap error"),
> +  major, minor);
> + ret = -1;
> + goto cleanup;
> + }
> + else if (conf == 0)
> + {
> + ereport(FATAL, (errmsg("GSSAPI did not provide 
> confidentiality")));
> + ret = -1;
> + goto cleanup;
> + }
> +
> + /* format for on-wire: 4 network-order bytes of length, then payload */
> + netlen = htonl(output.length);
> + memcpy(lenbuf, , 4);
> +
> + appendBinaryStringInfo(>gss->writebuf, lenbuf, 4);
> + appendBinaryStringInfo(>gss->writebuf, output.value, 
> output.length);

That strikes me as a bit of overkill, we tend to just cast the pointer
to a (char *) rather than memcpy'ing the data just to get a different
pointer out of it.

> + /* recur to send any buffered data */
> + gss_release_buffer(, );
> + return be_gssapi_write(port, ptr, len);

This feels a bit odd to be doing, honestly.  We try to take a lot of
care to consider low-memory situation and to be careufl when it comes to
potential for infinite recursion and there's already a way to ask for
this function to be called again, isn't there?

> + cleanup:
> + if (output.value != NULL)
> + gss_release_buffer(, );
> +
> + return ret;
> +}

There's no need for any of this.  This goto will never be reached as
either a pg_GSS_error(ERROR) or an ereport(FATAL) isn't going to return
control to this path.  That's one of the reasons to be careful with
memory allocation and to use appropriate memory contexts, we're going to
longjmp out of this code path and clean up the memory allocations by
free'ing the contexts that we no longer need.  That is, on an ERROR
level failure, on FATAL, we're just going to exit, so we don't have to
worry about memory cleanup in that case.  Note that 

[HACKERS] Detrimental performance impact of ringbuffers on performance

2016-04-06 Thread Andres Freund
Hi,

While benchmarking on hydra
(c.f. 
http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
which has quite slow IO, I was once more annoyed by how incredibly long
the vacuum at the the end of a pgbench -i takes.

The issue is that, even for an entirely shared_buffers resident scale,
essentially no data is cached in shared buffers. The COPY to load data
uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
that copy immediately writes and evicts all data. Then vacuum reads &
writes the data in small chunks; again evicting nearly all buffers. Then
the creation of the ringbuffer has to read that data *again*.

That's fairly idiotic.

While it's not easy to fix this in the general case, we introduced those
ringbuffers for a reason after all, I think we at least should add a
special case for loads where shared_buffers isn't fully used yet.  Why
not skip using buffers from the ringbuffer if there's buffers on the
freelist? If we add buffers gathered from there to the ringlist, we
should have few cases that regress.

Additionally, maybe we ought to increase the ringbuffer sizes again one
of these days? 256kb for VACUUM is pretty damn low.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Andres Freund
On 2016-04-06 11:52:28 +0200, Andres Freund wrote:
> Hi,
> 
> On 2016-04-03 16:47:49 +0530, Dilip Kumar wrote:
> 
> > Summary Of the Run:
> > -
> > 1. Throughout one run if we observe TPS every 30 seconds its stable in one
> > run.
> > 2. With Head 64 client run vary between ~250,000 to ~45. you can see
> > below results.
> > 
> > run1: 434860(5min)
> > run2: 275815(5min)
> > run3: 437872(5min)
> > run4: 237033   (5min)
> > run5: 347611(10min)
> > run6: 435933   (20min)
> 
> > > [1] -
> > > http://www.postgresql.org/message-id/20160401083518.ge9...@awork2.anarazel.de
> > >
> > 
> > 
> > Non Default Parameter:
> > 
> > shared_buffer 8GB
> > Max Connections 150
> > 
> > ./pgbench -c $threads -j $threads -T 1200 -M prepared -S -P 30 postgres
> 
> Which scale did you initialize with? I'm trying to reproduce the
> workload on hydra as precisely as possible...

On hydra, even after a fair amount of tinkering, I cannot reprodue such
variability.

Pin/Unpin only has a minor effect itself, reducing the size of lwlock
on-top improves performance (378829 to 409914 TPS) in
intentionally short 15s tests (warm cache) with 128 clients; as well as
to a lower degree in 120s tests (415921 to 420487).

It appears, over 5 runs, that the alignment fix shortens the rampup
phase from about 100s to about 8; interesting.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yet another small patch - reorderbuffer.c:1099

2016-04-06 Thread Andres Freund
On 2016-04-05 11:38:27 -0300, Alvaro Herrera wrote:
> IMO the code is wrong.

I'm a bit confused how an intentionally duplicated block makes code
wrong...

But whatever, I found it to be clerarer that way, but apparently I'm alone.


> The current arrangement looks bizantine to me, for no reason.  If we
> think that one of the two branches might do something additional to the
> list deletion, surely that will be in a separate stanza with its own
> comment; and if we ever want to remove the list deletion from one of the
> two cases (something that strikes me as unlikely) then we will need to
> fix the comment, too.

You realize it's two different lists they're deleted in the different
branches?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Yet another small patch - reorderbuffer.c:1099

2016-04-06 Thread Aleksander Alekseev
> > IMO the code is wrong.  There should be a single block comment
> > saying something like "Remove the node from its containing list.
> > In the FOO case, the list corresponds to BAR and therefore we
> > delete it because BAZ.  In the QUUX case the list is PLUGH and we
> > delete in because THUD." Then a single line dlist_delete(...)
> > follows.
> >
> > The current arrangement looks bizantine to me, for no reason.  If we
> > think that one of the two branches might do something additional to
> > the list deletion, surely that will be in a separate stanza with
> > its own comment; and if we ever want to remove the list deletion
> > from one of the two cases (something that strikes me as unlikely)
> > then we will need to fix the comment, too.
> 
> +1 to everything here except for the way byzantine is spelled.
> 

+1 and yet another patch.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52c6986..360e6fd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1159,17 +1159,17 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->base_snapshot_lsn = InvalidXLogRecPtr;
 	}
 
-	/* delete from list of known subxacts */
-	if (txn->is_known_as_subxact)
-	{
-		/* NB: nsubxacts count of parent will be too high now */
-		dlist_delete(>node);
-	}
-	/* delete from LSN ordered list of toplevel TXNs */
-	else
-	{
-		dlist_delete(>node);
-	}
+	/*
+	 * Remove TXN from its containing list. Note that following line covers
+	 * two cases.
+	 *
+	 * If TXN is a subxact we delete it from list of known subxacts. NB:
+	 * nsubxacts count of parent becomes too high in this case.
+	 *
+	 * Otherwise we delete TXN from ordered list of toplevel TXNs.
+	 *
+	 */
+	dlist_delete(>node);
 
 	/* now remove reference from buffer */
 	hash_search(rb->by_txn,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Andres Freund
Hi,

On 2016-04-03 16:47:49 +0530, Dilip Kumar wrote:

> Summary Of the Run:
> -
> 1. Throughout one run if we observe TPS every 30 seconds its stable in one
> run.
> 2. With Head 64 client run vary between ~250,000 to ~45. you can see
> below results.
> 
> run1: 434860(5min)
> run2: 275815(5min)
> run3: 437872(5min)
> run4: 237033   (5min)
> run5: 347611(10min)
> run6: 435933   (20min)

> > [1] -
> > http://www.postgresql.org/message-id/20160401083518.ge9...@awork2.anarazel.de
> >
> 
> 
> Non Default Parameter:
> 
> shared_buffer 8GB
> Max Connections 150
> 
> ./pgbench -c $threads -j $threads -T 1200 -M prepared -S -P 30 postgres

Which scale did you initialize with? I'm trying to reproduce the
workload on hydra as precisely as possible...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speedup twophase transactions

2016-04-06 Thread Stas Kelvich
> On Apr 2, 2016, at 3:14 AM, Michael Paquier  wrote:
> 
> On Fri, Apr 1, 2016 at 10:53 PM, Stas Kelvich  
> wrote:
>> I wrote:
>>> While testing the patch, I found a bug in the recovery conflict code
>>> path. You can do the following to reproduce it:
>>> 1) Start a master with a standby
>>> 2) prepare a transaction on master
>>> 3) Stop immediate on standby to force replay
>>> 4) commit prepare transaction on master
>>> 5) When starting the standby, it remains stuck here:
>> 
>> Hm, I wasn’t able to reproduce that. Do you mean following scenario or am I 
>> missing something?
>> 
>> (async replication)
>> 
>> $node_master->psql('postgres', "
>>begin;
>>insert into t values (1);
>>prepare transaction 'x';
>> ");
>> $node_slave->teardown_node;
>> $node_master->psql('postgres',"commit prepared 'x'");
>> $node_slave->start;
>> $node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", 
>> stdout => \$psql_out);
>> is($psql_out, '0', "Commit prepared on master while slave is down.");
> 
> Actually, not exactly, the transaction prepared on master created a
> table. Sorry for the lack of precisions in my review.

Sorry for delay.

Actually I can’t reproduce that again, tried with following tx:

begin;
insert into t values(0);
create table t1(id int);
insert into t1 values(1);
create table t2(id int);
insert into t2 values(2);
savepoint s1;
drop table t1;
select * from t for update;
select * from t2 for share;
prepare transaction 'x’;

use strict;
use warnings;
use PostgresNode;
use TestLib;
use Test::More tests => 2;

# Setup master node
my $node_master = get_new_node("master");
$node_master->init(allows_streaming => 1);
$node_master->append_conf('postgresql.conf', qq(
max_prepared_transactions = 10
));
$node_master->start;
$node_master->backup('master_backup');
$node_master->psql('postgres', "create table t(id int)");

# Setup slave node
my $node_slave = get_new_node('slave');
$node_slave->init_from_backup($node_master, 'master_backup', has_streaming => 1);
$node_slave->start;

my $psql_out = '';
my $psql_rc = '';

$node_master->psql('postgres', "
	begin;
	insert into t values(0);
	create table t1(id int);
	insert into t1 values(1);
	create table t2(id int);
	insert into t2 values(2);
	savepoint s1;
	drop table t1;
	select * from t for update;
	select * from t2 for share;
	prepare transaction 'x';
");
sleep 2; # wait for changes to arrive on slave
$node_slave->teardown_node;
$node_master->psql('postgres',"commit prepared 'x'");
$node_slave->start;
$node_slave->psql('postgres',"select count(*) from pg_prepared_xacts", stdout => \$psql_out);
is($psql_out, '0', "Commit prepared on master while slave is down.");
$node_slave->psql('postgres',"select sum(id) from t2", stdout => \$psql_out);
is($psql_out, '2', "Check that tx changes are visible.");


-- 
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 25 January 2016 at 14:25, Craig Ringer  wrote:


> I'd like to get failover slots in place for 9.6 since the're fairly
> self-contained and meet an immediate need: allowing replication using slots
> (physical or logical) to follow a failover event.
>

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now
follow a failover event?

There are many and varied users of logical decoding now and a fix is
critically important for 9.6.

Do all decoding plugins need to write their own support code??

Please explain how we cope without this, so if a problem remains we can fix
by the freeze.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Andres Freund
On 2016-04-05 12:56:46 +0530, Dilip Kumar wrote:
> On Mon, Apr 4, 2016 at 2:28 PM, Andres Freund  wrote:
> 
> > Hm, interesting. I suspect that's because of the missing backoff in my
> > experimental patch. If you apply the attached patch ontop of that
> > (requires infrastructure from pinunpin), how does performance develop?
> >
> 
> I have applied this patch also, but still results are same, I mean around
> 550,000 with 64 threads and 650,000 with 128 client with lot of
> fluctuations..
> 
> *128 client
> **(head+**0001-WIP-Avoid-the-use-of-a-separate-spinlock-to-protect
> +pinunpin-cas-9+backoff)*
> 
> run1 645769
> run2 643161
> run3 *285546*
> run4 *289421*
> run5 630772
> run6 *284363*

I wonder what 
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=09adc9a8c09c9640de05c7023b27fb83c761e91c
does to all these numbers. It seems entirely possible that "this" is
mainly changing the alignment of some common datastructures...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-04-06 Thread Alexander Korotkov
On Tue, Apr 5, 2016 at 5:45 PM, Andres Freund  wrote:

> On 2016-04-05 17:36:49 +0300, Alexander Korotkov wrote:
> > Could the reason be that we're increasing concurrency for LWLock state
> > atomic variable by placing queue spinlock there?
>
> Don't think so, it's the same cache-line either way.
>

Yes, it's very unlikely.

> But I wonder why this could happen during "pgbench -S", because it doesn't
> > seem to have high traffic of exclusive LWLocks.
>
> Yea, that confuses me too. I suspect there's some mis-aligned
> datastructures somewhere. It's hard to investigate such things without
> access to hardware.
>

But it's quite easy to check if it is alignment issue.  We can try your
patch but without removing mutex from LWLock struct.  If it's alignment
issue, then TPS should become stable again.

(FWIW, I'm working on getting pinunpin committed)
>

Sounds good.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] large regression for parallel COPY

2016-04-06 Thread Andres Freund
On 2016-04-05 17:12:11 -0400, Robert Haas wrote:
> On Wed, Mar 30, 2016 at 4:10 PM, Andres Freund  wrote:
> > Indeed. On SSDs I see about a 25-35% gain, on HDDs about 5%. If I
> > increase the size of backend_flush_after to 64 (like it's for bgwriter)
> > I however do get about 15% for HDDs as well.
> 
> I tried the same test mentioned in the original post on cthulhu (EDB
> machine, CentOS 7.2, 8 sockets, 8 cores per socket, 2 threads per
> core, Xeon E7-8830 @ 2.13 GHz).  I attempted to test both the effects
> of multi_extend_v21 and the *_flush_after settings.  The machine has
> both HD and SSD, but I used HD for this test.

> master, logged tables, 4 parallel copies: 
> 1m15.411s, 1m14.248s, 1m15.040s
> master, logged tables, 1 copy:
> 0m28.336s, 0m28.040s, 0m29.576s
> multi_extend_v21, logged tables, 4 parallel copies:   
> 0m46.058s, 0m44.515s, 0m45.688s
> multi_extend_v21, logged tables, 1 copy:  
> 0m28.440s, 0m28.129s, 0m30.698s
> master, logged tables, 4 parallel copies, {backend,bgwriter}_flush_after=0:   
> 1m2.817s, 1m4.467s, 1m12.319s
> multi_extend_v21, logged tables, 4 parallel copies, 
> {backend,bgwriter}_flush_after=0: 0m41.301s, 0m41.104s, 0m41.342s
> master, logged tables, 1 copy, {backend,bgwriter}_flush_after=0:  
> 0m26.948s, 0m26.829s, 0m26.616s

Any chance you could repeat with backend_flush_after set to 64? I wonder
if the current value isn't just too small a default for HDDs due to
their increased latency.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 09:23, Fujii Masao  wrote:


> Okay, I pushed the patch!
> Many thanks to all involved in the development of this feature!
>

Very good.

I think the description in the commit message that we don't support "quorum
commit" is sufficient to cover my concerns about what others might expect
from this feature. Could we add similar wording to the docs?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Michael Paquier
On Wed, Apr 6, 2016 at 5:23 PM, Fujii Masao  wrote:
> Okay, I pushed the patch!
> Many thanks to all involved in the development of this feature!

I think that I am crying... Really cool to see this milestone accomplished.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Andres Freund
On 2016-04-06 10:04:42 +0100, Simon Riggs wrote:
> On 6 April 2016 at 09:45, Andres Freund  wrote:
> 
> > On 2016-04-06 09:18:54 +0100, Simon Riggs wrote:
> > > Rather than take that option, I went to the trouble of writing a patch
> > that
> > > does the same thing but simpler, less invasive and more maintainable.
> > > Primarily, I did that for you, to avoid you having wasted your time and
> > to
> > > allow you to backpatch a solution.
> >
> > But it doesn't. It doesn't solve the longstanding problem of checkpoints
> > needlessly being repeated due to standby snapshots.

>  I can't see why you say this. I am willing to listen, but this
> appears to be wrong.

The issue there is that we continue to issue checkpoints if the only
activity since the last checkpoint was emitting a standby
snapshot. That's because:
/*
 * If this isn't a shutdown or forced checkpoint, and we have not 
inserted
 * any XLOG records since the start of the last checkpoint, skip the
 * checkpoint.  The idea here is to avoid inserting duplicate 
checkpoints
 * when the system is idle. That wastes log space, and more importantly 
it
 * exposes us to possible loss of both current and previous checkpoint
 * records if the machine crashes just as we're writing the update.
 * (Perhaps it'd make even more sense to checkpoint only when the 
previous
 * checkpoint record is in a different xlog page?)
 *
 * If the previous checkpoint crossed a WAL segment, however, we create
 * the checkpoint anyway, to have the latest checkpoint fully contained 
in
 * the new segment. This is for a little bit of extra robustness: it's
 * better if you don't need to keep two WAL segments around to recover 
the
 * checkpoint.
 */
if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
  CHECKPOINT_FORCE)) == 0)
{
if (prevPtr == ControlFile->checkPointCopy.redo &&
prevPtr / XLOG_SEG_SIZE == curInsert / XLOG_SEG_SIZE)
{
WALInsertLockRelease();
LWLockRelease(CheckpointLock);
END_CRIT_SECTION();
return;
}
}
doesn't trigger anymore.

The proposed patch allows to fix that in a more principled manner,
because we can simply check that no "important" records have been
emitted since the last checkpoint, and skip if that's the case.


> What issue is that? Previously you said it must not skip it at all for
> logical.

It's fine to skip the records iff nothing important has happened since
the last time a snapshot has been logged. Again, the proposed approach
allowed to detect that.


> > We now log more WAL with
> > XLogArchiveTimeout > 0 than without.

> And the problem with that is what?

That an idle system unnecessarily produces WAL? Waking up disks and
everything?


> I'm not much concerned with what emotive language you choose to support
> your arguments

Err.  You're side-tracking the discussion.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 09:45, Andres Freund  wrote:

> On 2016-04-06 09:18:54 +0100, Simon Riggs wrote:
> > Rather than take that option, I went to the trouble of writing a patch
> that
> > does the same thing but simpler, less invasive and more maintainable.
> > Primarily, I did that for you, to avoid you having wasted your time and
> to
> > allow you to backpatch a solution.
>
> But it doesn't. It doesn't solve the longstanding problem of checkpoints
> needlessly being repeated due to standby snapshots.


 I can't see why you say this. I am willing to listen, but this
appears to be wrong.


> It doesn't fix the
> issue for for wal_level=logical.


What issue is that? Previously you said it must not skip it at all for
logical.


> We now log more WAL with
> XLogArchiveTimeout > 0 than without.
>

And the problem with that is what?


> The other was an architectural fix, this is a selectively applied
> bandaid.
>

It was an attempt at an architectural fix, which went wrong by being too
much code and too invasive for such a minor issue.

I'm not much concerned with what emotive language you choose to support
your arguments, but I am concerned about clear, maintainable code and I
object to the previous patch.

There are other problems worthy of our attention and I will attend to those
now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Andres Freund
On 2016-04-06 09:18:54 +0100, Simon Riggs wrote:
> Rather than take that option, I went to the trouble of writing a patch that
> does the same thing but simpler, less invasive and more maintainable.
> Primarily, I did that for you, to avoid you having wasted your time and to
> allow you to backpatch a solution.

But it doesn't. It doesn't solve the longstanding problem of checkpoints
needlessly being repeated due to standby snapshots. It doesn't fix the
issue for for wal_level=logical. We now log more WAL with
XLogArchiveTimeout > 0 than without.

The other was an architectural fix, this is a selectively applied
bandaid.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Wed, Apr 6, 2016 at 5:07 PM, Fujii Masao  wrote:
> On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
>  wrote:
>> On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao  wrote:
>>> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
>>>  wrote:
 At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao  
 wrote in 
 

Re: [HACKERS] [COMMITTERS] pgsql: Avoid archiving XLOG_RUNNING_XACTS on idle server

2016-04-06 Thread Simon Riggs
On 5 April 2016 at 01:18, Michael Paquier  wrote:

> On Mon, Apr 4, 2016 at 4:50 PM, Andres Freund  wrote:
> > On 2016-04-04 08:44:47 +0100, Simon Riggs wrote:
> >> That patch does exactly the same thing as the patch you prefer, just
> >> does it differently;
> >
> > No, it doesn't; as explained above.
>

I think these few changes are all we need. (attached)


FWIW, I vote also for reverting this patch. This has been committed
> without any real discussions..
>

Michael, its a shame to hear you say that, so let me give full context.

The patches under review in the CF are too invasive and not worth the
trouble for such a minor problem. After full review, I would simply reject
those patches (already discussed on list).

Rather than take that option, I went to the trouble of writing a patch that
does the same thing but simpler, less invasive and more maintainable.
Primarily, I did that for you, to avoid you having wasted your time and to
allow you to backpatch a solution.

We can, if you wish, revert this patch. If we do, we will have nothing,
since I object to the other patch(es).

My recommendation is that we apply the attached patch and leave it there.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


avoid_running_xacts_log.v1plus.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Kyotaro HORIGUCHI
Sorry, my code was wrong in the case that the total numer of
synchronous standby exceeds required number and the wansender is
at priority 1.

Sorry for the noise.

At Wed, 06 Apr 2016 17:01:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160406.170151.246853881.horiguchi.kyot...@lab.ntt.co.jp>
> You must misread the patch. am_sync is originally set in the loop
> just after that for the case.

regards,


-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Wed, Apr 6, 2016 at 5:01 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 6 Apr 2016 15:29:12 +0900, Fujii Masao  wrote 
> in 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Fujii Masao
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
 wrote:
> On Wed, Apr 6, 2016 at 3:29 PM, Fujii Masao  wrote:
>> On Wed, Apr 6, 2016 at 2:18 PM, Kyotaro HORIGUCHI
>>  wrote:
>>> At Tue, 5 Apr 2016 20:17:21 +0900, Fujii Masao  
>>> wrote in 
>>> 

Re: [HACKERS] Support for N synchronous standby servers - take 2

2016-04-06 Thread Michael Paquier
On Wed, Apr 6, 2016 at 4:08 PM, Michael Paquier
 wrote:
> Here are few things I have noticed:
> +   for (i = 0; i < max_wal_senders; i++)
> +   {
> +   walsnd = >walsnds[i];
> No volatile pointer to prevent code reordering?
>
>   */
>  typedef struct WalSnd
>  {
> +   int slotno; /* index of this slot in WalSnd array */
> pid_t   pid;/* this walsender's process id, or 0 */
> slotno is used nowhere.
>
> I'll grab the tests and look at them.

So I had a look at those tests and finished with the attached:
- patch 1 adds a reload routine to PostgresNode
- patch 2 the list of tests.

I took the tests from patch 21 and did many tweaks on them:
- Use of qq() instead of quotes
- Removal of hardcoded newlines
- typo fixes and sanity fixes
- etc.
Regards,
-- 
Michael


2_n_sync_tests.patch
Description: invalid/octet-stream


1_add_reload_routine.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >