date:20150910

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-10 Thread Michael Paquier

On Wed, Sep 9, 2015 at 8:39 PM, Nikolay Shaplov
 wrote:
> В письме от 8 сентября 2015 11:53:24 Вы написали:
>> Hence, instead of all those complications, why not simply introducing two
>> functions that take as input the tuple data and the OID of the relation,
>> returning those bytea arrays? It seems to be that this would be a more
>> handy interface, and as this is for educational purposes I guess that the
>> addition of the overhead induced by the extra function call would be
>> acceptable.
>
> When I looked at this task I considered doing the same thing. Bun 
> unfortunately it is
> not possible. (Or if be more correct it is possible, but if I do so, it would 
> be hard to us
> e it)
> The thing is, that to properly split tuple data into attributes, we need some 
> values from
> tuple header:
> t_infomask2: where postgres store actual number of stored attributes, that 
> may differ
> from one in tuple data. That will allow to properly parse tuples after
> ALTER TABLE ADD COLUMN when it was done without SET DEFAULT option
> t_infomask: has bit that indicates that there is some null attributes in 
> tuple.
> t_bits: has a bit mask that shows what attributes were set to null.
>
> So if move tuple data parsing into separate function, then we have to pass 
> these
> values alongside the tuple data. This is not convenient at all.
> So I did it in a way you see in the patch.

Why is it not convenient at all? Yes, you have a point, we need those
fields to be able to parse the t_data properly. Still the possibility
to show individual fields of a tuple as a bytea array either with
toasted or detoasted values is a concept completely different from
simply showing the page items, which is what, it seems to me,
heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
and t_bits are already available as return fields of heap_page_items,
we should simply add a function like that:
heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
returns bytea[]

Note that the data corruption checks apply only to this function as
far as I understand, so I think that things could be actually split
into two independent patches:
1) Upgrade heap_page_items to add the tuple data as bytea.
2) Add the new function able to parse those fields appropriately.

As this code, as you justly mentioned, is aimed mainly for educational
purposes to understand a page structure, we should definitely make it
as simple as possible at code level, and it seems to me that this
comes with a separate SQL interface to control tuple data parsing as a
bytea[]. We are doing no favor to our users to complicate the code of
pageinspect.c as this patch is doing in its current state, especially
if beginners want to have a look at it.

>> Actually not two functions, but just one, with an extra flag to be
>> able to enforce detoasting on the field values where this can be done.
>
> Yeah, I also thought about it. But did not come to any final conclusion. 
> Should
> we add forth argument, that enables detoasting, instead of adding separate
> function?

This is definitely something you want to control with a switch.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-10 Thread dinesh kumar

Hi All,

Thanks for your inputs on this.

Here, I see a conflict between the doable{RAISE}, and convenience{SQL
function}, and will follow your inputs on this.

Also, I was under impression that, all our TODO
 items are filtered for the real use
cases. Is my impression wrong. If I wanted to work on another TODO item,
where i need to take a look.

Thanks in advance.

-- 

Regards,
Dinesh
manojadinesh.blogspot.com

Re: [HACKERS] [PATCH v2] GSSAPI encryption support

2015-09-10 Thread Michael Paquier

On Thu, Sep 10, 2015 at 1:44 AM, Robbie Harwood  wrote:
> Michael Paquier  writes:
>
>> On Wed, Sep 9, 2015 at 4:12 AM, Robbie Harwood wrote:
>>> Michael Paquier writes:
>>> As promised, here's a V2 to address your issues with comments.  I
>>> haven't heard back on the issues you found in testing, so no other
>>> changes are present.
>>
>> Well, the issue is still here: login through gssapi fails with your
>> patch, not with HEAD. This patch is next on my review list by the way
>> so I'll see what I can do about it soon even if I am in the US for
>> Postgres Open next week. Still, how did you test it? I am just
>> creating by myself a KDC, setting up a valid credential with kinit,
>> and after setting up Postgres for this purpose the protocol
>> communication just fails.
>
> My KDC is setup through freeIPA; I create a service for postgres,
> acquire a keytab, set it in the config file, and fire up the server.  It
> should go without saying that this is working for me, which is why I
> asked you for more information so I could try to debug.  I wrote a post
> on this back in June when this was still in development:
> http://mivehind.net/page/view-page-slug/16/postgres-kerberos

Hm. OK. I'll give it a try with freeipa and your patch with Fedora for
example. Could you as well try the configuration I have used? In any
case, it seems to me that we have a real problem with your patch: the
gss authentication protocol is broken with your patch and *not* HEAD
when using a custom kdc like the one I have set up manually on one of
my VMs.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Waits monitoring

2015-09-10 Thread Kyotaro HORIGUCHI

Hi,

> Great - let's try to deal with [1] first, then.
> 
> Does anyone wish to object to me committing that part?

I have no objection to commiting this, too. But some fix would be
needed.

===
Generated lwlocknames.[ch] don't have header comment because
generate-lwlocknames.pl writes them into wrong place.

lmgr/Makefile looks to have some mistakes.

 - lwlocknames.c is not generated from (or using) lwlocknames.c
   so the entry "lwlocknames.c: lwlocknames.h" doesn't looks to
   be appropriate.

 - maintainer-clean in lmgr/Makefile forgets to remove lwlocknames.c.

lwlock.c now includes lwlockname.c so such entry would be
needed. (But might not be needed because it is naturally
generated along with .h)

Perhaps uncommenting in pg_config_manual.h is left alone.
(This is not included in the diff below)


regards,


At Tue, 8 Sep 2015 13:01:23 -0400, Robert Haas  wrote in 

> On Mon, Sep 7, 2015 at 7:33 AM, Ildus Kurbangaliev
>  wrote:
> >> > Ildus, could you please review Amit & Robert's patch?
> >> [1] -
> >> http://www.postgresql.org/message-id/caa4ek1kdec1tm5ya9gkv85vtn4qqsrxzkjru-tu70g_tl1x...@mail.gmail.com
> > About [1] and [2]. They are slightly conflicting, but if [1] is
> > going to be committed I can easily use it in [2]. And it will simplify
> > my patch, so I have no objections here. And the same time [3] can be
> > significantly simplified and improved on top of [1] and [2].
> 
> Great - let's try to deal with [1] first, then.
> 
> Does anyone wish to object to me committing that part?



--- generate-lwlocknames.pl.org 2015-09-10 15:46:02.291079423 +0900
+++ generate-lwlocknames.pl 2015-09-10 15:53:22.177903045 +0900
@@ -8,5 +8,6 @@
 
-print
-  "/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not edit 
*/\n";
-print "/* there is deliberately not an #ifndef LWLOCKNAMES_H here */\n";
+my $head_comment_c = 
+   "/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not 
edit */\n\n";
+my $head_comment_h = $head_comment_c .
+  "/* there is deliberately not an #ifndef LWLOCKNAMES_H here */\n\n";
 
@@ -23,2 +24,6 @@
 
+# Write the caution comment at the beginning of the files
+print H $head_comment_h;
+print C $head_comment_c;
+
 print C "static char *MainLWLockNames[] = {";

--- Makefile.org2015-09-10 16:16:38.610660526 +0900
+++ Makefile2015-09-10 16:39:09.718916378 +0900
@@ -26,6 +26,6 @@
 
-# see explanation in ../../parser/Makefile
-lwlocknames.c: lwlocknames.h ;
+lwlock.c: lwlocknames.c
 
-lwlocknames.h,lwlocknames.c: 
$(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
+# see explanation in ../../parser/Makefile
+lwlocknames.h lwlocknames.c: 
$(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
$(PERL) $(srcdir)/generate-lwlocknames.pl $<
@@ -39,2 +39,2 @@
 maintainer-clean: clean
-   rm -f lwlocknames.h
+   rm -f lwlocknames.[ch]

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Partitioned checkpointing

2015-09-10 Thread Takashi Horikawa

Hi All,

Recently, I have found a paper titled "Segmented Fussy Checkpointing for
Main Memory Databases" published in 1996 at ACM symposium on Applied
Computing, which inspired me to implement a similar mechanism in PostgreSQL.
Since the early evaluation results obtained from a 16 core server was beyond
my expectation, I have decided to submit a patch to be open for discussion
by community members interested in this mechanism.

Attached patch is a PoC (or mayby prototype) implementation of a partitioned
checkpointing on 9.5alpha2 The term 'partitioned' is used here instead of
'segmented' because I feel 'segmented' is somewhat confusable with 'xlog
segment,' etc. In contrast, the term 'partitioned' is not as it implies
almost the same concept of 'buffer partition,' thus I think it is suitable.

Background and my motivation is that performance dip due to checkpoint is a
major concern, and thus it is valuable to mitigate this issue. In fact, many
countermeasures have been attempted against this issue. As far as I know,
those countermeasures so far focus mainly on mitigating the adverse impact
due to disk writes to implement the buffer sync; recent highly regarded
'checkpointer continuous flushing' is a typical example. On the other hand,
I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.

In the following, I will describe early evaluation results and mechanism of
the partitioned checkpointing briefly.

1. Performance evaluation
1.1 Experimental setup
The configuration of the server machine was as follows.

CPU: Intel E5-2650 v2 (8 cores/chip) @ 2.60GHz x 2
Memory: 64GB
OS: Linux 2.6.32-504.12.2.el6.x86_64 (CentOS)
Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
 raid1 of 2 SSDs for $PGDATA (other than pg_xlog)

PostgreSQL settings
 shared_buffers = 28GB
 wal_buffers = 64MB
 checkpoint_timeout = 10min
 max_wal_size = 128GB
 min_wal_size = 8GB
 checkpoint_completion_target = 0.9

benchmark
 pgbench -M prepared -N -P 1 -T 3600

The scaling factor was 1000.  Both the number of clients (-c option) and
threads (-j option) were 120 for sync. commit case and 96 for async. commit
(synchronous_commit = off) case. These are chosen because maximum
throughputs were obtained under these conditions.

The server was connected to a client machine on which pgbench client program
run with a 1G ether. Since the client machine was not saturated in the
measurement and thus hardly affected the results, details of the client
machine are not described here.


1.2 Early results
The measurement results shown here are latency average, latency stddev, and
throughput (tps), which are the output of the pgbench program.

1.2.1 synchronous_commit = on
(a) 9.5alpha2(original)
latency average: 2.852 ms
latency stddev: 6.010 ms
tps = 42025.789717 (including connections establishing)
tps = 42026.137247 (excluding connections establishing)

(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.815 ms
latency stddev: 2.317 ms
tps = 42575.301137 (including connections establishing)
tps = 42575.677907 (excluding connections establishing)

1.2.2 synchronous_commit = off
(a) 9.5alpha2(original)
latency average: 2.136 ms
latency stddev: 5.422 ms
tps = 44870.897907 (including connections establishing)
tps = 44871.215792 (excluding connections establishing)

(b) 9.5alpha2 with partitioned checkpointing
latency average: 2.085 ms
latency stddev: 1.529 ms
tps = 45974.617724 (including connections establishing)
tps = 45974.973604 (excluding connections establishing)

1.3 Summary
The partitioned checkpointing produced great improvement (reduction) in
latency stddev and slight improvement in latency average and tps; there was
no performance degradation. Therefore, there is an effect to stabilize the
operation in this partitioned checkpointing. In fact, the throughput
variation, obtained by -P 1 option, shows that the dips were mitigated in
both magnitude and frequency.

# Since I'm not sure whether it is OK to send an email to this mailing with
attaching some files other than patch, I refrain now from attaching raw
results (200K bytes of text/case) and result graphs in .jpg or .epsf format
illustrating the throughput variations to this email. If it is OK, I'm
pleased to show the results in those formats.


2. Mechanism
Imaginably, 'partitioned checkpointing' conducts buffer sync not for all
buffers at once but for the buffers belonging to one partition at one
invocation of the checkpointer. In the following description, the number of
partitions is expressed by N. (N is fixed to 16 in the attached patch).

2.1 Principles of operations
In order to preserve the semantics of the traditional

Re: [HACKERS] Partitioned checkpointing

2015-09-10 Thread Fabien COELHO





I don't feel that another source of the performance dip has been heartily
addressed; full-page-write rush, which I call here, would be a major issue.
That is, the average size of transaction log (XLOG) records jumps up sharply
immediately after the beginning of each checkpoint, resulting in the
saturation of WAL write path including disk(s) for $PGDATA/pg_xlog and WAL
buffers.


On this point, you may have a look at this item:

https://commitfest.postgresql.org/5/283/

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned checkpointing

2015-09-10 Thread Fabien COELHO



Hello Takashi-san,

I suggest that you have a look at the following patch submitted in June:

https://commitfest.postgresql.org/6/260/

And these two threads:

http://www.postgresql.org/message-id/flat/alpine.DEB.2.10.1408251900211.11151@sto/
http://www.postgresql.org/message-id/flat/alpine.DEB.2.10.150601132.28433@sto/

Including many performance figures under different condition in the second 
thread. I'm not sure how it would interact with what you are proposing, 
but it is also focussing on improving postgres availability especially on 
HDD systems by changing how the checkpointer write buffers with sorting 
and flushing.


Also, you may consider running some tests about the latest version of this 
patch onto your hardware, to complement Amit tests?


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Etsuro Fujita

Hi,

I'm attaching a small patch to fix another comment typo in setrefs.c:
s/TIDs/OIDs/

Best regards,
Etsuro Fujita
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ee8710d..093d925 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1243,7 +1243,7 @@ copyVar(Var *var)
  * This is code that is common to all variants of expression-fixing.
  * We must look up operator opcode info for OpExpr and related nodes,
  * add OIDs from regclass Const nodes into root->glob->relationOids, and
- * add catalog TIDs for user-defined functions into root->glob->invalItems.
+ * add catalog OIDs for user-defined functions into root->glob->invalItems.
  * We also fill in column index lists for GROUPING() expressions.
  *
  * We assume it's okay to update opcode info in-place.  So this could possibly

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small patch to fix an O(N^2) problem in foreign keys

2015-09-10 Thread Jan Wieck


On 09/08/2015 08:49 AM, Kevin Grittner wrote:

Jan Wieck  wrote:


The problem is a cache introduced in commit 45ba4247 that improves


That's a bit off; 45ba424f seems to be what you mean.


Copy paste over paper.





foreign key lookups during bulk updates when the FK value does not
change. When restoring a schema dump from a database with many (say
100,000) foreign keys, this cache is growing very big and every ALTER
TABLE command is causing a InvalidateConstraintCacheCallBack(), which
does a sequential hash table scan.

The patch uses a heuristic method of detecting when the hash table
should be destroyed and recreated. InvalidateConstraintCacheCallBack()
adds the current size of the hash table to a counter. When that sum
reaches 1,000,000, the hash table is flushed. This improves the schema
restore of a database with 100,000 foreign keys by factor 3.

According to my tests the patch does not interfere with the bulk
updates, the original feature was supposed to improve.


In the real-world customer case that caused you to look into this,
I thought 45ba424f drove schema-only restore time from 2 hours to
about 25 hours, and that this patch takes it back down to 2 hours.
Am I remembering right?  And this came about because it added
20-some hours to a pg_upgrade run?


From 2 hours to >50, but yes, this is that case.



If there are no objections, I will push this as a bug fix back to
9.3, where the performance regression was introduced.



Regards, Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgsql: Improve logging of TAP tests.

2015-09-10 Thread Stephen Frost

* Stephen Frost (sfr...@snowman.net) wrote:
> Patch attached for review.  Barring objections, I'll commit this in a
> few hours.

Done.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Stephen Frost

* Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
> I'm attaching a small patch to fix another comment typo in setrefs.c:
> s/TIDs/OIDs/

Fixed.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Tom Lane

Stephen Frost  writes:
> * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
>> I'm attaching a small patch to fix another comment typo in setrefs.c:
>> s/TIDs/OIDs/

> Fixed.

I do not think "typo" is the right characterization.  I'm too lazy to
check for sure, but I think what was accumulated was indeed TIDs at one
time.  The proposed patch is not correct either: what we accumulate now is
syscache hash values.  Might be best to just say "add PlanInvalItems for
user-defined functions", which is the wording used in some other places,
eg line 173.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-10 Thread Bruce Momjian

On Thu, Sep 10, 2015 at 03:46:25PM +0900, Michael Paquier wrote:
> Why is it not convenient at all? Yes, you have a point, we need those
> fields to be able to parse the t_data properly. Still the possibility
> to show individual fields of a tuple as a bytea array either with
> toasted or detoasted values is a concept completely different from
> simply showing the page items, which is what, it seems to me,
> heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
> and t_bits are already available as return fields of heap_page_items,
> we should simply add a function like that:
> heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
> t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
> returns bytea[]

Should pageinspect create a table that contains some of the constants
used to interpret infomask?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Etsuro Fujita (fujita.ets...@lab.ntt.co.jp) wrote:
> >> I'm attaching a small patch to fix another comment typo in setrefs.c:
> >> s/TIDs/OIDs/
> 
> > Fixed.
> 
> I do not think "typo" is the right characterization.  I'm too lazy to
> check for sure, but I think what was accumulated was indeed TIDs at one
> time.  The proposed patch is not correct either: what we accumulate now is
> syscache hash values.  Might be best to just say "add PlanInvalItems for
> user-defined functions", which is the wording used in some other places,
> eg line 173.

Perhaps it was.  I had looked at what was being called (which is
record_plan_function_dependency) and noted that it was taking OIDs and
certainly not TIDs.

I agree that rewording it to refer to PlanInvalItems is better than just
saying OIDs when we're actually looking up the OID and then adding a
PlanInvalItem which includes PROCOID and the syscache hash value.

Attached is a patch with the proposed change (against master, the back
branches require slightly different patches due to nearby wording
changes).

Thanks!

Stephen
From a562831ed53e73071243b4226c02ba1f0fbb93cb Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 10 Sep 2015 10:16:57 -0400
Subject: [PATCH] Use PlanInvalItems instead of OIDs

As pointed out by Tom, we're not really adding OIDs (though that's what
is passed to record_plan_function_dependency), we're adding a
PlanInvalItem which contains PROCOID and the syscache hash value to
invalItems.  Rather than go into any of those details here, refer to
what's added as PlanInvalItems, which will hopefully minimize any
confusion.
---
 src/backend/optimizer/plan/setrefs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 093d925..daeb584 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1243,7 +1243,7 @@ copyVar(Var *var)
  * This is code that is common to all variants of expression-fixing.
  * We must look up operator opcode info for OpExpr and related nodes,
  * add OIDs from regclass Const nodes into root->glob->relationOids, and
- * add catalog OIDs for user-defined functions into root->glob->invalItems.
+ * add PlanInvalItems for user-defined functions into root->glob->invalItems.
  * We also fill in column index lists for GROUPING() expressions.
  *
  * We assume it's okay to update opcode info in-place.  So this could possibly
-- 
1.9.1



signature.asc
Description: Digital signature

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Attached is a patch with the proposed change (against master, the back
> > branches require slightly different patches due to nearby wording
> > changes).
> 
> Already done, after a bit of research into when things actually changed.

Awesome, thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Tom Lane

Stephen Frost  writes:
> Attached is a patch with the proposed change (against master, the back
> branches require slightly different patches due to nearby wording
> changes).

Already done, after a bit of research into when things actually changed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-09-10 Thread Fabien COELHO




Thanks for the hints!  Two-part v12 attached fixes these.


Here is a v13, which is just a rebase after 1aba62ec.

--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 9e7bcf5..2ef21fb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2457,6 +2457,28 @@ include_dir 'conf.d'
   
  
 
+ 
+  checkpoint_sort (bool)
+  
+   checkpoint_sort configuration parameter
+  
+  
+  
+   
+Whether to sort buffers before writting them out to disk on checkpoint.
+For a HDD storage, this setting allows to group together
+neighboring pages written to disk, thus improving performance by
+reducing random write activity.
+This sorting should have limited performance effects on SSD backends
+as such storages have good random write performance, but it may
+help with wear-leveling so be worth keeping anyway.
+The default is on.
+This parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+  
+ 
+
  
   checkpoint_warning (integer)
   
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index e3941c9..f538698 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -546,6 +546,18 @@
   
 
   
+   When hard-disk drives (HDD) are used for terminal data storage
+allows to sort pages
+   so that neighboring pages on disk will be flushed together by
+   chekpoints, reducing the random write load and improving performance.
+   If solid-state drives (SSD) are used, sorting pages induces no benefit
+   as their random write I/O performance is good: this feature could then
+   be disabled by setting checkpoint_sort to off.
+   It is possible that sorting may help with SSD wear leveling, so it may
+   be kept on that account.
+  
+
+  
The number of WAL segment files in pg_xlog directory depends on
min_wal_size, max_wal_size and
the amount of WAL generated in previous checkpoint cycles. When old log
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 152d4ed..7291447 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7999,11 +7999,13 @@ LogCheckpointEnd(bool restartpoint)
 sync_secs,
 total_secs,
 longest_secs,
+sort_secs,
 average_secs;
 	int			write_usecs,
 sync_usecs,
 total_usecs,
 longest_usecs,
+sort_usecs,
 average_usecs;
 	uint64		average_sync_time;
 
@@ -8034,6 +8036,10 @@ LogCheckpointEnd(bool restartpoint)
 		CheckpointStats.ckpt_end_t,
 		_secs, _usecs);
 
+	TimestampDifference(CheckpointStats.ckpt_sort_t,
+		CheckpointStats.ckpt_sort_end_t,
+		_secs, _usecs);
+
 	/*
 	 * Timing values returned from CheckpointStats are in microseconds.
 	 * Convert to the second plus microsecond form that TimestampDifference
@@ -8052,8 +8058,8 @@ LogCheckpointEnd(bool restartpoint)
 
 	elog(LOG, "%s complete: wrote %d buffers (%.1f%%); "
 		 "%d transaction log file(s) added, %d removed, %d recycled; "
-		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; "
-		 "sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
+		 "sort=%ld.%03d s, write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s;"
+		 " sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; "
 		 "distance=%d kB, estimate=%d kB",
 		 restartpoint ? "restartpoint" : "checkpoint",
 		 CheckpointStats.ckpt_bufs_written,
@@ -8061,6 +8067,7 @@ LogCheckpointEnd(bool restartpoint)
 		 CheckpointStats.ckpt_segs_added,
 		 CheckpointStats.ckpt_segs_removed,
 		 CheckpointStats.ckpt_segs_recycled,
+		 sort_secs, sort_usecs / 1000,
 		 write_secs, write_usecs / 1000,
 		 sync_secs, sync_usecs / 1000,
 		 total_secs, total_usecs / 1000,
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 3ae2848..3bd5eab 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -65,7 +65,8 @@ void
 InitBufferPool(void)
 {
 	bool		foundBufs,
-foundDescs;
+foundDescs,
+foundCpid;
 
 	/* Align descriptors to a cacheline boundary. */
 	BufferDescriptors = (BufferDescPadded *) CACHELINEALIGN(
@@ -77,10 +78,14 @@ InitBufferPool(void)
 		ShmemInitStruct("Buffer Blocks",
 		NBuffers * (Size) BLCKSZ, );
 
-	if (foundDescs || foundBufs)
+	CheckpointBufferIds = (CheckpointSortItem *)
+		ShmemInitStruct("Checkpoint BufferIds",
+		NBuffers * sizeof(CheckpointSortItem), );
+
+	if (foundDescs || foundBufs || foundCpid)
 	{
-		/* both should be present or neither */
-		Assert(foundDescs && foundBufs);
+		/* all should be present or neither */
+		Assert(foundDescs && foundBufs && foundCpid);
 		/* note: this path is only taken in EXEC_BACKEND case */
 	}
 	else
@@ -144,5 +149,8 @@ BufferShmemSize(void)
 	/* size of stuff controlled by freelist.c */
 	size = add_size(size, StrategyShmemSize());
 
+	/* size of checkpoint sort array

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-10 Thread dinesh kumar

On Thu, Sep 10, 2015 at 8:44 PM, Alvaro Herrera 
wrote:

> dinesh kumar wrote:
>
> > Also, I was under impression that, all our TODO
> >  items are filtered for the real
> use
> > cases. Is my impression wrong. If I wanted to work on another TODO item,
> > where i need to take a look.
>
> Your impression is completely, absolutely, horribly wrong.
>
>
:-)


> The TODO contains some ideas that are good but postponed, other ideas
> that are bad but we didn't know at the time they were recorded, other
> ideas that we don't know either way.  Before doing anything on an item
> from the TODO list, you should first read the linked threads (if any),
> and keep track when they end with an email saying "what an awful idea".
> If this doesn't happen, _search_ for other threads not linked on the
> TODO list that also deal with the same topic; note if they end the same
> way (if you find such threads, it's useful to add a link to them in the
> TODO item).
>
Even if you can't find overly negative opinions about some item, discuss
> it here before doing any actual coding.
>
>
Sure.


> I wonder if we need a new page TONOTDO or something like that.
>
> --
> Álvaro Herrerahttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>



-- 

Regards,
Dinesh
manojadinesh.blogspot.com

Re: [HACKERS] pageinspect patch, for showing tuple data

2015-09-10 Thread Nikolay Shaplov

В письме от 10 сентября 2015 15:46:25 пользователь Michael Paquier написал:

> > So if move tuple data parsing into separate function, then we have to pass
> > these values alongside the tuple data. This is not convenient at all.
> > So I did it in a way you see in the patch.
> 
> Why is it not convenient at all? Yes, you have a point, we need those
> fields to be able to parse the t_data properly. Still the possibility
> to show individual fields of a tuple as a bytea array either with
> toasted or detoasted values is a concept completely different from
> simply showing the page items, which is what, it seems to me,
> heap_page_items is aimed to only do. Hence, As t_infomask2, t_infomask
> and t_bits are already available as return fields of heap_page_items,
> we should simply add a function like that:
> heap_page_item_parse(Oid relid, bytea data, t_infomask2 int,
> t_infomask int, t_bits int, bool force_detoast, warning_mode bool)
> returns bytea[]

Just compare two expressions:

select * from heap_page_item_attrs(get_raw_page('test', 0),'test'::regclass);

and 

select  lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, 
t_infomask2, t_infomask, t_hoff, t_bits, t_oid,  tuple_data_parse (
t_data,  t_infomask, t_infomask2, t_bits, 'test'::regclass, true, false) 
from heap_page_item_attrs(get_raw_page('test', 0));

The second variant is a total mess and almost unsable...

Though I've discussed this issue with friedns, and we came to conclusion that 
it might be good to implement tuple_data_parse and then implement 
easy to use heap_page_item_attrs in pure SQL, using heap_page_items and 
tuple_data_parse.

That would keep usage simplicity, and make code more simple as you offer.

The only one argument against it is that in internal representation t_bist is 
binary, and in sql output it is string with '0' and '1' characters. We will 
have to convert it back to binary mode. This is not difficult, but just useless 
convertations there and back again.

What do you think about this solution?


> Note that the data corruption checks apply only to this function as
> far as I understand, so I think that things could be actually split
> into two independent patches:
> 1) Upgrade heap_page_items to add the tuple data as bytea.
> 2) Add the new function able to parse those fields appropriately.
> 
> As this code, as you justly mentioned, is aimed mainly for educational
> purposes to understand a page structure, we should definitely make it
> as simple as possible at code level, and it seems to me that this
> comes with a separate SQL interface to control tuple data parsing as a
> bytea[]. We are doing no favor to our users to complicate the code of
> pageinspect.c as this patch is doing in its current state, especially
> if beginners want to have a look at it.
> 
> >> Actually not two functions, but just one, with an extra flag to be
> >> able to enforce detoasting on the field values where this can be done.
> > 
> > Yeah, I also thought about it. But did not come to any final conclusion.
> > Should we add forth argument, that enables detoasting, instead of adding
> > separate function?
> 
> This is definitely something you want to control with a switch.
Ok.Let's come to the final decision with tuple_data_parse, and i will add this 
switch there and to pure sql heap_page_item_attrs


-- 
Nikolay Shaplov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-10 Thread Alvaro Herrera

dinesh kumar wrote:

> Also, I was under impression that, all our TODO
>  items are filtered for the real use
> cases. Is my impression wrong. If I wanted to work on another TODO item,
> where i need to take a look.

Your impression is completely, absolutely, horribly wrong.

The TODO contains some ideas that are good but postponed, other ideas
that are bad but we didn't know at the time they were recorded, other
ideas that we don't know either way.  Before doing anything on an item
from the TODO list, you should first read the linked threads (if any),
and keep track when they end with an email saying "what an awful idea".
If this doesn't happen, _search_ for other threads not linked on the
TODO list that also deal with the same topic; note if they end the same
way (if you find such threads, it's useful to add a link to them in the
TODO item).

Even if you can't find overly negative opinions about some item, discuss
it here before doing any actual coding.

I wonder if we need a new page TONOTDO or something like that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-10 Thread Andres Freund

On 2015-09-09 23:45:08 -0400, Tom Lane wrote:
> Were all twenty of them exactly alike?  Were they identical to Andres'
> several dozen attempts?

Mine were pretty much alike and trivial - which is why I never even
bothered to standardize on a variant and store it somewhere.

> The problem I've got with this proposal is that by the time you get to
> a function that could satisfy every possible use-case, you do not have
> something that is easier to use than "write your own function that
> addresses just your use-case".

That's a valid concern. I think the answer there is that we shouldn't
design something usable for every use-case, but rather for 90% of the
cases. Which is a tradeof we very frequently make.

> The only complaint I've seen in this thread that seems like a valid
> deficiency is that RAISE can't deal with treating the error severity level
> as a variable.  But surely we should address that as a new RAISE feature,
> not by inventing a SQL wrapper that will need to reproduce every existing
> RAISE feature before it can think about solving anything new.

That seems like something independently useful.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-09-10 Thread Alvaro Herrera

Michael Paquier wrote:
> On Wed, Sep 9, 2015 at 7:13 PM, Alexander Korotkov wrote:

> > I was running manual tests like a noob :) When you attached you bash script,
> > I've switched to it.
> 
> [product_placement]The patch to improve the recovery test suite
> submitted in this CF would have eased the need of bash-ing test cases
> here, and we could have test cases directly included in your
> patch.[/product_placement]

The problem of bash test scripts is that they don't run on Windows,
which is why we settled on Perl-based TAP tests.  I think having tests
for various aspects of recovery is useful, because being relatively new
it's still being hacked in various ways and it's good to ensure that
things that were previously working continue to work.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-09-10 Thread Jeff Janes

On Thu, Sep 10, 2015 at 8:14 AM, Alvaro Herrera 
wrote:

> dinesh kumar wrote:
>
> > Also, I was under impression that, all our TODO
> >  items are filtered for the real
> use
> > cases. Is my impression wrong. If I wanted to work on another TODO item,
> > where i need to take a look.
>
> Your impression is completely, absolutely, horribly wrong.
>
> The TODO contains some ideas that are good but postponed, other ideas
> that are bad but we didn't know at the time they were recorded, other
> ideas that we don't know either way.  Before doing anything on an item
> from the TODO list, you should first read the linked threads (if any),
> and keep track when they end with an email saying "what an awful idea".
> If this doesn't happen, _search_ for other threads not linked on the
> TODO list that also deal with the same topic; note if they end the same
> way (if you find such threads, it's useful to add a link to them in the
> TODO item).
>
> Even if you can't find overly negative opinions about some item, discuss
> it here before doing any actual coding.
>
> I wonder if we need a new page TONOTDO or something like that.
>

The bottom of the TODO list already has "Features We Do Not Want".  I think
it would just be a matter of moving them down to that section.

Or maybe a new section of "Features I lost an argument over, but hope to
eventually accumulate enough allies to re-fight that battle".

I've thought in the paste that maybe we should create a "stale" badge and
go through and add it to every open item.  Once someone has re-evaluated
that it is still desirable (and that the description even makes sense in
the first place), they can remove the stale badge.  Currently, the TODO
seems to be a collection of belly button lint more than anything else.

Cheers,

Jeff

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-10 Thread Robert Haas

On Wed, Sep 9, 2015 at 11:49 PM, Amit Kapila  wrote:
> - errdetail("Could not rename \"%s\" to \"%s\": %m.",
> + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
>
> Is there any reason to change this message?
> I think you have changed this message to make it somewhat similar with
> the new message we are planning to use in this function, but I don't see
> that as compelling reason to change this message.

The old message better follows the guidelines.  See section 51.3.7:
Avoid Passive Voice.  The old message is what's called
"telegram-style", with PostgreSQL itself as the implicit subject.  The
proposed replacement is just the regular old passive voice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Hooking at standard_join_search (Was: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual)

2015-09-10 Thread Robert Haas

On Wed, Sep 9, 2015 at 2:30 AM, Etsuro Fujita
 wrote:
>> But that path might have already been discarded on the basis of cost.
>> I think Tom's idea is better: let the FDW consult some state cached
>> for this purpose in the RelOptInfo.
>
> Do you have an idea of what information would be collected into the state
> and how the FDW would derive parameterizations to consider producing
> pushed-down joins with from that information?  What I'm concerned about that
> is to reduce the number of parameterizations to consider, to reduce overhead
> in costing the corresponding queries.  I'm missing something, though.

I think the thing we'd want to store in the state would be enough
information to reconstruct a valid join nest.  For example, the
reloptinfo for (A B) might note that A needs to be left-joined to B.
When we go to construct paths for (A B C), and there is no
SpecialJoinInfo that mentions C, we know that we can construct (A LJ
B) IJ C rather than (A IJ B) IJ C.  If any paths survived, we could
find a way to pull that information out of the path, but pulling it
out of the RelOptInfo should always work.

I am not sure what to do about parameterizations.  That's one of my
remaining concerns about moving the hook.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-09-10 Thread Michael Paquier

On Fri, Sep 11, 2015 at 3:41 AM, Beena Emerson  wrote:
> Please find attached the WIP patch for the proposed feature. It is built
> based on the already discussed design.
>
> Changes made:
> - add new parameter "sync_file" to provide the location of the pg_syncinfo
> file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
> pg_ident file.

I am not sure that's really necessary. We could just hardcode its location.

> - pg_syncinfo file will hold the sync rep information in the approved JSON
> format.

OK. Have you considered as well the approach to add support for
multi-line GUC parameters? This has been mentioned a couple of time
above as well, with something like that I imagine:
param = 'value1,' \
'value2,' \
'value3'
and this reads as 'value1,value2,value3'. This would benefit as well
for other parameters.

> - synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the
> JSON value stored in the file.

Check.

> - All the standbys mentioned in the s_s_names or the pg_syncinfo file
> currently get the priority as 1 and all others as  0 (async)
> - Various functions in syncrep.c to read the json file and store the values
> in a struct to be used in checking the quorum status of syncrep standbys
> (SyncRepGetQuorumRecPtr function).
> It does not support the current behavior for synchronous_standby_names = '*'.
> I am yet to thoroughly test the patch.

As this patch adds a whole new infrastructure, this is going to need
complex test setups with many configurations that will require either
bash-ing a bunch of new things, and we are not protected from bugs in
those scripts either or manual manipulation mistakes during the tests.
What I think looks really necessary with this patch is to have
included a set of tests to prove that the patch actually does what it
should with complex scenarios and that it does it correctly. So we had
better perhaps move on with this patch first:
https://commitfest.postgresql.org/6/197/

And it would be really nice to get the tests of this patch integrated
with it as well. We are not protected from bugs in this patch as well,
but if we have an infrastructure centralized this will add a level of
confidence that we are doing things the right way. Your patch offers
as well a good occasion to see if there would be some generic routines
that would be helpful in this recovery test suite.
Regards,
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [PATCH] pg_upgrade fails when postgres/template1 isn't in default tablespace

2015-09-10 Thread Hyongtae Lim

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

This looks good to me.
I've done some tests(pg_dumpall & restore with sql, upgrade test(8.4, 9.x, 
...)) and there was no problems.
so I'm marking this as ready for committer.

Regards,
Hyongtae

The new status of this patch is: Ready for Committer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Autonomous Transaction is back

2015-09-10 Thread Noah Misch

On Wed, Sep 09, 2015 at 10:04:01AM -0400, Robert Haas wrote:
> On Sun, Sep 6, 2015 at 1:56 AM, Noah Misch  wrote:
> > What design principle(s) have you been using to decide how autonomous
> > transactions should behave?
> 
> I have to admit to a complete lack of principle.  I'm quite frightened
> of what this is going to need from the lock manager, and I'm trying to
> wriggle out of having to do things there that are going to be nastily
> hard.  My wriggling isn't going very well, though.

It's an exceptionally-challenging development project, agreed.  So much code
assumes the 1:1 relationship between backends and top-level transactions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned checkpointing

2015-09-10 Thread Takashi Horikawa


> Feel free to do that.  200kB is well below this list's limits.  (I'm not
> sure how easy it is to open .epsf files in today's systems, but .jpg or
> other raster image formats are pretty common.)
Thanks for your comment.

Please find two result graphs (for sync and async commit cases) in .jpg format.
I think it is obvious that performance dips were mitigated in both magnitude 
and frequency by introducing the partitioned checkpointing.


(In passing) one correction for my previous email.
> Storage: raid1 of 4 HDs (write back assumed using BBU) for $PGDATA/pg_xlog
>  raid1 of 2 SSDs for $PGDATA (other than pg_xlog)
'raid1' is wrong and 'raid0' is correct.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories

> -Original Message-
> From: Alvaro Herrera [mailto:pgsql-hackers-ow...@postgresql.org]
> Sent: Friday, September 11, 2015 12:03 AM
> To: Horikawa Takashi(堀川 隆)
> Subject: Re: [HACKERS] Partitioned checkpointing
> 
> Takashi Horikawa wrote:
> 
> > # Since I'm not sure whether it is OK to send an email to this mailing
> > with attaching some files other than patch, I refrain now from
> > attaching raw results (200K bytes of text/case) and result graphs in
> > .jpg or .epsf format illustrating the throughput variations to this
> > email. If it is OK, I'm pleased to show the results in those formats.
> 
> Feel free to do that.  200kB is well below this list's limits.  (I'm not
> sure how easy it is to open .epsf files in today's systems, but .jpg or
> other raster image formats are pretty common.)
> 
> --
> Álvaro Herrera


smime.p7s
Description: S/MIME cryptographic signature

Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Alvaro Herrera

Bernd Helmle wrote:
> A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
> instance.
> 
> The database crashed with the following log messages:
> 
> 2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
> transaction 1068235595
> 2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
> "pg_multixact/members/5FC4": No such file or directory.
> 2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
> where oid = $1 and tanum = $2 

I wonder if these bogus page and offset numbers are just
SlruReportIOError being confused because pg_multixact/members is so
weird (I don't think it should be the case, since this stuff is using
page numbers only, not anything related to how each page is layed out).

Anyway, can you please request pg_controldata to be run on the failed
cluster and paste it here?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Thomas Munro

On Fri, Sep 11, 2015 at 10:45 AM, Alvaro Herrera 
wrote:

> Bernd Helmle wrote:
> > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
> > instance.
> >
> > The database crashed with the following log messages:
> >
> > 2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
> > transaction 1068235595
> > 2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
> > "pg_multixact/members/5FC4": No such file or directory.
> > 2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
> > where oid = $1 and tanum = $2
>
> I wonder if these bogus page and offset numbers are just
> SlruReportIOError being confused because pg_multixact/members is so
> weird (I don't think it should be the case, since this stuff is using
> page numbers only, not anything related to how each page is layed out).
>

But SlruReportIOError uses the same macro to build the filename as
SlruReadPhysicalPage and other functions, namely SlruFileName which uses
sprintf with %04X (unsigned integer uppercase hex) and gives it segno
(which is always an int), so I don't think the problem is in error
reporting only.

Assuming default block size, to get 5FC4 from SlruFileName you need
segno == -41020.

We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get
segno == -41020 you need pageno between -1312640 and -1312609 (whose bit
patterns  reinterpreted as unsigned are 4293654656 and 4293654687).

In various places we have int pageno = offset / (uint32) 1636, expanded
from this macro (which calls the offset an xid):

#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)
MULTIXACT_MEMBERS_PER_PAGE)
I don't really see how any uint32 value could produce such a pageno via
that macro.  Even if called in an environment where (xid) is accidentally
an int, the int / unsigned expression would convert it to unsigned first
(unless (xid) is a bigger type like int64_t: by the rules of int promotion
you'd get signed division in that case, hmm...).  But it's always called
with a MultiXactOffset AKA uint32 variable.

So via that route, there is no MultiXactOffset value that can't be mapped
to a segment in the range "", "14078".  Famously, it wraps after that.

Maybe the negative pageno came from somewhere else.  Where?  Inside SLRU
code we can see pageno = shared->page_number[slotno]... maybe the SLRU
slots got corrupted somehow?

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Re: [PATCH] pg_upgrade fails when postgres/template1 isn't in default tablespace

2015-09-10 Thread Bruce Momjian

On Fri, Sep 11, 2015 at 12:41:09AM +, Hyongtae Lim wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
> 
> This looks good to me.
> I've done some tests(pg_dumpall & restore with sql, upgrade test(8.4, 9.x, 
> ...)) and there was no problems.
> so I'm marking this as ready for committer.

Thanks.  I am going to commit it in the next 24 hours.  Thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Alvaro Herrera

Bernd Helmle wrote:

> 2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
> "pg_multixact/members/5FC4" to offset 4294950912: Invalid argument.
> 2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
> offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 

I just noticed that this offset number 2147483648 is exactly 2^31.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi-tenancy with RLS

2015-09-10 Thread Joe Conway

On 09/01/2015 11:25 PM, Haribabu Kommi wrote:
> If any user is granted any permissions on that object then that user
> can view it's meta data of that object from the catalog tables.
> To check the permissions of the user on the object, instead of
> checking each and every available option, I just added a new
> privilege check option called "any". If user have any permissions on
> the object, the corresponding permission check function returns
> true. Patch attached for the same.
> 
> Any thoughts/comments?

Thanks for working on this! Overall I like the concept and know of use
cases where it is critical and should be supported. Some comments:

1.) "... WITH ROW LEVEL SECURITY ...": name is not really accurate. This
feature does not, for example automatically add row security to all
tables in the database. It is really a specific set of RLS policies on
system tables to accomplish one very specific goal -- restrict metadata
access to those objects on which the user has some access. So maybe
something like one of these?
  ... WITH METADATA SECURITY ...
  ... WITH CATALOG SECURITY ...
  ... WITH METADATA RESTRICT ...
  ... WITH CATALOG RESTRICT ...

2.) Related to #1, I wonder if we should even provide a set of policies,
or should we just allow RLS on system tables and let people create their
own policies to suit their own needs. Maybe that could be dangerous, but
so are a lot of things we allow superusers do.

3.) In RelationBuildRowSecurity:

8<--
+   /* */
+   if (!criticalRelcachesBuilt || !criticalSharedRelcachesBuilt ||
relation->rd_id == PolicyRelationId)
+   return;
8<--

Bypassing RelationBuildRowSecurity() for relation->rd_id ==
PolicyRelationId causes a segfault for me and prevents me from even
logging into the database. Additionally I would want row security to
apply to pg_policy as well. Maybe not the best approach, but I solved
the infinite recursion issue like this:

8<--
diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index 45326a3..49db767 100644
*** a/src/backend/commands/policy.c
--- b/src/backend/commands/policy.c
*** static char parse_policy_command(const c
*** 52,57 
--- 105,116 
  static Datum *policy_role_list_to_array(List *roles, int *num_roles);

  /*
+  * used to prevent infinite recursion when building
+  * row security for pg_policy itself
+  */
+ int rdepth = 0;
+
+ /*
   * Callback to RangeVarGetRelidExtended().
   *
   * Checks the following:
*** RelationBuildRowSecurity(Relation relati
*** 197,202 
--- 257,273 
MemoryContext oldcxt = CurrentMemoryContext;
RowSecurityDesc *volatile rsdesc = NULL;

+   /* */
+   if (!criticalRelcachesBuilt || !criticalSharedRelcachesBuilt)
+   return;
+   if (relation->rd_id == PolicyRelationId && rdepth > 0)
+   {
+   rdepth = 0;
+   return;
+   }
+   if (relation->rd_id == PolicyRelationId)
+   rdepth = 1;
+
/*
 * Create a memory context to hold everything associated with this
 * relation's row security policy.  This makes it easy to clean
up during
*** RelationBuildRowSecurity(Relation relati
*** 366,377 
--- 437,450 
/* Delete rscxt, first making sure it isn't active */
MemoryContextSwitchTo(oldcxt);
MemoryContextDelete(rscxt);
+   rdepth = 0;
PG_RE_THROW();
}
PG_END_TRY();

/* Success --- attach the policy descriptor to the relcache entry */
relation->rd_rsdesc = rsdesc;
+   rdepth = 0;
  }

  /*
8<--

4.) Currently, policies are always installed in the current database
rather than the target database. Didn't try to figure out how to fix:
8<--
postgres=# create database test with ROW LEVEL SECURITY = true;
CREATE DATABASE
postgres=# select count(1) from pg_policy;
 count
---
30
(1 row)

postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# select count(1) from pg_policy;
 count
---
 0
(1 row)
8<--

5.) I'm not thrilled with all the additional included headers added to
policy.c for this, but I don't offhand see a better way either.

6.) I would like to see this work in conjunction with sepgsql (and/or
other security label providers) as well. That could be accomplished
either as mentioned in #2 above (let me create custom policies to suit
my needs), or by providing security label hooks somewhere that they
affect the output of the has__privilege*() functions.

HTH,

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Moving SS_finalize_plan processing to the end of planning

2015-09-10 Thread Robert Haas

On Sun, Aug 9, 2015 at 3:50 PM, Tom Lane  wrote:
> I've started to work on path-ification of the upper planner (finally),
> and since that's going to be a large patch in any case, I've been looking
> for pieces that could be bitten off and done separately.  One such piece
> is the fact that SS_finalize_plan (computation of extParams/allParams)
> currently gets run at the end of subquery_planner; as long as that's true,
> we cannot have subquery_planner returning paths rather than concrete
> plans.  The attached patch moves that processing to occur just before
> set_plan_references is run.

Tom,

Do you expect to do more work on the upper planner path-ification stuff soon?

Just checking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Thomas Munro

On Fri, Sep 11, 2015 at 11:51 AM, Thomas Munro <
thomas.mu...@enterprisedb.com> wrote:

> On Fri, Sep 11, 2015 at 10:45 AM, Alvaro Herrera  > wrote:
>
>> Bernd Helmle wrote:
>> > A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
>> > instance.
>> >
>> > The database crashed with the following log messages:
>> >
>> > 2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
>> > transaction 1068235595
>> > 2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
>> > "pg_multixact/members/5FC4": No such file or directory.
>> > 2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
>> > where oid = $1 and tanum = $2
>>
>> I wonder if these bogus page and offset numbers are just
>> SlruReportIOError being confused because pg_multixact/members is so
>> weird (I don't think it should be the case, since this stuff is using
>> page numbers only, not anything related to how each page is layed out).
>>
>
> But SlruReportIOError uses the same macro to build the filename as
> SlruReadPhysicalPage and other functions, namely SlruFileName which uses
> sprintf with %04X (unsigned integer uppercase hex) and gives it segno
> (which is always an int), so I don't think the problem is in error
> reporting only.
>
> Assuming default block size, to get 5FC4 from SlruFileName you need
> segno == -41020.
>

Oops, I meant to attach the proviso "Assuming default block size" to the
assumption further down that MULTIXACT_MEMBERS_PER_PAGE == 1636.


> We have int segno = pageno / 32 (that's SLRU_PAGES_PER_SEGMENT), so to get
> segno == -41020 you need pageno between -1312640 and -1312609 (whose bit
> patterns  reinterpreted as unsigned are 4293654656 and 4293654687).
>
> In various places we have int pageno = offset / (uint32) 1636, expanded
> from this macro (which calls the offset an xid):
>
> #define MXOffsetToMemberPage(xid) ((xid) / (TransactionId)
> MULTIXACT_MEMBERS_PER_PAGE)
> I don't really see how any uint32 value could produce such a pageno via
> that macro.  Even if called in an environment where (xid) is accidentally
> an int, the int / unsigned expression would convert it to unsigned first
> (unless (xid) is a bigger type like int64_t: by the rules of int promotion
> you'd get signed division in that case, hmm...).  But it's always called
> with a MultiXactOffset AKA uint32 variable.
>
> So via that route, there is no MultiXactOffset value that can't be mapped
> to a segment in the range "", "14078".  Famously, it wraps after that.
>
> Maybe the negative pageno came from somewhere else.  Where?  Inside SLRU
> code we can see pageno = shared->page_number[slotno]... maybe the SLRU
> slots got corrupted somehow?
>

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Partitioned checkpointing

2015-09-10 Thread Takashi Horikawa

Fabien,

Thanks for your comment.
I'll check them and try to examine what is the same and what is different.
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Fabien COELHO
> Sent: Thursday, September 10, 2015 7:33 PM
> To: Horikawa Takashi(堀川 隆)
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Partitioned checkpointing
> 
> 
> 
> > I don't feel that another source of the performance dip has been
> > heartily addressed; full-page-write rush, which I call here, would be
> a major issue.
> > That is, the average size of transaction log (XLOG) records jumps up
> > sharply immediately after the beginning of each checkpoint, resulting
> > in the saturation of WAL write path including disk(s) for
> > $PGDATA/pg_xlog and WAL buffers.
> 
> On this point, you may have a look at this item:
> 
>   https://commitfest.postgresql.org/5/283/
> 
> --
> Fabien.
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


smime.p7s
Description: S/MIME cryptographic signature

Re: [HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Alvaro Herrera

Bernd Helmle wrote:
> A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
> instance.
> 

> 2015-09-08 11:40:59 CEST [27047] FATAL:  could not access status of
> transaction 1068235595
> 2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
> "pg_multixact/members/5FC4" to offset 4294950912: Invalid argument.
> 2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
> offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 

I think the math to compute segment number and byte offset of the member
might be bogus here.  The file names in pg_multixact/members is supposed
to go to 14078 (hex) in a 8kB-BLCKSZ build, and of course it goes a bit
higher in builds with smaller page sizes, but nowhere as high as
5FC4.  And the offset is way too close to 2^32 (exactly 16384 less
than 2^32, to be precise.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-10 Thread Fujii Masao

On Fri, Sep 11, 2015 at 5:43 AM, Robert Haas  wrote:
> On Thu, Sep 10, 2015 at 3:52 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Wed, Sep 9, 2015 at 11:49 PM, Amit Kapila  
>>> wrote:
 - errdetail("Could not rename \"%s\" to \"%s\": %m.",
 + errdetail("\"%s\" could not be renamed to \"%s\": %m.",

 Is there any reason to change this message?
 I think you have changed this message to make it somewhat similar with
 the new message we are planning to use in this function, but I don't see
 that as compelling reason to change this message.
>>
>>> The old message better follows the guidelines.  See section 51.3.7:
>>> Avoid Passive Voice.  The old message is what's called
>>> "telegram-style", with PostgreSQL itself as the implicit subject.  The
>>> proposed replacement is just the regular old passive voice.
>>
>> Neither version is following the guidelines very well, in particular they
>> should be mentioning what kind of object %s is (file? table? tablespace?).
>> But to me the "could not be renamed" version seems to be closer to the
>> spirit of the "use complete sentences" rule for errdetail.  The other one
>> seems better fit for a primary error message, which is supposed to be
>> kept short.
>
> Hmm, I did miss the fact that this was an errdetail().  I agree that
> the object type should be mentioned either way.

So I added the object type, i.e., file in this case, to the errdetail
messages. Attached is the updated version of the patch.

I also changed other log messages related to tablespace_map
so that they follow the style guide.

Regards,

-- 
Fujii Masao
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 152d4ed..a092aad 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6160,13 +6160,13 @@ StartupXLOG(void)
 ereport(LOG,
 	(errmsg("ignoring \"%s\" file because no \"%s\" file exists",
 			TABLESPACE_MAP, BACKUP_LABEL_FILE),
-	 errdetail("\"%s\" was renamed to \"%s\".",
+	 errdetail("File \"%s\" was renamed to \"%s\".",
 			   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 			else
 ereport(LOG,
 		(errmsg("ignoring \"%s\" file because no \"%s\" file exists",
 TABLESPACE_MAP, BACKUP_LABEL_FILE),
-		 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
+		 errdetail("File \"%s\" could not be renamed to \"%s\": %m.",
    TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
 
@@ -10911,32 +10911,32 @@ CancelBackup(void)
 {
 	struct stat stat_buf;
 
-	/* if the file is not there, return */
+	/* if the backup_label file is not there, return */
 	if (stat(BACKUP_LABEL_FILE, _buf) < 0)
 		return;
 
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(BACKUP_LABEL_OLD);
 
-	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) == 0)
-	{
-		ereport(LOG,
-(errmsg("online backup mode canceled"),
- errdetail("\"%s\" was renamed to \"%s\".",
-		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
-	}
-	else
+	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD) != 0)
 	{
 		ereport(WARNING,
 (errcode_for_file_access(),
  errmsg("online backup mode was not canceled"),
- errdetail("Could not rename \"%s\" to \"%s\": %m.",
+ errdetail("File \"%s\" could not be renamed to \"%s\": %m.",
 		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
+		return;
 	}
 
 	/* if the tablespace_map file is not there, return */
 	if (stat(TABLESPACE_MAP, _buf) < 0)
+	{
+		ereport(LOG,
+(errmsg("online backup mode canceled"),
+ errdetail("File \"%s\" was renamed to \"%s\".",
+		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD)));
 		return;
+	}
 
 	/* remove leftover file from previously canceled backup if it exists */
 	unlink(TABLESPACE_MAP_OLD);
@@ -10945,15 +10945,19 @@ CancelBackup(void)
 	{
 		ereport(LOG,
 (errmsg("online backup mode canceled"),
- errdetail("\"%s\" was renamed to \"%s\".",
-		   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
+ errdetail("Files \"%s\" and \"%s\" were renamed to "
+		   "\"%s\" and \"%s\", respectively.",
+		   BACKUP_LABEL_FILE, TABLESPACE_MAP,
+		   BACKUP_LABEL_OLD, TABLESPACE_MAP_OLD)));
 	}
 	else
 	{
 		ereport(WARNING,
 (errcode_for_file_access(),
- errmsg("online backup mode was not canceled"),
- errdetail("Could not rename \"%s\" to \"%s\": %m.",
+ errmsg("online backup mode canceled"),
+ errdetail("File \"%s\" was renamed to \"%s\", but "
+		   "file \"%s\" could not be renamed to \"%s\": %m.",
+		   BACKUP_LABEL_FILE, BACKUP_LABEL_OLD,
 		   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 	}
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Kouhei Kaigai

Hello,

> -Original Message-
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp]
> Sent: Friday, September 11, 2015 2:05 PM
> To: robertmh...@gmail.com
> Cc: fujita.ets...@lab.ntt.co.jp; Kaigai Kouhei(海外 浩平);
> pgsql-hackers@postgresql.org; shigeru.han...@gmail.com
> Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
> 
> Hello,
> 
> At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas  wrote
> in 
> > On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
> >  wrote:
> > >> I'm wondering if there's another approach.  If I understand correctly,
> > >> there are two reasons why the current situation is untenable.  The
> > >> first is that ForeignRecheck always returns true, but we could instead
> > >> call an FDW-supplied callback routine there.  The callback could be
> > >> optional, so that we just return true if there is none, which is nice
> > >> for already-existing FDWs that then don't need to do anything.
> > >
> > > My question about this is, is the callback really needed?  If there are 
> > > any
> > > FDWs that want to do the work *in their own way*, instead of just doing
> > > ExecProcNode for executing a local join execution plan in case of foreign
> > > join (or just doing ExecQual for checking remote quals in case of foreign
> > > table), I'd agree with introducing the callback, but if not, I don't think
> > > that that makes much sense.
> >
> > It doesn't seem to me that it hurts much of anything to add the
> > callback there, and it does provide some flexibility.  Actually, I'm
> > not really sure why we're thinking we need a subplan here at all,
> > rather than just having a ForeignRecheck callback that can do whatever
> > it needs to do with no particular help from the core infrastructure.
> > I think you wrote some code to show how postgres_fdw would use the API
> > you are proposing, but I can't find it.  Can you point me in the right
> > direction?
> 
> I've heard that the reason for the (fs_)subplan is that it should
> be initialized using create_plan_recurse, set_plan_refs and
> finalyze_plan (or others), which are static functions in the
> planner, unavailable in fdw code.
>
It was a discussion when custom-scan/join interface got merged, because
I primarily designed the interface to call create_plan_recurse() from
the extension, however, we concluded that we keep this function as static
and tells the core a bunch of path-nodes to be initialized.
It also reduced interface complexity because we can omit callbacks to
be placed on the setrefs.c and subselect.c.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-10 Thread Amit Kapila

On Fri, Sep 11, 2015 at 10:10 AM, Fujii Masao  wrote:
>
> So I added the object type, i.e., file in this case, to the errdetail
> messages. Attached is the updated version of the patch.
>
> I also changed other log messages related to tablespace_map
> so that they follow the style guide.
>

Looks good to me.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Kyotaro HORIGUCHI

Hello,

At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas  wrote 
in 
> On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
>  wrote:
> >> I'm wondering if there's another approach.  If I understand correctly,
> >> there are two reasons why the current situation is untenable.  The
> >> first is that ForeignRecheck always returns true, but we could instead
> >> call an FDW-supplied callback routine there.  The callback could be
> >> optional, so that we just return true if there is none, which is nice
> >> for already-existing FDWs that then don't need to do anything.
> >
> > My question about this is, is the callback really needed?  If there are any
> > FDWs that want to do the work *in their own way*, instead of just doing
> > ExecProcNode for executing a local join execution plan in case of foreign
> > join (or just doing ExecQual for checking remote quals in case of foreign
> > table), I'd agree with introducing the callback, but if not, I don't think
> > that that makes much sense.
> 
> It doesn't seem to me that it hurts much of anything to add the
> callback there, and it does provide some flexibility.  Actually, I'm
> not really sure why we're thinking we need a subplan here at all,
> rather than just having a ForeignRecheck callback that can do whatever
> it needs to do with no particular help from the core infrastructure.
> I think you wrote some code to show how postgres_fdw would use the API
> you are proposing, but I can't find it.  Can you point me in the right
> direction?

I've heard that the reason for the (fs_)subplan is that it should
be initialized using create_plan_recurse, set_plan_refs and
finalyze_plan (or others), which are static functions in the
planner, unavailable in fdw code.

Is this pointless?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Kyotaro HORIGUCHI

Sorry, that's quite wrong.. Please let me fix it.

- Is this pointless?
+ Does it make sense?

=
Hello,

At Thu, 10 Sep 2015 17:24:00 -0400, Robert Haas  wrote 
in 
> On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
>  wrote:
> >> I'm wondering if there's another approach.  If I understand correctly,
> >> there are two reasons why the current situation is untenable.  The
> >> first is that ForeignRecheck always returns true, but we could instead
> >> call an FDW-supplied callback routine there.  The callback could be
> >> optional, so that we just return true if there is none, which is nice
> >> for already-existing FDWs that then don't need to do anything.
> >
> > My question about this is, is the callback really needed?  If there are any
> > FDWs that want to do the work *in their own way*, instead of just doing
> > ExecProcNode for executing a local join execution plan in case of foreign
> > join (or just doing ExecQual for checking remote quals in case of foreign
> > table), I'd agree with introducing the callback, but if not, I don't think
> > that that makes much sense.
> 
> It doesn't seem to me that it hurts much of anything to add the
> callback there, and it does provide some flexibility.  Actually, I'm
> not really sure why we're thinking we need a subplan here at all,
> rather than just having a ForeignRecheck callback that can do whatever
> it needs to do with no particular help from the core infrastructure.
> I think you wrote some code to show how postgres_fdw would use the API
> you are proposing, but I can't find it.  Can you point me in the right
> direction?

I've heard that the reason for the (fs_)subplan is that it should
be initialized using create_plan_recurse, set_plan_refs and
finalyze_plan (or others), which are static functions in the
planner, unavailable in fdw code.

Does it make sense?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parser emits mysterious error message for very long tokens

2015-09-10 Thread Kyotaro HORIGUCHI

Hello, this is a problem on an extreme situation.

When parser encounters very long tokens, it returns an error
message which should be mysterious for users.

> $ perl -e "print \"select '\" . \"x\"x(512*1024*1024) . \"'\"" | psql postgres
> ERROR:  invalid memory alloc request size 1073741824

Since parser repeats repalloc doubling in size starting from 1024
bytes, it practically fails for tokens longer than 512MiB. Most
tokens doesn't touch the limit but text literal may do.

If we don't assume this as a problem, the following discussion
would be useless.


1. Supplying context meessage or something would be needed anyway.

  > ERROR:  invalid memory alloc request size 1073741824
  > DETAIL: maybe encoutered too long token in parsing query


2. Edit the documentaion, or modify the behavior?

 http://www.postgresql.org/docs/devel/static/datatype-character.html 

 >  In any case, the longest possible character string that can be
 >  stored is about 1 GB.

 Maximum possible text length would be about 1GB but it is far
 shorter when parsing text literal.

 I thought it is better to avoid 512MiB limit than adding
 discription about it to documentation.  It could easily be done
 using MaxAllocSize (I don't like this but enlargeStringInfo
 already does this..), or start from (1024 - 1) and repalloc with
 ((last_size + 1) * 2 - 1) would do assuming MaxAllocSize is 2^n
 (this looks unwantedly complex), or simply starts from 1023
 works enough.

 On the other hand, any of these fixes caused "out of memory" of
 receive buffer on my environment. So I think such fix might be
 useless.

Thoughts? Sugestions? or is this totally useless?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Another typo in comment in setrefs.c

2015-09-10 Thread Etsuro Fujita


On 2015/09/10 23:31, Stephen Frost wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

Stephen Frost  writes:

Attached is a patch with the proposed change (against master, the back
branches require slightly different patches due to nearby wording
changes).


Already done, after a bit of research into when things actually changed.


Awesome, thanks!


Thank you, Stephen and Tom.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Etsuro Fujita


On 2015/09/11 6:24, Robert Haas wrote:

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
 wrote:

I'm wondering if there's another approach.  If I understand correctly,
there are two reasons why the current situation is untenable.  The
first is that ForeignRecheck always returns true, but we could instead
call an FDW-supplied callback routine there.  The callback could be
optional, so that we just return true if there is none, which is nice
for already-existing FDWs that then don't need to do anything.


My question about this is, is the callback really needed?  If there are any
FDWs that want to do the work *in their own way*, instead of just doing
ExecProcNode for executing a local join execution plan in case of foreign
join (or just doing ExecQual for checking remote quals in case of foreign
table), I'd agree with introducing the callback, but if not, I don't think
that that makes much sense.


It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility.  Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it.  Can you point me in the right
direction?


I've proposed the following API changes:

* I modified create_foreignscan_path, which is called from 
postgresGetForeignJoinPaths/postgresGetForeignPaths, so that a path, 
subpath, is passed as the eighth argument of the function. subpath 
represents a local join execution path if scanrelid==0, but NULL if 
scanrelid>0.


* I modified make_foreignscan, which is called from 
postgresGetForeignPlan, so that a list of quals, fdw_quals, is passed as 
the last argument of the function.  fdw_quals represents remote quals if 
scanrelid>0, but NIL if scanrelid==0.


You can find that code in the postgres_fdw patch 
(foreign_join_v16_efujita.patch) attached to [1].


Best regards,
Etsuro Fujita

[1] http://www.postgresql.org/message-id/55cb2d45.7040...@lab.ntt.co.jp



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] more RLS oversights

2015-09-10 Thread Stephen Frost

Noah,

First off, thanks again for your review and comments on RLS.  Hopefully
this addresses the last remaining open item from that review.

* Noah Misch (n...@leadboat.com) wrote:
> On Wed, Feb 25, 2015 at 11:37:24PM -0500, Stephen Frost wrote:
> > +  
> > +   Referential integrity checks, such as unique or primary key constraints
> > +   and foreign key references, will bypass row security to ensure that
> > +   data integrity is maintained.  Care must be taken when developing
> > +   schemas and row level policies to avoid a "covert channel" leak of
> > +   information through these referential integrity checks.
> ...
> > +   /*
> > +* Row-level security should be disabled in the case where a foreign-key
> > +* relation is queried to check existence of tuples that references the
> > +* primary-key being modified.
> > +*/
> > +   temp_sec_context = save_sec_context | SECURITY_LOCAL_USERID_CHANGE;
> > +   if (qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK
> > +   || qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK_FROM_PK
> > +   || qkey->constr_queryno == RI_PLAN_RESTRICT_DEL_CHECKREF
> > +   || qkey->constr_queryno == RI_PLAN_RESTRICT_UPD_CHECKREF)
> > +   temp_sec_context |= SECURITY_ROW_LEVEL_DISABLED;
> 
> (5) This code does not use SECURITY_ROW_LEVEL_DISABLED for foreign keys with
> CASCADE, SET NULL or SET DEFAULT actions.  The associated documentation says
> nothing about this presumably-important distinction.

I believe the original intent was to only include the queries which were
against the primary table, but reviewing what ends up happening, that
clearly doesn't actually make sense.  As you note below, this is only
to address the 'row_security = force' mode, which I suspect is why it
wasn't picked up in earlier testing.

> Is SECURITY_ROW_LEVEL_DISABLED mode even needed?  We switch users to the owner
> of the FROM-clause table before running an RI query.  That means use of this
> mode can only matter under row_security=force, right?  I would rest easier if
> this mode went away, because it is a security landmine.  If user code managed
> to run in this mode, it would bypass every policy in the database.  (I find no
> such vulnerability today, because we use the mode only for parse analysis of
> ri_triggers.c queries.)

You're right, the reason for it to exist was to address the
'row_security = force' case.  I spent a bit of time considering that and
have come up with the attached for consideration (which needs additional
work before being committed, but should get the idea across clearly).

This removes the notion of SECURITY_ROW_LEVEL_DISABLED entirely and
instead ignores 'row_security = force' if InLocalUserIdChange() is true.
Further, this changes row_security to have GUC_NOT_WHILE_SEC_REST set.
I didn't set GUC_NO_RESET_ALL for it though as the default is for
row_security to be 'on'.

The biggest drawback that I can see to this is that users won't be able
to set the row_security GUC inside of a security definer function.  I
can imagine use-cases where someone might want to change it in a
security definer function but it doesn't seem like they're valuable
enough to counter your argument (which I agree with) that
SECURITY_ROW_LEVEL_DISABLED is a security landmine.

Another approach which I considered was to add a new 'RLS_IGNORE_FORCE'
flag, which would, again, ignore 'row_security=force' when set (meaning
owners would bypass RLS regardless of the actual row_security setting).
I didn't like adding that to sec_context though and it wasn't clear
where a good place would be.  Further, it seems like it'd be nice to
have a generic flag that says "we're running an internal referential
integrity operation, don't get in the way", though that could also be a
risky flag to have.  Such an approach would allow us to differentiate RI
queries from operations run under a security definer function though.

Thoughts?

Thanks!

Stephen
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
new file mode 100644
index 61edde9..647ea77
*** a/src/backend/utils/adt/ri_triggers.c
--- b/src/backend/utils/adt/ri_triggers.c
*** ri_PlanCheck(const char *querystr, int n
*** 2984,3001 
  	/* Switch to proper UID to perform check as */
  	GetUserIdAndSecContext(_userid, _sec_context);

- 	/*
- 	 * Row-level security should be disabled in the case where a foreign-key
- 	 * relation is queried to check existence of tuples that references the
- 	 * primary-key being modified.
- 	 */
  	temp_sec_context = save_sec_context | SECURITY_LOCAL_USERID_CHANGE;
- 	if (qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK
- 		|| qkey->constr_queryno == RI_PLAN_CHECK_LOOKUPPK_FROM_PK
- 		|| qkey->constr_queryno == RI_PLAN_RESTRICT_DEL_CHECKREF
- 		|| qkey->constr_queryno == RI_PLAN_RESTRICT_UPD_CHECKREF)
- 		temp_sec_context |= SECURITY_ROW_LEVEL_DISABLED;
- 

  	SetUserIdAndSecContext(RelationGetForm(query_rel)->relowner,

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-10 Thread Tom Lane

Robert Haas  writes:
> On Wed, Sep 9, 2015 at 11:49 PM, Amit Kapila  wrote:
>> - errdetail("Could not rename \"%s\" to \"%s\": %m.",
>> + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
>> 
>> Is there any reason to change this message?
>> I think you have changed this message to make it somewhat similar with
>> the new message we are planning to use in this function, but I don't see
>> that as compelling reason to change this message.

> The old message better follows the guidelines.  See section 51.3.7:
> Avoid Passive Voice.  The old message is what's called
> "telegram-style", with PostgreSQL itself as the implicit subject.  The
> proposed replacement is just the regular old passive voice.

Neither version is following the guidelines very well, in particular they
should be mentioning what kind of object %s is (file? table? tablespace?).
But to me the "could not be renamed" version seems to be closer to the
spirit of the "use complete sentences" rule for errdetail.  The other one
seems better fit for a primary error message, which is supposed to be
kept short.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-09-10 Thread Robert Haas

On Thu, Sep 10, 2015 at 3:52 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, Sep 9, 2015 at 11:49 PM, Amit Kapila  wrote:
>>> - errdetail("Could not rename \"%s\" to \"%s\": %m.",
>>> + errdetail("\"%s\" could not be renamed to \"%s\": %m.",
>>>
>>> Is there any reason to change this message?
>>> I think you have changed this message to make it somewhat similar with
>>> the new message we are planning to use in this function, but I don't see
>>> that as compelling reason to change this message.
>
>> The old message better follows the guidelines.  See section 51.3.7:
>> Avoid Passive Voice.  The old message is what's called
>> "telegram-style", with PostgreSQL itself as the implicit subject.  The
>> proposed replacement is just the regular old passive voice.
>
> Neither version is following the guidelines very well, in particular they
> should be mentioning what kind of object %s is (file? table? tablespace?).
> But to me the "could not be renamed" version seems to be closer to the
> spirit of the "use complete sentences" rule for errdetail.  The other one
> seems better fit for a primary error message, which is supposed to be
> kept short.

Hmm, I did miss the fact that this was an errdetail().  I agree that
the object type should be mentioned either way.  Actually, it's sort
of surprising that this message is a detail message rather than a
primary message.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] RLS open items are vague and unactionable

2015-09-10 Thread Robert Haas

The open items list here:

https://wiki.postgresql.org/wiki/PostgreSQL_9.5_Open_Items

...currently has the following items related to RLS:

CREATE POLICY and RETURNING
Arguable RLS security bug, EvalPlanQual() paranoia
Dean's latest round of RLS refactoring.
more RLS oversights

Those items don't seem to have changed much in the last month, but I
think what I'm actually more concerned about is that I don't really
know what they mean or how serious they are.  Is there a way that we
can rephrase this list into a form such that, if somebody were
motivated to work on these issues, they'd be able to contribute?  And
if they were not motivated to work on them but at least wanted to
understand the situation, that would be easy?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Robert Haas

On Thu, Sep 3, 2015 at 6:25 AM, Etsuro Fujita
 wrote:
> I gave it another thought; the following changes to ExecInitNode would make
> the patch much simpler, ie, we would no longer need to call the new code in
> ExecInitForeignScan, ExecForeignScan, ExecEndForeignScan, and
> ExecReScanForeignScan.  I think that would resolve the name problem also.
>
> *** a/src/backend/executor/execProcnode.c
> --- b/src/backend/executor/execProcnode.c
> ***
> *** 247,254  ExecInitNode(Plan *node, EState *estate, int eflags)
> break;
>
> case T_ForeignScan:
> !   result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
> !  estate, eflags);
> break;
>
> case T_CustomScan:
> --- 247,269 
> break;
>
> case T_ForeignScan:
> !   {
> !   Index   scanrelid = ((ForeignScan *)
> node)->scan.scanrelid;
> !
> !   if (estate->es_epqTuple != NULL && scanrelid == 0)
> !   {
> !   /*
> !* We are in foreign join inside an EvalPlanQual
> recheck.
> !* Initialize local join execution plan, instead.
> !*/
> !   Plan   *subplan = ((ForeignScan *)
> node)->fs_subplan;
> !
> !   result = ExecInitNode(subplan, estate, eflags);
> !   }
> !   else
> !   result = (PlanState *) ExecInitForeignScan((ForeignScan
> *) node,
> !  estate,
> eflags);
> !   }
> break;

I don't think that's a good idea.  The Plan tree and the PlanState
tree should be mirror images of each other; breaking that equivalence
will cause confusion, at least.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-09-10 Thread Robert Haas

On Thu, Sep 3, 2015 at 1:22 AM, Etsuro Fujita
 wrote:
>> I'm wondering if there's another approach.  If I understand correctly,
>> there are two reasons why the current situation is untenable.  The
>> first is that ForeignRecheck always returns true, but we could instead
>> call an FDW-supplied callback routine there.  The callback could be
>> optional, so that we just return true if there is none, which is nice
>> for already-existing FDWs that then don't need to do anything.
>
> My question about this is, is the callback really needed?  If there are any
> FDWs that want to do the work *in their own way*, instead of just doing
> ExecProcNode for executing a local join execution plan in case of foreign
> join (or just doing ExecQual for checking remote quals in case of foreign
> table), I'd agree with introducing the callback, but if not, I don't think
> that that makes much sense.

It doesn't seem to me that it hurts much of anything to add the
callback there, and it does provide some flexibility.  Actually, I'm
not really sure why we're thinking we need a subplan here at all,
rather than just having a ForeignRecheck callback that can do whatever
it needs to do with no particular help from the core infrastructure.
I think you wrote some code to show how postgres_fdw would use the API
you are proposing, but I can't find it.  Can you point me in the right
direction?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] 9.3.9 and pg_multixact corruption

2015-09-10 Thread Bernd Helmle

A customer had a severe issue with a PostgreSQL 9.3.9/sparc64/Solaris 11
instance.

The database crashed with the following log messages:

2015-09-08 00:49:16 CEST [2912] PANIC:  could not access status of
transaction 1068235595
2015-09-08 00:49:16 CEST [2912] DETAIL:  Could not open file
"pg_multixact/members/5FC4": No such file or directory.
2015-09-08 00:49:16 CEST [2912] STATEMENT:  delete from StockTransfer
where oid = $1 and tanum = $2 

When they called us later, it turned out that the crash happened during a
base backup, leaving a backup_label behind which prevented the database
coming up again with a invalid checkpoint location. However, removing the
backup_label still didn't let the database through recovery, it failed
again with the former error, this time during recovery:

2015-09-08 11:40:04 CEST [27047] LOG:  database system was interrupted
while in recovery at 2015-09-08 11:19:44 CEST
2015-09-08 11:40:04 CEST [27047] HINT:  This probably means that some data
is corrupted and you will have to use the last backup for recovery.
2015-09-08 11:40:04 CEST [27047] LOG:  database system was not properly
shut down; automatic recovery in progress
2015-09-08 11:40:05 CEST [27047] LOG:  redo starts at 1A52/2313FEF8
2015-09-08 11:40:47 CEST [27082] FATAL:  the database system is starting up
2015-09-08 11:40:59 CEST [27047] FATAL:  could not access status of
transaction 1068235595
2015-09-08 11:40:59 CEST [27047] DETAIL:  Could not seek in file
"pg_multixact/members/5FC4" to offset 4294950912: Invalid argument.
2015-09-08 11:40:59 CEST [27047] CONTEXT:  xlog redo create mxid 1068235595
offset 2147483648 nmembers 2: 2896635220 (upd) 2896635510 (keysh) 
2015-09-08 11:40:59 CEST [27045] LOG:  startup process (PID 27047) exited
with exit code 1
2015-09-08 11:40:59 CEST [27045] LOG:  aborting startup due to startup
process failure

Some side notes:

An additional recovery from a base backup and archive recovery yield to the
same error, as soon as the affected tuple was touched with a DELETE. The
affected table was fully dumpable via pg_dump, though.

We also have a core dump, but no direct access to the machine. If there's
more information  required (and i believe it is), let me know where to dig
deeper. I also would like to request a backtrace from the existing core
dump, but in the absence of a sparc64 machine here we need to ask the
customer to get one.

-- 
Thanks

Bernd


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] statistics for array types

2015-09-10 Thread Alvaro Herrera

Jeff Janes wrote:

> The attached patch forces there to be at least one element in MCE, keeping
> the one element with the highest predicted frequency if the MCE would
> otherwise be empty.  Then any other element queried for is assumed to be no
> more common than this most common element.

Hmm, what happens if a common-but-not-an-MCE element is pruned out of
the array when a bucket is filled?  I imagine it's going to mis-estimate
the selectivity (though I imagine the effect is going to be pretty
benign anyway, I mean it's still going to be better than stock 0.5%.)

> I'd also briefly considered just having the part of the code that pulls the
> stats out of pg_stats interpret a MCE array as meaning that nothing is more
> frequent than the threshold, but that would mean that that part of the code
> needs to know about how the threshold is chosen, which just seems wrong.

I wonder if we shouldn't add a separate stats STATISTIC_KIND for this,
instead ot trying to transfer knowledge.

Given how simple this patch is, I am tempted to apply it anyway.  It
needs a few additional comment to explain what is going on, though.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-09-10 Thread Fabien COELHO



Hello Amit,


 - T1 flush=off  27480 -> 27482 :+0.0%
 - T1 flush=on   25214 -> 26819 :+6.3%
 - T2 flush=off   5050 ->  6194 :   +22.6%
 - T2 flush=on2771 ->  6110 :  +120.4%


There is a clear win only in cases when sort is used with flush, apart
from that using sort alone doesn't have any clear advantage.


Indeed, I agree that the improvement is much smaller without flushing, 
although it is there somehow (+0.0 & +22.6 => +11.3% on average).


Well, at least we may agree that it is "somehow significantly better" ?:-)

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-09-10 Thread Beena Emerson

Hello,

Please find attached the WIP patch for the proposed feature. It is built
based on the already discussed design.

Changes made:
- add new parameter "sync_file" to provide the location of the pg_syncinfo
file. The default is 'ConfigDir/pg_syncinfo.conf', same as for pg_hba and
pg_ident file.
- pg_syncinfo file will hold the sync rep information in the approved JSON
format.
- synchronous_standby_names can be set to 'pg_syncinfo.conf'  to read the
JSON value stored in the file.
- All the standbys mentioned in the s_s_names or the pg_syncinfo file
currently get the priority as 1 and all others as  0 (async)
- Various functions in syncrep.c to read the json file and store the values
in a struct to be used in checking the quorum status of syncrep standbys
(SyncRepGetQuorumRecPtr function).

It does not support the current behavior for synchronous_standby_names =
'*'. I am yet to thoroughly test the patch.

Thoughts?

--
Beena Emerson


WIP_multiple_syncrep.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

56 matches

Mail list logo