Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Michael Paquier
On Fri, Jul 21, 2017 at 12:23 AM, Alvaro Herrera
 wrote:
> Kyotaro HORIGUCHI wrote:
>> Finally, I added a new TAP test library PsqlSession. It offers
>> interactive psql sessions. Then added a simple test to
>> postgres_fdw using it.
>
> Hmm, I think this can be very useful for other things.  Let's keep this
> in mind to use in the future, even if we find another way to fix the
> issue at hand.  In fact, I had a problem a couple of weeks ago in which
> I needed two concurrent sessions and one of them disconnected in the
> middle of the test.

Agreed, I wanted the ability to hold a session at hand a couple of
times already for tests. And I agree with the point of having a
separate discussion for such things out of the scope of a bug fix.
Thinking larger, I think that it would be more helpful to hold
processes and run commands in parallel, say for pg_receivewal.

> Can't do that with isolationtester ...

In the pglogical fork of Postgres, you guys improved isolationtester
to handle multiple hosts, right? That sounds harder to integrate than
a perl module though, as isolation tester starts only one server.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Kyotaro HORIGUCHI
At Thu, 20 Jul 2017 18:23:05 -0400, Alvaro Herrera  
wrote in <20170720222305.ij3pk7qw5im3wozr@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > Finally, I added a new TAP test library PsqlSession. It offers
> > interactive psql sessions. Then added a simple test to
> > postgres_fdw using it.
> 
> Hmm, I think this can be very useful for other things.  Let's keep this
> in mind to use in the future, even if we find another way to fix the
> issue at hand.  In fact, I had a problem a couple of weeks ago in which
> I needed two concurrent sessions and one of them disconnected in the
> middle of the test.  Can't do that with isolationtester ...

Thanks. I agree that it still useful to write more complex
tests. The most significant issue on this (PsqlSession.pm) comes
from the fact that I didn't find the way to detect the end of an
query execution without attaching a bogus query.. And this kind
of things tend to be unstable on an high-load environment.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Thomas Munro
On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
 wrote:
>
> Attached is the core dump file received on PG 10beta2 version.

Thanks Neha.  It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG().  What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway?  Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently.  They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid.  What am I missing here?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Kyotaro HORIGUCHI
At Thu, 20 Jul 2017 18:15:42 -0400, Tom Lane  wrote in 
<18927.1500588...@sss.pgh.pa.us>
> This seems like overkill.  We can test it reasonably easily within the
> existing framework, as shown in the attached patch.  I'm also fairly

It checks for a disconnection caused in a single session. I
thought that its inter-process characteristics is important
(since I had forgot that in the previous version), but it is
reasonable enough if we can rely on the fact that it surely works
through invalidation mechanism.

In shoft, I agree to the test in your patch.

> concerned that what you're showing here would be unstable in the buildfarm
> as a result of race conditions between the multiple sessions.

Sure. It is what I meant by 'fragile'.

> I made some cosmetic updates to the code patch, as well.

Thank you for leaving the hashvalue staff and revising the comment.

By the way I mistakenly had left the following code in the
previous patch.

+ /* hashvalue == 0 means a cache reset, must clear all state */
+ if (hashvalue == 0)
+   entry->invalidated = true;
+ else if ((cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue) ||
+  (cacheid == USERMAPPINGOID &&
+   entry->mapping_hashvalue == hashvalue))
+   entry->invalidated = true;

The reason for the redundancy was that it had used switch-case in
the else block just before. However, it is no longer
reasonable. I'd like to change here as the follows.

+ /* hashvalue == 0 means a cache reset, must clear all state */
+ if ((hashvalue == 0) ||
+ ((cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue) ||
+  (cacheid == USERMAPPINGOID &&
+   entry->mapping_hashvalue == hashvalue)))
+   entry->invalidated = true;

The attached patch differs only in this point.

> I think this is actually a bug fix, and should not wait for the next
> commitfest.

Agreed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/contrib/postgres_fdw/connection.c
--- b/contrib/postgres_fdw/connection.c
***
*** 22,27 
--- 22,28 
  #include "pgstat.h"
  #include "storage/latch.h"
  #include "utils/hsearch.h"
+ #include "utils/inval.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
***
*** 48,58  typedef struct ConnCacheEntry
--- 49,63 
  {
  	ConnCacheKey key;			/* hash key (must be first) */
  	PGconn	   *conn;			/* connection to foreign server, or NULL */
+ 	/* Remaining fields are invalid when conn is NULL: */
  	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
   * one level of subxact open, etc */
  	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
  	bool		have_error;		/* have any subxacts aborted in this xact? */
  	bool		changing_xact_state;	/* xact state change in process */
+ 	bool		invalidated;	/* true if reconnect is pending */
+ 	uint32		server_hashvalue;	/* hash value of foreign server OID */
+ 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
  } ConnCacheEntry;
  
  /*
***
*** 69,74  static bool xact_got_connection = false;
--- 74,80 
  
  /* prototypes of private functions */
  static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+ static void disconnect_pg_server(ConnCacheEntry *entry);
  static void check_conn_params(const char **keywords, const char **values);
  static void configure_remote_session(PGconn *conn);
  static void do_sql_command(PGconn *conn, const char *sql);
***
*** 78,83  static void pgfdw_subxact_callback(SubXactEvent event,
--- 84,90 
  	   SubTransactionId mySubid,
  	   SubTransactionId parentSubid,
  	   void *arg);
+ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
  static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
  static bool pgfdw_cancel_query(PGconn *conn);
  static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
***
*** 130,135  GetConnection(UserMapping *user, bool will_prep_stmt)
--- 137,146 
  		 */
  		RegisterXactCallback(pgfdw_xact_callback, NULL);
  		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
+ 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
+ 	  pgfdw_inval_callback, (Datum) 0);
+ 		CacheRegisterSyscacheCallback(USERMAPPINGOID,
+ 	  pgfdw_inval_callback, (Datum) 0);
  	}
  
  	/* Set flag that we did GetConnection during the current transaction */
***
*** 144,161  GetConnection(UserMapping *user, bool will_prep_stmt)
  	entry = hash_search(ConnectionHash, , HASH_ENTER, );
  	if (!found)
  	{
! 		/* initialize new hashtable entry (key is already filled in) */
  		entry->conn = NULL;
- 		entry->xact_depth = 0;
- 		entry->have_prep_stmt = false;
- 		entry->have_error = false;
- 		entry->changing_xact_state = false;
  	}
  
  	/* Reject further use of connections 

Re: [HACKERS] Better error message for trying to drop a DB with open subscriptions?

2017-07-20 Thread Jeff Janes
On Thu, Jul 20, 2017 at 4:09 PM, Josh Berkus  wrote:

> All:
>
> The problem:
>
> postgres=# drop database bookdata;
> ERROR:  database "bookdata" is being accessed by other users
> DETAIL:  There is 1 other session using the database.
> postgres=# \c bookdata
> You are now connected to database "bookdata" as user "postgres".
> bookdata=# drop subscription wholedb;
> NOTICE:  dropped replication slot "wholedb" on publisher
> DROP SUBSCRIPTION
> bookdata=# \c postgres
> You are now connected to database "postgres" as user "postgres".
> postgres=# drop database bookdata;
> DROP DATABASE
>
> Is there any easy way for us to detect that the "user" accessing the
> target database is actually a logical replication subscription, and give
> the DBA a better error message (e.g. "database 'bookdata' still has open
> subscrptions")?
>


+1

Better yet would be to just cascade the drop, but I assume that would be
harder to do.

Cheers,

Jeff


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Mark Kirkwood

On 21/07/17 15:58, Joshua D. Drake wrote:


On 07/19/2017 07:57 PM, Tom Lane wrote:

Peter Geoghegan  writes:

My argument for the importance of index bloat to the more general
bloat problem is simple: any bloat that accumulates, that cannot be
cleaned up, will probably accumulate until it impacts performance
quite noticeably.


But that just begs the question: *does* it accumulate indefinitely, or
does it eventually reach a more-or-less steady state?  The traditional
wisdom about btrees, for instance, is that no matter how full you pack
them to start with, the steady state is going to involve something like
1/3rd free space.  You can call that bloat if you want, but it's not
likely that you'll be able to reduce the number significantly without
paying exorbitant costs.

I'm not claiming that we don't have any problems, but I do think it's
important to draw a distinction between bloat and normal operating
overhead.


Agreed but we aren't talking about 30% I don't think. Here is where I 
am at. It took until 30 minutes ago for the tests to finish:


name |  setting
-+---
 autovacuum  | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_analyze_threshold| 50
 autovacuum_freeze_max_age   | 2
 autovacuum_max_workers  | 3
 autovacuum_multixact_freeze_max_age | 4
 autovacuum_naptime  | 60
 autovacuum_vacuum_cost_delay| 20
 autovacuum_vacuum_cost_limit| -1
 autovacuum_vacuum_scale_factor  | 0.2
 autovacuum_vacuum_threshold | 50
 autovacuum_work_mem | -1
 log_autovacuum_min_duration | -1


Test 1: 55G/srv/main
TPS:955

Test 2: 112G/srv/main
TPS:531 (Not sure what happened here, long checkpoint?)

Test 3: 109G/srv/main
TPS:868

Test 4: 143G
TPS:840

Test 5: 154G
TPS: 722

I am running the query here:

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index 



And will post a followup. Once the query finishes I am going to launch 
the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything 
else you folks would like me to change?







I usually advise setting autovacuum_naptime = 10s (or even 5s) for 
workloads that do a lot of updates (or inserts + deletes) - as on modern 
HW a lot of churn can happen in 1 minute, and that just makes vacuum's 
job harder.


regards
Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Hi,

Attached is the core dump file received on PG 10beta2 version.
​
 myfile.tgz

​

Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 2:45 PM, Neha Sharma 
wrote:

>
>
> Regards,
> Neha Sharma
>
> On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer 
> wrote:
>
>> On 20 July 2017 at 15:00, Neha Sharma 
>> wrote:
>>
>>> Hi Craig,
>>>
>>> I had done a fresh initdb,the default parameter configuration was used.
>>> I was setting few set of parameters while startup by the below command.
>>>
>>> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
>>> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
>>> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>>>
>>> Now I have modified the script a bit with Robert's suggestion as below.
>>> Instead of starting it with postgres binary i have set it in conf file and
>>> starting the server with pg_ctl. I am waiting for the results,once the core
>>> dump is generated will share the details.
>>>
>>
>> Thanks.
>>
>> To verify that you do get a coredump, you might want to consider sending
>> a kill -SEGV to a backend and make sure that it actually dumps core and you
>> can find the core.
>>
>> Ideally you'd actually set the coredumps to include shmem (see
>> coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html),
>> but with 8GB shared_buffers that may not be practical. It'd be very useful
>> if possible.
>>
>> If this is wraparound-related, as it appears to be, you might get faster
>> results by using a custom pgbench script for one or more workers that just
>> runs txid_current() a whole lot. Or jump the server's xid space forward.
>>
> Thanks. Will put together suggestions to get the result.
>
>>
>> I've got a few other things on right now but I'll keep an eye out and
>> hope for a core dump.
>>
>> --
>>  Craig Ringer   http://www.2ndQuadrant.com/
>>  PostgreSQL Development, 24x7 Support, Training & Services
>>
>
>


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Joshua D. Drake

On 07/19/2017 07:57 PM, Tom Lane wrote:

Peter Geoghegan  writes:

My argument for the importance of index bloat to the more general
bloat problem is simple: any bloat that accumulates, that cannot be
cleaned up, will probably accumulate until it impacts performance
quite noticeably.


But that just begs the question: *does* it accumulate indefinitely, or
does it eventually reach a more-or-less steady state?  The traditional
wisdom about btrees, for instance, is that no matter how full you pack
them to start with, the steady state is going to involve something like
1/3rd free space.  You can call that bloat if you want, but it's not
likely that you'll be able to reduce the number significantly without
paying exorbitant costs.

I'm not claiming that we don't have any problems, but I do think it's
important to draw a distinction between bloat and normal operating
overhead.


Agreed but we aren't talking about 30% I don't think. Here is where I am 
at. It took until 30 minutes ago for the tests to finish:


name |  setting
-+---
 autovacuum  | on
 autovacuum_analyze_scale_factor | 0.1
 autovacuum_analyze_threshold| 50
 autovacuum_freeze_max_age   | 2
 autovacuum_max_workers  | 3
 autovacuum_multixact_freeze_max_age | 4
 autovacuum_naptime  | 60
 autovacuum_vacuum_cost_delay| 20
 autovacuum_vacuum_cost_limit| -1
 autovacuum_vacuum_scale_factor  | 0.2
 autovacuum_vacuum_threshold | 50
 autovacuum_work_mem | -1
 log_autovacuum_min_duration | -1


Test 1: 55G /srv/main
TPS:955

Test 2: 112G/srv/main
TPS:531 (Not sure what happened here, long checkpoint?)

Test 3: 109G/srv/main
TPS:868

Test 4: 143G
TPS:840

Test 5: 154G
TPS:722

I am running the query here:

https://wiki.postgresql.org/wiki/Index_Maintenance#Summarize_keyspace_of_a_B-Tree_index

And will post a followup. Once the query finishes I am going to launch 
the tests with autovacuum_vacuum_cost_limit of 5000. Is there anything 
else you folks would like me to change?


JD




--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-07-20 Thread Craig Ringer
On 20 July 2017 at 19:09, Ashutosh Sharma  wrote:

> I had a quick look into this patch and also tested it and following
> are my observations.
>
> 1) I am seeing a server crash when passing any non meaningful value
> for t_infomask2 to heap_infomask_flags().
>
> postgres=# SELECT heap_infomask_flags(2816, 3);
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !> \q
>
> Following is the backtrace,
>
> (gdb) bt
> #0  0x00d9c55b in pg_detoast_datum (datum=0x0) at fmgr.c:1833
> #1  0x00b87374 in construct_md_array (elems=0x2ad74c0,
> nulls=0x0, ndims=1, dims=0x7ffc0b0bbcd0, lbs=0x7ffc0b0bbcc0,
> elmtype=25, elmlen=-1,
> elmbyval=0 '\000', elmalign=105 'i') at arrayfuncs.c:3382
> #2  0x00b8709f in construct_array (elems=0x2ad74c0, nelems=10,
> elmtype=25, elmlen=-1, elmbyval=0 '\000', elmalign=105 'i') at
> arrayfuncs.c:3316
> #3  0x7fb8001603a5 in heap_infomask_flags (fcinfo=0x2ad3b88) at
> heapfuncs.c:597
>


Fixed.


> 2) I can see the documentation for heap_infomask(). But, I do not see
> it being defined or used anywhere in the patch.
>
> + 
> +  The heap_infomask function can be used to
> unpack the
> +  recognised bits of the infomasks of heap tuples.
> + 
>

Fixed. Renamed the function, missed a spot.


> 3) If show_combined flag is set to it's default value and a tuple is
> frozen then may i know the reason for not showing it as frozen tuple
> when t_infomask2
> is passed as zero.


It was a consequence of (1). Fixed.


> 4) I think, it would be better to use the same argument name for the
> newly added function i.e heap_infomask_flags() in both documentation
> and sql file. I am basically refering to 'include_combined' argument.
> IF you see the function definition, the argument name used is
> 'include_combined' whereas in documentation you have mentioned
> 'show_combined'.
>

Fixed, thanks.

I want to find time to expand the tests on this some more and look more
closely, but here it is for now.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From a25c1e50e3b3ff493a018ab594a71dc3d64832ee Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 20 Jul 2017 11:20:21 +0800
Subject: [PATCH v2] Introduce heap_infomask_flags to decode infomask and
 infomask2

---
 contrib/pageinspect/Makefile  |   3 +-
 contrib/pageinspect/expected/page.out |  85 ++
 contrib/pageinspect/heapfuncs.c   | 120 ++
 contrib/pageinspect/pageinspect--1.6--1.7.sql |   9 ++
 contrib/pageinspect/pageinspect.control   |   2 +-
 contrib/pageinspect/sql/page.sql  |  24 ++
 doc/src/sgml/pageinspect.sgml |  32 +++
 7 files changed, 273 insertions(+), 2 deletions(-)
 create mode 100644 contrib/pageinspect/pageinspect--1.6--1.7.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index 0a3cbee..de114c7 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -5,7 +5,8 @@ OBJS		= rawpage.o heapfuncs.o btreefuncs.o fsmfuncs.o \
 		  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
 
 EXTENSION = pageinspect
-DATA = pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
+DATA = pageinspect--1.6--1.7.sql \
+	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
 	pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
diff --git a/contrib/pageinspect/expected/page.out b/contrib/pageinspect/expected/page.out
index 8e15947..b335265 100644
--- a/contrib/pageinspect/expected/page.out
+++ b/contrib/pageinspect/expected/page.out
@@ -82,6 +82,91 @@ SELECT * FROM fsm_page_contents(get_raw_page('test1', 'fsm', 0));
  
 (1 row)
 
+-- If we freeze the only tuple on test1, the infomask should
+-- always be the same in all test runs.
+VACUUM FREEZE test1;
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_flags(t_infomask, t_infomask2, true) m(flags);
+ t_infomask | t_infomask2 |   flags
++-+
+   2816 |   2 | {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
+(1 row)
+
+SELECT t_infomask, t_infomask2, flags
+FROM heap_page_items(get_raw_page('test1', 0)),
+ LATERAL heap_infomask_flags(t_infomask, t_infomask2, false) m(flags);
+ t_infomask | t_infomask2 |   flags   

Re: [HACKERS] Mishandling of WCO constraints in direct foreign table modification

2017-07-20 Thread Etsuro Fujita

On 2017/07/21 3:24, Robert Haas wrote:

I think that's reasonable.  This should be committed and back-patched
to 9.6, right?


Yeah, because direct modify was introduced in 9.6.

Attached is the second version which updated docs in postgres-fdw.sgml 
as well.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 5856,5861  INSERT INTO ft1(c1, c2) VALUES(, 2);
--- 5856,5921 
  UPDATE ft1 SET c2 = c2 + 1 WHERE c1 = 1;
  ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c2negative;
  -- ===
+ -- test WITH CHECK OPTION constraints
+ -- ===
+ CREATE TABLE base_tbl (a int, b int);
+ CREATE FOREIGN TABLE foreign_tbl (a int, b int)
+   SERVER loopback OPTIONS(table_name 'base_tbl');
+ CREATE VIEW rw_view AS SELECT * FROM foreign_tbl
+   WHERE a < b WITH CHECK OPTION;
+ \d+ rw_view
+View "public.rw_view"
+  Column |  Type   | Collation | Nullable | Default | Storage | Description 
+ +-+---+--+-+-+-
+  a  | integer |   |  | | plain   | 
+  b  | integer |   |  | | plain   | 
+ View definition:
+  SELECT foreign_tbl.a,
+ foreign_tbl.b
+FROM foreign_tbl
+   WHERE foreign_tbl.a < foreign_tbl.b;
+ Options: check_option=cascaded
+ 
+ INSERT INTO rw_view VALUES (0, 10); -- ok
+ INSERT INTO rw_view VALUES (10, 0); -- should fail
+ ERROR:  new row violates check option for view "rw_view"
+ DETAIL:  Failing row contains (10, 0).
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- not pushed down
+ QUERY PLAN

+ 
--
+  Update on public.foreign_tbl
+Remote SQL: UPDATE public.base_tbl SET b = $2 WHERE ctid = $1
+->  Foreign Scan on public.foreign_tbl
+  Output: foreign_tbl.a, 20, foreign_tbl.ctid
+  Remote SQL: SELECT a, ctid FROM public.base_tbl WHERE ((a < b)) AND 
((a = 0)) FOR UPDATE
+ (5 rows)
+ 
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- ok
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- not pushed down
+ QUERY PLAN

+ 
--
+  Update on public.foreign_tbl
+Remote SQL: UPDATE public.base_tbl SET b = $2 WHERE ctid = $1
+->  Foreign Scan on public.foreign_tbl
+  Output: foreign_tbl.a, '-20'::integer, foreign_tbl.ctid
+  Remote SQL: SELECT a, ctid FROM public.base_tbl WHERE ((a < b)) AND 
((a = 0)) FOR UPDATE
+ (5 rows)
+ 
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- should fail
+ ERROR:  new row violates check option for view "rw_view"
+ DETAIL:  Failing row contains (0, -20).
+ SELECT * FROM foreign_tbl;
+  a | b  
+ ---+
+  0 | 20
+ (1 row)
+ 
+ DROP FOREIGN TABLE foreign_tbl CASCADE;
+ NOTICE:  drop cascades to view rw_view
+ DROP TABLE base_tbl;
+ -- ===
  -- test serial columns (ie, sequence-based defaults)
  -- ===
  create table loc1 (f1 serial, f2 text);
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
***
*** 1158,1163  UPDATE ft1 SET c2 = c2 + 1 WHERE c1 = 1;
--- 1158,1187 
  ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c2negative;
  
  -- ===
+ -- test WITH CHECK OPTION constraints
+ -- ===
+ 
+ CREATE TABLE base_tbl (a int, b int);
+ CREATE FOREIGN TABLE foreign_tbl (a int, b int)
+   SERVER loopback OPTIONS(table_name 'base_tbl');
+ CREATE VIEW rw_view AS SELECT * FROM foreign_tbl
+   WHERE a < b WITH CHECK OPTION;
+ \d+ rw_view
+ 
+ INSERT INTO rw_view VALUES (0, 10); -- ok
+ INSERT INTO rw_view VALUES (10, 0); -- should fail
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- not pushed down
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- ok
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- not pushed down
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- should fail
+ SELECT * FROM foreign_tbl;
+ 
+ DROP FOREIGN TABLE foreign_tbl CASCADE;
+ DROP TABLE base_tbl;
+ 
+ -- ===
  -- test serial columns (ie, sequence-based defaults)
  -- ===
  create table loc1 (f1 serial, f2 text);
*** 

Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

2017-07-20 Thread Amit Langote
On 2017/07/20 22:19, Tom Lane wrote:
> Greg Stark  writes:
>> On 19 July 2017 at 00:26, Tom Lane  wrote:
>>> It's probably a bit late in the v10 cycle to be taking any risks in
>>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
>>> as soon as the v11 cycle opens, unless someone can show an example
>>> of non-broken coding that requires it.  (And if so, there ought to
>>> be a regression test incorporating that.)
> 
>> Would it be useful to keep in one of the memory checking assertion builds?
> 
> Why?  Code that expects to continue accessing a relcache entry's tupdesc
> after closing the relcache entry is broken, independently of whether it's
> in a debug build or not.

Am I wrong in thinking that TupleDesc refcounting (along with resowner
tracking) allows one to use a TupleDesc without worrying about the
lifetime of its parent relcache entry?

I'm asking this independently of the concerns being discussed in this
thread about having the RememberToFreeTupleDescAtEOX() mechanism on top of
TupleDesc refcounting.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlogfilename

2017-07-20 Thread Craig Ringer
On 20 July 2017 at 21:33, Yugo Nagata  wrote:

> On Thu, 20 Jul 2017 11:02:25 +0200
> Michael Paquier  wrote:
>
> > On Thu, Jul 20, 2017 at 10:58 AM, DEV_OPS  wrote:
> > > I think you may reference to function: pg_xlogfile_name   in
> > > src/backend/access/transam/xlogfuncs.c,  it use XLogFileName  defined
> in
> > > src/include/access/xlog_internal.h
> > >
> > > #define XLogFileName(fname, tli, logSegNo)  \
> > > snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,   \
> > >  (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
> > >  (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
> > >
> > >
> > > hope it's helpful for you
> >
> > The first 8 characters are the timeline number in hexadecimal format.
> > The next 8 characters indicate a segment number, which gets
> > incremented every 256 segments in hexa format. The last 8 characters
> > indicate the current segment number in hexa format.
>
> As far as I understand, XLOG is a logical big file of 256 * 16 MB,
> and this is split to multiple physical files of 16MB which are called
> WAL segments. The second 8 characters indicate the id of the logical
> xlog file, and the last 8 characters indicate the sequencial number of
> the segment in this xlog.
> 
>

You missed the timeline ID, which is the first 8 digits.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Definitional questions for pg_sequences view

2017-07-20 Thread Craig Ringer
On 20 July 2017 at 22:36, Tom Lane  wrote:

>
> This could be fixed if it were possible to translate to
> select * from pg_sequences where seqoid = 'my_seq'::regclass;
> but the view isn't exposing the sequence OID.  Should it?
>

It probably should. It's not part of information_schema, it's in
pg_catalog, and it's entirely reasonable to join on oids.

The relfilenode for the sequence can change, but the sequence oid won't
unless we actually drop and re-create it, so the weird issues with alter
sequence operations being partly transactional and partly not shouldn't be
a concern.

If it's to be a convenience view, it should possibly also expose the OWNED
BY relation oid IMO, if any. You have the sequence oid you can join on
pg_class and grab the relowner, so it's not a great hassle if it's missing,
but if it's a view to help users out exposing that would seem sensible.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] More optimization effort?

2017-07-20 Thread Craig Ringer
On 21 July 2017 at 07:11, Tatsuo Ishii  wrote:

> Currently following query does not use an index:
>
> t-ishii@localhost: psql -p 5433 test
> Pager usage is off.
> psql (9.6.3)
> Type "help" for help.
>
> test=# explain select * from pgbench_accounts where aid*100 < 1;
>QUERY PLAN
> 
>  Seq Scan on pgbench_accounts  (cost=0.00..3319.00 rows=3 width=97)
>Filter: ((aid * 100) < 1)
> (2 rows)
>
> While following one does use the index.
>
> test=# explain select * from pgbench_accounts where aid < 1/100;
> QUERY PLAN
> 
> --
>  Index Scan using pgbench_accounts_pkey on pgbench_accounts
> (cost=0.29..11.08 rows=102 width=97)
>Index Cond: (aid < 100)
> (2 rows)
>
> Is it worth to make our optimizer a little bit smarter to convert the
> the first query into the second form?
>

If I understand correctly, you're proposing that the optimiser should
attempt algebraic simplification to fold more constants, rather than
stopping pre-evaluation constant expressions  as soon as we see a
non-constant like we do now. Right?

I'm sure there are documented algorithms out there for algebraic
manipulations like that, taking account of precedence etc. But will they be
cheap enough to run in the optimiser? And likely to benefit many queries?

There's also the hiccup of partial index matching. If Pg simplifies and
rearranges expressions more, will we potentially fail to match partial
indexes that we would've originally matched? I'm not sure it's a blocker,
but it bears consideration, and Pg might have to do more work on partial
index matching too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Better error message for trying to drop a DB with open subscriptions?

2017-07-20 Thread Craig Ringer
On 21 July 2017 at 07:09, Josh Berkus  wrote:

> All:
>
> The problem:
>
> postgres=# drop database bookdata;
> ERROR:  database "bookdata" is being accessed by other users
> DETAIL:  There is 1 other session using the database.
> postgres=# \c bookdata
> You are now connected to database "bookdata" as user "postgres".
> bookdata=# drop subscription wholedb;
> NOTICE:  dropped replication slot "wholedb" on publisher
> DROP SUBSCRIPTION
> bookdata=# \c postgres
> You are now connected to database "postgres" as user "postgres".
> postgres=# drop database bookdata;
> DROP DATABASE
>
> Is there any easy way for us to detect that the "user" accessing the
> target database is actually a logical replication subscription, and give
> the DBA a better error message (e.g. "database 'bookdata' still has open
> subscrptions")? 



Good idea. Also, this affects any active logical (db-specific) replication
slot, not just built-in logical replication.

CountOtherDBBackends reports prepared xacts separately already, and
errdetail_busy_db uses that to report the two separately. Since we have
slot attachment data I expect reporting attached replication slots would
not be hard either; you might be able to prep a patch for that in a few
hours.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [PATCH] A hook for session start

2017-07-20 Thread Craig Ringer
On 21 July 2017 at 08:42, Robert Haas  wrote:

> On Thu, Jul 20, 2017 at 8:27 AM, Fabrízio de Royes Mello
>  wrote:
> > I'm not sure your real needs but doesn't it material for improve Event
> > Triggers???
>
> I've thought about that, too.  One problem is what to do if the user
> hits ^C while the event trigger procedure is running.  If you respond
> to that by killing the event trigger and letting the user issue
> commands, then the event trigger can't be used for security or
> auditing purposes because the user might prevent it from doing
> whatever it's intended to do with a well-timed interrupt.  If you
> ignore ^C or make it turn into FATAL, then a poorly-crafted trigger
> can lock users out of the database.  Maybe that's OK.  We could say
> "well, if you lock yourself out of the database with your logon
> trigger, you get to shut down the database and restart in single user
> mode to recover".
>
> A hook, as proposed here, is a lot simpler and lacks these concerns.
> Installing code in C into the database is intrinsically risky
> anywhere, and not any moreso here than elsewhere.  But it's also less
> accessible to the average user.
> 


I'd favour the c hook personally. It's a lot more flexible, and can be used
by an extension to implement trigger-like behaviour if anyone wants it,
including the extension's choice of error handling decisions.

It's also a lot simpler and less intrusive for core. Which is nice where we
don't have something that we don't have anything compelling destined for
core that needs it. (I want to add a bunch of hooks in the logical
replication code in pg11 for similar reasons, and so features like DDL
replication can be prototyped as extensions more practically).

That said, isn't ExecutorStart_hook + ProcessUtility_hook able to serve the
same job as a session-start hook, albeit at slightly higher overhead? You
can just test to see if your initial tasks have run yet.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [PATCH] A hook for session start

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 8:27 AM, Fabrízio de Royes Mello
 wrote:
> I'm not sure your real needs but doesn't it material for improve Event
> Triggers???

I've thought about that, too.  One problem is what to do if the user
hits ^C while the event trigger procedure is running.  If you respond
to that by killing the event trigger and letting the user issue
commands, then the event trigger can't be used for security or
auditing purposes because the user might prevent it from doing
whatever it's intended to do with a well-timed interrupt.  If you
ignore ^C or make it turn into FATAL, then a poorly-crafted trigger
can lock users out of the database.  Maybe that's OK.  We could say
"well, if you lock yourself out of the database with your logon
trigger, you get to shut down the database and restart in single user
mode to recover".

A hook, as proposed here, is a lot simpler and lacks these concerns.
Installing code in C into the database is intrinsically risky
anywhere, and not any moreso here than elsewhere.  But it's also less
accessible to the average user.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] More optimization effort?

2017-07-20 Thread Tatsuo Ishii
Currently following query does not use an index:

t-ishii@localhost: psql -p 5433 test
Pager usage is off.
psql (9.6.3)
Type "help" for help.

test=# explain select * from pgbench_accounts where aid*100 < 1;
   QUERY PLAN   

 Seq Scan on pgbench_accounts  (cost=0.00..3319.00 rows=3 width=97)
   Filter: ((aid * 100) < 1)
(2 rows)

While following one does use the index.

test=# explain select * from pgbench_accounts where aid < 1/100;
QUERY PLAN  
  
--
 Index Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.29..11.08 
rows=102 width=97)
   Index Cond: (aid < 100)
(2 rows)

Is it worth to make our optimizer a little bit smarter to convert the
the first query into the second form?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Better error message for trying to drop a DB with open subscriptions?

2017-07-20 Thread Josh Berkus
All:

The problem:

postgres=# drop database bookdata;
ERROR:  database "bookdata" is being accessed by other users
DETAIL:  There is 1 other session using the database.
postgres=# \c bookdata
You are now connected to database "bookdata" as user "postgres".
bookdata=# drop subscription wholedb;
NOTICE:  dropped replication slot "wholedb" on publisher
DROP SUBSCRIPTION
bookdata=# \c postgres
You are now connected to database "postgres" as user "postgres".
postgres=# drop database bookdata;
DROP DATABASE

Is there any easy way for us to detect that the "user" accessing the
target database is actually a logical replication subscription, and give
the DBA a better error message (e.g. "database 'bookdata' still has open
subscrptions")?

-- 
Josh Berkus
Containers & Databases Oh My!


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Thomas Munro
On Fri, Jul 21, 2017 at 8:21 AM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
>>  wrote:
>>> Does anyone know how to test a situation where the join is reversed 
>>> according to
>>> get_join_variables, or "complicated cases where we can't tell for sure"?
>
>> explain select * from pg_class c right join pg_type t on (c.reltype =
>> t.oid); would end up with  *join_is_reversed = true; Is that what you
>> want? For a semi-join however I don't know how to induce that. AFAIU,
>> in a semi-join there is only one direction in which join can be
>> specified.
>
> You just have to flip the <> clause around, eg instead of
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where t.ten <> i.f1);
>
> do
>
> explain analyze select * from tenk1 t
>   where exists (select 1 from int4_tbl i where i.f1 <> t.ten);
>
> No matter what the surrounding query is like exactly, one or the
> other of those should end up "join_is_reversed".

Ahh, I see.  Thanks for the explanation.

> This would be a bit harder to trigger for equality clauses, where you'd
> have to somehow defeat the EquivalenceClass logic's tendency to rip the
> clauses apart and reassemble them according to its own whims.  But for
> neqjoinsel that's not a problem.
>
>> I didn't get the part about "complicated cases where we can't tell for sure".
>
> You could force that with mixed relation membership on one or both sides
> of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
> especially interesting for the present purpose though, since we're going
> to end up with 1.0 selectivity in any case where examine_variable can't
> find stats.

Thanks.  Bearing all that in mind, I ran through a series of test
scenarios and discovered that my handling for JOIN_ANTI was wrong: I
thought that I had to deal with inverting the result, but I now see
that that's handled elsewhere (calc_joinrel_size_estimate() I think).
So neqjoinsel should just treat JOIN_SEMI and JOIN_ANTI exactly the
same way.

That just leaves the question of whether we should try to handle the
empty RHS and single-value RHS cases using statistics.  My intuition
is that we shouldn't, but I'll be happy to change my intuition and
code that up if that is the feedback from planner gurus.

Please find attached a new version, and a test script I used, which
shows a bunch of interesting cases.  I'll add this to the commitfest.

-- 
Thomas Munro
http://www.enterprisedb.com


neqjoinsel-fix-v3.patch
Description: Binary data


test-neqjoinsel.sql
Description: Binary data


test-neqjoinsel.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Alvaro Herrera
Kyotaro HORIGUCHI wrote:

> Finally, I added a new TAP test library PsqlSession. It offers
> interactive psql sessions. Then added a simple test to
> postgres_fdw using it.

Hmm, I think this can be very useful for other things.  Let's keep this
in mind to use in the future, even if we find another way to fix the
issue at hand.  In fact, I had a problem a couple of weeks ago in which
I needed two concurrent sessions and one of them disconnected in the
middle of the test.  Can't do that with isolationtester ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Tom Lane
Kyotaro HORIGUCHI  writes:
> Here it is. First I tried this using ordinary regression
> framework but the effect of this patch is shown only in log and
> it contains variable parts so I gave up it before trying more
> complex way.

> Next I tried existing TAP test but this test needs continuous
> session to achieve alternating operation on two sessions but
> PostgresNode::psql doesn't offer such a functionality.

> Finally, I added a new TAP test library PsqlSession. It offers
> interactive psql sessions. Then added a simple test to
> postgres_fdw using it.

This seems like overkill.  We can test it reasonably easily within the
existing framework, as shown in the attached patch.  I'm also fairly
concerned that what you're showing here would be unstable in the buildfarm
as a result of race conditions between the multiple sessions.

I made some cosmetic updates to the code patch, as well.

I think this is actually a bug fix, and should not wait for the next
commitfest.

regards, tom lane

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8c33dea..8eb477b 100644
*** a/contrib/postgres_fdw/connection.c
--- b/contrib/postgres_fdw/connection.c
***
*** 22,27 
--- 22,28 
  #include "pgstat.h"
  #include "storage/latch.h"
  #include "utils/hsearch.h"
+ #include "utils/inval.h"
  #include "utils/memutils.h"
  #include "utils/syscache.h"
  
*** typedef struct ConnCacheEntry
*** 48,58 
--- 49,63 
  {
  	ConnCacheKey key;			/* hash key (must be first) */
  	PGconn	   *conn;			/* connection to foreign server, or NULL */
+ 	/* Remaining fields are invalid when conn is NULL: */
  	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
   * one level of subxact open, etc */
  	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
  	bool		have_error;		/* have any subxacts aborted in this xact? */
  	bool		changing_xact_state;	/* xact state change in process */
+ 	bool		invalidated;	/* true if reconnect is pending */
+ 	uint32		server_hashvalue;	/* hash value of foreign server OID */
+ 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
  } ConnCacheEntry;
  
  /*
*** static bool xact_got_connection = false;
*** 69,74 
--- 74,80 
  
  /* prototypes of private functions */
  static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+ static void disconnect_pg_server(ConnCacheEntry *entry);
  static void check_conn_params(const char **keywords, const char **values);
  static void configure_remote_session(PGconn *conn);
  static void do_sql_command(PGconn *conn, const char *sql);
*** static void pgfdw_subxact_callback(SubXa
*** 78,83 
--- 84,90 
  	   SubTransactionId mySubid,
  	   SubTransactionId parentSubid,
  	   void *arg);
+ static void pgfdw_inval_callback(Datum arg, int cacheid, uint32 hashvalue);
  static void pgfdw_reject_incomplete_xact_state_change(ConnCacheEntry *entry);
  static bool pgfdw_cancel_query(PGconn *conn);
  static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
*** GetConnection(UserMapping *user, bool wi
*** 130,135 
--- 137,146 
  		 */
  		RegisterXactCallback(pgfdw_xact_callback, NULL);
  		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
+ 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
+ 	  pgfdw_inval_callback, (Datum) 0);
+ 		CacheRegisterSyscacheCallback(USERMAPPINGOID,
+ 	  pgfdw_inval_callback, (Datum) 0);
  	}
  
  	/* Set flag that we did GetConnection during the current transaction */
*** GetConnection(UserMapping *user, bool wi
*** 144,161 
  	entry = hash_search(ConnectionHash, , HASH_ENTER, );
  	if (!found)
  	{
! 		/* initialize new hashtable entry (key is already filled in) */
  		entry->conn = NULL;
- 		entry->xact_depth = 0;
- 		entry->have_prep_stmt = false;
- 		entry->have_error = false;
- 		entry->changing_xact_state = false;
  	}
  
  	/* Reject further use of connections which failed abort cleanup. */
  	pgfdw_reject_incomplete_xact_state_change(entry);
  
  	/*
  	 * We don't check the health of cached connection here, because it would
  	 * require some overhead.  Broken connection will be detected when the
  	 * connection is actually used.
--- 155,182 
  	entry = hash_search(ConnectionHash, , HASH_ENTER, );
  	if (!found)
  	{
! 		/*
! 		 * We need only clear "conn" here; remaining fields will be filled
! 		 * later when "conn" is set.
! 		 */
  		entry->conn = NULL;
  	}
  
  	/* Reject further use of connections which failed abort cleanup. */
  	pgfdw_reject_incomplete_xact_state_change(entry);
  
  	/*
+ 	 * If the connection needs to be remade due to invalidation, disconnect as
+ 	 * soon as we're out of all transactions.
+ 	 */
+ 	if (entry->conn != NULL && entry->invalidated && entry->xact_depth == 0)
+ 	{
+ 		elog(DEBUG3, "closing 

Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Tom Lane
Ashutosh Bapat  writes:
> On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
>  wrote:
>> Does anyone know how to test a situation where the join is reversed 
>> according to
>> get_join_variables, or "complicated cases where we can't tell for sure"?

> explain select * from pg_class c right join pg_type t on (c.reltype =
> t.oid); would end up with  *join_is_reversed = true; Is that what you
> want? For a semi-join however I don't know how to induce that. AFAIU,
> in a semi-join there is only one direction in which join can be
> specified.

You just have to flip the <> clause around, eg instead of

explain analyze select * from tenk1 t
  where exists (select 1 from int4_tbl i where t.ten <> i.f1);

do

explain analyze select * from tenk1 t
  where exists (select 1 from int4_tbl i where i.f1 <> t.ten);

No matter what the surrounding query is like exactly, one or the
other of those should end up "join_is_reversed".

This would be a bit harder to trigger for equality clauses, where you'd
have to somehow defeat the EquivalenceClass logic's tendency to rip the
clauses apart and reassemble them according to its own whims.  But for
neqjoinsel that's not a problem.

> I didn't get the part about "complicated cases where we can't tell for sure".

You could force that with mixed relation membership on one or both sides
of the <>, for instance "(a.b + b.y) <> a.c".  I don't think it's
especially interesting for the present purpose though, since we're going
to end up with 1.0 selectivity in any case where examine_variable can't
find stats.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Tom Lane
Robert Haas  writes:
> I think that's a valid point.  There are also other concerns here -
> e.g. whether instead of adopting the patch as proposed we ought to (a)
> use some smaller size, or (b) keep the size as-is but reduce the
> maximum fraction of shared_buffers that can be consumed, or (c) divide
> the ring buffer size through by autovacuum_max_workers.  Personally,
> of those approaches, I favor (b).  I think a 16MB ring buffer is
> probably just fine if you've got 8GB of shared_buffers but I'm
> skeptical about it when you've got 128MB of shared_buffers.

WFM.  I agree with *not* dividing the basic ring buffer size by
autovacuum_max_workers.  If you have allocated more AV workers, I think
you expect AV to go faster, not for the workers to start fighting among
themselves.

It might, however, be reasonable for the fraction-of-shared-buffers
limitation to have something to do with autovacuum_max_workers, so that
you can't squeeze yourself out of shared_buffers if you set that number
really high.  IOW, I think the upthread suggestion of
min(shared_buffers/8/autovacuum_workers, 16MB) is basically the right
idea, though we could debate the exact constants.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Jul 20, 2017 at 3:04 PM, Stephen Frost  wrote:
> > I agree that it's a common problem for VACUUM to go too fast, or for
> > VACUUM to go too slow, but that's really what the vacuum_cost_limit
> > mechanism is for.
> 
> I think that's a valid point.  There are also other concerns here -
> e.g. whether instead of adopting the patch as proposed we ought to (a)
> use some smaller size, or (b) keep the size as-is but reduce the
> maximum fraction of shared_buffers that can be consumed, or (c) divide
> the ring buffer size through by autovacuum_max_workers.  Personally,
> of those approaches, I favor (b).  I think a 16MB ring buffer is
> probably just fine if you've got 8GB of shared_buffers but I'm
> skeptical about it when you've got 128MB of shared_buffers.

Right, agreed on that and that (b) looks to be a good option there.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 3:04 PM, Stephen Frost  wrote:
> I agree that it's a common problem for VACUUM to go too fast, or for
> VACUUM to go too slow, but that's really what the vacuum_cost_limit
> mechanism is for.

I think that's a valid point.  There are also other concerns here -
e.g. whether instead of adopting the patch as proposed we ought to (a)
use some smaller size, or (b) keep the size as-is but reduce the
maximum fraction of shared_buffers that can be consumed, or (c) divide
the ring buffer size through by autovacuum_max_workers.  Personally,
of those approaches, I favor (b).  I think a 16MB ring buffer is
probably just fine if you've got 8GB of shared_buffers but I'm
skeptical about it when you've got 128MB of shared_buffers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 2:17 AM, Michael Paquier
 wrote:
> Heikki, this bug is rather bad for anybody using pg_rewind with
> relation file sizes larger than 2GB as this corrupts data of
> instances. I think that you would be the best fit as a committer to
> look at this patch as you implemented the tool first, and it would be
> a bad idea to let that sit for a too long time. Could it be possible
> to spare a bit of your time at some point to look at it?

The comment header of receiveFileChunks is updated like this:

 /*
  * Runs a query, which returns pieces of files from the remote source data
  * directory, and overwrites the corresponding parts of target files with
  * the received parts. The result set is expected to be of format:
  *
  * pathtext-- path in the data directory, e.g "base/1/123"
- * begin   int4-- offset within the file
+ * begin   int8-- offset within the file
  * chunk   bytea   -- file content
  *

...but down inside the function there's still this:

if (PQftype(res, 0) != TEXTOID &&
PQftype(res, 1) != INT4OID &&
PQftype(res, 2) != BYTEAOID)
{
pg_fatal("unexpected data types in result set
while fetching remote files: %u %u %u\n",
 PQftype(res, 0), PQftype(res,
1), PQftype(res, 2));
}

I was initially surprised that your testing managed to pass, but then
I noticed that this sanity test is using && where it should really be
using ||; it will only fail if ALL of the data types are wrong.  Oops.

This code plays pretty fast and loose with off_t vs. size_t vs. uint64
vs. int64, but those definitely don't all agree in signedness and may
not even agree in width (size_t, at least, might be 4 bytes).  We use
stat() to get the size of a file as an off_t, and then store it into a
uint64. Perhaps off_t is always uint64 anyway, but I'm not sure.
Then, that value gets passed to fetch_file_range(), still as a uint64,
and it then gets sent to the server to be stored into an int8 column,
which is signed rather than unsigned.  That will error out if there's
a file size >= 2^63, but filesystem holding more than 8 exabytes are
unusual and will probably remain so for some time.  The server sends
back an int64 which we pass to write_target_range(), whose argument is
declared as off_t, which is where we started.  I'm not sure there's
any real problem with all of that, but it's a little nervous-making
that we keep switching types.  Similarly, libpqProcessFileList gets
the file size as an int64 and then passes it to process_source_file()
which is expecting size_t.  So we manage to use 4 different data types
to represent a file size.  On most systems, all of them except SQL's
int8 are likely to be 64-bit unsigned integers, but I'm not sure what
will happen on obscure platforms.

Still, it can't be worse than the status quo, where instead of int64
we're using int and int32, so maybe we ought to back-patch it as-is
for now and look at any further cleanup that is needed as a
master-only improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Stephen Frost
Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Jul 20, 2017 at 12:16 PM, Sokolov Yura
>  wrote:
> > But in fact, vacuum process performs FSYNC! It happens, cause vacuum
> > evicts dirty pages from its ring buffer. And to evict dirty page, it
> > has to be sure WAL record about its modification is FSYNC-ed to WAL.
> > Because ring buffer is damn small, vacuum almost always have to perform
> > FSYNC by itself and have to do it very frequently (cause ring is damn
> > small).
> >
> > With greater ring buffer, there is greater chance that fsync were
> > performed by wal_writer process, or other backend. And even if
> > vacuum does fsync by itself, it syncs records about more modified
> > pages from its ring, so evicting next page is free.
> 
> OK, but I have helped *many* customers whose problem was that vacuum
> ran too fast and blew data out of the OS cache causing query response
> times to go through the roof.  That's a common problem.  Making VACUUM
> run faster will presumably make it more common.  I've also run into
> many customers whose problem that vacuum ran too slowly, and generally
> raising vacuum_cost_limit fixes that problem just fine.  So I don't
> think it's nearly as clear as you do that making VACUUM run faster is
> desirable.

I agree that it's a common problem for VACUUM to go too fast, or for
VACUUM to go too slow, but that's really what the vacuum_cost_limit
mechanism is for.

I can see an argument that existing tuned systems which have been
expecting the small ring-buffer to help slow down VACUUM may have to be
adjusted to handle a change, though I would think that other changes
we've made might also require changes to vacuum costing, so I'm not sure
that this is particularly different in that regard.

What I don't agree with is holding off on improving VACUUM in the case
where cost delay is set to zero because we think people might be
depending on it only going so fast in that case.  If the cost delay is
set to zero, then VACUUM really should be going as fast as it can and we
should welcome improvments in that area in much the same way that we
welcome performance improvements in sorting and other backend
algorithms.

> > If some fears that increasing vacuum ring buffer will lead to
> > decreasing transaction performance, then why it were not exhaustively
> > tested?
> 
> If you want something changed, it's your job to do that testing.
> Asking why nobody else tested the effects of changing the thing you
> want changed is like asking why nobody else wrote the patch you want
> written.

I do agree with this.  Asking for others to also test is fine, but it's
the patch submitter who needs to ensure that said testing actually
happens and that results are provided to -hackers to support the change.

In particular, multiple different scenarios (DB all in shared_buffers,
DB all in OS cache, DB not able to fit in memory at all, etc) should be
tested.

> > And possible community will decide, that it should be GUC variable:
> > - if one prefers to keep its database unbloated, he could increase
> > vacuum ring buffer,
> > - otherwise just left it in "safe-but-slow" default.
> 
> That's a possible outcome, but I don't think this discussion is really
> going anywhere unless you are willing to admit that increasing VACUUM
> performance could have some downsides.  If you're not willing to admit
> that, there's not a lot to talk about here.

I'd rather we encourage people to use the existing knobs for tuning
VACUUM speed rather than adding another one that ends up being actually
only a proxy for speed.  If there's a memory utilization concern here,
then having a knob for that might make sense, but it sounds like the
concern here is more about the speed and less about coming up with a
reasonable way to scale the size of the ring buffer.

Of course, I'm all for coming up with a good way to size the ring
buffer, and providing a knob if we aren't able to do so, I just don't
want to add unnecessary knobs if we don't need them.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Peter Geoghegan
On Thu, Jul 20, 2017 at 7:45 AM, Claudio Freire  wrote:
>> For the purposes of this discussion, I'm mostly talking about
>> duplicates within a page on a unique index. If the keyspace owned by
>> an int4 unique index page only covers 20 distinct values, it will only
>> ever cover 20 distinct values, now and forever, despite the fact that
>> there is room for about 400 (a 90/10 split leaves you with 366 items +
>> 1 high key).
>
> Microvacuum could also help.
>
> If during a scan you find pointers that point to dead (in vacuum terms)
> tuples, the pointers in the index could be deleted. That could be done
> during insert into unique indexes before a split, to avoid the split.
>
> Chances are, if there are duplicates, at least a few of them will be dead.

My whole point is that that could easily fail to happen early enough
to prevent a pagesplit that is only needed because there is a short
term surge in the number of duplicate versions that need to be
available for one old snapshot. A pagesplit can be a permanent
solution to a temporary problem. Page deletion can only occur under
tight conditions that are unlikely to *ever* be met in many cases.

Imagine if it was impossible to insert physical duplicates into unique
indexes. In that world, you'd end up bloating some overflow data
structure in UPDATE heavy cases (where HOT doesn't work out). The
bloat wouldn't go on leaf pages, and so you wouldn't get page splits,
and so you wouldn't end up with leaf pages that can only store 20
distinct values now and forever, because that's the range of values
represented by downlinks and the leaf's high key. That's a situation
we actually saw for the leftmost leaf page in Alik's Zipfian
distribution test.

The way that the keyspace is broken up is supposed to be balanced, and
to have long term utility. Working against that to absorb a short term
bloat problem is penny wise, pound foolish.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mishandling of WCO constraints in direct foreign table modification

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 7:40 AM, Etsuro Fujita
 wrote:
> Here is an example for $subject using postgres_fdw:
>
> postgres=# create foreign table foreign_tbl (a int, b int) server loopback
> options (table_name 'base_tbl');
> CREATE FOREIGN TABLE
> postgres=# create view rw_view as select * from foreign_tbl where a < b with
> check option;
> CREATE VIEW
> postgres=# insert into rw_view values (0, 10);
> INSERT 0 1
> postgres=# explain verbose update rw_view set a = 20 where b = 10;
>   QUERY PLAN
> --
>  Update on public.foreign_tbl  (cost=100.00..146.21 rows=4 width=14)
>->  Foreign Update on public.foreign_tbl  (cost=100.00..146.21 rows=4
> width=14)
>  Remote SQL: UPDATE public.base_tbl SET a = 20 WHERE ((a < b)) AND
> ((b = 10))
> (3 rows)
>
> postgres=# update rw_view set a = 20 where b = 10;
> UPDATE 1
>
> This is wrong!  This should fail.  The reason for that is; direct modify is
> overlooking checking WITH CHECK OPTION constraints from parent views.  I
> think we could do direct modify, even if there are any WITH CHECK OPTIONs,
> in some way or other, but I think that is a feature.  So, I'd like to
> propose to fix this by just giving up direct modify if there are any WITH
> CHECK OPTIONs.  Attached is a patch for that.  I'll add it to the next
> commitfest.

I think that's reasonable.  This should be committed and back-patched
to 9.6, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 12:16 PM, Sokolov Yura
 wrote:
> So, from my point of view, no one just evaluate performance of increased
> ring buffer for vacuum.

I think that argument is clearly incorrect.  In commit
6382448cf96a9b88d418cbaf86027b63f465b5d8, which you cited, Tom even
added a note in the README file about why he didn't increase the value
for vacuum also.  He knew it would have increased performance had he
also done it for BAS_BULKWRITE, and I knew it too, but it wasn't clear
that it was a good idea, and it's still not.

> But in fact, vacuum process performs FSYNC! It happens, cause vacuum
> evicts dirty pages from its ring buffer. And to evict dirty page, it
> has to be sure WAL record about its modification is FSYNC-ed to WAL.
> Because ring buffer is damn small, vacuum almost always have to perform
> FSYNC by itself and have to do it very frequently (cause ring is damn
> small).
>
> With greater ring buffer, there is greater chance that fsync were
> performed by wal_writer process, or other backend. And even if
> vacuum does fsync by itself, it syncs records about more modified
> pages from its ring, so evicting next page is free.

OK, but I have helped *many* customers whose problem was that vacuum
ran too fast and blew data out of the OS cache causing query response
times to go through the roof.  That's a common problem.  Making VACUUM
run faster will presumably make it more common.  I've also run into
many customers whose problem that vacuum ran too slowly, and generally
raising vacuum_cost_limit fixes that problem just fine.  So I don't
think it's nearly as clear as you do that making VACUUM run faster is
desirable.

> If some fears that increasing vacuum ring buffer will lead to
> decreasing transaction performance, then why it were not exhaustively
> tested?

If you want something changed, it's your job to do that testing.
Asking why nobody else tested the effects of changing the thing you
want changed is like asking why nobody else wrote the patch you want
written.

> And possible community will decide, that it should be GUC variable:
> - if one prefers to keep its database unbloated, he could increase
> vacuum ring buffer,
> - otherwise just left it in "safe-but-slow" default.

That's a possible outcome, but I don't think this discussion is really
going anywhere unless you are willing to admit that increasing VACUUM
performance could have some downsides.  If you're not willing to admit
that, there's not a lot to talk about here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: weekly progress reports (week 7)

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 1:22 PM, Shubham Barai 
wrote:

> I had detailed discussion about this with my mentor. Sorry, I didn't share
> details on hackers list.
>
> B-tree, gist, spgist, and gin are all tree based indexes where we scan and
> acquire predicate lock
> on only some and not all internal pages or leaf pages. So, here we have
> scope to reduce false positives.
> In BRIN index, each tuple stores summarizing values in the consecutive
> group of heap pages.
> So initially I thought we could implement tuple level predicate locking in
> BRIN. But, here we scan
> the whole index which forces us to acquire predicate lock on all tuples.
> Acquiring predicate lock on all
> tuples will be no different than a relation level predicate lock.
>

Ah, right.  Makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] WIP Patch: Precalculate stable functions, infrastructure v1

2017-07-20 Thread Robert Haas
On Tue, Jul 18, 2017 at 9:16 AM, Marina Polyakova
 wrote:
> Here I have made the 5th version of the patches. I have added the
> precalculation of all primitive nodes that don't return set, are not
> volatile themselves and their arguments are constant or precalculated
> expressions too. There're regression tests for all of them and little notes
> in the documentation. Like for the previous patches it seems that there is
> no obvious performance degradation too on regular queries (according to
> pgbench).

pgbench probably isn't a very good test for this sort of thing - it
only issues very short-running queries where the cost of evaluating
expressions is a relatively small part of the total cost.  Even if
things get worse, I'm not sure if you'd see it.  I'm not sure exactly
how you could construct a test case that could be harmed by this patch
- I guess you'd want to initialize lots of CacheExprs but never make
use of the caching usefully?

It could also be useful to test things like TPC-H to see if you get an
improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Alik Khilazhev



> I think that developping a test would be much simpler with the improved tap 
> test infrastructure, so I would suggest to wait to know the result of the 
> corresponding patch.

Ok, I will wait then.

> Also, could you recod the patch to CF 2017-09?
> https://commitfest.postgresql.org/14/ 

Yea, I will send it.

—
Thanks and Regards,
Alik Khilazhev
Postgres Professional:
http://www.postgrespro.com
The Russian Postgres Company




Re: [HACKERS] GSoC 2017: weekly progress reports (week 7)

2017-07-20 Thread Shubham Barai
Hi Robert,

I had detailed discussion about this with my mentor. Sorry, I didn't share
details on hackers list.

B-tree, gist, spgist, and gin are all tree based indexes where we scan and
acquire predicate lock
on only some and not all internal pages or leaf pages. So, here we have
scope to reduce false positives.
In BRIN index, each tuple stores summarizing values in the consecutive
group of heap pages.
So initially I thought we could implement tuple level predicate locking in
BRIN. But, here we scan
the whole index which forces us to acquire predicate lock on all tuples.
Acquiring predicate lock on all
tuples will be no different than a relation level predicate lock.


Regards,
Shubham



 Sent with Mailtrack


On 20 July 2017 at 21:18, Robert Haas  wrote:

> On Tue, Jul 18, 2017 at 10:36 AM, Shubham Barai 
> wrote:
>
>> During this week, I read documentation and source code of BRIN index to
>> understand its implementation.
>> But later I found out that it is unlikely to implement page level or
>> tuple level predicate locking in BRIN index.
>> In this week, I also fixed a small issue and updated my previous patch
>> for gin index. Currently, I am working on
>> sp-gist index.
>>
>
> It's a shame that nobody seems to be answering emails like this, but you
> also haven't provided much detail here - e.g. *why* is BRIN unlikely to
> work out well?  I think I know the answer, but it'd be good to spell out
> your thoughts on the subject, I think.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Fabien COELHO


Hello Alik,

About the maths: As already said, I'm not at ease with a random_zipfian 
function which does not display a (good) zipfian distribution. At the 
minimum the documentation should be clear about the approximations 
implied depending on the parameter value.


I add one more sentence to documentation to emphasize that degree of 
proximity depends on parameter .
And also I made restriction on 
parameter, now it can be only in range (0; 1)


Hmmm. On second thought, maybe one or the other is enough, either restrict 
the parameter to values where the approximation is good, or put out a 
clear documentation about when the approximation is not very good, but it 
may be still useful even if not precise.


So I would be in favor of expanding the documentation but not restricting 
the parameter beyond avoiding value 1.0.



[...]
Now it prints warning message if array overflowed. To print message only 
one time, it uses global flag, which is available for all threads. And 
theoretically message can be printed more than one time. [...]

So, should I spend time on solving this issue?


No. Just write a comment.

If the zipf cache is constant size, there is no point in using dynamic 
allocation, just declare an array…


Fixed. Does ZIPF_CACHE_SIZE = 15 is ok?


My guess is yes, till someone complains that it is not the case;-)


There should be non regression tests somehow. If the "improve pgbench
tap test infrastructure" get through, things should be added there.


I will send tests later, as separate patch.


I think that developping a test would be much simpler with the improved 
tap test infrastructure, so I would suggest to wait to know the result of 
the corresponding patch.


Also, could you recod the patch to CF 2017-09?
https://commitfest.postgresql.org/14/

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Jeff Janes
On Thu, Jul 20, 2017 at 7:59 AM, Robert Haas  wrote:

> >
> > Initially I wanted to make BAS_BULKWRITE and BAS_VACUUM ring sizes
> > configurable, but after testing I don't see much gain from increasing
> > ring buffer above 16MB. So I propose just 1 line change.
>
> I think the question for this patch is "so, why didn't we do it this
> way originally?".
>
> It's no secret that making the ring buffer larger will improve
> performance -- in fact, not having a ring buffer at all would improve
> performance even more.  But it would also increase the likelihood that
> the background work of vacuum would impact the performance of
> foreground operations, which is already a pretty serious problem that
> we probably don't want to make worse.


But having a very fast sequence of fdatasync calls is not terribly friendly
to the performance of the foreground operations, either.

I think the reason we didn't do it this way originally is tied the same
reason that autovacuum_vacuum_cost_delay = 20ms by default. If we want it
to be heavily throttled, there isn't much point in using a larger ring
buffer.  It is just wasted space. Maybe we could have it start out at
BAS_VACUUM's default size, then grow by one buffer every time it had to
issue a WAL sync, until it reached BAS_BULKWRITE's size where it would max
out.

Cheers,

Jeff


Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-07-20 Thread Alik Khilazhev
Hello Fabien,I am attaching patch v4. On 19 Jul 2017, at 17:21, Fabien COELHO  wrote:About the maths: As already said, I'm not at ease with a random_zipfian function which does not display a (good) zipfian distribution. At the minimum the documentation should be clear about the approximations implied depending on the parameter value.I add one more sentence to documentation to emphasize that degree of proximity depends on parameter . And also I made restriction on parameter, now it can be only in range (0; 1)In the litterature the theta parameter seems to be often called alphaor s (eg see https://en.wikipedia.org/wiki/Zipf%27s_law). I would suggest tostick to "s" instead of "theta”?I have renamed it to “s”.Functions zipfZeta(n, theta) does not really computes the zeta(n) function,so I think that a better name should be chosen. It seems to computeH_{n,theta}, the generalized harmonic number. Idem "thetan" field in struct.Renamed zipfZeta to zipfGeneralizedHarmonic, zetan to harmonicn.The handling of cache overflow by randomly removing one entry looks likea strange idea. Rather remove the oldest entry?Replaced with Least Recently Used replacement algorithm.ISTM that it should print a warning once if the cache array overflows as performance would drop heavily.Now it prints warning message if array overflowed. To print message only one time, it uses global flag, which is available for all threads. And theoretically message can be printed more than one time. It could be solved easily using pg_atomic_test_set_flag() from src/include/port/atomics.h but it can not be used in pgbench because of following lines of code there:#ifdef FRONTEND#error "atomics.h may not be included from frontend code"#endifOr it can be fixed by using mutexes from pthread, but I think code become less readable and more complex in this case.So, should I spend time on solving this issue? If the zipf cache is constant size, there is no point in using dynamic allocation, just declare an array…Fixed. Does ZIPF_CACHE_SIZE = 15 is ok? There should be non regression tests somehow. If the "improve pgbenchtap test infrastructure" get through, things should be added there.I will send tests later, as separate patch.

pgbench-zipf-04v.patch
Description: Binary data
—Thanks and Regards,Alik KhilazhevPostgres Professional:http://www.postgrespro.comThe Russian Postgres Company

Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2017-07-20 Thread Alvaro Herrera
Michael Paquier wrote:
> On Thu, Jul 20, 2017 at 4:04 PM, Alvaro Herrera
>  wrote:
> > I think the addition of checks everywhere for NULL return is worse.
> > Let's add a missing_ok flag instead, so that most callers can just trust
> > that they get a non null value if they don't want to deal with that
> > case.
> 
> If you want to minimize the diffs or such a patch, we could as well
> have an extended version of those APIs. I don't think that for the
> addition of one argument like a missing_ok that's the way to go, but
> some people may like that to make this patch less intrusive.

I think minimizing API churn is worthwhile in some cases but not all.
These functions seem fringe enough that not having an API-compatible
version is unnecessary.  But that's just my opinion.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 1:08 PM, Claudio Freire  wrote:
> So, the 64x increase may be justifiable in absolute terms: it's not
> unlikely that a 16MB buffer will be evicted from the OS cache before
> vacuum is done with it, even in heavily throttled vacuums.

Sorry, that should read "It's not *likely* that a 16MB buffer will be evicted"


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Sokolov Yura

On 2017-07-20 19:04, Tom Lane wrote:

Claudio Freire  writes:
On Thu, Jul 20, 2017 at 11:59 AM, Robert Haas  
wrote:

I think the question for this patch is "so, why didn't we do it this
way originally?".

It's no secret that making the ring buffer larger will improve
performance -- in fact, not having a ring buffer at all would improve
performance even more.  But it would also increase the likelihood 
that

the background work of vacuum would impact the performance of
foreground operations, which is already a pretty serious problem that
we probably don't want to make worse.  I'm not certain what the right
decision is here, but I think that a careful analysis of possible
downsides is needed.



IIRC, originally, the default shared_buffers settings was tiny.


At the time we set the ring buffer size to 256K, the maximum
shared_buffers that initdb would configure was 32MB; and you often 
didn't
get that much due to SHMMAX.  Now of course it's 128MB, and you'll 
pretty
much always get that.  So there's certainly room to argue that it's 
time
to increase vacuum's ring buffer size, but that line of argument 
doesn't

justify more than ~10X increase at most.

Like Robert, I'm afraid of changing this number in a vacuum (ahem).
If you've got the default number of autovacuum workers going (3), you'd
have them thrashing a total of 3/8ths of shared memory by default, 
which

seems like a lot.  We do need to look at the impact on foreground
processing, and not just at the speed of vacuum itself.

One idea for addressing this would be to raise the max values in the
switch, but tighten the fraction-of-shared-buffers limit just below.
I wouldn't have any objection to a 16MB ring buffer for vacuum when
it is coming out of a 1GB arena ... it just seems like a rather large
fraction of 128MB to give to a background process, especially to each
of several background processes.

Maybe the fraction-of-shared-buffers shouldn't be one size fits all,
but a different limit for each case?

regards, tom lane


It could be 'min(shared_buffers/8/autovacuum_workers, 16MB)'.
It quite rarely people live shared_buffers as 128MB, and those people
don't really care about other settings. So 5MB will be enough for
their autovacuum's ring buffer.

People, who care about tuning its postgresql, will increase their
shared_buffers, and autovacuum will have its 16MB ring buffer.

With regards,
--
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Sokolov Yura

On 2017-07-20 17:59, Robert Haas wrote:

On Tue, Jul 18, 2017 at 6:09 AM, Sokolov Yura
 wrote:

I investigated autovacuum performance, and found that it suffers a lot
from small ring buffer. It suffers in a same way bulk writer suffered
before Tom Lane's commit 6382448cf96:


Tom Lane   2009-06-23 00:04:28
For bulk write operations (eg COPY IN), use a ring buffer of 16MB
instead of the 256KB limit originally enforced by a patch committed
2008-11-06. Per recent test results, the smaller size resulted in an
undesirable decrease in bulk data loading speed, due to COPY
processing frequently getting blocked for WAL flushing. This area
might need more tweaking later, but this setting seems to be good
enough for 8.4.


It is especially noticable when database doesn't fit in shared_buffers
but fit into OS file cache, and data is intensively updated (ie OLTP
load). In this scenario autovacuum with current 256kB (32 pages) ring
buffer lasts 3-10 times longer than with increased to 16MB ring 
buffer.


I've tested with synthetic load with 256MB or 1GB shared buffers and
2-6GB (with indices) tables, with different load factor and 
with/without

secondary indices on updated columns. Table were randomly updated with
hot and non-hot updates. Times before/after buffer increase (depending
on load) were 7500sec/1200sec, 75000sec/11500sec. So benefit is
consistently reproducible.

I didn't tested cases when database doesn't fit OS file cache. 
Probably

in this case benefit will be smaller cause more time will be spent in
disk read.
I didn't test intensively OLAP load. I've seen once that increased
buffer slows a bit scanning almost immutable huge table, perhaps cause
of decreased CPU cache locality. But given such scan is already fast,
and autovacuum of "almost immutable table" runs rarely, I don't think
it is very important.

Initially I wanted to make BAS_BULKWRITE and BAS_VACUUM ring sizes
configurable, but after testing I don't see much gain from increasing
ring buffer above 16MB. So I propose just 1 line change.


I think the question for this patch is "so, why didn't we do it this
way originally?".

It's no secret that making the ring buffer larger will improve
performance -- in fact, not having a ring buffer at all would improve
performance even more.  But it would also increase the likelihood that
the background work of vacuum would impact the performance of
foreground operations, which is already a pretty serious problem that
we probably don't want to make worse.  I'm not certain what the right
decision is here, but I think that a careful analysis of possible
downsides is needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Initially, ring buffer were introduced for sequential scan.
It was added for vacuum "for a company", and before introducing
vacuum used just 1 page, so giving 32 pages to was huge improvement:

d526575f893c1a4e05ebd Tom Lane   2007-05-31 00:12:03
"Make large sequential scans and VACUUMs work in a limited-size "ring" 
of
 buffers, rather than blowing out the whole shared-buffer arena.  Aside 
from
 avoiding cache spoliation, this fixes the problem that VACUUM formerly 
tended
 to cause a WAL flush for every page it modified, because we had it 
hacked to

 use only a single buffer."

Later ring buffer were added for bulk writer, but the same 32 pages:

85e2cedf985bfecaf43a18ca Tom Lane   2008-11-06 
23:51:15
" Improve bulk-insert performance by keeping the current target buffer 
pinned
  (but not locked, as that would risk deadlocks).  Also, make it work in 
a small
  ring of buffers to avoid having bulk inserts trash the whole buffer 
arena.

  Robert Haas, after an idea of Simon Riggs'."

And finally after some real world usage buffer for bulk writer were 
increased


6382448cf96a9b88 Tom Lane   2009-06-23 00:04:28
" For bulk write operations (eg COPY IN), use a ring buffer of 16MB 
instead
  of the 256KB limit originally enforced by a patch committed 
2008-11-06.
  Per recent test results, the smaller size resulted in an undesirable 
decrease
  in bulk data loading speed, due to COPY processing frequently getting 
blocked
  for WAL flushing.  This area might need more tweaking later, but this 
setting

  seems to be good enough for 8.4."

So, from my point of view, no one just evaluate performance of increased
ring buffer for vacuum.

It were discussed year age: 
https://www.postgresql.org/message-id/flat/CA%2BTgmobmP%3DKE-z5f7-CegXMFGRbV%3DhC%2B%3DFxb2mbhpfD-ZD%3D-bA%40mail.gmail.com#CA+TgmobmP=KE-z5f7-CegXMFGRbV=hC+=Fxb2mbhpfD-ZD=-b...@mail.gmail.com


There was your, Robert wrong assumption:

But all that does is force the backend to write to the operating

system, which is where the real buffering happens.

But in fact, vacuum process performs FSYNC! It happens, cause vacuum
evicts dirty pages from its ring buffer. And to evict dirty 

Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Jeff Janes
On Thu, Jul 20, 2017 at 6:28 AM, Stephen Frost  wrote:

> Greetings,
>
> * Sokolov Yura (y.soko...@postgrespro.ru) wrote:
> > I wrote two days ago about vacuum ring buffer:
> > https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf492
> 9a%40postgrespro.ru
> >
> > Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
> > autovacuum time in 3-10 times.
> > (for both patched and unpatched version I used single non-default
> > setting
> > 'autovacuum_cost_delay=2ms').
> >
> > This is single line change, and it improves things a lot.
>
> Right- when the database fits in the OS cache but not in shared_buffers.
>


On a system with a slow fsync, increasing the ring buffer helps a lot even
if database doesn't fit in the OS cache. When the next buffer allocation
runs into a dirtied buffer in the ring, it needs to sync the WAL up through
that buffer's LSN before it can write it out and reuse it.  With a small
ring, this means a lot of WAL flushing needs to be done.


> I do agree that's a useful improvement to make based on your testing.
>
> It's not clear off-hand how much that would improve this case, as
> the database size appears to pretty quickly get beyond the OS memory
> size (and only in the first test is the DB starting size less than
> system memory to begin with).
>


Also, this system probably has a pretty fast fdatasync, considering it is
SSD.

Cheers,

Jeff


Re: [HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread Tom Lane
"David G. Johnston"  writes:
> Per the docs:
> "If there are no common column names, NATURAL behaves like CROSS JOIN."

> I'm being a bit pedantic here but since NATURAL is a replacement for
> "ON/USING" it would seem more consistent to describe it, when no matching
> columns are found, as "behaves like specifying ON TRUE" instead.

Yeah, the analogy to CROSS JOIN falls down if it's an outer join.
I'll go fix that.

> I find it a bit strange, though not surprising, that it doesn't devolve to
> "ON FALSE".

No, it's normal that an AND of no conditions degenerates to TRUE.
It's like omitting a WHERE clause.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 12:51 PM, Robert Haas  wrote:
> On Thu, Jul 20, 2017 at 11:09 AM, Claudio Freire  
> wrote:
>>> It's no secret that making the ring buffer larger will improve
>>> performance -- in fact, not having a ring buffer at all would improve
>>> performance even more.  But it would also increase the likelihood that
>>> the background work of vacuum would impact the performance of
>>> foreground operations, which is already a pretty serious problem that
>>> we probably don't want to make worse.  I'm not certain what the right
>>> decision is here, but I think that a careful analysis of possible
>>> downsides is needed.
>>
>> IIRC, originally, the default shared_buffers settings was tiny.
>
> It is true that we increased the default shared_buffers value from
> 32MB to 128MB in f358428280953643313ee7756e0a8b8ccfde7660, but it's
> also true ring buffers are capped at 1/8th of shared_buffers
> regardless of anything else, so I don't think that's the explanation
> here.  Even if that weren't the case, how would a 4x increase in the
> default size of shared_buffers (which is probably the most-commonly
> changed GUC of any that PostgreSQL has) justify a 64x increase in the
> size of the ring buffer?

I'm theorizing here, because I've not been involved in any of those
decisions. But I have been stracing and checking on vacuum quite
frequently lately, so my 2 cents:

The 4x increase in shared_buffers acknowledges increases in available
host memory over the years. It's not just about how much of
shared_buffers is dedicated to the ring buffer, but also whether we
can reasonably expect the whole ring to remain in the OS cache while
it's getting dirtied.

Vacuum will almost always dirty pages once and never again, and
flushing dirty pages back to the OS cache ASAP helps avoid a
read-modify-write cycle if the page didn't leave the OS cache. That's
more likely to happen with smaller rings than with bigger rings. But
as memory increases, the ring can be made bigger without fear of it
falling from the OS cache prematurely.

So, the 64x increase may be justifiable in absolute terms: it's not
unlikely that a 16MB buffer will be evicted from the OS cache before
vacuum is done with it, even in heavily throttled vacuums. Memory
pressure on the host would have to be insane to cause that, in modern
systems with GBs of RAM. That might not have been true in the early
days.

Now, whether autovacuum would suck up a big portion of the
shared_buffers or not, is another matter. Perhaps the ring buffer
could tune itself to whatever limit seems comfortable in that regard,
the way it does with other GUCs (like cost_limit): divide it among the
number of workers?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Tom Lane
Claudio Freire  writes:
> On Thu, Jul 20, 2017 at 11:59 AM, Robert Haas  wrote:
>> I think the question for this patch is "so, why didn't we do it this
>> way originally?".
>> 
>> It's no secret that making the ring buffer larger will improve
>> performance -- in fact, not having a ring buffer at all would improve
>> performance even more.  But it would also increase the likelihood that
>> the background work of vacuum would impact the performance of
>> foreground operations, which is already a pretty serious problem that
>> we probably don't want to make worse.  I'm not certain what the right
>> decision is here, but I think that a careful analysis of possible
>> downsides is needed.

> IIRC, originally, the default shared_buffers settings was tiny.

At the time we set the ring buffer size to 256K, the maximum
shared_buffers that initdb would configure was 32MB; and you often didn't
get that much due to SHMMAX.  Now of course it's 128MB, and you'll pretty
much always get that.  So there's certainly room to argue that it's time
to increase vacuum's ring buffer size, but that line of argument doesn't
justify more than ~10X increase at most.

Like Robert, I'm afraid of changing this number in a vacuum (ahem).
If you've got the default number of autovacuum workers going (3), you'd
have them thrashing a total of 3/8ths of shared memory by default, which
seems like a lot.  We do need to look at the impact on foreground
processing, and not just at the speed of vacuum itself.

One idea for addressing this would be to raise the max values in the
switch, but tighten the fraction-of-shared-buffers limit just below.
I wouldn't have any objection to a 16MB ring buffer for vacuum when
it is coming out of a 1GB arena ... it just seems like a rather large
fraction of 128MB to give to a background process, especially to each
of several background processes.

Maybe the fraction-of-shared-buffers shouldn't be one size fits all,
but a different limit for each case?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread David G. Johnston
On Thu, Jul 20, 2017 at 6:53 AM, Tom Lane  wrote:

> tushar  writes:
> > postgres=# create table t(n int);
> > CREATE TABLE
> > postgres=# create table t1(a int);
> > CREATE TABLE
> > postgres=# create view ttt1 as SELECT e.n FROM t e NATURAL LEFT JOIN t1
> d;
> > CREATE VIEW
>
> You realize of course that that's a pretty useless join definition.
> Still, yes, we do need to reverse-list the view with correct syntax.
> Probably t LEFT JOIN t1 ON TRUE would do it.
>

Per the docs:

"If there are no common column names, NATURAL behaves like CROSS JOIN."

I'm being a bit pedantic here but since NATURAL is a replacement for
"ON/USING" it would seem more consistent to describe it, when no matching
columns are found, as "behaves like specifying ON TRUE" instead.   Maybe
"behaves like specifying ON TRUE, causing a CROSS JOIN to occur instead."

I find it a bit strange, though not surprising, that it doesn't devolve to
"ON FALSE".

David J.


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Robert Haas
On Thu, Jul 20, 2017 at 11:09 AM, Claudio Freire  wrote:
>> It's no secret that making the ring buffer larger will improve
>> performance -- in fact, not having a ring buffer at all would improve
>> performance even more.  But it would also increase the likelihood that
>> the background work of vacuum would impact the performance of
>> foreground operations, which is already a pretty serious problem that
>> we probably don't want to make worse.  I'm not certain what the right
>> decision is here, but I think that a careful analysis of possible
>> downsides is needed.
>
> IIRC, originally, the default shared_buffers settings was tiny.

It is true that we increased the default shared_buffers value from
32MB to 128MB in f358428280953643313ee7756e0a8b8ccfde7660, but it's
also true ring buffers are capped at 1/8th of shared_buffers
regardless of anything else, so I don't think that's the explanation
here.  Even if that weren't the case, how would a 4x increase in the
default size of shared_buffers (which is probably the most-commonly
changed GUC of any that PostgreSQL has) justify a 64x increase in the
size of the ring buffer?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: weekly progress reports (week 7)

2017-07-20 Thread Robert Haas
On Tue, Jul 18, 2017 at 10:36 AM, Shubham Barai 
wrote:

> During this week, I read documentation and source code of BRIN index to
> understand its implementation.
> But later I found out that it is unlikely to implement page level or tuple
> level predicate locking in BRIN index.
> In this week, I also fixed a small issue and updated my previous patch for
> gin index. Currently, I am working on
> sp-gist index.
>

It's a shame that nobody seems to be answering emails like this, but you
also haven't provided much detail here - e.g. *why* is BRIN unlikely to
work out well?  I think I know the answer, but it'd be good to spell out
your thoughts on the subject, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] JSONB - JSONB operator feature request

2017-07-20 Thread Dmitry Dolgov
> On 20 July 2017 at 16:24, David Fetter  wrote:
>
> If we can agree to a definition, we can make this go.  My vague
> memories from graph theory indicate that that "agree to a definition"
> part is the real problem to be solved.

I tried to embody some relevant thoughts in this thread [1], I think it
would be great if
you can take a look and suggest something more.

[1]:
https://www.postgresql.org/message-id/ca+q6zcu+gy1+dxqd09msz8zwqq+sppfs-6gykmynqgvqdfe...@mail.gmail.com


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 11:59 AM, Robert Haas  wrote:
> On Tue, Jul 18, 2017 at 6:09 AM, Sokolov Yura
>  wrote:
>> I investigated autovacuum performance, and found that it suffers a lot
>> from small ring buffer. It suffers in a same way bulk writer suffered
>> before Tom Lane's commit 6382448cf96:
>>
>>> Tom Lane   2009-06-23 00:04:28
>>> For bulk write operations (eg COPY IN), use a ring buffer of 16MB
>>> instead of the 256KB limit originally enforced by a patch committed
>>> 2008-11-06. Per recent test results, the smaller size resulted in an
>>> undesirable decrease in bulk data loading speed, due to COPY
>>> processing frequently getting blocked for WAL flushing. This area
>>> might need more tweaking later, but this setting seems to be good
>>> enough for 8.4.
>>
>> It is especially noticable when database doesn't fit in shared_buffers
>> but fit into OS file cache, and data is intensively updated (ie OLTP
>> load). In this scenario autovacuum with current 256kB (32 pages) ring
>> buffer lasts 3-10 times longer than with increased to 16MB ring buffer.
>>
>> I've tested with synthetic load with 256MB or 1GB shared buffers and
>> 2-6GB (with indices) tables, with different load factor and with/without
>> secondary indices on updated columns. Table were randomly updated with
>> hot and non-hot updates. Times before/after buffer increase (depending
>> on load) were 7500sec/1200sec, 75000sec/11500sec. So benefit is
>> consistently reproducible.
>>
>> I didn't tested cases when database doesn't fit OS file cache. Probably
>> in this case benefit will be smaller cause more time will be spent in
>> disk read.
>> I didn't test intensively OLAP load. I've seen once that increased
>> buffer slows a bit scanning almost immutable huge table, perhaps cause
>> of decreased CPU cache locality. But given such scan is already fast,
>> and autovacuum of "almost immutable table" runs rarely, I don't think
>> it is very important.
>>
>> Initially I wanted to make BAS_BULKWRITE and BAS_VACUUM ring sizes
>> configurable, but after testing I don't see much gain from increasing
>> ring buffer above 16MB. So I propose just 1 line change.
>
> I think the question for this patch is "so, why didn't we do it this
> way originally?".
>
> It's no secret that making the ring buffer larger will improve
> performance -- in fact, not having a ring buffer at all would improve
> performance even more.  But it would also increase the likelihood that
> the background work of vacuum would impact the performance of
> foreground operations, which is already a pretty serious problem that
> we probably don't want to make worse.  I'm not certain what the right
> decision is here, but I think that a careful analysis of possible
> downsides is needed.

IIRC, originally, the default shared_buffers settings was tiny.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Increase Vacuum ring buffer.

2017-07-20 Thread Robert Haas
On Tue, Jul 18, 2017 at 6:09 AM, Sokolov Yura
 wrote:
> I investigated autovacuum performance, and found that it suffers a lot
> from small ring buffer. It suffers in a same way bulk writer suffered
> before Tom Lane's commit 6382448cf96:
>
>> Tom Lane   2009-06-23 00:04:28
>> For bulk write operations (eg COPY IN), use a ring buffer of 16MB
>> instead of the 256KB limit originally enforced by a patch committed
>> 2008-11-06. Per recent test results, the smaller size resulted in an
>> undesirable decrease in bulk data loading speed, due to COPY
>> processing frequently getting blocked for WAL flushing. This area
>> might need more tweaking later, but this setting seems to be good
>> enough for 8.4.
>
> It is especially noticable when database doesn't fit in shared_buffers
> but fit into OS file cache, and data is intensively updated (ie OLTP
> load). In this scenario autovacuum with current 256kB (32 pages) ring
> buffer lasts 3-10 times longer than with increased to 16MB ring buffer.
>
> I've tested with synthetic load with 256MB or 1GB shared buffers and
> 2-6GB (with indices) tables, with different load factor and with/without
> secondary indices on updated columns. Table were randomly updated with
> hot and non-hot updates. Times before/after buffer increase (depending
> on load) were 7500sec/1200sec, 75000sec/11500sec. So benefit is
> consistently reproducible.
>
> I didn't tested cases when database doesn't fit OS file cache. Probably
> in this case benefit will be smaller cause more time will be spent in
> disk read.
> I didn't test intensively OLAP load. I've seen once that increased
> buffer slows a bit scanning almost immutable huge table, perhaps cause
> of decreased CPU cache locality. But given such scan is already fast,
> and autovacuum of "almost immutable table" runs rarely, I don't think
> it is very important.
>
> Initially I wanted to make BAS_BULKWRITE and BAS_VACUUM ring sizes
> configurable, but after testing I don't see much gain from increasing
> ring buffer above 16MB. So I propose just 1 line change.

I think the question for this patch is "so, why didn't we do it this
way originally?".

It's no secret that making the ring buffer larger will improve
performance -- in fact, not having a ring buffer at all would improve
performance even more.  But it would also increase the likelihood that
the background work of vacuum would impact the performance of
foreground operations, which is already a pretty serious problem that
we probably don't want to make worse.  I'm not certain what the right
decision is here, but I think that a careful analysis of possible
downsides is needed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Claudio Freire
On Thu, Jul 20, 2017 at 12:08 AM, Peter Geoghegan  wrote:
>> The traditional
>> wisdom about btrees, for instance, is that no matter how full you pack
>> them to start with, the steady state is going to involve something like
>> 1/3rd free space.  You can call that bloat if you want, but it's not
>> likely that you'll be able to reduce the number significantly without
>> paying exorbitant costs.
>
> For the purposes of this discussion, I'm mostly talking about
> duplicates within a page on a unique index. If the keyspace owned by
> an int4 unique index page only covers 20 distinct values, it will only
> ever cover 20 distinct values, now and forever, despite the fact that
> there is room for about 400 (a 90/10 split leaves you with 366 items +
> 1 high key).

Microvacuum could also help.

If during a scan you find pointers that point to dead (in vacuum terms)
tuples, the pointers in the index could be deleted. That could be done
during insert into unique indexes before a split, to avoid the split.

Chances are, if there are duplicates, at least a few of them will be dead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Definitional questions for pg_sequences view

2017-07-20 Thread Tom Lane
What exactly is the point of the new pg_sequences view?

It seems like it's intended to ease conversion of applications that
formerly did "select * from sequencename", but if so, there are some
fairly annoying discrepancies.  The old way got you these columns:

regression=# \d s1
 Sequence "public.s1"
Column |  Type   |Value
---+-+-
 sequence_name | name| s1
 last_value| bigint  | 1
 start_value   | bigint  | 1
 increment_by  | bigint  | 1
 max_value | bigint  | 9223372036854775807
 min_value | bigint  | 1
 cache_value   | bigint  | 1
 log_cnt   | bigint  | 0
 is_cycled | boolean | f
 is_called | boolean | f

but now we offer

regression=# \d pg_sequences
  View "pg_catalog.pg_sequences"
Column |  Type   | Collation | Nullable | Default 
---+-+---+--+-
 schemaname| name|   |  | 
 sequencename  | name|   |  | 
 sequenceowner | name|   |  | 
 data_type | regtype |   |  | 
 start_value   | bigint  |   |  | 
 min_value | bigint  |   |  | 
 max_value | bigint  |   |  | 
 increment_by  | bigint  |   |  | 
 cycle | boolean |   |  | 
 cache_size| bigint  |   |  | 
 last_value| bigint  |   |  | 

Why aren't sequencename, cache_size, and cycle spelled consistently
with past practice?  And is there a really good reason to order the
columns randomly differently from before?

The big problem, though, is that there's no convenient way to use
this view in a schema-safe manner.  If you try to translate
select * from my_seq;
into
select * from pg_sequences where sequencename = 'my_seq';
then you're going to get burnt if there's more than one my_seq
in different schemas.  There's no easy way to get your search
path incorporated into the result.  Maybe people will always know
how to constrain the schemaname too, but I wouldn't count on it.

This could be fixed if it were possible to translate to
select * from pg_sequences where seqoid = 'my_seq'::regclass;
but the view isn't exposing the sequence OID.  Should it?

As things stand, it's actually considerably easier and safer to
use the pg_sequence catalog directly, because then you *can* do
select * from pg_sequence where seqrelid = 'my_seq'::regclass;
and you only have to deal with the different-from-before column names.
Which pretty much begs the question why we bothered to provide the
view.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSONB - JSONB operator feature request

2017-07-20 Thread David Fetter
On Wed, Jul 19, 2017 at 06:17:35PM -0500, Merlin Moncure wrote:
> On Tue, Jul 18, 2017 at 12:49 PM, David Fetter  wrote:
> > On Tue, Jul 18, 2017 at 01:36:32PM +0200, david.tu...@linuxbox.cz wrote:
> >> Hi,
> >>
> >> some users and me used hstore - hstore for example storing only changed
> >> rows in trigger like:
> >>
> >> hsore(NEW) - hstore(OLD)
> >>
> >> There isn't same operator/function in JSON/JSONB. We can only remove keys
> >> from JSONB, but not equal key-value pairs. Is there any chance to have
> >> same feature with JSON/JSONB in postgres core?
> >
> > Here's one slightly modified from 
> > http://coussej.github.io/2016/05/24/A-Minus-Operator-For-PostgreSQLs-JSONB/
> >
> > CREATE OR REPLACE FUNCTION jsonb_minus ( arg1 jsonb, arg2 jsonb )
> > RETURNS jsonb
> > LANGUAGE sql
> > AS $$
> > SELECT
> > COALESCE(json_object_agg(
> > key,
> > CASE
> > -- if the value is an object and the value of the second 
> > argument is
> > -- not null, we do a recursion
> > WHEN jsonb_typeof(value) = 'object' AND arg2 -> key IS NOT NULL
> > THEN jsonb_minus(value, arg2 -> key)
> > -- for all the other types, we just return the value
> > ELSE value
> > END
> > ), '{}')::jsonb
> > FROM
> > jsonb_each(arg1)
> > WHERE
> > arg1 -> key IS DISTINCT FROM arg2 -> key
> > $$;
> >
> > CREATE OPERATOR - (
> > PROCEDURE = jsonb_minus,
> > LEFTARG   = jsonb,
> > RIGHTARG  = jsonb
> > );
> >
> > I suspect that there's a faster way to do the jsonb_minus function
> > internally.
> 
> yes, please!  I also sorely miss the hstore 'slice' function which is
> very similar.  The main remaining disadvantage with jsonb WRT to
> hstore is that you can't do simple retransformations that these
> operations allow for.  Too often you end up doing multiple '->'
> operations against the same object followed by a rebundling which is a
> real performance killer.

If we can agree to a definition, we can make this go.  My vague
memories from graph theory indicate that that "agree to a definition"
part is the real problem to be solved.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Joshua D. Drake

On 07/20/2017 06:28 AM, Stephen Frost wrote:


It's not clear off-hand how much that would improve this case, as
the database size appears to pretty quickly get beyond the OS memory
size (and only in the first test is the DB starting size less than
system memory to begin with).


FYI,

I will be posting new numbers in a few hours. I had been planning on 
posting this last night but... KDE.


JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2017-07-20 Thread Michael Paquier
On Thu, Jul 20, 2017 at 4:04 PM, Alvaro Herrera
 wrote:
> I think the addition of checks everywhere for NULL return is worse.
> Let's add a missing_ok flag instead, so that most callers can just trust
> that they get a non null value if they don't want to deal with that
> case.

If you want to minimize the diffs or such a patch, we could as well
have an extended version of those APIs. I don't think that for the
addition of one argument like a missing_ok that's the way to go, but
some people may like that to make this patch less intrusive.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2017-07-20 Thread Alvaro Herrera
Michael Paquier wrote:

> - getObjectDescription and getObjectIdentity are called in quite a
> couple of places. We could have those have a kind of missing_ok, but
> as the status is just for adding cache lookup errors I have kept the
> interface simple as this keeps the code in objectaddress.c more simple
> as well. getObjectIdentity is used mainly in sepgsql, which I have not
> compiled yet so I may have missed something :) getObjectDescription is
> used in more places in the backend code, but I am not much into
> complicating the objaddr API with this patch more.

I think the addition of checks everywhere for NULL return is worse.
Let's add a missing_ok flag instead, so that most callers can just trust
that they get a non null value if they don't want to deal with that
case.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache lookup errors with functions manipulation object addresses

2017-07-20 Thread Michael Paquier
On Wed, Jul 19, 2017 at 7:29 PM, Robert Haas  wrote:
> On Wed, Jul 19, 2017 at 2:25 AM, Michael Paquier
>  wrote:
>> Would we want to improve the error handling of such objects?
>
> +1 for such an improvement.

Attached is a patch for all that. Here are some notes:
- format_type_be and friends use allow_invalid. I have added a flag to
control the NULL-ness as many code paths rely on existing APIs, and
introduced an _extended version of this API. I would argue for the
removal of allow_invalid to give more flexibility to callers, but this
impacts extensions :(
- A similar thing is needed for format_operator().
- We could really add a missing_ok to get_attname, but that does not
seem worth the refactoring with modules potentially calling it..
- GetForeignDataWrapper is extended with a missing_ok, unfortunately
not saving one cache lookup in GetForeignDataWrapperByName.
- Same remark as the previous one for GetForeignServer.
- get_publication_name and get_subscription_name gain a missing_ok.
- getObjectDescription and getObjectIdentity are called in quite a
couple of places. We could have those have a kind of missing_ok, but
as the status is just for adding cache lookup errors I have kept the
interface simple as this keeps the code in objectaddress.c more simple
as well. getObjectIdentity is used mainly in sepgsql, which I have not
compiled yet so I may have missed something :) getObjectDescription is
used in more places in the backend code, but I am not much into
complicating the objaddr API with this patch more.
- I have added tests for all the OCLASS objects, for a total more or
less 120 cache lookup errors that a user can face.
- Some docs are present as well, but I think that they are a bit
incomplete. I'll review them a bit later.
- The patch is large, 800 lines are used for the tests which is very mechanical:
 32 files changed, 1721 insertions(+), 452 deletions(-)

Thanks,
-- 
Michael


objaddr_nullness_v1.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread Tom Lane
tushar  writes:
> postgres=# create table t(n int);
> CREATE TABLE
> postgres=# create table t1(a int);
> CREATE TABLE
> postgres=# create view ttt1 as SELECT e.n FROM t e NATURAL LEFT JOIN t1 d;
> CREATE VIEW

You realize of course that that's a pretty useless join definition.
Still, yes, we do need to reverse-list the view with correct syntax.
Probably t LEFT JOIN t1 ON TRUE would do it.

> I think -this issue should be there in the older branches as well but 
> not checked that.

[experiments]  Seems to be wrong back to 9.3.  Although I have a feeling
this might be a mistake in a back-patched bug fix, so that it'd depend
on which 9.3.x you looked at.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread Tom Lane
Thom Brown  writes:
> On 20 July 2017 at 13:23, tushar  wrote:
>> postgres=# create sequence seq_9166 start 1 increment 1;
>> CREATE SEQUENCE
>> postgres=# create or replace view v3_9166 as select * from seq_9166;
>> CREATE VIEW

> This is because sequence_name, start_value, increment_by, max_value,
> min_value, cache_value and is_cycled are no longer output when
> selecting from sequences.

Yes.  This will not be "fixed"; you'll have to adjust the view before
you can update it to v10.  (If you want those values, you should now
get them out of the pg_sequence catalog.)

This should have been called out as a significant incompatibility
in the v10 release notes, but I see that it's not listed in the
right section.  Will fix that ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread Thom Brown
On 20 July 2017 at 14:04, Thom Brown  wrote:
> On 20 July 2017 at 13:23, tushar  wrote:
>> Steps to reproduce -
>>
>> v9.6
>>
>> postgres=# create sequence seq_9166 start 1 increment 1;
>> CREATE SEQUENCE
>> postgres=# create or replace view v3_9166 as select * from seq_9166;
>> CREATE VIEW
>>
>> v10
>>
>> run pg_upgrade , going to fail with this error
>>
>>
>> command: "./pg_restore" --host
>> /home/centos/pg10_14july/postgresql/edbpsql/bin --port 50432 --username edb
>> --exit-on-error --verbose --dbname 'dbname=postgres'
>> "pg_upgrade_dump_13269.custom" >> "pg_upgrade_dump_13269.log" 2>&1
>> pg_restore: connecting to database for restore
>> pg_restore: creating pg_largeobject "pg_largeobject"
>> pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
>> pg_restore: creating COMMENT "postgres"
>> pg_restore: creating SCHEMA "public"
>> pg_restore: creating COMMENT "SCHEMA "public""
>> pg_restore: creating TABLE "public.fb17136_tab1"
>> pg_restore: creating SEQUENCE "public.seq_9166"
>> pg_restore: creating VIEW "public.v3_9166"
>> pg_restore: [archiver (db)] Error while PROCESSING TOC:
>> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16392 VIEW
>> v3_9166 edb
>> pg_restore: [archiver (db)] could not execute query: ERROR:  column
>> seq_9166.sequence_name does not exist
>> LINE 14:  SELECT "seq_9166"."sequence_name",
>>  ^
>> Command was:
>> -- For binary upgrade, must preserve pg_type oid
>> SELECT
>> pg_catalog.binary_upgrade_set_next_pg_type_oid('16394'::pg_catalog.oid);
>>
>>
>> -- For binary upgrade, must preserve pg_type array oid
>> SELECT
>> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16393'::pg_catalog.oid);
>>
>>
>> -- For binary upgrade, must preserve pg_class oids
>> SELECT
>> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16392'::pg_catalog.oid);
>>
>> CREATE VIEW "v3_9166" AS
>>  SELECT "seq_9166"."sequence_name",
>> "seq_9166"."last_value",
>> "seq_9166"."start_value",
>> "seq_9166"."increment_by",
>> "seq_9166"."max_value",
>> "seq_9166"."min_value",
>> "seq_9166"."cache_value",
>> "seq_9166"."log_cnt",
>> "seq_9166"."is_cycled",
>> "seq_9166"."is_called"
>>FROM "seq_9166";
>
> This is because sequence_name, start_value, increment_by, max_value,
> min_value, cache_value and is_cycled are no longer output when
> selecting from sequences.  Commit
> 1753b1b027035029c2a2a1649065762fafbf63f3 didn't take into account
> upgrading sequences to 10.

Actually, I'm not sure we need to bother fixing this.  In the view
creation, * has to be expanded to whatever columns exist at the time
of creating the view, and since most of those columns no longer exist
in v10, there's no way to get the view ported over without rewriting
it.  Anything that depends on the output of those columns would be
broken anyway.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Incorrect comment of XLByteToSeg() and XLByteToPrevSeg()

2017-07-20 Thread Yugo Nagata
Hi,

I found a type in the comment for XLByteToSeg() and XLByteToPrevSeg().
This says "Compute ID and segment from an XLogRecPtr", but these
macros compute only segment numbers. I think "Compute a segment number
from an XLogRecPtr" is correct.

The definition of these macros were modified by the following commit,
but the comment were not.

commit dfda6ebaec6763090fb78b458a979b558c50b39b
Author: Heikki Linnakangas 
Date:   Sun Jun 24 18:06:38 2012 +0300

Regards,

-- 
Yugo Nagata 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index a661ec0..7453dcb 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -96,7 +96,7 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 		(dest) = (segno) * XLOG_SEG_SIZE + (offset)
 
 /*
- * Compute ID and segment from an XLogRecPtr.
+ * Compute a segment number from an XLogRecPtr.
  *
  * For XLByteToSeg, do the computation at face value.  For XLByteToPrevSeg,
  * a boundary byte is taken to be in the previous segment.  This is suitable

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlogfilename

2017-07-20 Thread Yugo Nagata
On Thu, 20 Jul 2017 11:02:25 +0200
Michael Paquier  wrote:

> On Thu, Jul 20, 2017 at 10:58 AM, DEV_OPS  wrote:
> > I think you may reference to function: pg_xlogfile_name   in
> > src/backend/access/transam/xlogfuncs.c,  it use XLogFileName  defined in
> > src/include/access/xlog_internal.h
> >
> > #define XLogFileName(fname, tli, logSegNo)  \
> > snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,   \
> >  (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
> >  (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
> >
> >
> > hope it's helpful for you
> 
> The first 8 characters are the timeline number in hexadecimal format.
> The next 8 characters indicate a segment number, which gets
> incremented every 256 segments in hexa format. The last 8 characters
> indicate the current segment number in hexa format.

As far as I understand, XLOG is a logical big file of 256 * 16 MB,
and this is split to multiple physical files of 16MB which are called
WAL segments. The second 8 characters indicate the id of the logical
xlog file, and the last 8 characters indicate the sequencial number of
the segment in this xlog.

Regards,

> -- 
> Michael
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Yugo Nagata 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Stephen Frost
Greetings,

* Sokolov Yura (y.soko...@postgrespro.ru) wrote:
> I wrote two days ago about vacuum ring buffer:
> https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf4929a%40postgrespro.ru
> 
> Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
> autovacuum time in 3-10 times.
> (for both patched and unpatched version I used single non-default
> setting
> 'autovacuum_cost_delay=2ms').
> 
> This is single line change, and it improves things a lot.

Right- when the database fits in the OS cache but not in shared_buffers.

I do agree that's a useful improvement to make based on your testing.

It's not clear off-hand how much that would improve this case, as
the database size appears to pretty quickly get beyond the OS memory
size (and only in the first test is the DB starting size less than
system memory to begin with).

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

2017-07-20 Thread Tom Lane
Greg Stark  writes:
> On 19 July 2017 at 00:26, Tom Lane  wrote:
>> It's probably a bit late in the v10 cycle to be taking any risks in
>> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
>> as soon as the v11 cycle opens, unless someone can show an example
>> of non-broken coding that requires it.  (And if so, there ought to
>> be a regression test incorporating that.)

> Would it be useful to keep in one of the memory checking assertion builds?

Why?  Code that expects to continue accessing a relcache entry's tupdesc
after closing the relcache entry is broken, independently of whether it's
in a debug build or not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread Thom Brown
On 20 July 2017 at 13:23, tushar  wrote:
> Steps to reproduce -
>
> v9.6
>
> postgres=# create sequence seq_9166 start 1 increment 1;
> CREATE SEQUENCE
> postgres=# create or replace view v3_9166 as select * from seq_9166;
> CREATE VIEW
>
> v10
>
> run pg_upgrade , going to fail with this error
>
>
> command: "./pg_restore" --host
> /home/centos/pg10_14july/postgresql/edbpsql/bin --port 50432 --username edb
> --exit-on-error --verbose --dbname 'dbname=postgres'
> "pg_upgrade_dump_13269.custom" >> "pg_upgrade_dump_13269.log" 2>&1
> pg_restore: connecting to database for restore
> pg_restore: creating pg_largeobject "pg_largeobject"
> pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
> pg_restore: creating COMMENT "postgres"
> pg_restore: creating SCHEMA "public"
> pg_restore: creating COMMENT "SCHEMA "public""
> pg_restore: creating TABLE "public.fb17136_tab1"
> pg_restore: creating SEQUENCE "public.seq_9166"
> pg_restore: creating VIEW "public.v3_9166"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16392 VIEW
> v3_9166 edb
> pg_restore: [archiver (db)] could not execute query: ERROR:  column
> seq_9166.sequence_name does not exist
> LINE 14:  SELECT "seq_9166"."sequence_name",
>  ^
> Command was:
> -- For binary upgrade, must preserve pg_type oid
> SELECT
> pg_catalog.binary_upgrade_set_next_pg_type_oid('16394'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_type array oid
> SELECT
> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16393'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_class oids
> SELECT
> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16392'::pg_catalog.oid);
>
> CREATE VIEW "v3_9166" AS
>  SELECT "seq_9166"."sequence_name",
> "seq_9166"."last_value",
> "seq_9166"."start_value",
> "seq_9166"."increment_by",
> "seq_9166"."max_value",
> "seq_9166"."min_value",
> "seq_9166"."cache_value",
> "seq_9166"."log_cnt",
> "seq_9166"."is_cycled",
> "seq_9166"."is_called"
>FROM "seq_9166";

This is because sequence_name, start_value, increment_by, max_value,
min_value, cache_value and is_cycled are no longer output when
selecting from sequences.  Commit
1753b1b027035029c2a2a1649065762fafbf63f3 didn't take into account
upgrading sequences to 10.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread Thom Brown
On 20 July 2017 at 13:09, tushar  wrote:
> Steps to reproduce -
>
> v9.6
>
> postgres=# create table t(n int);
> CREATE TABLE
> postgres=# create table t1(a int);
> CREATE TABLE
> postgres=# create view ttt1 as SELECT e.n FROM t e NATURAL LEFT JOIN t1 d;
> CREATE VIEW
>
> v10 -
>
> run pg_upgrade -
>
> going to fail ,with this error -
>
> "
> pg_restore: creating TABLE "public.t"
> pg_restore: creating TABLE "public.t1"
> pg_restore: creating VIEW "public.ttt1"
> pg_restore: [archiver (db)] Error while PROCESSING TOC:
> pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16390 VIEW ttt1
> edb
> pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
> or near ")"
> LINE 16:  LEFT JOIN "t1" "d");
> ^
> Command was:
> -- For binary upgrade, must preserve pg_type oid
> SELECT
> pg_catalog.binary_upgrade_set_next_pg_type_oid('16392'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_type array oid
> SELECT
> pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16391'::pg_catalog.oid);
>
>
> -- For binary upgrade, must preserve pg_class oids
> SELECT
> pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16390'::pg_catalog.oid);
>
> CREATE VIEW "ttt1" AS
>  SELECT "e"."n"
>FROM ("t" "e"
>  LEFT JOIN "t1" "d");
>
> "
> I think -this issue should be there in the older branches as well but not
> checked that.

I get the same result on 9.2 and 10 in pg_dump output.

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] huge RAM use in multi-command ALTER of table heirarchy

2017-07-20 Thread Greg Stark
On 19 July 2017 at 00:26, Tom Lane  wrote:

> It's probably a bit late in the v10 cycle to be taking any risks in
> this area, but I'd vote for ripping out RememberToFreeTupleDescAtEOX
> as soon as the v11 cycle opens, unless someone can show an example
> of non-broken coding that requires it.  (And if so, there ought to
> be a regression test incorporating that.)

Would it be useful to keep in one of the memory checking assertion builds?

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] A hook for session start

2017-07-20 Thread Fabrízio de Royes Mello
On Thu, Jul 20, 2017 at 8:47 AM, Yugo Nagata  wrote:
>
> Hi,
>
> Currently, PostgreSQL doen't have a hook triggered at session
> start. Although we already have ClientAuthentication_hook,
> this is triggered during authentication, so we can not
> access the database.
>
> If we have a hook triggerd only once at session start, we may
> do something useful on the session for certain database or user.
>
> For example, one of our clients wanted such feature. He wanted
> to handle encription for specific users, though I don't know
> the detail.
>
> The attached patch (session_start_hook.patch) implements such
> hook.
>
> Another patch, session_start_sample.patch, is a very simple
> example of this hook that changes work_mem values for sessions
> of a specific database.
>
> I would appreciate hearing your opinion on this hook.
>

I'm not sure your real needs but doesn't it material for improve Event
Triggers???

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello


[HACKERS] pg_upgrade failed if view is based on sequence

2017-07-20 Thread tushar

Steps to reproduce -

v9.6

postgres=# create sequence seq_9166 start 1 increment 1;
CREATE SEQUENCE
postgres=# create or replace view v3_9166 as select * from seq_9166;
CREATE VIEW

v10

run pg_upgrade , going to fail with this error


command: "./pg_restore" --host 
/home/centos/pg10_14july/postgresql/edbpsql/bin --port 50432 --username 
edb --exit-on-error --verbose --dbname 'dbname=postgres' 
"pg_upgrade_dump_13269.custom" >> "pg_upgrade_dump_13269.log" 2>&1

pg_restore: connecting to database for restore
pg_restore: creating pg_largeobject "pg_largeobject"
pg_restore: creating pg_largeobject_metadata "pg_largeobject_metadata"
pg_restore: creating COMMENT "postgres"
pg_restore: creating SCHEMA "public"
pg_restore: creating COMMENT "SCHEMA "public""
pg_restore: creating TABLE "public.fb17136_tab1"
pg_restore: creating SEQUENCE "public.seq_9166"
pg_restore: creating VIEW "public.v3_9166"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16392 VIEW 
v3_9166 edb
pg_restore: [archiver (db)] could not execute query: ERROR:  column 
seq_9166.sequence_name does not exist

LINE 14:  SELECT "seq_9166"."sequence_name",
 ^
Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT 
pg_catalog.binary_upgrade_set_next_pg_type_oid('16394'::pg_catalog.oid);



-- For binary upgrade, must preserve pg_type array oid
SELECT 
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16393'::pg_catalog.oid);



-- For binary upgrade, must preserve pg_class oids
SELECT 
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16392'::pg_catalog.oid);


CREATE VIEW "v3_9166" AS
 SELECT "seq_9166"."sequence_name",
"seq_9166"."last_value",
"seq_9166"."start_value",
"seq_9166"."increment_by",
"seq_9166"."max_value",
"seq_9166"."min_value",
"seq_9166"."cache_value",
"seq_9166"."log_cnt",
"seq_9166"."is_cycled",
"seq_9166"."is_called"
   FROM "seq_9166";

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 5:30 PM, Thomas Munro
 wrote:
> On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
>  wrote:
>> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
>>  wrote:
>>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
 I don't think it does really.  The thing about a <> semijoin is that it
 will succeed unless *every* join key value from the inner query is equal
 to the outer key value (or is null).  That's something we should consider
 to be of very low probability typically, so that the <> selectivity should
 be estimated as nearly 1.0.  If the regular equality selectivity
 approaches 1.0, or when there are expected to be very few rows out of the
 inner query, then maybe the <> estimate should start to drop off from 1.0,
 but it surely doesn't move linearly with the equality selectivity.
>>>
>>> Ok, here I go like a bull in a china shop: please find attached a
>>> draft patch.  Is this getting warmer?
>>>
>>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>>> thought of but my intuition was that we don't go for such sensitive
>>> and discontinuous treatment of stats; so I made the simplifying
>>> assumption that RHS always has more than 1 distinct value in it.
>>>
>>> Anti-join <> returns all the nulls from the LHS, and then it only
>>> returns other LHS rows if there is exactly one distinct non-null value
>>> in RHS and it happens to be that one.  But if we make the same
>>> assumption I described above, namely that there are always at least 2
>>> distinct values on the RHS, then the join selectivity is just
>>> nullfrac.
>>>
>>
>> The patch looks good to me.
>>
>> +   /*
>> +* For semi-joins, if there is more than one distinct key in the RHS
>> +* relation then every non-null LHS row must find a match since it 
>> can
>> +* only be equal to one of them.
>> The word "match" confusing. Google's dictionary entry gives "be equal
>> to (something) in quality or strength." as its meaning. May be we want
>> to reword it as "... LHS row must find a joining row in RHS ..."?
>
> Thanks!  Yeah, here's a version with better comments.

Thanks. Your version is better than mine.

>
> Does anyone know how to test a situation where the join is reversed according 
> to
> get_join_variables, or "complicated cases where we can't tell for sure"?
>

explain select * from pg_class c right join pg_type t on (c.reltype =
t.oid); would end up with  *join_is_reversed = true; Is that what you
want? For a semi-join however I don't know how to induce that. AFAIU,
in a semi-join there is only one direction in which join can be
specified.

I didn't get the part about "complicated cases where we can't tell for sure".
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade failed if view contain natural left join condition

2017-07-20 Thread tushar

Steps to reproduce -

v9.6

postgres=# create table t(n int);
CREATE TABLE
postgres=# create table t1(a int);
CREATE TABLE
postgres=# create view ttt1 as SELECT e.n FROM t e NATURAL LEFT JOIN t1 d;
CREATE VIEW

v10 -

run pg_upgrade -

going to fail ,with this error -

"
pg_restore: creating TABLE "public.t"
pg_restore: creating TABLE "public.t1"
pg_restore: creating VIEW "public.ttt1"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 187; 1259 16390 VIEW 
ttt1 edb
pg_restore: [archiver (db)] could not execute query: ERROR:  syntax 
error at or near ")"

LINE 16:  LEFT JOIN "t1" "d");
^
Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT 
pg_catalog.binary_upgrade_set_next_pg_type_oid('16392'::pg_catalog.oid);



-- For binary upgrade, must preserve pg_type array oid
SELECT 
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('16391'::pg_catalog.oid);



-- For binary upgrade, must preserve pg_class oids
SELECT 
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('16390'::pg_catalog.oid);


CREATE VIEW "ttt1" AS
 SELECT "e"."n"
   FROM ("t" "e"
 LEFT JOIN "t1" "d");

"
I think -this issue should be there in the older branches as well but 
not checked that.


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Thomas Munro
On Thu, Jul 20, 2017 at 11:47 PM, Ashutosh Bapat
 wrote:
> On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
>  wrote:
>> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
>>> I don't think it does really.  The thing about a <> semijoin is that it
>>> will succeed unless *every* join key value from the inner query is equal
>>> to the outer key value (or is null).  That's something we should consider
>>> to be of very low probability typically, so that the <> selectivity should
>>> be estimated as nearly 1.0.  If the regular equality selectivity
>>> approaches 1.0, or when there are expected to be very few rows out of the
>>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>>> but it surely doesn't move linearly with the equality selectivity.
>>
>> Ok, here I go like a bull in a china shop: please find attached a
>> draft patch.  Is this getting warmer?
>>
>> In the comment for JOIN_SEMI I mentioned a couple of refinements I
>> thought of but my intuition was that we don't go for such sensitive
>> and discontinuous treatment of stats; so I made the simplifying
>> assumption that RHS always has more than 1 distinct value in it.
>>
>> Anti-join <> returns all the nulls from the LHS, and then it only
>> returns other LHS rows if there is exactly one distinct non-null value
>> in RHS and it happens to be that one.  But if we make the same
>> assumption I described above, namely that there are always at least 2
>> distinct values on the RHS, then the join selectivity is just
>> nullfrac.
>>
>
> The patch looks good to me.
>
> +   /*
> +* For semi-joins, if there is more than one distinct key in the RHS
> +* relation then every non-null LHS row must find a match since it can
> +* only be equal to one of them.
> The word "match" confusing. Google's dictionary entry gives "be equal
> to (something) in quality or strength." as its meaning. May be we want
> to reword it as "... LHS row must find a joining row in RHS ..."?

Thanks!  Yeah, here's a version with better comments.

Does anyone know how to test a situation where the join is reversed according to
get_join_variables, or "complicated cases where we can't tell for sure"?

-- 
Thomas Munro
http://www.enterprisedb.com


neqjoinsel-fix-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-07-20 Thread Craig Ringer
On 20 Jul. 2017 19:09, "Ashutosh Sharma"  wrote:

I had a quick look into this patch and also tested it and following
are my observations.


Thanks very much.

I'll expand the tests to cover various normal and nonsensical masks and
combinations and fix the identified issues.

This was a quick morning's work in amongst other things so not surprised I
missed a few details. The check is appreciated.


[HACKERS] [PATCH] A hook for session start

2017-07-20 Thread Yugo Nagata
Hi,

Currently, PostgreSQL doen't have a hook triggered at session
start. Although we already have ClientAuthentication_hook,
this is triggered during authentication, so we can not
access the database. 

If we have a hook triggerd only once at session start, we may
do something useful on the session for certain database or user.

For example, one of our clients wanted such feature. He wanted
to handle encription for specific users, though I don't know
the detail.

The attached patch (session_start_hook.patch) implements such
hook. 

Another patch, session_start_sample.patch, is a very simple
example of this hook that changes work_mem values for sessions
of a specific database. 

I would appreciate hearing your opinion on this hook.

Regards,

-- 
Yugo Nagata 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b8d860e..7a1fa3b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -160,6 +160,9 @@ static bool RecoveryConflictPending = false;
 static bool RecoveryConflictRetryable = true;
 static ProcSignalReason RecoveryConflictReason;
 
+/* Hook for plugins to get control at start of session */
+session_start_hook_type session_start_hook = NULL;
+
 /* 
  *		decls for routines only used in this file
  * 
@@ -3808,6 +3811,9 @@ PostgresMain(int argc, char *argv[],
 	if (!IsUnderPostmaster)
 		PgStartTime = GetCurrentTimestamp();
 
+	if (session_start_hook)
+		(*session_start_hook) (dbname, username);
+
 	/*
 	 * POSTGRES main processing loop begins here
 	 *
diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h
index f8c535c..d349592 100644
--- a/src/include/tcop/tcopprot.h
+++ b/src/include/tcop/tcopprot.h
@@ -35,6 +35,11 @@ extern PGDLLIMPORT const char *debug_query_string;
 extern int	max_stack_depth;
 extern int	PostAuthDelay;
 
+/* Hook for plugins to get control at start of session */
+typedef void (*session_start_hook_type) (const char *dbname,
+		 const char *username);
+extern PGDLLIMPORT session_start_hook_type session_start_hook;
+
 /* GUC-configurable parameters */
 
 typedef enum
diff --git a/contrib/session_start/Makefile b/contrib/session_start/Makefile
new file mode 100644
index 000..f94355b
--- /dev/null
+++ b/contrib/session_start/Makefile
@@ -0,0 +1,15 @@
+# contrib/session_start/Makefile
+
+MODULES = session_start
+PGFILEDESC = "session_start - sample for session start hook"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/session_start
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/session_start/session_start.c b/contrib/session_start/session_start.c
new file mode 100644
index 000..1792879
--- /dev/null
+++ b/contrib/session_start/session_start.c
@@ -0,0 +1,53 @@
+/* -
+ *
+ * session_start.c
+ *
+ * Copyright (c) 2010-2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/session_start/session_start.c
+ *
+ * -
+ */
+#include "postgres.h"
+
+#include "executor/spi.h"
+#include "tcop/tcopprot.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/* Original Hook */
+static session_start_hook_type original_session_start_hook = NULL;
+
+/* sample hook function */
+static void
+sample_session_start_hook(const char *dbname, const char *username)
+{
+	CommandDest back;
+
+	if (original_session_start_hook)
+		original_session_start_hook(dbname, username);
+
+	if (!strcmp(dbname, "test"))
+	{
+		StartTransactionCommand();
+		SPI_connect();
+		SPI_exec("set work_mem to 10240", 1);
+		SPI_finish();
+		CommitTransactionCommand();
+	}
+}
+
+/*
+ * Module Load Callback
+ */
+void
+_PG_init(void)
+{
+	/* Install Hooks */
+
+	original_session_start_hook = session_start_hook;
+	session_start_hook = sample_session_start_hook;
+}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] <> join selectivity estimate question

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 11:04 AM, Thomas Munro
 wrote:
> On Fri, Jun 2, 2017 at 4:16 AM, Tom Lane  wrote:
>> I don't think it does really.  The thing about a <> semijoin is that it
>> will succeed unless *every* join key value from the inner query is equal
>> to the outer key value (or is null).  That's something we should consider
>> to be of very low probability typically, so that the <> selectivity should
>> be estimated as nearly 1.0.  If the regular equality selectivity
>> approaches 1.0, or when there are expected to be very few rows out of the
>> inner query, then maybe the <> estimate should start to drop off from 1.0,
>> but it surely doesn't move linearly with the equality selectivity.
>
> Ok, here I go like a bull in a china shop: please find attached a
> draft patch.  Is this getting warmer?
>
> In the comment for JOIN_SEMI I mentioned a couple of refinements I
> thought of but my intuition was that we don't go for such sensitive
> and discontinuous treatment of stats; so I made the simplifying
> assumption that RHS always has more than 1 distinct value in it.
>
> Anti-join <> returns all the nulls from the LHS, and then it only
> returns other LHS rows if there is exactly one distinct non-null value
> in RHS and it happens to be that one.  But if we make the same
> assumption I described above, namely that there are always at least 2
> distinct values on the RHS, then the join selectivity is just
> nullfrac.
>

The patch looks good to me.

+   /*
+* For semi-joins, if there is more than one distinct key in the RHS
+* relation then every non-null LHS row must find a match since it can
+* only be equal to one of them.
The word "match" confusing. Google's dictionary entry gives "be equal
to (something) in quality or strength." as its meaning. May be we want
to reword it as "... LHS row must find a joining row in RHS ..."?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Mishandling of WCO constraints in direct foreign table modification

2017-07-20 Thread Etsuro Fujita

Here is an example for $subject using postgres_fdw:

postgres=# create foreign table foreign_tbl (a int, b int) server 
loopback options (table_name 'base_tbl');

CREATE FOREIGN TABLE
postgres=# create view rw_view as select * from foreign_tbl where a < b 
with check option;

CREATE VIEW
postgres=# insert into rw_view values (0, 10);
INSERT 0 1
postgres=# explain verbose update rw_view set a = 20 where b = 10;
  QUERY PLAN
--
 Update on public.foreign_tbl  (cost=100.00..146.21 rows=4 width=14)
   ->  Foreign Update on public.foreign_tbl  (cost=100.00..146.21 
rows=4 width=14)
 Remote SQL: UPDATE public.base_tbl SET a = 20 WHERE ((a < b)) 
AND ((b = 10))

(3 rows)

postgres=# update rw_view set a = 20 where b = 10;
UPDATE 1

This is wrong!  This should fail.  The reason for that is; direct modify 
is overlooking checking WITH CHECK OPTION constraints from parent views. 
 I think we could do direct modify, even if there are any WITH CHECK 
OPTIONs, in some way or other, but I think that is a feature.  So, I'd 
like to propose to fix this by just giving up direct modify if there are 
any WITH CHECK OPTIONs.  Attached is a patch for that.  I'll add it to 
the next commitfest.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 5856,5861  INSERT INTO ft1(c1, c2) VALUES(, 2);
--- 5856,5921 
  UPDATE ft1 SET c2 = c2 + 1 WHERE c1 = 1;
  ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c2negative;
  -- ===
+ -- test WITH CHECK OPTION constraints
+ -- ===
+ CREATE TABLE base_tbl (a int, b int);
+ CREATE FOREIGN TABLE foreign_tbl (a int, b int)
+   SERVER loopback OPTIONS(table_name 'base_tbl');
+ CREATE VIEW rw_view AS SELECT * FROM foreign_tbl
+   WHERE a < b WITH CHECK OPTION;
+ \d+ rw_view
+View "public.rw_view"
+  Column |  Type   | Collation | Nullable | Default | Storage | Description 
+ +-+---+--+-+-+-
+  a  | integer |   |  | | plain   | 
+  b  | integer |   |  | | plain   | 
+ View definition:
+  SELECT foreign_tbl.a,
+ foreign_tbl.b
+FROM foreign_tbl
+   WHERE foreign_tbl.a < foreign_tbl.b;
+ Options: check_option=cascaded
+ 
+ INSERT INTO rw_view VALUES (0, 10); -- ok
+ INSERT INTO rw_view VALUES (10, 0); -- should fail
+ ERROR:  new row violates check option for view "rw_view"
+ DETAIL:  Failing row contains (10, 0).
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- not pushed down
+ QUERY PLAN

+ 
--
+  Update on public.foreign_tbl
+Remote SQL: UPDATE public.base_tbl SET b = $2 WHERE ctid = $1
+->  Foreign Scan on public.foreign_tbl
+  Output: foreign_tbl.a, 20, foreign_tbl.ctid
+  Remote SQL: SELECT a, ctid FROM public.base_tbl WHERE ((a < b)) AND 
((a = 0)) FOR UPDATE
+ (5 rows)
+ 
+ UPDATE rw_view SET b = 20 WHERE a = 0; -- ok
+ EXPLAIN (VERBOSE, COSTS OFF)
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- not pushed down
+ QUERY PLAN

+ 
--
+  Update on public.foreign_tbl
+Remote SQL: UPDATE public.base_tbl SET b = $2 WHERE ctid = $1
+->  Foreign Scan on public.foreign_tbl
+  Output: foreign_tbl.a, '-20'::integer, foreign_tbl.ctid
+  Remote SQL: SELECT a, ctid FROM public.base_tbl WHERE ((a < b)) AND 
((a = 0)) FOR UPDATE
+ (5 rows)
+ 
+ UPDATE rw_view SET b = -20 WHERE a = 0; -- should fail
+ ERROR:  new row violates check option for view "rw_view"
+ DETAIL:  Failing row contains (0, -20).
+ SELECT * FROM foreign_tbl;
+  a | b  
+ ---+
+  0 | 20
+ (1 row)
+ 
+ DROP FOREIGN TABLE foreign_tbl CASCADE;
+ NOTICE:  drop cascades to view rw_view
+ DROP TABLE base_tbl;
+ -- ===
  -- test serial columns (ie, sequence-based defaults)
  -- ===
  create table loc1 (f1 serial, f2 text);
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
***
*** 1158,1163  UPDATE ft1 SET c2 = c2 + 1 WHERE c1 = 1;
--- 1158,1187 
  ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c2negative;
  
  -- ===
+ -- test WITH CHECK OPTION 

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-07-20 Thread Ashutosh Sharma
I had a quick look into this patch and also tested it and following
are my observations.

1) I am seeing a server crash when passing any non meaningful value
for t_infomask2 to heap_infomask_flags().

postgres=# SELECT heap_infomask_flags(2816, 3);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> \q

Following is the backtrace,

(gdb) bt
#0  0x00d9c55b in pg_detoast_datum (datum=0x0) at fmgr.c:1833
#1  0x00b87374 in construct_md_array (elems=0x2ad74c0,
nulls=0x0, ndims=1, dims=0x7ffc0b0bbcd0, lbs=0x7ffc0b0bbcc0,
elmtype=25, elmlen=-1,
elmbyval=0 '\000', elmalign=105 'i') at arrayfuncs.c:3382
#2  0x00b8709f in construct_array (elems=0x2ad74c0, nelems=10,
elmtype=25, elmlen=-1, elmbyval=0 '\000', elmalign=105 'i') at
arrayfuncs.c:3316
#3  0x7fb8001603a5 in heap_infomask_flags (fcinfo=0x2ad3b88) at
heapfuncs.c:597
#4  0x0082f4cd in ExecInterpExpr (state=0x2ad3aa0,
econtext=0x2ad3750, isnull=0x7ffc0b0bbf67 "") at execExprInterp.c:672
#5  0x0088b832 in ExecEvalExprSwitchContext (state=0x2ad3aa0,
econtext=0x2ad3750, isNull=0x7ffc0b0bbf67 "")
at ../../../src/include/executor/executor.h:290
#6  0x0088b8e3 in ExecProject (projInfo=0x2ad3a98) at
../../../src/include/executor/executor.h:324
#7  0x0088bb89 in ExecResult (node=0x2ad36b8) at nodeResult.c:132
#8  0x008494fe in ExecProcNode (node=0x2ad36b8) at execProcnode.c:416
#9  0x0084125d in ExecutePlan (estate=0x2ad34a0,
planstate=0x2ad36b8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x2ac0ae0,
execute_once=1 '\001') at execMain.c:1693
#10 0x0083d54b in standard_ExecutorRun (queryDesc=0x2a42880,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:362
#11 0x0083d253 in ExecutorRun (queryDesc=0x2a42880,
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at
execMain.c:305
#12 0x00b3dd8f in PortalRunSelect (portal=0x2ad1490, forward=1
'\001', count=0, dest=0x2ac0ae0) at pquery.c:932
#13 0x00b3d7e7 in PortalRun (portal=0x2ad1490,
count=9223372036854775807, isTopLevel=1 '\001', run_once=1 '\001',
dest=0x2ac0ae0, altdest=0x2ac0ae0,
completionTag=0x7ffc0b0bc2c0 "") at pquery.c:773
#14 0x00b31fe4 in exec_simple_query (query_string=0x2a9d9a0
"SELECT heap_infomask_flags(11008, , true);") at postgres.c:1099
#15 0x00b3a727 in PostgresMain (argc=1, argv=0x2a49eb0,
dbname=0x2a1d480 "postgres", username=0x2a49d18 "ashu") at
postgres.c:4090
#16 0x00a2cb3f in BackendRun (port=0x2a3e700) at postmaster.c:4357
#17 0x00a2bc63 in BackendStartup (port=0x2a3e700) at postmaster.c:4029
#18 0x00a248ab in ServerLoop () at postmaster.c:1753
#19 0x00a236a9 in PostmasterMain (argc=3, argv=0x2a1b2b0) at
postmaster.c:1361
#20 0x008d8054 in main (argc=3, argv=0x2a1b2b0) at main.c:228

2) I can see the documentation for heap_infomask(). But, I do not see
it being defined or used anywhere in the patch.

+ 
+  The heap_infomask function can be used to unpack the
+  recognised bits of the infomasks of heap tuples.
+ 

3) If show_combined flag is set to it's default value and a tuple is
frozen then may i know the reason for not showing it as frozen tuple
when t_infomask2
is passed as zero.

postgres=# SELECT heap_infomask_flags(2816, 0);
heap_infomask_flags
---
 {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID}
(1 row)

postgres=# SELECT heap_infomask_flags(2816, 1);
heap_infomask_flags

 {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
(1 row)


4) I think, it would be better to use the same argument name for the
newly added function i.e heap_infomask_flags() in both documentation
and sql file. I am basically refering to 'include_combined' argument.
IF you see the function definition, the argument name used is
'include_combined' whereas in documentation you have mentioned
'show_combined'.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Jul 20, 2017 at 11:56 AM, Masahiko Sawada  wrote:
> On Thu, Jul 20, 2017 at 3:13 PM, Julien Rouhaud  wrote:
>> On Thu, Jul 20, 2017 at 5:44 AM, Peter Geoghegan  wrote:
>>> On Wed, Jul 19, 2017 at 8:33 PM, Craig Ringer  wrote:
 That's silly, so here's a patch to teach pageinspect how to decode 
 infomasks
 to a human readable array of flag names.

 Example:

 SELECT t_infomask, t_infomask2, flags
 FROM heap_page_items(get_raw_page('test1', 0)),

Re: [HACKERS] Causal reads take II

2017-07-20 Thread Thomas Munro
On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs  wrote:
> On 3 January 2017 at 01:43, Thomas Munro  
> wrote:
>
>> Here is a new version of my "causal reads" patch (see the earlier
>> thread from the 9.6 development cycle[1]), which provides a way to
>> avoid stale reads when load balancing with streaming replication.
>
> I'm very happy that you are addressing this topic.
>
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.
>
> This patch presumes we will load balance writes to a master and reads
> to a pool of standbys. How will we achieve that?
>
> 1. We decorate the application with additional info to indicate
> routing/write concerns.
> 2. We get middleware to do routing for us, e.g. pgpool style read/write 
> routing
>
> The explicit premise of the patch is that neither of the above options
> are practical, so I'm unclear how this makes sense. Is there some use
> case that you have in mind that has not been fully described? If so,
> lets get it on the table.
>
> What I think we need is a joined up plan for load balancing, so that
> we can understand how it will work. i.e. explain the whole use case
> and how the solution works.

Simon,

Here's a simple proof-of-concept Java web service using Spring Boot
that demonstrates how load balancing could be done with this patch.
It show two different techniques for routing: an "adaptive" one that
learns which transactional methods need to run on the primary server
by intercepting errors, and a "declarative" one that respects Spring's
@Transactional(readOnly=true) annotations (inspired by the way people
use MySQL Connector/J with Spring to do load balancing).  Whole
transactions are automatically retried at the service request level on
transient failures using existing techniques (Spring Retry, as used
for handling deadlocks and serialisation failures etc), and the
"TransactionRouter" avoids servers that have recently raised the
"synchronous_replay not available" error.  Aside from the optional
annotations, the application code in KeyValueController.java is
unaware of any of this.

https://github.com/macdice/syncreplay-spring-demo

I suspect you could find ways to do similar things with basically any
application development stack that supports some kind of container
managed transactions.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal about a "deep" versions for some jsonb functions

2017-07-20 Thread Dmitry Dolgov
Hi

As far as I know, since 9.5 we're missing some convenient features, namely a
deepversion of `jsonb_concat` and `jsonb_minus`. There are already few
feature
requests about `jsonb_minus` (see [1], [2]) and a lot of confusion and
requests
about a deep version of `jsonb_concat`. From my point of view they're pretty
much related, so I want to propose the following description for this
functionality and eventually implement it.

# jsonb_minus

```
jsonb_minus(jsonb, jsonb, deep=False)
```

Looks like we have to abandon "-" operator for that purpose (see a concern
about that in this thread [2]).

In general this functionality is something like the relative complement for
two
jsonb objects. Basically we're taking all the paths inside all jsonb objects
and remove duplicated paths from the left one. Of course an actual
implementation may be different, but I think it's a nice way of thinking
about
this logic.

Here are few examples, where "->" is an operation to get an actual value,
".->" - an operation to get a next key, "#->" an operation to get a value
from
an array ("-" operator is just for the sake of readability):

--

{"a": 1} - {"a": 1}
=> null

paths:
 a -> 1

 a -> 1

--

{"a": 1} - {"a": 2}
=> {"a": 1}

paths:
 a -> 1

 a -> 2

--

{"a": 1} - {"a": {"b": 1}}
=> {"a": 1}

paths:
 a ->

 a -> .b -> 1

--

{"a": 1, "b": {"c": 2}} - {"b": 1, "b": {"c": 3}}
=> {"b": {"c": 2}}

paths:
 a -> 1
 b .-> c -> 2

 b -> 1
 b .-> c -> 3

--

{"a": {"b": 1}} - {"a": {"b": 1}}
=> null

paths:
 a .-> b -> 1

 a .-> b -> 1

--

{"a": {"b": 1, "c": 2}} - {"a": {"b": 1}}
=> {"a": {"b": 1}}

paths:
 a .-> b -> 1
 a .-> c -> 2

 a .-> b -> 1

--

{"a": {
"b": {"b1": 1},
"c": {"c2": 2}
}}

-

{"a": {
"b": {"b1": 1},
"c": {"c2": 3}
}}
=> {"a": {"c": {"c2": 2}}

paths:
 a .-> b .-> b1 -> 1
 a .-> c .-> c2 -> 2

 a .-> b .-> b1 -> 1
 a .-> c .-> c2 -> 3

--

{"a": [1, 2, 3]} - {"a": [1, 2]}
=> {"a": [3]}

paths:
 a #-> 1
 a #-> 2
 a #-> 3

 a #-> 1
 a #-> 2

--

{"a": [{"b": 1}, {"c": 2}]} - {"a": [{"b": 1}, {"c": 3}]}
=> {"a": [{"c": 3}]}

paths:
 a #-> b -> 1
 a #-> c -> 2

 a #-> b -> 1
 a #-> c -> 3


But judging from the previous discussions, there is a demand for a bit
different behavior, when `jsonb_minus` is operating only on the top level of
jsonb objects. For that purpose I suggest introducing a flag `deep`, that
should be false by default (as for `jsonb_concat`), that will allow to
enable a
"deep logic" (a.k.a relative complement) I described above. With
`deep=False`
this function will behave similar to `hstore`:

{"a": 1, "b": {"c": 2}} - {"a": 1, "b": {"c": 3}}
=> {"a": 1}

# jsonb_concat

We already have this function implemented, but a "deep" mode is missing.

```
jsonb_concat(jsonb, jsonb, deep=False)
```

Basically we're taking all the paths inside all jsonb objects and override
duplicated paths in the left one, then add all unique paths from right one
to
the result.

Here are few examples for deep mode ("||" operator is just for the sake of
readability):

--

{"a": 1, "b": {"c": 2}} || {"a": 1, "b": {"d": 3}}
=> {"a": 1, "b": {"c": 2, "d": 3}}

paths:
 a -> 1
 b .-> c -> 2

 a -> 1
 b .-> d -> 3

--

{"a": 1, "b": {"c": 2}} || {"a": 1, "b": {"c": 3}}
=> {"a": 1, "b": {"c": 3}}

paths:
 a -> 1
 b .-> c -> 2

 a -> 1
 b .-> c -> 3

--

{"a": [1, 2, 3]} || {"a": [3, 4]}
=> {"a": [1, 2, 3, 4]}

paths:
 a #-> 1
 a #-> 2
 a #-> 3

 a #-> 3
 a #-> 4


What do you think about that?

[1]:
https://www.postgresql.org/message-id/flat/CAHyXU0wtJ%2Bi-4MC5FPVc_oFu%2B3-tQVC8u04GmMNwYdPEAX1XSA%40mail.gmail.com#cahyxu0wtj+i-4mc5fpvc_ofu+3-tqvc8u04gmmnwydpeax1...@mail.gmail.com
[2]:
https://www.postgresql.org/message-id/flat/CAHyXU0wm0pkX0Gvzb5BH2jUAA_%3DswMJmyYuhBWzgOjfKxdrKfw%40mail.gmail.com#CAHyXU0wm0pkX0Gvzb5BH2jUAA_=swmjmyyuhbwzgojfkxdr...@mail.gmail.com


Re: [HACKERS] Adding -E switch to pg_dumpall

2017-07-20 Thread Fabien COELHO


Hello Michaël-san,


Attached is a patch to add support for this switch. I am parking that
in the next CF.


I'm in favor of this feature for consistency with pg_dump, and as the 
environment variable workaround is not specially elegant and can induce 
issues of its own.


Patch applies and compiles.

No special regression tests are provided, but this seems ok to me as it 
just forward the option to pg_dump, which I have no doubt is heavily 
tested. Or not. Anyway it is consistent...


Manually tested with UTF8, latin1 (including conversion errors), non 
existing.


Code is straightforward. Doc and help are both fine.

Ok for me. I switched the status to "Ready for committers".

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer  wrote:

> On 20 July 2017 at 15:00, Neha Sharma 
> wrote:
>
>> Hi Craig,
>>
>> I had done a fresh initdb,the default parameter configuration was used. I
>> was setting few set of parameters while startup by the below command.
>>
>> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
>> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
>> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>>
>> Now I have modified the script a bit with Robert's suggestion as below.
>> Instead of starting it with postgres binary i have set it in conf file and
>> starting the server with pg_ctl. I am waiting for the results,once the core
>> dump is generated will share the details.
>>
>
> Thanks.
>
> To verify that you do get a coredump, you might want to consider sending a
> kill -SEGV to a backend and make sure that it actually dumps core and you
> can find the core.
>
> Ideally you'd actually set the coredumps to include shmem (see
> coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
> with 8GB shared_buffers that may not be practical. It'd be very useful if
> possible.
>
> If this is wraparound-related, as it appears to be, you might get faster
> results by using a custom pgbench script for one or more workers that just
> runs txid_current() a whole lot. Or jump the server's xid space forward.
>
Thanks. Will put together suggestions to get the result.

>
> I've got a few other things on right now but I'll keep an eye out and hope
> for a core dump.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 11:46 AM, Amit Langote
 wrote:
> On 2017/07/20 15:05, Ashutosh Bapat wrote:
>> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
>>  wrote:
>>>
>>> Partition information:
>>> Type of partitioning - single column range partition
>>> Tables partitioned - Lineitem and orders
>>>
>>> Lineitem -
>>> Partition key = l_orderkey
>>> No of partitions = 18
>>>
>>> Orders -
>>> Partition key = o_orderkey
>>> No of partitions = 11
>>>
>>
>> The patch set upto 0015 would refuse to join two partitioned relations
>> using a partition-wise join if they have different number of
>> partitions. Next patches implement a more advanced partition matching
>> algorithm only for list partitions. Those next patches would refuse to
>> apply partition-wise join for range partitioned tables. So, I am
>> confused as to how come partition-wise join is being chosen even when
>> the number of partitions differ.
>
> In 21_part_patched.out, I see that lineitem is partitionwise-joined with
> itself.
>
>  >  Append
>
>->  Hash Semi Join
>Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>Rows Removed by Join Filter: 395116
>
>->  Parallel Seq Scan on lineitem_001 l1
>Filter: (l_receiptdate > l_commitdate)
>Rows Removed by Filter: 919654
>
>->  Hash
>Buckets: 8388608  Batches: 1  Memory Usage: 358464kB
>->  Seq Scan on lineitem_001 l2
>
Ah, I see now.

We need the same number of partitions in all partitioned tables, for
joins to pick up partition-wise join.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlogfilename

2017-07-20 Thread Michael Paquier
On Thu, Jul 20, 2017 at 10:58 AM, DEV_OPS  wrote:
> I think you may reference to function: pg_xlogfile_name   in
> src/backend/access/transam/xlogfuncs.c,  it use XLogFileName  defined in
> src/include/access/xlog_internal.h
>
> #define XLogFileName(fname, tli, logSegNo)  \
> snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,   \
>  (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
>  (uint32) ((logSegNo) % XLogSegmentsPerXLogId))
>
>
> hope it's helpful for you

The first 8 characters are the timeline number in hexadecimal format.
The next 8 characters indicate a segment number, which gets
incremented every 256 segments in hexa format. The last 8 characters
indicate the current segment number in hexa format.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xlogfilename

2017-07-20 Thread DEV_OPS

I think you may reference to function: pg_xlogfile_name   in
src/backend/access/transam/xlogfuncs.c,  it use XLogFileName  defined in
src/include/access/xlog_internal.h

#define XLogFileName(fname, tli, logSegNo)  \
snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,   \
 (uint32) ((logSegNo) / XLogSegmentsPerXLogId), \
 (uint32) ((logSegNo) % XLogSegmentsPerXLogId))


hope it's helpful for you

--Tony

On 20/07/2017 16:42, 王刚 wrote:
> I study source code about wal, and have a question about xlog file name . 
> what does 00010001 mean? Someone says that it means tli logid 
> segno. I don't understand.





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] xlogfilename

2017-07-20 Thread 王刚
I study source code about wal, and have a question about xlog file name . what 
does 00010001 mean? Someone says that it means tli logid segno. 
I don't understand.

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Craig Ringer
On 20 July 2017 at 15:00, Neha Sharma  wrote:

> Hi Craig,
>
> I had done a fresh initdb,the default parameter configuration was used. I
> was setting few set of parameters while startup by the below command.
>
> ./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
> min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
>
> Now I have modified the script a bit with Robert's suggestion as below.
> Instead of starting it with postgres binary i have set it in conf file and
> starting the server with pg_ctl. I am waiting for the results,once the core
> dump is generated will share the details.
>

Thanks.

To verify that you do get a coredump, you might want to consider sending a
kill -SEGV to a backend and make sure that it actually dumps core and you
can find the core.

Ideally you'd actually set the coredumps to include shmem (see
coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but
with 8GB shared_buffers that may not be practical. It'd be very useful if
possible.

If this is wraparound-related, as it appears to be, you might get faster
results by using a custom pgbench script for one or more workers that just
runs txid_current() a whole lot. Or jump the server's xid space forward.

I've got a few other things on right now but I'll keep an eye out and hope
for a core dump.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Dealing with logical replication

2017-07-20 Thread Tatsuo Ishii
> On 19 July 2017 at 16:34, Tatsuo Ishii  wrote:
> 
>> Now that we are going to have logical replication in PostgreSQL 10, I
>> have started thinking how Pgpool-II can deal with it. For example, the
>> logical replication does not replicate DDLs. Isn't it convenient for
>> users to do it automatically in Pgpool-II? Or even doing it for
>> TRUNCATE?
>>
>> Or are they against the design philosophy of the logical replication?
>>
> 
> (Disclaimer - Petr Jelinek and Peter Eisentraut were the main ones working
> on in in-core logical rep, and I haven't had time to play with it much).
> 
> TL;DR: a pooler can only solve a limited subset of the problem, fairly
> unreliably, and this is really something that needs further work in core.
> 
> Not replicating TRUNCATE and schema changes is more an implementation
> limitation than anything else. It's desirable to get to the point where
> logical replication can transparently replicate DDL. There are some hurdles
> for doing so, but it'll be possible to get something in place that's good
> enough when time/release progress permits.
> 
> Similarly, with TRUNCATE, AFAIK support just didn't make the cut for pg10.
> 
> A pooler could well help in the mean time, but you have to consider
> ordering with care. For example, given "U" upstream, "D" downstream:
> 
> U: CREATE TABLE x (a integer, b integer);
> D: CREATE TABLE x (a); -- user script syncs
> U: INSERT INTO x (a,b) VALUES (1,0);
> D: [applies INSERT 1,0]
> U: INSERT INTO x (a,b) VALUES (2,0);
> U: ALTER TABLE x DROP COLUMN b;
> D: ALTER TABLE x DROP COLUMN b; -- user script syncs
> U: INSERT INTO x (a) VALUES (3);
> D: [ERROR on INSERT of 2,0: no column 'b']
> 
> Because the DDL here is transported out of band vs the row data, you can
> easily create situations where the schema change is applied before the last
> committed-but-not-yet-replicated data from the upstream that was based on
> the old schema.

What I am thinking for now is, do not allow to issue DDLs on the
downstream.

> To achieve correct ordering, the simplest approach is to record DDL in a
> table when you perform it on the upstream, and replay it when you see rows
> in that table appear on the downstream.  You know it's safe to replay it
> now. This is the essence of what BDR and pglogical do with their DDL
> replication, but they handle DDL in the middle of transactions that also
> make row data changes by intercepting writes to the queue table and
> performing the DDL at the exact point in the transaction where it happened
> on the upstream. I don't think that's possible with the in-core logical
> replication yet, and certainly not something a pooler can do.
> 
> To do it externally, you have to take note of when a schema change happened
> on the upstream and apply it on the downstream at or after the point where
> the downstream has replayed and confirmed up to the upstream lsn where the
> schema change happened. Then apply the schema change.
> 
> A pooler trying to help here must also be very aware of the impact of
> multi-statements. If you send a single simple query message with a mixture
> of schema change commands and normal DML, you probably don't want to repeat
> the DML on downstream nodes or you'll get duplicate rows etc. But ... by
> unless it embeds transaction control commands, a simple query message
> executes in a single implicit transaction, so if you extract just the DDL
> you'll again have ordering issues of upstream vs downstream.

Pgpool-II will avoid the problem by rejecting such multi-statements.

> There are even a variety of difficulties to overcome with doing it in core:
> event triggers don't capture ddl command text and have no facility to turn
> the internal command representation back into SQL command text, nor do we
> have any way to turn the internal representation back into a parsenode tree
> for execution on a downstream's standard_ProcessUtility. However, we can
> capture raw command text with ProcessUtility_hook now that we have
> byte-ranges for the query-parts of a multi-part query (yay!), and that
> works well enough if you also capture the active search_path and apply with
> the same search_path. It can match the wrong object if extra objects with
> the same name are present earlier on the search_path on the downstream than
> on the upstream, so it's not ideal, but that's a weird corner case.
> 
> If we had a hook in the logical apply worker's insert or wal-message
> routines it'd be possible to write an extension to do this for pg10, but
> AFAICS we don't.
> 
> So schema changes in logical replication currently require more care than
> in physical replication.

Thank you for the explanation regarding those difficulties.

I now understand that the reason why logical replication does not
support DDLs and TRUNCATE. The reason is, not by design decision but
by the difficulties of implementation.

Probably that means those limitations will be solved in the
future. Then the motivation for 

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

2017-07-20 Thread Neha Sharma
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I
was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below.
Instead of starting it with postgres binary i have set it in conf file and
starting the server with pg_ctl. I am waiting for the results,once the core
dump is generated will share the details.

Test code Snippet :

#Pre condition:
#Create and initialize the database and set the export PGDATA
export PGDATA=/home/centos/PG_sources/postgresql/inst/bin/data
export PGPORT=5432
export
LD_LIBRARY_PATH=/home/centos/PG_sources/postgresql/inst/lib:$LD_LIBRARY_PATH
#for i in "scale_factor shared_buffers time_for_readings no_of_readings
orig_or_patch"
for i in "300 8GB 1800 3"
do
scale_factor=`echo $i | cut -d" " -f1`
shared_bufs=`echo $i | cut -d" " -f2`
time_for_reading=`echo $i | cut -d" " -f3`
no_of_readings=`echo $i | cut -d" " -f4`

# ---

echo "Start of script for $scale_factor $shared_bufs " >>
/home/centos/test_results.txt


echo "== $run_bin =" >>
/home/centos/test_results.txt

for threads in 1 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
#for threads in 8 16
do
#Start taking reading
for ((readcnt = 1 ; readcnt <= $no_of_readings ; readcnt++))
do
echo ""  >>
/home/centos/test_results.txt
echo $scale_factor, $shared_bufs, $threads, $threads,
$time_for_reading Reading - ${readcnt}  >> /home/centos/test_results.txt
#start server
./pg_ctl -D data -c -l logfile start
#./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c
min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
sleep 5
#drop and recreate database
./dropdb test
./createdb test
#initialize database
./pgbench -i -s $scale_factor test
sleep 5
# Run pgbench
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared
test  >> /home/centos/test_results.txt
sleep 10
./psql -d test -c "checkpoint" >> /home/centos/test_results.txt
./pg_ctl stop
done;
done;

sleep 1


mv /home/centos/test_results.txt
/home/centos/test_results_list_${scale_factor}_${shared_bufs}_rw.txt
done;



Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 11:23 AM, Craig Ringer 
wrote:

> On 19 July 2017 at 20:26, Neha Sharma 
>> wrote:
>>
>>> Hi,
>>>
>>> I am getting FailedAssertion while executing the attached
>>> script.However,I am not able to produce the core dump for the same,the
>>> script runs in background and takes around a day time to produce the
>>> mentioned error.
>>>
>>> "TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact,
>>> ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
>>> 2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was
>>> terminated by signal 6: Aborted
>>> 2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running:
>>> autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"
>>>
>>
>>
> What are the starting conditions of your postgres instance? Does your
> script assume a newly initdb'd instance with no custom configuration? If
> not, what setup steps/configuration precede your script run?
>
>
>
>
>
>> well short of the 2-million mark.
>>
>
> Er, billion.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-07-20 Thread Masahiko Sawada
On Thu, Jul 20, 2017 at 3:13 PM, Julien Rouhaud  wrote:
> On Thu, Jul 20, 2017 at 5:44 AM, Peter Geoghegan  wrote:
>> On Wed, Jul 19, 2017 at 8:33 PM, Craig Ringer  wrote:
>>> That's silly, so here's a patch to teach pageinspect how to decode infomasks
>>> to a human readable array of flag names.
>>>
>>> Example:
>>>
>>> SELECT t_infomask, t_infomask2, flags
>>> FROM heap_page_items(get_raw_page('test1', 0)),
>>>  LATERAL heap_infomask_flags(t_infomask, t_infomask2, true) m(flags);
>>>  t_infomask | t_infomask2 |   flags
>>> +-+
>>>2816 |   2 |
>>> {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
>>> (1 row)
>>
>> Seems like a good idea to me.
>>
>
> +1, it'll be really helpful.
>

+1.
When I investigated data corruption incident I also wrote a plpgsql
function for the same purpose, and it was very useful. I think we can
have the similar thing for lp_flags as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PgFDW connection invalidation by ALTER SERVER/ALTER USER MAPPING

2017-07-20 Thread Kyotaro HORIGUCHI
Finally, I added new TAP test library PsqlSession.

At Tue, 18 Jul 2017 18:12:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170718.181213.206979369.horiguchi.kyot...@lab.ntt.co.jp>
> > * How about some regression test cases?  You couldn't really exercise
> > cross-session invalidation easily, but I don't think you need to.
> 
> Ha Ha. You got me. I will add some test cases for this in the
> next version. Thanks.

Here it is. First I tried this using ordinary regression
framework but the effect of this patch is shown only in log and
it contains variable parts so I gave up it before trying more
complex way.

Next I tried existing TAP test but this test needs continuous
session to achieve alternating operation on two sessions but
PostgresNode::psql doesn't offer such a functionality.

Finally, I added a new TAP test library PsqlSession. It offers
interactive psql sessions. Then added a simple test to
postgres_fdw using it.

The first patch is the PsqlSession.pm and the second is the new
test for postgres_fdw.

- The current PsqlSession is quite fragile but seems working
  enough for this usage at the present.

- I'm afraid this might not work on Windows according to the
  manpage of IPC::Run, but I haven't confirmed yet.

  http://search.cpan.org/~toddr/IPC-Run-0.96/lib/IPC/Run.pm#Win32_LIMITATIONS


Any comment or suggestions are welcome.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fdb5cbab3375d9d2e4da078cf6ee7eaf7de5c8fd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 20 Jul 2017 14:56:51 +0900
Subject: [PATCH 1/2] Add new TAP test library PsqlSession.pm

PostgreNode::psql makes temporary session to run commands so it is not
usable when more interactive operation on a continued session. This
library offers continuous sessions feature that can execute multiple
sql commands separately.
---
 src/test/perl/PsqlSession.pm | 341 +++
 1 file changed, 341 insertions(+)
 create mode 100644 src/test/perl/PsqlSession.pm

diff --git a/src/test/perl/PsqlSession.pm b/src/test/perl/PsqlSession.pm
new file mode 100644
index 000..d69fd14
--- /dev/null
+++ b/src/test/perl/PsqlSession.pm
@@ -0,0 +1,341 @@
+
+=pod
+
+=head1 NAME
+
+PsqlSession - class representing PostgreSQL psql instance
+
+=head1 SYNOPSIS
+
+  use PsqlSession;
+
+  my $session = get_new_session('session1', $server);
+
+  to connect to a PostgreNode as $server, or
+
+  my $session = get_new_session('session1', 'localhost', '5432', 'postgres');
+
+  to specify each options explicitly.
+
+  # Connect to the server
+  $session->open();
+
+  # Execute an SQL query
+  $ret = $session->execsql('SELECT now();');
+
+  Returns a pair of output of stdout, and stderr in array context.
+
+  ($out, $err) = $session->execsql('SELECT now();');
+
+  $session->execsql_multi('SELECT 1;', 'SELECT now();');
+
+  is just a shortcut of writing many execsqls.
+
+  # End the session
+  $session->close();
+
+=head1 DESCRIPTION
+
+PsqlSession contains a set of routines able to work on a psql session,
+allowing to connect, send a command and receive the result and close.
+
+The IPC::Run module is required.
+
+=cut
+
+package PsqlSession;
+
+use strict;
+use warnings;
+
+use Exporter 'import';
+use Test::More;
+use TestLib ();
+use Scalar::Util qw(blessed);
+
+our @EXPORT = qw(
+  get_new_session
+);
+
+
+=pod
+
+=head1 METHODS
+
+=over
+
+=item PsqlSession::new($class, $name, $pghost, $pgport, $dbname)
+
+Create a new PsqlSession instance. Does not connect.
+
+You should generally prefer to use get_new_session() instead since it
+takes care of finding host name, port number or database name.
+
+=cut
+
+sub new
+{
+	my ($class, $name, $pghost, $pgport, $dbname) = @_;
+
+	my $self = {
+		_name => $name,
+		_host => $pghost,
+		_port => $pgport,
+		_dbname => $dbname };
+
+	bless $self, $class;
+
+#	$self->dump_info;
+
+	return $self;
+}
+
+=pod
+
+=item $session->name()
+
+The name assigned to the session at creation time.
+
+=cut
+
+sub name
+{
+	return $_[0]->{_name};
+}
+
+=pod
+
+=item $session->host()
+
+Return the host (like PGHOST) for this instance. May be a UNIX socket path.
+
+=cut
+
+sub host
+{
+	return $_[0]->{_host};
+}
+
+=pod
+
+=item $session->port()
+
+Get the port number connects to. This won't necessarily be a TCP port
+open on the local host since we prefer to use unix sockets if
+possible.
+
+=cut
+
+sub port
+{
+	return $_[0]->{_port};
+}
+
+=pod
+
+=item $session->dbname()
+
+Get the database name this session connects to.
+
+=cut
+
+sub dbname
+{
+	return $_[0]->{_dbname};
+}
+
+=pod
+
+=item $session->errstate()
+
+Get the error state of this session. 0 means no error and 1 means
+error. This value is reset at the starting of every execution of an
+SQL query.
+
+=cut
+
+sub errstate
+{
+	return $_[0]->{_errstate};
+}
+
+=pod
+
+=item $session->open()
+
+Open this session.
+
+=cut
+
+sub open
+{

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-20 Thread Ashutosh Bapat
On Thu, Jul 20, 2017 at 12:30 AM, Robert Haas  wrote:
>
>> This suggests that partitioning is not a suitable strategy for this query,
>> but then may be partition wise should not be picked for such a case to
>> aggravate the performance issue.
>
> In the unpartitioned case, and in the partitioned case on head, the
> join order is l1-(nation-supplier)-l2-orders-l3.  In the patched case,
> the join order changes to l1-l2-supplier-orders-nation-l3.  If the
> planner used the former join order, it wouldn't be able to do a
> partition-wise join at all, so it must think that the l1-l2 join gets
> much cheaper when done partitionwise, thus justifying a change in the
> overall join order to be able to use partion-wise join.  But it
> doesn't work out.
>
> I think the problem is that the row count estimates for the child
> joins seem to be totally bogus:
>
> ->  Hash Semi Join  (cost=309300.53..491665.60 rows=1 width=12)
> (actual time=10484.422..15945.851 rows=1523493 loops=3)
>   Hash Cond: (l1.l_orderkey = l2.l_orderkey)
>   Join Filter: (l2.l_suppkey <> l1.l_suppkey)
>   Rows Removed by Join Filter: 395116
>
> That's clearly wrong.  In the un-partitioned plan, the join to l2
> produces about as many rows of output as the number of rows that were
> input (998433 vs. 962909); but here, a child join with a million rows
> as input is estimated to produce only 1 row of output.  I bet the
> problem is that the child-join's row count estimate isn't getting
> initialized at all, but then something is clamping it to 1 row instead
> of 0.
>
> So this looks like a bug in Ashutosh's patch.

The patch does not have any changes to the selectivity estimation. It
might happen that some correction in selectivity estimation for
child-joins is required, but I have not spotted any code in
selectivity estimation that differentiates explicitly between child
and parent Vars and estimates. So, I am more inclined to believe
Thomas's theory. I will try Tom's suggested approach.

I am investigating this case with the setup that Rafia provided.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error while copying a large file in pg_rewind

2017-07-20 Thread Michael Paquier
On Fri, Jul 7, 2017 at 9:33 AM, Michael Paquier
 wrote:
> On Fri, Jul 7, 2017 at 4:31 PM, Kuntal Ghosh  
> wrote:
>> On Fri, Jul 7, 2017 at 7:49 AM, Michael Paquier
>>  wrote:
>> I don't have any more inputs on this patch and it looks good to me.
>> So, I'm moving the status to ready for committer.
>
> Thanks!
>
>>> At some point it would really make sense to group all things under the
>>> same banner (64-b LO, pg_basebackup, and now pg_rewind).
>>>
>> +1. Implementation-wise, I prefer pg_recvint64 to fe_recvint64.
>
> So do I. That's a matter of taste I guess.

Heikki, this bug is rather bad for anybody using pg_rewind with
relation file sizes larger than 2GB as this corrupts data of
instances. I think that you would be the best fit as a committer to
look at this patch as you implemented the tool first, and it would be
a bad idea to let that sit for a too long time. Could it be possible
to spare a bit of your time at some point to look at it?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-20 Thread Amit Langote
On 2017/07/20 15:05, Ashutosh Bapat wrote:
> On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
>  wrote:
>>
>> Partition information:
>> Type of partitioning - single column range partition
>> Tables partitioned - Lineitem and orders
>>
>> Lineitem -
>> Partition key = l_orderkey
>> No of partitions = 18
>>
>> Orders -
>> Partition key = o_orderkey
>> No of partitions = 11
>>
> 
> The patch set upto 0015 would refuse to join two partitioned relations
> using a partition-wise join if they have different number of
> partitions. Next patches implement a more advanced partition matching
> algorithm only for list partitions. Those next patches would refuse to
> apply partition-wise join for range partitioned tables. So, I am
> confused as to how come partition-wise join is being chosen even when
> the number of partitions differ.

In 21_part_patched.out, I see that lineitem is partitionwise-joined with
itself.

 >  Append

   ->  Hash Semi Join
   Hash Cond: (l1.l_orderkey = l2.l_orderkey)
   Join Filter: (l2.l_suppkey <> l1.l_suppkey)
   Rows Removed by Join Filter: 395116

   ->  Parallel Seq Scan on lineitem_001 l1
   Filter: (l_receiptdate > l_commitdate)
   Rows Removed by Filter: 919654

   ->  Hash
   Buckets: 8388608  Batches: 1  Memory Usage: 358464kB
   ->  Seq Scan on lineitem_001 l2


Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-07-20 Thread Julien Rouhaud
On Thu, Jul 20, 2017 at 5:44 AM, Peter Geoghegan  wrote:
> On Wed, Jul 19, 2017 at 8:33 PM, Craig Ringer  wrote:
>> That's silly, so here's a patch to teach pageinspect how to decode infomasks
>> to a human readable array of flag names.
>>
>> Example:
>>
>> SELECT t_infomask, t_infomask2, flags
>> FROM heap_page_items(get_raw_page('test1', 0)),
>>  LATERAL heap_infomask_flags(t_infomask, t_infomask2, true) m(flags);
>>  t_infomask | t_infomask2 |   flags
>> +-+
>>2816 |   2 |
>> {HEAP_XMIN_COMMITTED,HEAP_XMIN_INVALID,HEAP_XMAX_INVALID,HEAP_XMIN_FROZEN}
>> (1 row)
>
> Seems like a good idea to me.
>

+1, it'll be really helpful.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum can't keep up, bloat just continues to rise

2017-07-20 Thread Sokolov Yura

On 2017-07-20 05:52, Masahiko Sawada wrote:
On Thu, Jul 20, 2017 at 8:24 AM, Stephen Frost  
wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

"Joshua D. Drake"  writes:
> At PGConf US Philly last week I was talking with Jim and Jan about
> performance. One of the items that came up is that PostgreSQL can't run
> full throttle for long periods of time. The long and short is that no
> matter what, autovacuum can't keep up. This is what I have done:

Try reducing autovacuum_vacuum_cost_delay more, and/or increasing
autovacuum_vacuum_cost_limit.


Or get rid of the cost delay entirely and let autovacuum actually go 
as

fast as it can when it's run.  The assertion that it can't keep up is
still plausible, but configuring autovacuum to sleep regularly and 
then

complaining that it's not able to keep up doesn't make sense.

Reducing the nap time might also be helpful if autovacuum is going as
fast as it can and it's able to clear a table in less than a minute.

There have been discussions on this list about parallel vacuum of a
particular table as well; to address this issue I'd encourage 
reviewing

those discussions and looking at writing a patch to implement that
feature as that would address the case where the table is large enough
that autovacuum simply can't get through all of it before the other
backends have used all space available and then substantially 
increased

the size of the relation (leading to vacuum on the table running for
longer).


Yeah, the parallel vacuum of a particular table might help this issue
unless disk I/O is bottle-neck. I'm planning work on that.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


I wrote two days ago about vacuum ring buffer:
https://www.postgresql.org/message-id/8737e9bddb82501da1134f021bf4929a%40postgrespro.ru

Increasing Vacuum's ring buffer to size of Bulk Writer's one reduces
autovacuum time in 3-10 times.
(for both patched and unpatched version I used single non-default 
setting

'autovacuum_cost_delay=2ms').

This is single line change, and it improves things a lot.

With regards,
--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-07-20 Thread Ashutosh Bapat
On Wed, Jul 19, 2017 at 9:54 AM, Rafia Sabih
 wrote:
>
> Partition information:
> Type of partitioning - single column range partition
> Tables partitioned - Lineitem and orders
>
> Lineitem -
> Partition key = l_orderkey
> No of partitions = 18
>
> Orders -
> Partition key = o_orderkey
> No of partitions = 11
>

The patch set upto 0015 would refuse to join two partitioned relations
using a partition-wise join if they have different number of
partitions. Next patches implement a more advanced partition matching
algorithm only for list partitions. Those next patches would refuse to
apply partition-wise join for range partitioned tables. So, I am
confused as to how come partition-wise join is being chosen even when
the number of partitions differ.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers