Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai
2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 I see two ways to fix this:

 (1) enforce the 1GB limit (probably better for back-patching, if that's
 necessary)

 (2) make it work with hash tables over 1GB

 I'm in favor of (2) if there's a good way to do that. It seems a bit
 stupid not to be able to use fast hash table because there's some artificial
 limit. Are there any fundamental reasons not to use the
 MemoryContextAllocHuge fix, proposed by KaiGai-san?


 If there are no objections, I will apply the patch for 2) to HEAD and
 backpatch to 9.5.

Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk  ws2.ws_warehouse_sk;
 QUERY PLAN


 Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
   Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
   Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
   Rows Removed by Join Filter: 127853313
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
 Sort Key: ws1.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
 Sort Key: ws2.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
 Planning time: 0.232 ms
 Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Oleg Bartunov
On Wed, Aug 19, 2015 at 4:46 PM, Andres Freund and...@anarazel.de wrote:

 On 2015-08-19 09:41:32 -0400, Tom Lane wrote:
  In fact, they'd still need to use DNS balancing for Postgres,
  because not everything connects with libpq (think JDBC for instance).

 It already does support this though.

 https://jdbc.postgresql.org/documentation/head/connect.html :

  Connection Fail-over
 
  To support simple connection fail-over it is possible to define multiple
  endpoints (host and port pairs) in the connection url separated by
  commas. The driver will try to once connect to each of them in order
  until the connection succeeds. If none succeed, a normal connection
  exception is thrown.
 
  The syntax for the connection url is:
 
  jdbc:postgresql://host1:port1,host2:port2/database


yes, I also wanted to show this, but you was quicker.




  So I think we ought to reject this proposal, full stop.  I see no
  reason to re-invent this wheel, and there are good reasons not to.

 I don't really buy this argument. Allowing to connect to several
 endpoints isn't exactly new tech either. A lot of database connectors
 do support something very close to the above pgjdbc feature.



mysql, for example.



 Greetings,

 Andres Freund


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
  On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com
 wrote:
 
 
  I see two ways to fix this:
 
  (1) enforce the 1GB limit (probably better for back-patching, if that's
  necessary)
 
  (2) make it work with hash tables over 1GB
 
  I'm in favor of (2) if there's a good way to do that. It seems a bit
  stupid not to be able to use fast hash table because there's some
 artificial
  limit. Are there any fundamental reasons not to use the
  MemoryContextAllocHuge fix, proposed by KaiGai-san?
 
 
  If there are no objections, I will apply the patch for 2) to HEAD and
  backpatch to 9.5.
 
 Please don't be rush. :-)


Please explain what rush you see?


 It is not difficult to replace palloc() by palloc_huge(), however, it may
 lead
 another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.


 Yes, I can read both threads and would apply patches for each problem.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Andres Freund
On 2015-08-19 10:49:46 +0200, Magnus Hagander wrote:
 What happens the first time? Meaning I'm on wal_level=minimal and take a
 base backup. Then when the replica first connects 10 minutes later, it
 needs WAL back in time, which was logged at wal_level=minimal.

 So you'd need to bump it up whenever a base backup is done -- but then you
 can't drop it back down again or your base backup will be useless.

 Or am I missing something?

Nope. Requiring pg_basebackup to automatically create such a
'non-reserving' slot doesn't seem to be too bad to me.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Andres Freund
On 2015-08-18 21:47:51 -0400, Peter Eisentraut wrote:
 On 8/18/15 1:46 PM, Andres Freund wrote:
  I don't think not requiring restarts is sufficient, having to twiddle a
  bunch of parameters manually still is a lot more effort than people see
  as necessary.
 
 I agree that we want both.  But requiring a restart is a hard stop,
 whereas making configuration easier is a soft feature.

I don't think it makes that much of a difference for people new to
postgres.

  To deal with streaming replication, we automatically create slots for
  replicas, but, by default, *without* having them reserve WAL. The slot
  name would be determined in some automatic fashion (uuid or something)
  and stored in a new file in the data directory.  That allows us to
  increase the internal wal_level to hot_standby/archive whenever a
  replica has connected (and thus a physical slot exists), and to logical
  whenever a logical slot exists.
 
 That seems kind of weird.  So every time a replication client connects,
 we create a new replication slot?  Won't they accumulate?

No, that's not what I was thinking of. The name of the slot would be
stored somewhere in the data directory, and then be reused henceforth.


 Do we need the replication slot, or could we just trigger this on the
 existence of a replication client?

I don't think that can work, because a replication connection obviously
can go away temporarily. If we'd then fall back to wal_level minimal the
standby would be cut off.


 Also note that this scheme and any similar one requires merging the
 archive and hot_standby levels, which was the core of your original
 proposal [1]

It doesn't really, we could continue to keep them separate. I'm
proposing to maintain wal_level automatically, so it'd not be
configurable anymore, so it'd not matter much, as long as we're not less
efficient.


 , which was then objected to, which subsequently lead to
 Robert's proposal to make wal_level SIGHUP, which lead to my message,
 which lead to your message, which is similar to your initial one.
 Somehow we have to find a way break out of this circle. ;-)

My reading of the objections was that it was primarily about increasing
the amount of WAL in the default configuration, and this proposal
wouldn't anymore. An out-of-the-box configuration wouldn't emit more WAL
than today, until it's been used with streaming rep.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Please don't be rush. :-)

 Please explain what rush you see?

Yours.  You appear to be in a hurry to apply patches that there's no
consensus on.

 It is not difficult to replace palloc() by palloc_huge(), however, it
 may lead another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.

  Yes, I can read both threads and would apply patches for each problem.

I don't see anything very wrong with constraining the initial allocation
to 1GB, or even less.  That will prevent consuming insane amounts of
work_mem when the planner's rows estimate is too high rather than too low.
And we do have the ability to increase the hash table size on the fly.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize.  Now that we know there is one, that should be
fixed (and not only in HEAD/9.5).  But I believe Kaigai-san withdrew his
initial proposed patch, and we don't have a replacement as far as I saw.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-08-19 Thread Ashutosh Bapat
On Tue, Aug 4, 2015 at 2:20 PM, Shigeru Hanada shigeru.han...@gmail.com
wrote:

 Hi Ashutosh,

 Sorry for leaving the thread.

 2015-07-20 16:09 GMT+09:00 Ashutosh Bapat ashutosh.ba...@enterprisedb.com
 :
  In find_user_mapping(), if the first cache search returns a valid tuple,
 it
  is checked twice for validity, un-necessarily. Instead if the first
 search
  returns a valid tuple, it should be returned immediately. I see that this
  code was copied from GetUserMapping(), where as well it had the same
  problem.

 Oops.  I changed find_user_mapping to exit immediately when any valid
 cache was found.


Thanks.



  In build_join_rel(), we check whether user mappings of the two joining
  relations are same. If the user mappings are same, doesn't that imply
 that
  the foreign server and local user are same too?

 Yes, validity of umid is identical to serverid.  We can remove the
 check for serverid for some cycles.  One idea is to put Assert for
 serverid inside the if-statement block.


  Rest of the patch looks fine.

 Thanks


I started reviewing the other patches.

In patch foreign_join_v16.patch, the user mapping structure being passed to
GetConnection() is the one obtained from GetUserMappingById().
GetUserMappingById() constructs the user mapping structure from the user
mapping catalog. For public user mappings, catalog entries have InvalidOid
as userid. Thus, with this patch there is a chance that userid in
UserMapping being passed to GetConnection(), contains InvalidOid as userid.
This is not the case today. The UserMapping structure constructed using
GetUserMapping(Oid userid, Oid serverid) (which ultimately gets passed to
GetConnection()), has the passed in userid and not the one in the catalog.
Is this change intentional?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner
On 2015.08.19 at 09:21:50 +, Albe Laurenz wrote:

   Yes, but that will only work reliably if the (read-only) standby does not
   allow connections before it is promoted.
  
  It would just take a bit more time for client and a bit more load for
  server - to make sure that this connection is read-write by
  issuing
  
  show transaction_read_only
  
  statement before considering connection useful.
 
 That's not very comfortable, and a lot of middleware software won't easily
 learn the trick.

It shouldn't be left to middleware. It should be hidden into
PQConnectPoll. This function already handle very complicated state
transition diagram, including authentication, SSL negotiation and so on.
If we just add couple of new states such as CONNECTION_ASK_RW_STATUS
and CONNECTION_RW_STATUS_OK it should be fine.

Application would just call PQConnectPoll repeatedly (either via
PQconnectdb or explicitely when readable/writable condition detected on
the socket integrated into app even loop) until success or
unrecoverable error would be achieved. How many interaction with server
it would take, it is not middleware problem.


  It seems to me that in most cases last host in the connect string would
  be only host which accepts connections, so it wins anyway
 
 I'm not saying that it is particularly wide-spread and useful; it could
 happen through careless editing of connection strings or by using a
 connection service file entry
 (http://www.postgresql.org/docs/current/static/libpq-pgservice.html)
 and overriding the host parameter on the command line.


I don't think that host definition from all possible sources 
(service file, environment, command line) should be collected together.
Although it is essential to be clear when host list is appended and when
- replaced. 

If we implement sequential trial of all hosts, then we can start with
last one, to provide compatibility with existing behavior. In this case
if last host is online, no change occur.

Another idea - is to enable multiple host connection only if special
option (loadbalance or failover) present in the connect string.



  Other idea - allow to specify host-port pair as argument of host
  parameter.
  
host=db1.myorg.com:5432
  
  It is consistent with URL syntax and system administrators are used to
  it. And with long list of hosts there is less chances to made an error
  as host and corresponding port come together.
 
 I don't think that is very attactive as it confuses the distinction between
 host and port.  What would you do with
 
host=db1.myorg.com:2345 port=1234
 
I don't think that it does add any more confusion than simultaneous
existence of host and hostaddr, or ability to specify host and port both
in host part of URL and query parameters

postgresql://user@host:5432/dbname?host=otherhostport=2345

Bot really, you've convinced me that syntax with two or three (host, port
and hostaddr) parallel lists is the best. Although we'll need to allow
empty entries in the lists such as

host=master.db.com,standby1.db.com.standby2.db.com port=,2345, 
hostaddr=,192.168.0.4,192.160.1.8

with evident semantic:
get port, host and hostaddr elements with same number, and if some of
them are empty, behave as it was only host and corresponding parameter
not specified at all.

-- 
Victor Wagner vi...@wagner.pp.ru


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread David Fetter
On Wed, Aug 19, 2015 at 04:30:39PM +0900, Amit Langote wrote:
 On 2015-08-18 PM 10:43, David Fetter wrote:
 
  After the first command is done, the second command would take exclusive
  lock on table_name, scan the table to check if it contains any values
  outside the boundaries defined by FOR VALUES clause defined previously,
  throw error if so, mark as valid partition of parent if not.
  
  One small change to make this part more efficient:
  
  1. Take the access exclusive lock on table_name.
  2. Check for a matching constraint on it.
  3. If it's there, mark it as a valid partition.
  4. If not, check for values outside the boundaries as above.
  
 
 That's an interesting idea. Thanks!

I hope I'm advancing this feature rather than bogging it down...

 By a matching constraint, I guess you mean a 'valid' constraint from
 which the declared partition constraint can be proven to follow. For
 (a simple) example, from a CHECK (a = 100 AND a  150) on
 table_name, the partition constraint implied by FOR VALUES START
 (100) END (200) can be assumed to hold.

Well, I was assuming an exact match, but a stricter match seems like a
nice-to-have...possibly later.

  Should the be a *valid* constraint?  Perhaps that should be
  parameterized, as I'm not yet seeing a compelling argument either
  direction.  I'm picturing something like:
  
  ALTER TABLE table_name SET VALID PARTITION OF parent [TRUST]
  
  where TRUST would mean that an existing constraint need not be VALID.
 
 Hmm, I'd think this step must be able to assert the partition
 constraint beyond any doubt.  If the DBA added the constraint and
 marked it invalid, she should first VALIDATE the constraint to make
 it valid by performing whatever steps necessary before. IOW, a full
 heap scan at least once is inevitable (the reason why we might want
 to make this a two step process at all). Am I missing something?

There are use cases where we need to warn people that their assertions
need to be true, and if those assertions are not true, this will
explode, leaving them to pick the resulting shrapnel out of their
faces.  There are other parts of the system where this is true, as
when people write UDFs in C.

As I understand it, NOT VALID means, I assert that the tuples already
here fit the constraint.  Any changes will be checked against the
constraint.

I've seen cases where a gigantic amount of data is coming out of some
distributed system which holds the constraint as an invariant.  This
let a DBA decide to add a NOT VALID constraint in order not to take
the hit of a second full scan of the data, which might have made the
import, and possibly the entire project, untenable.

See above.

  5. Detach partition
 
  ALTER TABLE partitioned_table
  DETACH PARTITION partition_name [USING table_name]
 
  This removes partition_name as partition of partitioned_table.
  The table continues to exist with the same name or 'table_name',
  if specified.  pg_class.relispartition is set to false for the
  table, so it behaves like a normal table.
  
  Could this take anything short of an access exclusive lock on the
  parent?
 
 Yes, both the step 1 of ATTACH command and DETACH command take
 access exclusive lock on the parent. They are rather quick metadata
 changes, so should not stall others significantly, I think.

So no.  Weakening required locks has been something of an ongoing
project, project-wide, and need not be part of the first cut of this
long-needed feature.

Thanks so much for working on this!

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Andres Freund
On 2015-08-19 09:41:32 -0400, Tom Lane wrote:
 In fact, they'd still need to use DNS balancing for Postgres,
 because not everything connects with libpq (think JDBC for instance).

It already does support this though.

https://jdbc.postgresql.org/documentation/head/connect.html :

 Connection Fail-over
 
 To support simple connection fail-over it is possible to define multiple
 endpoints (host and port pairs) in the connection url separated by
 commas. The driver will try to once connect to each of them in order
 until the connection succeeds. If none succeed, a normal connection
 exception is thrown.
 
 The syntax for the connection url is:
 
 jdbc:postgresql://host1:port1,host2:port2/database


 So I think we ought to reject this proposal, full stop.  I see no
 reason to re-invent this wheel, and there are good reasons not to.

I don't really buy this argument. Allowing to connect to several
endpoints isn't exactly new tech either. A lot of database connectors
do support something very close to the above pgjdbc feature.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 I see two ways to fix this:

 (1) enforce the 1GB limit (probably better for back-patching, if that's
 necessary)

 (2) make it work with hash tables over 1GB

 I'm in favor of (2) if there's a good way to do that. It seems a bit
 stupid not to be able to use fast hash table because there's some
 artificial limit. Are there any fundamental reasons not to use the
 MemoryContextAllocHuge fix, proposed by KaiGai-san?


If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] More WITH

2015-08-19 Thread David Fetter
On Tue, Aug 18, 2015 at 11:23:32PM -0400, Tom Lane wrote:
 Craig Ringer cr...@2ndquadrant.com writes:
  On 18 August 2015 at 01:18, David Fetter da...@fetter.org wrote:
  FETCH [in WITH]
 
  I'd be a huge fan of this one. I'd love to see FETCH in
  subqueries, too. Currently doing anything like this requires an
  ugly PL/PgSQL wrapper.
 
  The cursor would have to be known at plan-time so it could be
  interrogated for its types.
 
 That's barely the tip of the iceberg of the problems with this idea.
 
 How many rows would be fetched from the cursor?  What row would it
 be left on?  Whatever answer you give will be wrong from some
 perspective, but particularly that of giving the planner any
 freedom-of-action to optimize such a query.
 
 More generally, what would you hope to accomplish with such a
 construct that wouldn't be better done by writing the cursor's
 underlying query directly in the WITH clause?

So FETCH is not a good candidate for inclusion in WITH, at least until
someone comes up with some meaningful definition of what this would
mean.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [patch] psql tab completion for grant execute

2015-08-19 Thread Daniel Verite
Robert Haas wrote:

  A trivial patch is attached. It adds the condition that if EXECUTE is
  preceded by GRANT itself preceded by nothing, then that completion
  with PROCEDURE is skipped.
 
 Thanks, I committed this.  I don't think we usually back-patch tab
 completion fixes, but I back-patched this one to 9.5 anyway, so that
 it gets out into the wild a little sooner.

Thanks! I had added it in the CF earlier, I've taken the liberty to mark
it as committed by you:

https://commitfest.postgresql.org/6/329/


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Simon Riggs
On 18 August 2015 at 18:46, Andres Freund and...@anarazel.de wrote:


 ISTM that it's not too hard to
 a) make archive_mode PGC_SIGHUP
 b) make wal_level PGC_SIGHUP


+1


 c) automatically increase wal_level to logical whenever a logical
replication slot is defined


-1

It would be easier to just have wal_level = logical always, but allow it to
be set lower if desired.

Increasing wal_level dynamically might otherwise happen too late.


 it seems considerably harder to

 d) make max_wal_senders PGC_SIGHUP
 e) make max_replication_slots PGC_SIGHUP

 because they need shmem, locks, and everything.


 Therefore I propose something slightly different:

 We change the default of max_wal_senders, max_replication_slots, to some
 reasonably high setting


Agreed, I suggest 8 as the default for each.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Simon Riggs
On 18 August 2015 at 18:31, Josh Berkus j...@agliodbs.com wrote:


  2. Creating a partition of a partitioned table
 
  CREATE TABLE table_name
  PARTITION OF partitioned_table_name
  FOR VALUES values_spec;
 
  Where values_spec is:
 
  listvalues: [IN] (val1, ...)
 
  rangevalues: START (col1min, ... ) END (col1max, ... )
 | START (col1min, ... )
 | END (col1max, ... )

 So, one thing I missed in here is anything about automated partitioning
 of tables; that is, creating new partitions based on incoming data or a
 simple statement which doesn't require knowledge of the partitioning
 scheme.  It's possible (and entirely accceptable) that you're
 considering automated partition creation outside of the scope of this
 patch.


I would like to make automatic partitioning outside the scope of this first
patch.


However, for range partitions, it would be *really* useful to
 have this syntax:

 CREATE NEXT PARTITION ON parent_table;

 Which would just create the next partition based on whatever the range
 partitoning scheme is, instead of requiring the user to calculate start
 and end values which might or might not match the parent partitioning
 scheme, and might leave gaps.  Also this would be useful for range
 partitions:

 CREATE PARTITION ON parent_table USING ( start_value );

 ... where start_value is the start range of the new partition.  Again,
 easier for users to get correct.

 Both of these require the idea of regular intervals for range
 partitions, that is, on a table partitioned by month on a timestamptz
 column, each partition will have the range [ month:1, nextmonth:1 ).
 This is the most common use-case for range partitions (like, 95% of all
 partitioning cases I've seen), so a new partitioning scheme ought to
 address it.

 While there are certainly users who desire the ability to define
 arbitrary ranges for each range partition, these are by far the minority
 and could be accomodated by a different path with more complex syntax.
 Further, I'd wager that most users who want to define arbitrary ranges
 for range partitions aren't going to be satisfied with the other
 restrictions on declarative partitioning (e.g. same constraints, columns
 for all partitions) and are going to use inheritance partitioning anyway.


I like the idea of a regular partitioning step because it is how you design
such tables - lets use monthly partitions.

This gives sanely terse syntax, rather than specifying pages and pages of
exact values in DDL

   PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' ) START
WITH value;

borrowing the same concepts from sequence syntax.

  Creating index on parent is not allowed. They should be defined on (leaf)

  partitions. Because of this limitation, primary keys are not allowed on a
  partitioned table. Perhaps, we should be able to just create a dummy
  entry somewhere to represent an index on parent (which every partition
  then copies.)

 This would be preferable, yes.  Making users remember to manually create
 indexes on each partition is undesirable.


I think it is useful to allow additional indexes on partitions, if desired,
but we should always automatically build the indexes that are defined on
the master when we create a new partition.

Presumably unique indexes will be allowed on partitions. So if the
partition key is unique, we can say the whole partitioned table is unique
and call that a Primary Key.


I would want individual partitions to be placed on separate tablespaces,
but not by default.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai
2015-08-19 21:29 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
  On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com
  wrote:
 
 
  I see two ways to fix this:
 
  (1) enforce the 1GB limit (probably better for back-patching, if that's
  necessary)
 
  (2) make it work with hash tables over 1GB
 
  I'm in favor of (2) if there's a good way to do that. It seems a bit
  stupid not to be able to use fast hash table because there's some
  artificial
  limit. Are there any fundamental reasons not to use the
  MemoryContextAllocHuge fix, proposed by KaiGai-san?
 
 
  If there are no objections, I will apply the patch for 2) to HEAD and
  backpatch to 9.5.
 
 Please don't be rush. :-)


 Please explain what rush you see?

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

 It is not difficult to replace palloc() by palloc_huge(), however, it may
 lead
 another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.


  Yes, I can read both threads and would apply patches for each problem.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Andres Freund
On 2015-08-18 20:36:13 -0400, Tom Lane wrote:
 I wrote:
  Just thinking about this ... I wonder why we need to call
  TransactionIdIsInProgress() at all rather than believing the answer from
  the snapshot?  Under what circumstances could TransactionIdIsInProgress()
  return true where XidInMVCCSnapshot() had not?
 
 I experimented with the attached patch, which replaces
 HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with
 XidInMVCCSnapshot, and then as a cross-check has all the return false
 exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress().

I'm not sure about it, but it might be worthwhile to add a
TransactionIdIsKnownCompleted() check before the more expensive parts of
XidInMVCCSnapshot(). Neither the array search nor, much more so, the
subtrans lookups are free.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Simon Riggs
On 18 August 2015 at 11:30, Amit Langote langote_amit...@lab.ntt.co.jp
wrote:


 There is no need to define tuple routing triggers. CopyFrom() and
 ExecInsert() determine target partition just before performing
 heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
 constraints (on parent) are executed for tuple before being routed to a
 partition. If no partition can be found, it's an error.

 Because row-level AFTER triggers need to save ItemPointers in trigger
 event data and defining triggers on partitions (which is where tuples
 really go) is not allowed, I could not find a straightforward way to
 implement them. So, perhaps we should allow (only) row-level AFTER
 triggers on partitions or think of modifying trigger.c to know about this
 twist explicitly.


I think tables will eventually need FK support; its not sustainable as a
long term restriction, though perhaps its something we can do in a later
patch.

You haven't specified what would happen if an UPDATE would change a row's
partition. I'm happy to add this to the list of restrictions by saying that
the partition key cannot be updated.

We'll need regression tests that cover each restriction and docs that
match. This is not something we should leave until last. People read the
docs to understand the feature, helping them to reach consensus. So it is
for you to provide the docs before, not wait until later. I will begin a
code review once you tell me docs and tests are present. We all want the
feature, so its all about the details now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-19 Thread Kevin Grittner
Kouhei Kaigai kai...@ak.jp.nec.com wrote:

 we may need a couple of overhaul around HashJoin to support large
 size of data, not only nbuckets around 0x8000.

Perhaps, but this is a clear bug, introduced to the 9.5 code, with
an obvious fix; so I've pushed the change from 1 to 1L on that left
shift.  There was clearly an attempt in surrounding code to deal
with size overruns by using a long, applying a min(), and casting
to int; this one statement just missed a trick.  If we need to
further constrain sizes to keep within allocation limits, that
seems like an entirely separate patch.

Thanks for finding and analyzing this and providing a patch!


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Tom Lane
Albe Laurenz laurenz.a...@wien.gv.at writes:
 Victor Wagner wrote:
 It would just take a bit more time for client and a bit more load for
 server - to make sure that this connection is read-write by
 issuing
 show transaction_read_only
 statement before considering connection useful.

 That's not very comfortable, and a lot of middleware software won't easily
 learn the trick.

That sort-of ties into what seems to me the main objection to this
proposal, namely that there is already a way to do this sort of thing:
DNS-based load balancing.  All the clients think they connect to
db.mycompany.com, but which server they actually get is determined by
what IP address the DNS server tells them to use.

This is a technology that is very well established, known to every
large-site admin, and usable for every Internet-based service.  Even if
libpq had its own nonstandard way of doing something similar, the site
admins would probably still need to use DNS load balancing for other
services.  In fact, they'd still need to use DNS balancing for Postgres,
because not everything connects with libpq (think JDBC for instance).

So I think we ought to reject this proposal, full stop.  I see no
reason to re-invent this wheel, and there are good reasons not to.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Tom Lane
Andres Freund and...@anarazel.de writes:
 On 2015-08-18 20:36:13 -0400, Tom Lane wrote:
 I experimented with the attached patch, which replaces
 HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with
 XidInMVCCSnapshot, and then as a cross-check has all the return false
 exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress().

 I'm not sure about it, but it might be worthwhile to add a
 TransactionIdIsKnownCompleted() check before the more expensive parts of
 XidInMVCCSnapshot(). Neither the array search nor, much more so, the
 subtrans lookups are free.

Hmmm... the comment for TransactionIdIsKnownCompleted says it's to
short-circuit calls of TransactionIdIsInProgress, which we wouldn't be
doing anymore.  Maybe it's useful anyway but I'm not convinced.

In any case, the big picture here is that XidInMVCCSnapshot should on
average be looking at just about the same number of xids and subtrans ids
as TransactionIdIsInProgress does --- only the latter is looking at them
in the PGPROC array, so it needs a lock, and iterating over that data
structure is more complex than scanning an array too.

My own thought about reducing the cost of XidInMVCCSnapshot, if that
proves necessary, is that maybe it would be worth the trouble to sort the
arrays so we could use binary search.  That would increase the cost of
snapshot acquisition noticeably though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Andres Freund
On 2015-08-19 17:51:47 +0200, Magnus Hagander wrote:
 That's doable - but what about manual base backups? And if they don't go
 away, what about the ones that are generated by the
 nightly/weekly/hourly/whatever pg_basebackup -x ones?

Good questions. I guess we could just make do_pg_start_backup() elevate
the WAL level (using a fixed slot name or some other mechanism) he first
time they're run, until there's explicit admin action.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Tom Lane
I wrote:
 Andres Freund and...@anarazel.de writes:
 I'm not sure about it, but it might be worthwhile to add a
 TransactionIdIsKnownCompleted() check before the more expensive parts of
 XidInMVCCSnapshot(). Neither the array search nor, much more so, the
 subtrans lookups are free.

 Hmmm... the comment for TransactionIdIsKnownCompleted says it's to
 short-circuit calls of TransactionIdIsInProgress, which we wouldn't be
 doing anymore.  Maybe it's useful anyway but I'm not convinced.

After further thought, the right way to implement the equivalent
optimization would be to add a couple of fields to struct Snapshot that
would cache the xid last checked against that snapshot and the outcome of
that check; this would be independent of TransactionIdIsKnownCompleted.
There would be no need to use that cache for xids below xmin or above
xmax, which would improve its chance of being applicable to in-doubt xids.

Not entirely sure it's worth doing, but if someone wants to do the
legwork, this would be an independent optimization possibility.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread jacques klein
I would like to execute a trigger function (written in C) in one of my
background workers.

Didn't figure out how to do that not even if it's possible.

Currently my trigger function is called, but in the server-process
connected to the client who does the insert into ... .

Any hint/help is welcome.

Jacques K.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 14:46, Andres Freund and...@anarazel.de wrote:

 On 2015-08-19 09:41:32 -0400, Tom Lane wrote:
  In fact, they'd still need to use DNS balancing for Postgres,
  because not everything connects with libpq (think JDBC for instance).

 It already does support this though.

 https://jdbc.postgresql.org/documentation/head/connect.html :

  Connection Fail-over
 
  To support simple connection fail-over it is possible to define multiple
  endpoints (host and port pairs) in the connection url separated by
  commas. The driver will try to once connect to each of them in order
  until the connection succeeds. If none succeed, a normal connection
  exception is thrown.
 
  The syntax for the connection url is:
 
  jdbc:postgresql://host1:port1,host2:port2/database


When we discussed this feature at the Dev Meeting in 2014, I thought we
agreed that allowing multiple hosts in the connection string would be OK.

+1 for bringing the jdbc driver URI syntax into libpq, so that all
interfaces can be optionally specified this way. This doesn't preclude the
use of ipfailover, in fact it might be work well together. If you don't
like it, don't use it.

I think we do need some way of saying that a readonly connection is OK. So
the default would be to connect to each in turn until we find the master.
It should keep retrying for a period of time since for a short period it is
possible there is no master. If you specify readonly, then a connection to
a standby is acceptable and it will stop there.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-19 Thread Tom Lane
Kevin Grittner kgri...@ymail.com writes:
 Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 we may need a couple of overhaul around HashJoin to support large
 size of data, not only nbuckets around 0x8000.

 Perhaps, but this is a clear bug, introduced to the 9.5 code, with
 an obvious fix; so I've pushed the change from 1 to 1L on that left
 shift.

I don't think it's anywhere near as clear as you think.  The preceding
lines should have limited nbuckets to be no more than INT_MAX/2, so how
could an overflow occur there?  (The result of 1mylog() should be
at most 0x4000 AFAICS.)  If overflow is possible, how will s/1/1L/
make it better on machines where int and long are the same size?

And on machines where long is wider than int, you've still got a bug
if the result of the left shift somehow overflowed an int, because
it's going to be assigned to nbuckets which is an int.

So I think the 1 coding was just fine in context.  If there is an
overflow in ExecChooseHashTableSize, it's somewhere else.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread ''Victor Wagner *EXTERN*' *EXTERN*' *EXTERN*
On 2015.08.19 at 15:35:17 +0100, Simon Riggs wrote:

 
 I think we do need some way of saying that a readonly connection is OK. So

I had such thing in my propsal (boolean parameter readonly). 
But haven't yet checked if it is compatible with jdbc syntax.

 the default would be to connect to each in turn until we find the master.
 It should keep retrying for a period of time since for a short period it is
 possible there is no master. If you specify readonly, then a connection to

It is very important addition  - to specify that if no host is able to
establish read-write session, we should retry and give a chance for
sever administration to promote one of standbys to master. Probably
there should be additional timeout parameter (we have
connection_timeout, and this would be failover_timeout) with some
reasonaable default.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-19 Thread Tom Lane
I wrote:
 I don't think it's anywhere near as clear as you think.

Ah, scratch that --- I was looking at the wrong my_log2() call.
-ENOCAFFEINE.

I'm still doubtful that this is the only overflow risk in that new
ExecChooseHashTableSize code, though.  For instance, the only reason the
line immediately above this one isn't broken is that NTUP_PER_BUCKET is 1
... but surely we shouldn't be coding on that assumption.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 16:21, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
  Andres Freund and...@anarazel.de writes:
  I'm not sure about it, but it might be worthwhile to add a
  TransactionIdIsKnownCompleted() check before the more expensive parts of
  XidInMVCCSnapshot(). Neither the array search nor, much more so, the
  subtrans lookups are free.

  Hmmm... the comment for TransactionIdIsKnownCompleted says it's to
  short-circuit calls of TransactionIdIsInProgress, which we wouldn't be
  doing anymore.  Maybe it's useful anyway but I'm not convinced.

 After further thought, the right way to implement the equivalent
 optimization would be to add a couple of fields to struct Snapshot that
 would cache the xid last checked against that snapshot and the outcome of
 that check; this would be independent of TransactionIdIsKnownCompleted.
 There would be no need to use that cache for xids below xmin or above
 xmax, which would improve its chance of being applicable to in-doubt xids.

 Not entirely sure it's worth doing, but if someone wants to do the
 legwork, this would be an independent optimization possibility.


This is what I suggested upthread.  It's not a massive gain, but its cheap
and a straightforward patch.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Magnus Hagander
On Wed, Aug 19, 2015 at 3:34 PM, Andres Freund and...@anarazel.de wrote:

 On 2015-08-19 10:49:46 +0200, Magnus Hagander wrote:
  What happens the first time? Meaning I'm on wal_level=minimal and take
 a
  base backup. Then when the replica first connects 10 minutes later, it
  needs WAL back in time, which was logged at wal_level=minimal.

  So you'd need to bump it up whenever a base backup is done -- but then
 you
  can't drop it back down again or your base backup will be useless.

  Or am I missing something?

 Nope. Requiring pg_basebackup to automatically create such a
 'non-reserving' slot doesn't seem to be too bad to me.


That's doable - but what about manual base backups? And if they don't go
away, what about the ones that are generated by the
nightly/weekly/hourly/whatever pg_basebackup -x ones?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 14:53, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:
  Please don't be rush. :-)

  Please explain what rush you see?

 Yours.  You appear to be in a hurry to apply patches that there's no
 consensus on.


I think that comment is unreasonable.

The problem was reported 2 months ago; following independent confirmation
of the proposed patch, I suggested committing it, with these words:

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

I was clearly waiting for objections before acting, to test whether there
was consensus or not.

Please explain what procedure you would like committers to follow?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Tatsuo Ishii
 Here we are discussing load-balancing on the client level, not on the
 statement level.

I see.

 Suppose that we have 100 readonly clients and 3 standby servers + master.
 If all clients specify all four servers in the their connect strings,
 and connect randomly to them, each server would have approximately 25
 clients.
 
 But once connection is established, each client works with one
 server (at least until communication failure occurs and it would call
 PQreset. In this case it has to reprepare statements anyway).

One downside of this is, if one of the standby servers is not
responding, every time clients will be blocked by the server before
giving up the connection trial. This could last for hours (for
example, the network cable is plugged out). I think round robin DNS is
better because the DNS server will drop the entry corresponding broken
server (or any solution which has similar capability). After all, this
type of client side solutions are not very stable in a real world
environment IMO (I heard the same opinion regarding HAProxy).

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-19 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:

 I'm still doubtful that this is the only overflow risk in that
 new ExecChooseHashTableSize code, though.

KaiGai already pointed that out on this thread and I completely
agree; but I figured that I might as well fix the clear bug with an
obvious fix that was causing an assertion failure during 9.5
testing pending analysis and fixes for the other hazards.  Getting
past the assertion failure might help identify some other bugs in
the area.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Shulgin, Oleksandr
On Wed, Aug 19, 2015 at 4:45 PM, ''Victor Wagner *EXTERN*' *EXTERN*'
*EXTERN* vi...@wagner.pp.ru wrote:

 On 2015.08.19 at 15:35:17 +0100, Simon Riggs wrote:

 
  I think we do need some way of saying that a readonly connection is OK.
 So

 I had such thing in my propsal (boolean parameter readonly).
 But haven't yet checked if it is compatible with jdbc syntax.

  the default would be to connect to each in turn until we find the master.
  It should keep retrying for a period of time since for a short period it
 is
  possible there is no master. If you specify readonly, then a connection
 to

 It is very important addition  - to specify that if no host is able to
 establish read-write session, we should retry and give a chance for
 sever administration to promote one of standbys to master. Probably
 there should be additional timeout parameter (we have
 connection_timeout, and this would be failover_timeout) with some
 reasonaable default.


Are we going to put support for every existing and new jdbc feature into
libpq?  One day they might want to add another parameter, e.g. the number
of retries before failing ultimately (hm, and probably, delay between
retries).  Should we already prepare for that?

I believe a good library should provide all the building blocks instead of
trying to envision every possible use case and incorporate them as
convenience functions.  All the described above can be implemented in terms
of existing libpq features rather easily.  Not to mention that the proposed
approach doesn't scale really well, IMO: once you have incorporated all
your database hosts in client's connection string, you need additional
steps to maintain this list on the app configuration side.

And the fact that a lot of other db connector libraries do this in one or
the other way, isn't actually an argument in favor of the feature, at least
not for me.

--
Alex


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-19 Thread Andres Freund
Hi,

 On Wed, Aug 5, 2015 at 10:59 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 OK, committed.

I spent some time today reviewing the commited patch. So far my only
major complaint is that I think the comments are only insufficiently
documenting the approach taken:
Stuff like avoiding ABA type problems by clearling the list entirely and
it being impossible that entries end up on the list too early absolutely
needs to be documented explicitly.

I think I found a few minor correctness issues. Mostly around the fact
that we, so far, tried to use semaphores in a way that copes with
unrelated unlocks arriving. I actually think I removed all of the
locations that caused that to happen, but I don't think we should rely
on that fact. Looking at the following two pieces of code:
/* If the list was not empty, the leader will clear our XID. */
if (nextidx != INVALID_PGPROCNO)
{
/* Sleep until the leader clears our XID. */
while (pg_atomic_read_u32(proc-nextClearXidElem) != 
INVALID_PGPROCNO)
{
extraWaits++;
PGSemaphoreLock(proc-sem);
}

/* Fix semaphore count for any absorbed wakeups */
while (extraWaits--  0)
PGSemaphoreUnlock(proc-sem);
return;
}
...
/*
 * Now that we've released the lock, go back and wake everybody up.  We
 * don't do this under the lock so as to keep lock hold times to a
 * minimum.  The system calls we need to perform to wake other processes
 * up are probably much slower than the simple memory writes we did 
while
 * holding the lock.
 */
while (wakeidx != INVALID_PGPROCNO)
{
PGPROC  *proc = allProcs[wakeidx];

wakeidx = pg_atomic_read_u32(proc-nextClearXidElem);
pg_atomic_write_u32(proc-nextClearXidElem, INVALID_PGPROCNO);

if (proc != MyProc)
PGSemaphoreUnlock(proc-sem);
}

There's a bunch of issues with those two blocks afaics:

1) The first block (in one backend) might read INVALID_PGPROCNO before
   ever locking the semaphore if a second backend quickly enough writes
   INVALID_PGPROCNO. That way the semaphore will be permanently out of
   balance.

2) There's no memory barriers around dealing with nextClearXidElem in
   the first block. Afaics there needs to be a read barrier before
   returning, otherwise it's e.g. not guaranteed that the woken up
   backend sees its own xid set to InvalidTransactionId.

3) If a follower returns before the leader has actually finished woken
   that respective backend up we can get into trouble:

   Consider what happens if such a follower enqueues in another
   transaction. It is not, as far as I could find out, guaranteed on all
   types of cpus that a third backend can actually see nextClearXidElem
   as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple
   sockets. If the write to nextClearXidElem is entered into the local
   store buffer (leader #1) a hyper-threaded process (leader #2) can
   possibly see it (store forwarding) while another backend doesn't
   yet.

   I think this is very unlikely to be an actual problem due to
   independent barriers until enqueued again, but I don't want to rely
   on it undocumentedly. It seems safer to replace
   +wakeidx = pg_atomic_read_u32(proc-nextClearXidElem);
   +pg_atomic_write_u32(proc-nextClearXidElem, INVALID_PGPROCNO);
   with a pg_atomic_exchange_u32().


I think to fix these ProcArrayGroupClearXid() should use a protocol
similar to lwlock.c. E.g. make the two blocks somethign like:
while (wakeidx != INVALID_PGPROCNO)
{
PGPROC  *proc = allProcs[wakeidx];

wakeidx = pg_atomic_read_u32(proc-nextClearXidElem);
pg_atomic_write_u32(proc-nextClearXidElem, INVALID_PGPROCNO);

/* ensure that all previous writes are visible before follower 
continues */
pg_write_barrier();

proc-lwWaiting = false;

if (proc != MyProc)
PGSemaphoreUnlock(proc-sem);
}
and
if (nextidx != INVALID_PGPROCNO)
{
Assert(!MyProc-lwWaiting);

for (;;)
{
/* acts as a read barrier */
PGSemaphoreLock(MyProc-sem);
if (!MyProc-lwWaiting)
break;
extraWaits++;
}

Assert(pg_atomic_read_u32(proc-nextClearXidElem) == 
INVALID_PGPROCNO)

/* Fix semaphore count for any absorbed wakeups */
while (extraWaits--  0)
PGSemaphoreUnlock(proc-sem);
return;
}

Going through the patch:

Re: [HACKERS] Bug? ExecChooseHashTableSize() got assertion failed with crazy number of rows

2015-08-19 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 we may need a couple of overhaul around HashJoin to support large
 size of data, not only nbuckets around 0x8000.

 Perhaps, but this is a clear bug, introduced to the 9.5 code, with
 an obvious fix; so I've pushed the change from 1 to 1L on that left
 shift.

 I don't think it's anywhere near as clear as you think.  The preceding
 lines should have limited nbuckets to be no more than INT_MAX/2, so how
 could an overflow occur there?

There is a 1 shifted for nbuckets that I didn't touch because it
could not. The limits on nbuckets are not applied at the point of
the bug.

 (The result of 1mylog() should be
 at most 0x4000 AFAICS.)  If overflow is possible, how will s/1/1L/
 make it better on machines where int and long are the same size?

In such cases there is not a bug -- at least not with my_log
returning less than 63.  In those cases the change will make no
difference.

 And on machines where long is wider than int, you've still got a bug
 if the result of the left shift somehow overflowed an int, because
 it's going to be assigned to nbuckets which is an int.

You seem to have missed reading the line in between:

| lbuckets = 1L  my_log2(hash_table_bytes / bucket_size);
| lbuckets = Min(lbuckets, max_pointers);
| nbuckets = (int) lbuckets;

There is a Min() and a cast to sort that out.

 So I think the 1 coding was just fine in context.

I disagree.

 If there is an overflow in ExecChooseHashTableSize, it's
 somewhere else.

I do believe that there are other places we need to place limits,
to avoid attempts to allocate more than 1GB, but I think patches to
address that should be separate from fixing this particular bug.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread David Fetter
On Wed, Aug 19, 2015 at 07:15:30AM +, Laurenz Albe wrote:
 Victor Wagner wrote:
  I wonder how useful this is at the present time.
 
 Maybe a better idea would be:
   host=db1.myorg.com,db2.myorg.com port=5432,2345

I think if we're going to provide multiple sets of connection info, we
should just do that rather than trying to piece them together from
constituent parts, where the former looks like:

host=service=foo 
sslmode=require,postgresql://bar.baz/mydb?sslmode=require,host=quux.corge 
user=grault port=6433

As far as I can tell, the only way a comma could sneak into these
strings is if it were in a database name or similarly bizarre spot, in
which case the usual quoting needed to handle it in general should
handle it here.

It's not clear to me that libpq is the correct place to add this
feature, as we have fairly large user bases--the JDBC world, for
example--that don't use it.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Andres Freund
On 2015-08-19 10:55:00 -0400, Tom Lane wrote:
 My own thought about reducing the cost of XidInMVCCSnapshot, if that
 proves necessary, is that maybe it would be worth the trouble to sort
 the arrays so we could use binary search.  That would increase the
 cost of snapshot acquisition noticeably though.

It's also considerably more expensive for search in many cases because
it's very hard to predict (branching and prefetching). I doubt it'd be
worthwhile.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread David Fetter
On Wed, Aug 19, 2015 at 05:37:31PM +0200, jacques klein wrote:
 I would like to execute a trigger function (written in C) in one of my
 background workers.
 
 Didn't figure out how to do that not even if it's possible.

You can write your trigger function in such a way as not to do the
usual check for trigger context, but it might be better to write two
functions, one with the trigger stuff in it, the other, which it
calls, for whatever action you actually want to trigger, and call that
second in your background worker.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner
On 2015.08.20 at 00:17:35 +0900, Tatsuo Ishii wrote:

  But once connection is established, each client works with one
  server (at least until communication failure occurs and it would call
  PQreset. In this case it has to reprepare statements anyway).
 
 One downside of this is, if one of the standby servers is not
 responding, every time clients will be blocked by the server before
 giving up the connection trial. This could last for hours (for

This shouldn't happen. My proposal was to connect all servers
simultaneously, and then use that connection which would be established
first closing other ones

Even if we wouldn't do so (to properly randomize server load or to be
compatible with jdbc), there is connection_timeout parameter, so 
client wouldn't seat and just wait for hours while system TCP/IP stack
trying to connect nonexistent server.


 example, the network cable is plugged out). I think round robin DNS is
 better because the DNS server will drop the entry corresponding broken

DNS server wouldn't drop anything unless explicitely told so (by
administrator or by some watchdog software which is able to talk
nsupdate protocol). 

And not everyone database owner has control on his own domain.

Moreover, DNS is distributed system with agressive caching. If our
system is not local, DNS records for non-existing server would be cached
by DNS servers of client's internet provider. 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Tom Lane
Victor Wagner vi...@wagner.pp.ru writes:
 On 2015.08.20 at 00:17:35 +0900, Tatsuo Ishii wrote:
 One downside of this is, if one of the standby servers is not
 responding, every time clients will be blocked by the server before
 giving up the connection trial. This could last for hours (for

 This shouldn't happen. My proposal was to connect all servers
 simultaneously, and then use that connection which would be established
 first closing other ones

That seems like seriously unfriendly behavior.  It will trigger dropped
connection bleats in the server logs, not to mentioned wasted process
forks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Badly broken logic in plpython composite-type handling

2015-08-19 Thread Tom Lane
I looked into the crash reported in bug #13579.  The proximate cause
of the crash is that PLyString_ToComposite does this:

   PLy_output_datum_func2(info-out.d, typeTup,
  exec_ctx-curr_proc-langid,
  exec_ctx-curr_proc-trftypes);

without any thought for the possibility that info-is_rowtype is 1,
in which case it's stomping on the info-out.r data that will be
needed later.  I have serious doubts that making PLyTypeOutput a
union was a wise idea, as it means that C offers you exactly zero
protection against this type of data clobber.

I tried changing that to

   PLy_output_datum_func(info, typeTup,
 exec_ctx-curr_proc-langid,
 exec_ctx-curr_proc-trftypes);

which unsurprisingly results in ERROR:  PLyTypeInfo struct is initialized
for a Tuple in the case reported in the bug ... and also results in quite
a few of the plpython regression tests failing that way, which makes it a
bit astonishing that we've not tripped over this bug already.

I'm inclined to think that maybe PLyString_ToComposite needs to be doing
something more like what PLyObject_ToComposite does, ie doing its own
lookup in a private descriptor; but I'm not sure if that's right, and
anyway it would just be doubling down on a bad design.  Not being able
to cache these I/O function lookups is really horrid.

I wonder if that could be fixed by changing PLyTypeInput and PLyTypeOutput
from unions to structs and getting rid of is_rowtype in favor of allowing
the d and r fields to be valid or not independently; then you could treat
the object as either scalar or rowtype at need.

Does whoever designed this code in the first place want to fix it?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Qingqing Zhou
On Tue, Aug 18, 2015 at 5:59 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 BTW, did you register the patch on the upcoming commit-fest?

Not yet, it is in WIP status.

 I think it may be a helpful feature, if we can add alternative
 subquery-path towards cte-scan on set_cte_pathlist() and choose
 them according to the cost estimation.

Are you suggesting that we keep both subquery-path (whenever possible)
and cte-path so that optimizer can choose among them?

I could imagine that if we could support materialize cte once and use
multiple times execution, then we shall be able to benefit from
keeping cte-path. But seems we still don't support this execution
mode.

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread jacques klein
Well, sorry David, I don't understand what you mean,

let me explain what I want to do: in short, IPC between background
workers.

I am trying to transform my app. from a multi-threaded C SQL-client into
some background workers, execution speed beeing the goal (avoid
network io).
Worker start/stopping is nicely solved by server start/stop, but I have
also to do some messaging/notifying between my worker processes, and
would like to use a Postgres based solution instead of the usual unix or
network ipc, or course by avoiding polling (tables acting as message
queues).


On Wed, 2015-08-19 at 10:01 -0700, David Fetter wrote:
 On Wed, Aug 19, 2015 at 05:37:31PM +0200, jacques klein wrote:
  I would like to execute a trigger function (written in C) in one of my
  background workers.
  
  Didn't figure out how to do that not even if it's possible.
 
 You can write your trigger function in such a way as not to do the
 usual check for trigger context, but it might be better to write two
 functions, one with the trigger stuff in it, the other, which it
 calls, for whatever action you actually want to trigger, and call that
 second in your background worker.
 
 Cheers,
 David.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-08-19 Thread Pavel Stehule
Hi

2015-08-19 21:33 GMT+02:00 Tom Lane t...@sss.pgh.pa.us:

 Pavel Stehule pavel.steh...@gmail.com writes:
  I miss a functionality that helps with parsing any identifier to basic
  three parts - database, schema, objectname. We have this function
  internally, but it is not available for SQL layer.

  FUNCTION parse_ident(IN ident text, OUT dbname text, OUT schemaname text,
  OUT objectname text)

 What exactly would you do with this that would not be better done with,
 for example, regclass?

 Don't say parse names for things other than tables.  Only a minority
 of the types of objects used in the database have names that meet this
 specification.


I see one important reason and one minor reason:

Important - cast to regclass is possible only for existing objects -
parse_ident doesn't check validity of parsed ident.
minor - cast to regclass depends on search_path - but parse_ident not -
with this function I am able to detect if ident depends (or not) on
search_path.

Regards

Pavel




 regards, tom lane



Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Marc Mamin

On 2015-08-19 AM 02:57, Marc Mamin wrote:
 2. Creating a partition of a partitioned table

 CREATE TABLE table_name
 PARTITION OF partitioned_table_name
 FOR VALUES values_spec;

 Where values_spec is:

 listvalues: [IN] (val1, ...)

 
 Would it make sense to allow one complementary partition to the listvalues?  
 
 listvalues: [[NOT] IN] (val1, ...)
 
 I've thought a few times about moving data with some most common values to 
 dedicated partitions
 and keeping the rest in a separate one...
 

Thanks, that's definitely something to consider.

I have been thinking of a sort of default list partition for the rest of
values. Would you rather declare that with something like the below than
having to enumerate all the values in a NOT IN list? Or the NOT IN way is
more intuitive/friendly?

CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT

Of course, at most one such partition would be allowed.


On the one hand I guess it will be easies to check for partition overlapping if 
their definitions all contain the exact allowed values.
But this could be generalized to range partitions too:

CREATE TABLE _rest FALLBACK PARTITION OF table_name

The need for it for range partitions seems very narrow at the first glimpse, 
but I remember bore administrative work in order to ensure that there always 
was a partition available for incoming data (from a very old time when I was 
still working with Oracle).

To have it comfortable and nevertheless allow to define new partitions, this 
would require to always check/move data from the default partition to new 
partitions at create time.

and 2 other thoughts: 
- In your proposal, the parent table is not materialized at all. Could it be 
used for the fallback partition? 
- what about always having a fallback partition? This would reduce the risk of 
unexpected failures and somewhat help Postgres stand out from the crowd :)


regards,
Marc Mamin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-08-19 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 I miss a functionality that helps with parsing any identifier to basic
 three parts - database, schema, objectname. We have this function
 internally, but it is not available for SQL layer.

 FUNCTION parse_ident(IN ident text, OUT dbname text, OUT schemaname text,
 OUT objectname text)

What exactly would you do with this that would not be better done with,
for example, regclass?

Don't say parse names for things other than tables.  Only a minority
of the types of objects used in the database have names that meet this
specification.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Josh Berkus
On 08/19/2015 04:59 AM, Simon Riggs wrote:
 I like the idea of a regular partitioning step because it is how you
 design such tables - lets use monthly partitions.
 
 This gives sanely terse syntax, rather than specifying pages and pages
 of exact values in DDL
 
PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
 START WITH value;

Oh, I like that syntax!

How would it work if there were multiple columns?  Maybe we don't want
to allow that for this form?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Thom Brown
On 19 August 2015 at 21:10, Josh Berkus j...@agliodbs.com wrote:

 On 08/19/2015 04:59 AM, Simon Riggs wrote:
  I like the idea of a regular partitioning step because it is how you
  design such tables - lets use monthly partitions.
 
  This gives sanely terse syntax, rather than specifying pages and pages
  of exact values in DDL
 
 PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
  START WITH value;

 Oh, I like that syntax!

 How would it work if there were multiple columns?  Maybe we don't want
 to allow that for this form?


If we went with that, and had:

CREATE TABLE orders (order_id serial, order_date date, item text)
  PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month')
  START WITH '2015-01-01';

Where would the following go?

INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old item');

Would there automatically be an others partition?  Or would it produce an
error and act like a constraint?

Thom


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Jeff Janes
On Tue, Aug 18, 2015 at 5:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
  Just thinking about this ... I wonder why we need to call
  TransactionIdIsInProgress() at all rather than believing the answer from
  the snapshot?  Under what circumstances could TransactionIdIsInProgress()
  return true where XidInMVCCSnapshot() had not?

 I experimented with the attached patch, which replaces
 HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with
 XidInMVCCSnapshot, and then as a cross-check has all the return false
 exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress().
 The asserts did not fire in the standard regression tests nor in a
 pgbench run, which is surely not proof of anything but it suggests
 that I'm not totally nuts.

 I wouldn't commit the changes in XidInMVCCSnapshot for real, but
 otherwise this is a possibly committable patch.


This gives the same performance improvement as the original patch (when
compiled without cassert).

It has survived testing in a hostile environment (rapid fire stream of
contentious updates intermixed with selects, with or without frequent
crashes) without signs of missed or duplicated rows.

So, it looks good to me.

Cheers,

Jeff


Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Josh Berkus
On 08/18/2015 04:40 PM, Qingqing Zhou wrote:
 Attached please find the WIP patch and also the ANALYZE results.
 Notes: the patch may not directly apply to head as some network issue
 here so my Linux box can't talk to git server.

So, one of the things we previously mentioned is that currently many
users deliberately use CTEs as an optimization barrier in order to force
the planner.  Given that, we need some kind of option to force the old
behavior; either SQL syntax or a GUC option.  Otherwise this will cause
a bunch of backwards-compatibility breakage.

Ideas?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 08/18/2015 04:40 PM, Qingqing Zhou wrote:
 Attached please find the WIP patch and also the ANALYZE results.
 Notes: the patch may not directly apply to head as some network issue
 here so my Linux box can't talk to git server.

 So, one of the things we previously mentioned is that currently many
 users deliberately use CTEs as an optimization barrier in order to force
 the planner.  Given that, we need some kind of option to force the old
 behavior; either SQL syntax or a GUC option.

I think we already agreed what the syntax would be: ye good olde OFFSET 0
in the subquery.

We could have a GUC option too if people are sufficiently worried about
it, but I think that the need for one hasn't really been proven.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mention column name in error messages

2015-08-19 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 The hints you add end in a new line, which then gives two new lines once
 they are emitted.  This is contrary to how other HINTs are formatted.
 Other HINTs are complete sentences (start with a capital letter, end with a
 period).
 But I think these belong as CONTEXT or as DETAIL, not as HINT.  The
 messages are giving me details about where (which column)  the error
 occured, they aren't giving suggestions to me about what to do about it.

We have specific style guidelines about this:
http://www.postgresql.org/docs/devel/static/error-message-reporting.html
http://www.postgresql.org/docs/devel/static/error-style-guide.html

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Freeze avoidance of very large table.

2015-08-19 Thread Jim Nasby

On 8/19/15 2:56 AM, Masahiko Sawada wrote:

The currently regression test for VM is that we just compare between
the total number of all-visible and all-frozen in VM before and after
VACUUM, and don't check particular a bit in VM.
we could substitute it to the ANALYZE command with enough sampling
number and checking pg_class.relallvisible and pg_class.relallfrozen.


I think this is another indication that we need more than just pg_regress...


So another way is that diagnostic function for VM is put into
something contrib (pg_freespacemap or pageinspect), and if we want to
use such function in production, we can install such extension as in
the past.


pg_buffercache is very useful as a performance monitoring tool, and I 
view being able to pull statistics about the VM and FM the same way. I'd 
like to see us providing more performance information by default, not less.


I think things like pageinspect are very different; I really can't see 
any use for those beyond debugging (and debugging by an expert at that).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-08-19 Thread Jim Nasby

On 8/19/15 2:44 PM, Pavel Stehule wrote:

Don't say parse names for things other than tables.  Only a minority
of the types of objects used in the database have names that meet this
specification.


Really? My impression is that almost everything that's not a shared 
object allows for a schema...



I see one important reason and one minor reason:

Important - cast to regclass is possible only for existing objects -
parse_ident doesn't check validity of parsed ident.
minor - cast to regclass depends on search_path - but parse_ident not -
with this function I am able to detect if ident depends (or not) on
search_path.


I've been forced to write this several times. I'd really like to expose 
this functionality.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread Alvaro Herrera
Bill Moran wrote:
 On Wed, 19 Aug 2015 19:45:47 +0200
 jacques klein jacques.k...@googlemail.com wrote:
 
  Well, sorry David, I don't understand what you mean,
  
  let me explain what I want to do: in short, IPC between background
  workers.
  
  I am trying to transform my app. from a multi-threaded C SQL-client into
  some background workers, execution speed beeing the goal (avoid
  network io).
  Worker start/stopping is nicely solved by server start/stop, but I have
  also to do some messaging/notifying between my worker processes, and
  would like to use a Postgres based solution instead of the usual unix or
  network ipc, or course by avoiding polling (tables acting as message
  queues).
 
 I think what David is saying, and what I would suggest, is the
 following:
 
 * It's not possible to have a trigger execute in the background
 * Create a background job that runs perpetually and listens for
   notification that it has work to do (i.e. it sleeps until notified)
 * Notify the job from the trigger

You could use shm_mq for IPC between backend (trigger) and bgworker, but
what should the backend do when the bgworker is not running for whatever
reason?  One option is to curl up and die of course (i.e. abort the
transaction that fires the action).  Also consider what happens if your
backend sends the notify and commits, and before the bgworker does its
stuff the system crashes.  You will be happy to have things queued in a
database table ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread Bill Moran
On Wed, 19 Aug 2015 19:45:47 +0200
jacques klein jacques.k...@googlemail.com wrote:

 Well, sorry David, I don't understand what you mean,
 
 let me explain what I want to do: in short, IPC between background
 workers.
 
 I am trying to transform my app. from a multi-threaded C SQL-client into
 some background workers, execution speed beeing the goal (avoid
 network io).
 Worker start/stopping is nicely solved by server start/stop, but I have
 also to do some messaging/notifying between my worker processes, and
 would like to use a Postgres based solution instead of the usual unix or
 network ipc, or course by avoiding polling (tables acting as message
 queues).

I think what David is saying, and what I would suggest, is the
following:

* It's not possible to have a trigger execute in the background
* Create a background job that runs perpetually and listens for
  notification that it has work to do (i.e. it sleeps until notified)
* Notify the job from the trigger

 On Wed, 2015-08-19 at 10:01 -0700, David Fetter wrote:
  On Wed, Aug 19, 2015 at 05:37:31PM +0200, jacques klein wrote:
   I would like to execute a trigger function (written in C) in one of my
   background workers.
   
   Didn't figure out how to do that not even if it's possible.
  
  You can write your trigger function in such a way as not to do the
  usual check for trigger context, but it might be better to write two
  functions, one with the trigger stuff in it, the other, which it
  calls, for whatever action you actually want to trigger, and call that
  second in your background worker.
  
  Cheers,
  David.
 
 
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Bill Moran wmo...@potentialtech.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mention column name in error messages

2015-08-19 Thread Jeff Janes
On Sun, Aug 9, 2015 at 8:44 AM, Franck Verrot fra...@verrot.fr wrote:

 On Wed, Jul 1, 2015 at 12:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 What seems more likely to lead to a usable patch is to arrange for the
 extra information you want to be emitted as error context, via an error
 context callback that gets installed at the right times.  ...
 ...
 with no need for int8in to be directly aware of the context.  You should
 try adapting that methodology for the cases you're worried about.


 Hi Tom (and others),

 Sorry it took so long for me to follow up on this, hopefully I found a
 couple
 a hours today to try writing another patch.

 In any case, thanks for reviewing my first attempt and taking time to write
 such a detailed critique... I've learned a lot!

 I am now using the error context callback stack. The current column name
 and column type are passed to the callback packed inside  a new structure
 of type TransformExprState.
 Those information are then passed to `errhint` and will be presented to the
 user later on (in case of coercion failure).


 Please find the WIP patch attached.
 (I've pushed the patch on my GH fork[1] too).



Hi Franck,

I like the idea of having the column name.

I took this for a test drive, and had some comments.on the user visible
parts.

The hints you add end in a new line, which then gives two new lines once
they are emitted.  This is contrary to how other HINTs are formatted.

Other HINTs are complete sentences (start with a capital letter, end with a
period).

But I think these belong as CONTEXT or as DETAIL, not as HINT.  The
messages are giving me details about where (which column)  the error
occured, they aren't giving suggestions to me about what to do about it.

Cheers,

Jeff


[HACKERS] Missing equivalent xpath_table in xml2

2015-08-19 Thread Michal Šalko
Hello,

I miss the newer features xpath_table equivalent in xml2. Are you planning
to replace it with something?

 

Thank's

 

 

Michal 

 




Ing. Michal Šalko
SQL specialista, vývojář

+420 549 494 572 
 mailto:sa...@iba.muni.cz sa...@iba.muni.cz
Masarykova univerzita, Institut biostatistiky a analýz
Kamenice 126/3, 625 00 Brno
 http://www.iba.muni.cz/ http://www.iba.muni.cz

 

 



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Josh Berkus
On 08/19/2015 01:32 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 08/18/2015 04:40 PM, Qingqing Zhou wrote:
 Attached please find the WIP patch and also the ANALYZE results.
 Notes: the patch may not directly apply to head as some network issue
 here so my Linux box can't talk to git server.
 
 So, one of the things we previously mentioned is that currently many
 users deliberately use CTEs as an optimization barrier in order to force
 the planner.  Given that, we need some kind of option to force the old
 behavior; either SQL syntax or a GUC option.
 
 I think we already agreed what the syntax would be: ye good olde OFFSET 0
 in the subquery.
 
 We could have a GUC option too if people are sufficiently worried about
 it, but I think that the need for one hasn't really been proven.

Asking users to refactor their applications to add OFFSET 0 is a bit
painful, if we could take care of it via a backwards-compatibility GUC.
 We have many users who are specifically using the CTE optimization
barrier to work around planner failures.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Josh Berkus
On 08/19/2015 01:18 PM, Thom Brown wrote:
 On 19 August 2015 at 21:10, Josh Berkus j...@agliodbs.com
 mailto:j...@agliodbs.com wrote:
 
 On 08/19/2015 04:59 AM, Simon Riggs wrote:
  I like the idea of a regular partitioning step because it is how you
  design such tables - lets use monthly partitions.
 
  This gives sanely terse syntax, rather than specifying pages and pages
  of exact values in DDL
 
 PARTITION BY RANGE ON (columns) INCREMENT BY (INTERVAL '1 month' )
  START WITH value;
 
 Oh, I like that syntax!
 
 How would it work if there were multiple columns?  Maybe we don't want
 to allow that for this form?
 
 
 If we went with that, and had:
 
 CREATE TABLE orders (order_id serial, order_date date, item text)
   PARTITION BY RANGE ON (order_date) INCREMENT BY (INTERVAL '1 month')
   START WITH '2015-01-01';
 
 Where would the following go?
 
 INSERT INTO orders (order_date, item) VALUES ('2014-11-12', 'Old item');
 
 Would there automatically be an others partition?  Or would it produce
 an error and act like a constraint?

The others partition was brought up upthread, as an addition to the
original proposal.  I really think that an others partition needs to
be up to the DBA; I've seen apps where they'd want to capture it, and
apps where they'd want such an insert to error.

I, for one, would be OK with a new partitioning which didn't address the
others partition issue until 9.7; I see it as a wholly separable
improvement.

Plus, you can always *manually* add high/low catchall partitions.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to write/setup a C trigger function in a background worker

2015-08-19 Thread jacques klein
Ok, think I got it,

I can use LISTEN and NOTIFY to do my IPC stuf, will just have to see if
it's possible to listen in a worker, with a permanent server connection,
I guess.
(as I remember, I already did a listener 10 years ago in a C client
app.)

Thanks,
Jaquest K.


On Wed, 2015-08-19 at 16:54 -0400, Bill Moran wrote:
 On Wed, 19 Aug 2015 19:45:47 +0200
 jacques klein jacques.k...@googlemail.com wrote:
 
  Well, sorry David, I don't understand what you mean,
  
  let me explain what I want to do: in short, IPC between background
  workers.
  
  I am trying to transform my app. from a multi-threaded C SQL-client into
  some background workers, execution speed beeing the goal (avoid
  network io).
  Worker start/stopping is nicely solved by server start/stop, but I have
  also to do some messaging/notifying between my worker processes, and
  would like to use a Postgres based solution instead of the usual unix or
  network ipc, or course by avoiding polling (tables acting as message
  queues).
 
 I think what David is saying, and what I would suggest, is the
 following:
 
 * It's not possible to have a trigger execute in the background
 * Create a background job that runs perpetually and listens for
   notification that it has work to do (i.e. it sleeps until notified)
 * Notify the job from the trigger
 
  On Wed, 2015-08-19 at 10:01 -0700, David Fetter wrote:
   On Wed, Aug 19, 2015 at 05:37:31PM +0200, jacques klein wrote:
I would like to execute a trigger function (written in C) in one of my
background workers.

Didn't figure out how to do that not even if it's possible.
   
   You can write your trigger function in such a way as not to do the
   usual check for trigger context, but it might be better to write two
   functions, one with the trigger stuff in it, the other, which it
   calls, for whatever action you actually want to trigger, and call that
   second in your background worker.
   
   Cheers,
   David.
  
  
  
  
  -- 
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting fallback RADIUS server(s)

2015-08-19 Thread Tom Lane
Marko Tiikkaja ma...@joh.to writes:
 So I'm developing a patch to fix this issue, but I'm not 
 exactly sure what the configuration should look like.  I see multiple 
 options, but the one I like the best is the following:

 Add two new HBA configuration options: radiusfallbackservers and 
 radiusfallbackports; both lists parsed with SplitIdentifierString (à la 
 listen_addresses).

Why add new GUCs for that?  Can't we just redefine radiusserver as a list
of servers to try in sequence, and similarly split radiusport into a list?

Or maybe better, rename it radius_servers.  But what you have here seems
weird, and it certainly doesn't follow the precedent of what we did when
we replaced listen_address with listen_addresses.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Kouhei Kaigai
 On Mon, Aug 17, 2015 at 9:40 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  I think SortSupport logic provides a reasonable way to solve this
  kind of problem. For example, btint4sortsupport() informs a function
  pointer of the fast version of comparator (btint4fastcmp) which takes
  two Datum argument without indirect memory reference.
  This mechanism will also make sense for HashAggregate logic, to reduce
  the cost of function invocations.
 
  Please comment on the idea I noticed here.
 
 It's possible that this can work, but it might be a good idea to run
 'perf' on this query and find out where the CPU time is actually
 going.  That might give you a clearer picture of why the HashAggregate
 is slow.

I tried to run one of CTE portion under the perf enabled.

HashAggregate still takes 490sec in spite of 70sec by underlying Join.


tpcds100=# explain analyze select c_customer_id customer_id
   ,c_first_name customer_first_name
   ,c_last_name customer_last_name
   ,c_preferred_cust_flag customer_preferred_cust_flag
   ,c_birth_country customer_birth_country
   ,c_login customer_login
   ,c_email_address customer_email_address
   ,d_year dyear
   
,sum(((ss_ext_list_price-ss_ext_wholesale_cost-ss_ext_discount_amt)+ss_ext_sales_price)/2)
 year_total
   ,'s' sale_type
 from customer
 ,store_sales
 ,date_dim
 where c_customer_sk = ss_customer_sk
   and ss_sold_date_sk = d_date_sk
 group by c_customer_id
 ,c_first_name
 ,c_last_name
 ,c_preferred_cust_flag
 ,c_birth_country
 ,c_login
 ,c_email_address
 ,d_year
;
 QUERY PLAN

-
HashAggregate  (cost=18194948.40..21477516.00 rows=262605408 width=178)
   (actual time=483480.161..490763.640 rows=9142442 loops=1)
   Group Key: customer.c_customer_id, customer.c_first_name, 
customer.c_last_name, customer.c_preferred_cust_flag,
  customer.c_birth_country, customer.c_login, 
customer.c_email_address, date_dim.d_year
   -  Custom Scan (GpuJoin)  (cost=101342.54..9660272.64 rows=262605408 
width=178)
  (actual time=2430.787..73116.553 rows=268562375 
loops=1)
 Bulkload: On (density: 100.00%)
 Depth 1: Logic: GpuHashJoin, HashKeys: (ss_sold_date_sk), JoinQual: 
(ss_sold_date_sk = d_date_sk),
  nrows (287997024 - 275041999, 95.50% expected 95.47%)
 Depth 2: Logic: GpuHashJoin, HashKeys: (ss_customer_sk), JoinQual: 
(ss_customer_sk = c_customer_sk),
  nrows (275041999 - 268562375, 93.25% expected 91.18%)
 -  Custom Scan (BulkScan) on store_sales  (cost=0.00..9649559.60 
rows=287996960 width=38)
(actual 
time=17.141..52757.354 rows=287997024 loops=1)
 -  Seq Scan on date_dim  (cost=0.00..2705.49 rows=73049 width=16)
   (actual time=0.030..20.597 rows=73049 
loops=1)
 -  Seq Scan on customer  (cost=0.00..87141.74 rows=274 width=156)
   (actual time=0.010..585.861 rows=200 
loops=1)
 Planning time: 1.558 ms
 Execution time: 492113.558 ms
(11 rows)


Perf output is below. Unlike my expectation, the largest portion was consumed
by bpchareq(6.76%) + bcTruelen(8.23%). One other big cluster is, probabaly,
TupleHashTableHash(1.11%) - slot_getattr(4.29%) - slot_deform_tuple(4.92%).


# 
# captured on: Thu Aug 20 09:52:24 2015
# hostname : ayu.kaigai.gr.jp
# os release : 2.6.32-504.23.4.el6.x86_64
# perf version : 2.6.32-504.23.4.el6.x86_64.debug
# arch : x86_64
# nrcpus online : 48
# nrcpus avail : 48
# cpudesc : Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
# cpuid : GenuineIntel,6,63,2
# total memory : 396795400 kB
# cmdline : /usr/bin/perf record -a -e cycles
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, 
excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, 
attr_mmap2 = 0, attr_mmap  = 1, attr_mmap_data = 0
# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, tracepoint = 2, software = 1
# 
#
# Samples: 2M of event 'cycles'
# Event count (approx.): 1558291468259
#
# Overhead  Command   Shared Object 
Symbol
#   ...  ..  
.
#
 8.23% postgres  postgres[.] bcTruelen
 6.76% postgres  postgres[.] bpchareq
 4.92% postgres  postgres[.] pg_detoast_datum
 4.29% postgres  postgres[.] slot_getattr
 4.07% postgres  postgres  

Re: [HACKERS] Our trial to TPC-DS but optimizer made unreasonable plan

2015-08-19 Thread Peter Geoghegan
On Wed, Aug 19, 2015 at 6:08 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Indeed, 6 of 8 grouping keys in this query uses bpchar() data type, so it is
 natural comparison function consumed larger portion of CPU cycles.
 Do we have any idea to assist these queries by the backend?

With abbreviated keys, char(n) is very significantly slower than
varchar(n) (or text). I've been meaning to revisit my docpatch to warn
users of this (we already specifically advise against using char(n),
more or less). Abbreviation and a few other tricks could easily make
an enormous difference.

There is no very good reason why the same optimizations that I applied
to text could not also be applied to bpchar(), I think. I think that
abbreviation could remove much of the special char(n) effort, as well;
someone simply needs to do the work. I'm actually more concerned about
the problems that this causes for third-party benchmarks than I am
about the problems for real users.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-08-19 Thread Tomas Vondra

Hi,

attached is a significantly reworked patch for using the foreign keys in 
selectivity estimation. The previous patch only really worked if the 
clauses matched the foreign key perfectly (i.e. no additional join 
clauses) - this patch attempts to relax those restrictions a bit.


This patch also significantly improves the comments - the best place to 
start reading is clauselist_join_selectivity().


In general, the patch works by looking for foreign keys between the 
inner and outer side of the join, but only for keys that:


  (a) have more than 2 keys (as this only really helps with multi-
  column foreign keys)

  (b) are 'fully matched' by the join clauses, i.e. there's a clause
  exactly matching each part of the foreign key

Once we have matched foreign keys (for each join), we can estimate each 
of them using 1/cardinality of the referenced table and estimate the 
remaining clauses (not used to match any foreign key) the old way.


example with 3 tables
-
create table a (a1 int, a2 int, primary key (a1, a2));

create table b (b1 int, b2 int, primary key (b1, b2));

create table f (f1 int, f2 int, f3 int, f4 int,
   foreign key (f1,f2) references a(a1,a2),
   foreign key (f3,f4) references b(b1,b2));

insert into a select i, i from generate_series(1,1) s(i);
insert into b select i, i from generate_series(1,1) s(i);

-- 10x
insert into f select i, i, i, i from generate_series(1,1) s(i);

analyze;

Then on current master, I get these estimates (showing just rows, 
because that's what matter):



while with the patch I get this:

select * from f join a on (f1 = a1 and f2 = a2);

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on a  (rows=1) (actual rows=1)

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f3 = b1 and f4 = b2);

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f3 = b.b1) AND (f.f4 = b.b2))
   -  Hash Join  (rows=10) (actual rows=10)
 Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
 -  Seq Scan on f  (rows=10) (actual rows=10)
 -  Hash  (rows=1) (actual rows=1)
   Buckets: 16384  Batches: 1  Memory Usage: 519kB
   -  Seq Scan on a  (rows=1) (actual rows=1)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on b  (rows=1) (actual rows=1)

So, that's pretty good.

I believe it might be possible to use even foreign keys matched only 
partially (e.g. foreign key on 3 columns, but only 2 of those matched by 
clauses), but I think that's a bit too much for now.


The examples above are rather trivial, and sadly it's not difficult to 
break them. For example by adding a single additional join clause to the 
first query does this:


select * from f join a on (f1 = a1 and f2 = a2 and f1 = a2);

  QUERY PLAN

 Hash Join  (rows=2) (actual rows=10)
   Hash Cond: (f.f1 = a.a1)
   -  Seq Scan on f  (rows=500) (actual rows=10)
 Filter: (f1 = f2)
   -  Hash  (rows=50) (rows=1)
 Buckets: 16384 (originally 1024)  Batches: 1 (originally 1) ...
 -  Seq Scan on a  (rows=50) (actual rows=1)
   Filter: (a1 = a2)

This of course happens thanks to equality classes because the planner 
is smart enough to realize that (f1=f2=a1=a2) thanks to the new clause. 
I'm not sure how to fix this, and maybe this particular case would be 
easier to fix using the multivariate stats (once it can estimate clauses 
like a1=a2).


Similarly, the equality classes may break other examples by deriving 
completely new clauses - for example assume the f table is defined 
like this (again, with 100k rows):


create table f (f1 int, f2 int,
   foreign key (f1,f2) references a(a1,a2),
   foreign key (f1,f2) references b(b1,b2));

then this query

select * from f join a on (f1 = a1 and f2 = a2)
join b on (f1 = b1 and f2 = b2);

may get planned like this:

  QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 -  Hash Join  (rows=1) (actual rows=1)
   Hash Cond: 

Re: [HACKERS] PATCH: use foreign keys to improve join estimates v1

2015-08-19 Thread Tomas Vondra


On 08/20/2015 03:49 AM, Tomas Vondra wrote:

Then on current master, I get these estimates (showing just rows,
because that's what matter):


while with the patch I get this:


And of course I forgot to include the plans from master, so here we go:

select * from f join a on (f1 = a1 and f2 = a2);

QUERY PLAN

 Hash Join  (rows=10) (actual rows=10)
   Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
   -  Seq Scan on f  (rows=10) (actual rows=10)
   -  Hash  (rows=1) (actual rows=1)
 Buckets: 16384  Batches: 1  Memory Usage: 519kB
 -  Seq Scan on a  (rows=1) (actual rows=1)


select * from f join a on (f1 = a1 and f2 = a2)
join b on (f3 = b1 and f4 = b2);

QUERY PLAN

 Nested Loop  (rows=1) (actual rows=10)
   -  Hash Join  (rows=10) (actual rows=10)
 Hash Cond: ((f.f1 = a.a1) AND (f.f2 = a.a2))
 -  Seq Scan on f  (rows=10) (actual rows=10)
 -  Hash  (rows=1) (actual rows=1)
   Buckets: 16384  Batches: 1  Memory Usage: 519kB
   -  Seq Scan on a  (rows=1) (actual rows=1)
   -  Index Only Scan using b_pkey on b  (rows=1) (actual rows=1)
 Index Cond: ((b1 = f.f3) AND (b2 = f.f4))

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Badly broken logic in plpython composite-type handling

2015-08-19 Thread Tom Lane
I wrote:
 I looked into the crash reported in bug #13579.  The proximate cause
 of the crash is that PLyString_ToComposite does this:
 ...
 I'm inclined to think that maybe PLyString_ToComposite needs to be doing
 something more like what PLyObject_ToComposite does, ie doing its own
 lookup in a private descriptor; but I'm not sure if that's right, and
 anyway it would just be doubling down on a bad design.  Not being able
 to cache these I/O function lookups is really horrid.

I wrote a draft patch that does it that way.  It indeed stops the crash,
and even better, makes PLyString_ToComposite actually work for RECORD,
which AFAICT it's never done before.  We'd conveniently omitted to test
the case in the plpython regression tests, thus masking the fact that
it would crash horribly if you invoked such a case more than once.

To make it work, I had to modify record_in to allow inputting a value
of type RECORD as long as it's given a correct typmod.  While that would
not happen in ordinary SQL, it's possible in this and probably other
usages, and I can't really see any downside.  The emitted record will
be properly marked with type RECORDOID and typmod whatever the internal
anonymous-type identifier is, and it's no more or less type-safe than
any other such value.

This version of PLyString_ToComposite is no better than
PLyObject_ToComposite as far as leaking memory goes.  We could probably
fix that in a crude fashion by using plpython's scratch context for the
transient type-lookup data, but I'd rather see a proper fix wherein the
lookup results stay cached.  In any case, that's a separate and less
critical matter.

Barring objections, I propose to commit and back-patch this.  The crash
can be demonstrated back to 9.1.

regards, tom lane

diff --git a/src/backend/utils/adt/rowtypes.c b/src/backend/utils/adt/rowtypes.c
index a65e18d..ec4f71e 100644
*** a/src/backend/utils/adt/rowtypes.c
--- b/src/backend/utils/adt/rowtypes.c
*** record_in(PG_FUNCTION_ARGS)
*** 73,84 
  {
  	char	   *string = PG_GETARG_CSTRING(0);
  	Oid			tupType = PG_GETARG_OID(1);
! 
! #ifdef NOT_USED
! 	int32		typmod = PG_GETARG_INT32(2);
! #endif
  	HeapTupleHeader result;
- 	int32		tupTypmod;
  	TupleDesc	tupdesc;
  	HeapTuple	tuple;
  	RecordIOData *my_extra;
--- 73,80 
  {
  	char	   *string = PG_GETARG_CSTRING(0);
  	Oid			tupType = PG_GETARG_OID(1);
! 	int32		tupTypmod = PG_GETARG_INT32(2);
  	HeapTupleHeader result;
  	TupleDesc	tupdesc;
  	HeapTuple	tuple;
  	RecordIOData *my_extra;
*** record_in(PG_FUNCTION_ARGS)
*** 91,106 
  	StringInfoData buf;
  
  	/*
! 	 * Use the passed type unless it's RECORD; we can't support input of
! 	 * anonymous types, mainly because there's no good way to figure out which
! 	 * anonymous type is wanted.  Note that for RECORD, what we'll probably
! 	 * actually get is RECORD's typelem, ie, zero.
  	 */
! 	if (tupType == InvalidOid || tupType == RECORDOID)
  		ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  		   errmsg(input of anonymous composite types is not implemented)));
- 	tupTypmod = -1;/* for all non-anonymous types */
  
  	/*
  	 * This comes from the composite type's pg_type.oid and stores system oids
--- 87,103 
  	StringInfoData buf;
  
  	/*
! 	 * Give a friendly error message if we did not get enough info to identify
! 	 * the target record type.  (lookup_rowtype_tupdesc would fail anyway, but
! 	 * with a non-user-facing message.)  Note that for RECORD, what we'll
! 	 * usually actually get is RECORD's typelem, ie, zero.  However there are
! 	 * cases where we do get a valid typmod and can do something useful.
  	 */
! 	if (tupType == InvalidOid ||
! 		(tupType == RECORDOID  tupTypmod  0))
  		ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  		   errmsg(input of anonymous composite types is not implemented)));
  
  	/*
  	 * This comes from the composite type's pg_type.oid and stores system oids
*** record_recv(PG_FUNCTION_ARGS)
*** 449,460 
  {
  	StringInfo	buf = (StringInfo) PG_GETARG_POINTER(0);
  	Oid			tupType = PG_GETARG_OID(1);
! 
! #ifdef NOT_USED
! 	int32		typmod = PG_GETARG_INT32(2);
! #endif
  	HeapTupleHeader result;
- 	int32		tupTypmod;
  	TupleDesc	tupdesc;
  	HeapTuple	tuple;
  	RecordIOData *my_extra;
--- 446,453 
  {
  	StringInfo	buf = (StringInfo) PG_GETARG_POINTER(0);
  	Oid			tupType = PG_GETARG_OID(1);
! 	int32		tupTypmod = PG_GETARG_INT32(2);
  	HeapTupleHeader result;
  	TupleDesc	tupdesc;
  	HeapTuple	tuple;
  	RecordIOData *my_extra;
*** record_recv(PG_FUNCTION_ARGS)
*** 466,481 
  	bool	   *nulls;
  
  	/*
! 	 * Use the passed type unless it's RECORD; we can't support input of
! 	 * anonymous types, mainly because there's no good way to figure out which
! 	 * anonymous type is wanted.  Note that for RECORD, what we'll probably
! 	 * actually get is RECORD's typelem, ie, zero.
  	 */
! 	if (tupType == InvalidOid || tupType 

[HACKERS] Supporting fallback RADIUS server(s)

2015-08-19 Thread Marko Tiikkaja

Hi,

We use RADIUS authentication at $WORK, and it has one major flaw (well, 
two, but I already fixed the other one this week): it only supports 
specifying a single server, which as we might know, is bad for high 
availability.  So I'm developing a patch to fix this issue, but I'm not 
exactly sure what the configuration should look like.  I see multiple 
options, but the one I like the best is the following:


Add two new HBA configuration options: radiusfallbackservers and 
radiusfallbackports; both lists parsed with SplitIdentifierString (à la 
listen_addresses).  If radiusfallbackservers is set, it's the list of 
servers to try after radiusserver has timed out (but ONLY if it timed 
out, obviously).  If radiusfallbackports is set, it must be of equal 
length with radiusfallbackservers, and the corresponding port entry is 
used for each fallback server.  If it's not set, radiusport is used as 
the port for all fallback servers, or finally the default port (1812) if 
no port options have been specified in pg_hba.conf.


This interface is both flexible, simple to use in the common case (YMMV) 
and fully backwards compatible.


I'm planning two submit two patches: one to refactor CheckRADIUSAuth() a 
bit to make it easier to try the authentication against more than one 
server, and then another one to add the two new configuration options.


Does anyone have a problem with this plan?  Want to comment on the 
interface or get anything else off their chest?



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Supporting fallback RADIUS server(s)

2015-08-19 Thread Marko Tiikkaja

On 2015-08-20 02:29, Tom Lane wrote:

Marko Tiikkaja ma...@joh.to writes:

So I'm developing a patch to fix this issue, but I'm not
exactly sure what the configuration should look like.  I see multiple
options, but the one I like the best is the following:



Add two new HBA configuration options: radiusfallbackservers and
radiusfallbackports; both lists parsed with SplitIdentifierString (Ã  la
listen_addresses).


Why add new GUCs for that?  Can't we just redefine radiusserver as a list
of servers to try in sequence, and similarly split radiusport into a list?

Or maybe better, rename it radius_servers.  But what you have here seems
weird, and it certainly doesn't follow the precedent of what we did when
we replaced listen_address with listen_addresses.


If we were adding RADIUS support today, this would be the best option. 
But seeing that we still only support one server today, this seems like 
a backwards incompatibility which practically 100% of users won't 
benefit from.  But I don't care much either way.  If we think breaking 
compatibility here is acceptable, I'd say we go for radius_servers and 
radius_ports.



.m


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hi,

On 08/19/2015 01:55 PM, Kohei KaiGai wrote:

  Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
Rows Removed by Join Filter: 127853313
-  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
  Sort Key: ws1.ws_order_number
  Sort Method: quicksort  Memory: 7083296kB
  -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
-  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
  Sort Key: ws2.ws_order_number
  Sort Method: quicksort  Memory: 7083296kB
  -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
  Planning time: 0.232 ms
  Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.


I'm not sure I understand what is the problem here? Could you elaborate?

The initial size of the hash table is determined using the estimate, and 
if we overestimate it will create more buckets (i.e. consuming more 
memory) and/or start batching (which might be unnecessary).


But I don't really see any more careful way to do this, without 
penalizing the cases where the estimate is actually correct - e.g. by 
starting with much smaller buckets (and then resizing the hash table, 
which is not free). Or by starting without batching, betting that we 
won't actually need it.


I think it'll be very difficult to get those working without causing 
real trouble to cases where we actually do have good estimates (and 
those are vast majority of queries).


But both of those are features, and we're dealing with a bug fix here.


kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.


I agree we need to put a few more safeguards there (e.g. make sure we 
don't overflow INT when counting the buckets, which may happen with the 
amounts of work_mem we'll see in the wild soon).


But I think we should not do any extensive changes to how we size the 
hashtable - that's not something we should do in a bugfix I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Amit Langote
On 2015-08-19 PM 09:23, Simon Riggs wrote:
 On 18 August 2015 at 11:30, Amit Langote langote_amit...@lab.ntt.co.jp
 wrote:
 

 There is no need to define tuple routing triggers. CopyFrom() and
 ExecInsert() determine target partition just before performing
 heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and
 constraints (on parent) are executed for tuple before being routed to a
 partition. If no partition can be found, it's an error.

 Because row-level AFTER triggers need to save ItemPointers in trigger
 event data and defining triggers on partitions (which is where tuples
 really go) is not allowed, I could not find a straightforward way to
 implement them. So, perhaps we should allow (only) row-level AFTER
 triggers on partitions or think of modifying trigger.c to know about this
 twist explicitly.

 
 I think tables will eventually need FK support; its not sustainable as a
 long term restriction, though perhaps its something we can do in a later
 patch.
 

Sure. Solving the row-level AFTER trigger problem should hopefully open up
the possibility of partitioned-table-as-FK-rel implementation.

 You haven't specified what would happen if an UPDATE would change a row's
 partition. I'm happy to add this to the list of restrictions by saying that
 the partition key cannot be updated.
 

UPDATEs that change a row's partition would cause error. I haven't
implemented that yet but will that way in the next patch.

By the last sentence, do you mean only UPDATEs to the partition key that
cause rows to jump partitions or simply any UPDATEs to the partition key?

 We'll need regression tests that cover each restriction and docs that
 match. This is not something we should leave until last. People read the
 docs to understand the feature, helping them to reach consensus. So it is
 for you to provide the docs before, not wait until later. I will begin a
 code review once you tell me docs and tests are present. We all want the
 feature, so its all about the details now.
 

Sorry, should have added tests and docs already. I will add them in the
next version of the patch. Thanks for willing to review.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Using quicksort for every external sort run

2015-08-19 Thread Peter Geoghegan
I'll start a new thread for this, since my external sorting patch has
now evolved well past the original quicksort with spillover
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.

I attach a patch series featuring a new, more comprehensive approach
to quicksorting runs during external sorts. What I have now still
includes quicksort with spillover, but it's just a part of a larger
project. I am quite happy with the improvements in performance shown
by my testing, which I go into below.

Controversy
=

A few weeks ago, I did not anticipate that I'd propose that
replacement selection sort be used far less (only somewhat less, since
I was only somewhat doubtful about the algorithm at the time). I had
originally planned on continuing to *always* use it for the first run,
to make quicksort with spillover possible (thereby sometimes
avoiding significant I/O by not spilling most tuples), but also to
make cases always considered sympathetic to replacement selection
continue to happen. I thought that second or subsequent runs could
still be quicksorted, but that I must still care about this latter
category, the traditional sympathetic cases. This latter category is
mostly just one important property of replacement selection: even
without a strong logical/physical correlation, the algorithm tends to
produce runs that are about twice the size of work_mem. (It's also
notable that replacement selection only produces one run with mostly
presorted input, even where input far exceeds work_mem, which is a
neat trick.)

I wanted to avoid controversy, but the case for the controversy is too
strong for me to ignore: despite these upsides, replacement selection
is obsolete, and should usually be avoided.

Replacement selection sort still has a role to play in making
quicksort with spillover possible (when a sympathetic case is
*anticipated*), but other than that it seems generally inferior to a
simple hybrid sort-merge strategy on modern hardware. By modern
hardware, I mean anything manufactured in at least the last 20 years.
We've already seen that the algorithm's use of a heap works badly with
modern CPU caches, but that is just one factor contributing to its
obsolescence.

The big selling point of replacement selection sort in the 20th
century was that it sometimes avoided multi-pass sorts as compared to
a simple sort-merge strategy (remember when tuplesort.c always used 7
tapes? When you need to use 7 actual magnetic tapes, rewinding is
expensive and in general this matters a lot!). We all know that memory
capacity has grown enormously since then, but we must also consider
another factor: At the same time, a simple hybrid sort-merge
strategy's capacity to more or less get the important detail here
right -- to avoid a multi-pass sort -- has increased quadratically
(relative to work_mem/memory capacity). As an example, testing shows
that for a datum tuplesort that requires about 2300MB of work_mem to
be completed as a simple internal sort this patch only needs 30MB to
just do one pass (see benchmark query below). I've mostly regressed
that particular property of tuplesort (it used to be less than 30MB),
but that's clearly the wrong thing to worry about for all kinds of
reasons, probably even in the unimportant cases now forced to do
multiple passes.

Multi-pass sorts
-

I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?

In theory, the answer could be yes, but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it). ISTM that these days, multi-pass merges are like
saving $2 on replacing a stairwell light bulb, at the expense of
regularly stumbling down the stairs in the dark. It shouldn't matter
if you have a 50 terabyte decision support database or if you're
paying Heroku a small monthly fee to run a database backing your web
app: simply avoiding multi-pass merges is probably always the most
economical solution, and by a wide margin.

Note that I am not skeptical of polyphase merging itself, even though
it is generally considered to be a complimentary technique to
replacement selection (some less formal writing on external sorting
seemingly fails to draw a sharp distinction). Nothing has changed
there.

Patch, performance
===

Let's focus on a multi-run sort, that does not use quicksort with
spillover, since 

Re: [HACKERS] Badly broken logic in plpython composite-type handling

2015-08-19 Thread Joshua D. Drake


On 08/19/2015 05:05 PM, Tom Lane wrote:


Barring objections, I propose to commit and back-patch this.  The crash
can be demonstrated back to 9.1.

regards, tom lane


+1


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: function parse_ident

2015-08-19 Thread Tom Lane
Jim Nasby jim.na...@bluetreble.com writes:
 Don't say parse names for things other than tables.  Only a minority
 of the types of objects used in the database have names that meet this
 specification.

 Really? My impression is that almost everything that's not a shared 
 object allows for a schema...

Tables meet this naming spec.  Columns, functions, operators, operator
classes/families, collations, constraints, and conversions do not (you
need more data to name them).  Schemas, databases, languages, extensions,
and some other things also do not, because you need *less* data to name
them.  Types also don't really meet this naming spec, because you need to
contend with special cases like int[] or timestamp with time zone.
So this proposal doesn't seem very carefully thought-through to me,
or at least the use case is much narrower than it could be.

Also, if object does not exist isn't supposed to be an error case,
what of name is not correctly formatted?  It seems a bit arbitrary
to me to throw an error in one case but not the other.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread David Rowley
On 12 June 2015 at 02:40, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-06-11 23:28 GMT+09:00 Robert Haas robertmh...@gmail.com:
  On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
  The attached patch replaces this palloc0() by MemoryContextAllocHuge()
 + memset().
  Indeed, this hash table is constructed towards the relation with
 nrows=119994544,
  so, it is not strange even if hash-slot itself is larger than 1GB.
 
  You forgot to attach the patch, I think.
 
 Oops, I forgot to attach indeed.

   It looks to me like the size
  of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
  That's a lot of buckets, but maybe not unreasonably many if you've got
  enough memory.
 
 EXPLAIN says, this Hash node takes underlying SeqScan with
 119994544 (~119 million) rows, but it is much smaller than my
 work_mem setting.


I've just run into this problem while running a TPC-H benchmark of 100GB,
on a machine with 64GB of RAM.
When attempting to run Q21 with a work_mem of 10GB I'm getting:
 ERROR:  invalid memory alloc request size 1073741824

Setting work_mem to 1GB or less gets the query running.

I've patched the code with your patch Kohei, and it's now working.

Thought I'd better post this just in case this gets forgotten about.

Thanks

David

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Amit Kapila
On Wed, Aug 19, 2015 at 6:06 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 I wrote:
  Just thinking about this ... I wonder why we need to call
  TransactionIdIsInProgress() at all rather than believing the answer from
  the snapshot?  Under what circumstances could
TransactionIdIsInProgress()
  return true where XidInMVCCSnapshot() had not?

 I experimented with the attached patch, which replaces
 HeapTupleSatisfiesMVCC's calls of TransactionIdIsInProgress with
 XidInMVCCSnapshot, and then as a cross-check has all the return false
 exits from XidInMVCCSnapshot assert !TransactionIdIsInProgress().
 The asserts did not fire in the standard regression tests nor in a
 pgbench run, which is surely not proof of anything but it suggests
 that I'm not totally nuts.

 I wouldn't commit the changes in XidInMVCCSnapshot for real, but
 otherwise this is a possibly committable patch.


I think one case where the patch can impact is for aborted transactions.
In TransactionIdIsInProgress(), we check for aborted transactions before
consulting pg_subtrans whereas with patch it will consult pg_subtrans
without aborted transaction check.  Now it could be better to first check
pg_subtrans if we found that the corresponding top transaction is
in-progress as that will save extra pg_clog lookup, but I just mentioned
because that is one difference that I could see with this patch.

Another minor point is, I think we should modify function level comment
for XidInMVCCSnapshot() where it says that this applies to known-
committed XIDs which will no longer be true after this patch.

Apart from above, the patch looks good.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread 'Victor Wagner *EXTERN*'
On 2015.08.18 at 08:32:28 +, Albe Laurenz wrote:

 I wonder how useful this is at the present time.
 
 If the primary goes down and the client gets connected to the standby,
 it would have read-only access there.  Most applications wouldn't cope
 well with that.

It is supposed that somebody (either system administrator or some
cluster management software) have noticed failure of master and promoted
one of the standbys to master.

So, clients have only to find out which cluster node serves as master
just now.

Idea is that we don't need any extra administration actions such as IP
migration to do it. Clients have list of alternate servers and discover
which one to work with by trial and error.

I consider in my proposal following situations:

1. Warm standby - doesn't accept client connection at all unless
promoted to master.

2. Hot standby - we have two types of clients - one for which readonly
access is sufficient, and other that need to connect only to master.
In this case intention to write is explicitely stated in the connect
string (readonly=false) and connect procedure would check if node it
tries to connect allowed write.

It seems that most people discussing in this thread think in millisecond
time intervals (failure and immediate reconnect).

I was thinking about much longer time intervals - it would probaly take
seconds to cluster management software to notice server failure and
promote backup server to master, it might be possible for application to
spend minute or so trying to reconnect, but it would take
hours to change connect string on clients - it would require visit of
support enginer to each client terminal, if we are thinking of
distributed OLTP system such as point-of-sale network with thick
clients.


 
  host=main-server host=standby1 host=standby2 port=5432 dbname=database
 
 It seems a bit arbitrary to require that all servers use the same port.

 Maybe parameters like host2, port2, host3, port3 etc. might be better.

I've thought about this approach. But PostgreSQL administration guide
insists that all servers in the cluster should have as identical
configuration as possible to simplify administration. 

Moreover I've seldom have seen configurations where postgresql is
accepting connection on non-default port.

Using host1, host2 etc would have unintended connotations, such is this
is first, main server. I think that client should treat all given
servers as equal and let cluster administration to choose which one
would accept connection.

-- 
Victor Wagner vi...@wagner.pp.ru



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Amit Kapila
On Wed, Aug 19, 2015 at 12:21 PM, Victor Wagner *EXTERN* vi...@wagner.pp.ru
wrote:

 On 2015.08.18 at 08:32:28 +, Albe Laurenz wrote:

  I wonder how useful this is at the present time.
 
  If the primary goes down and the client gets connected to the standby,
  it would have read-only access there.  Most applications wouldn't cope
  well with that.

 It is supposed that somebody (either system administrator or some
 cluster management software) have noticed failure of master and promoted
 one of the standbys to master.

 So, clients have only to find out which cluster node serves as master
 just now.

 Idea is that we don't need any extra administration actions such as IP
 migration to do it. Clients have list of alternate servers and discover
 which one to work with by trial and error.

 I consider in my proposal following situations:

 1. Warm standby - doesn't accept client connection at all unless
 promoted to master.

 2. Hot standby - we have two types of clients - one for which readonly
 access is sufficient, and other that need to connect only to master.
 In this case intention to write is explicitely stated in the connect
 string (readonly=false) and connect procedure would check if node it
 tries to connect allowed write.

 It seems that most people discussing in this thread think in millisecond
 time intervals (failure and immediate reconnect).

Why not have this as a separate parameter (*_timeout or something like
that)?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner
On 2015.08.19 at 08:28:32 +0530, Amit Kapila wrote:

 On Tue, Aug 18, 2015 at 9:48 AM, Victor Wagner vi...@wagner.pp.ru wrote:
 
 
  Behavoir
  
 
  If PQconnectdb encounters connect string with multiple hosts specified,
  it attempts to establish connection with all these hosts simultaneously,
  and begins to work with server which responds first, unless
  loadbalancing parameter is true.
 
 
 I think here you are mixing the behaviour for load balancing solution and
 failover solution.  It seems to me that for client-side failover solution
 (which is also known as Transparent Application Failover), the connection
 attempt to second server should be done after the first connection is
 broken as that provide more flexibility.

I think that failover procedure should begin before first connection is
ever established.

When client application starts, it has no way of knowing current state
of the server cluster - which of servers is working as master now.

Application uses connect string, placed into its configuration file
long time ago, and changing this configuration might require special
permissions, user of application doesn't have. But user typically know
how to restart application or reboot his terminal. So, for the
spatially distributed networks with  thick clients we can handle only 
initial connections, not connection resets. At least application author
always can implement restoration of connection as closing old
connection and establishing new.

So, when application first establishes connection it have to be prepared
to connect any of alternate hosts.

I don't think that making connections in sequential order provide big
flexibility. But it can greatly increase startup time, because connect
to host which is physically down fails after significant timeout. While
application waits for first connect to fail, it might complete session
initialization with working server several times.

Of course, connecting to servers in sequential order is simpler to
implement, and allows even more mixing of load balancing with failover,
because code would be same.


 Although both ideas (load balancing and failover) seems worth discussing,
 they are separate features and can be worked on separately.  It will be
 easier to sort out the details as well that way.

Really load balancing comes almost for free if we implement connect to
alternate server for failover purposes. I'm not sure that in case of hot
standby, where only readonly transactions can be loadbalanced,
loadbalancing is very useful. And included it in the proposal only
because it is very cheap to implement in this form,

 
 With Regards,
 Amit Kapila.
 EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner *EXTERN*
On 2015.08.19 at 12:29:51 +0530, Amit Kapila wrote:

  It seems that most people discussing in this thread think in millisecond
  time intervals (failure and immediate reconnect).
 
 Why not have this as a separate parameter (*_timeout or something like
 that)?

Because it is not in the software configuration. It is in the people
heads. Or may be in the organizational configuration of the environments
we are talking about.

Each of us imagining some use-case for discussed feature. And these
cases are completely different, and have different typical time
interval.

I haven't explicitely stated my use cases in the proposal. So people
thinking in terms of their use cases, and this is very significant
feedback for me.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Amit Kapila
On Wed, Aug 19, 2015 at 12:35 PM, Victor Wagner vi...@wagner.pp.ru wrote:

 On 2015.08.19 at 08:28:32 +0530, Amit Kapila wrote:

  On Tue, Aug 18, 2015 at 9:48 AM, Victor Wagner vi...@wagner.pp.ru
 wrote:
  
  
   Behavoir
   
  
   If PQconnectdb encounters connect string with multiple hosts specified,
   it attempts to establish connection with all these hosts
 simultaneously,
   and begins to work with server which responds first, unless
   loadbalancing parameter is true.
  
  
  I think here you are mixing the behaviour for load balancing solution and
  failover solution.  It seems to me that for client-side failover solution
  (which is also known as Transparent Application Failover), the connection
  attempt to second server should be done after the first connection is
  broken as that provide more flexibility.

 I think that failover procedure should begin before first connection is
 ever established.


As far as I understand, failover gets initiated once the master server goes
down or is not accessible due to some reason, so for such cases if you
have the connection to both the servers then it might not work.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] checkpointer continuous flushing

2015-08-19 Thread Fabien COELHO



Sure, I think what can help here is a testcase/'s (in form of script file
or some other form, to test this behaviour of patch) which you can write
and post here, so that others can use that to get the data and share it.


Sure... note that I already did that on this thread, without any echo... 
but I can do it again...


Tests should be run on a dedicated host. If it has n cores, I suggest to 
share them between postgres checkpointer  workers and pgbench threads so 
as to avoid thread competition to use cores. With 8 cores I used up to 2 
threads  4 clients, so that there is 2 core left for the checkpointer and 
other stuff (i.e. I also run iotop  htop in parallel...). Although it may 
seem conservative to do so, I think that the point of the test is to 
exercise checkpoints and not to test the process scheduler of the OS.


Here are the latest version of my test scripts:

 (1) cp_test.sh name test

Run test with setup name. Currently it runs 4000 seconds pgbench with 
the 4 possible on/off combinations for sorting  flushing, after some 
warmup. The 4000 second is chosen so that there are a few checkpoint 
cycles. For larger checkpoint times, I suggest to extend the run time to 
see at least 3 checkpoints during the run.


More test settings can be added to the 2 cases. Postgres settings,
especially shared_buffers, should be set to a pertinent value wrt the 
memory of the test host.


The test run with postgres version found in the PATH, so ensure that the 
right version is found!


 (2) cp_test_count.py one-test-output.log

For rate limited runs, look at the final figures and compute the number of 
late  skipped transactions. This can also be done by hand.


 (3) avg.py

For full speed runs, compute stats about per second tps:

  sh grep 'progress:' one-test-output.log | cut -d' ' -f4 | \
./avg.py --limit=10 --length=4000
  warning: 633 missing data, extending with zeros
  avg over 4000: 199.290575 ± 512.114070 [0.00, 0.00, 4.00, 
5.00, 2280.90]
  percent of values below 10.0: 82.5%

The figures I reported are the 199 (average tps), 512 (standard deviation 
on per second figures), 82.5% (percent of time below 10 tps, aka postgres 
is basically unresponsive). In brakets, the min q1 median q3 and max tps 
seen in the run.



Ofcourse, that is not mandatory to proceed with this patch, but still can
help you to prove your point as you might not have access to different
kind of systems to run the tests.


I agree that more tests would be useful to decide which default value for 
the flushing option is the better. For Linux, all tests so far suggest 
on is the best choice, but for other systems that use posix_fadvise, it 
is really an open question.


Another option would be to give me a temporary access for some available 
host, I'm used to running these tests...


--
Fabien.

cp_test.sh
Description: Bourne shell script
#! /usr/bin/env python
#
# $Id: cp_test_counts.py 316 2015-05-31 20:29:44Z fabien $
#
# show the % of skipped and over-the-limit transactions from pgbench output.
#

import re
import fileinput
d = {}

for line in fileinput.input():
	if line.find(number of transactions ) == 0:
		for kw in ['processed', 'skipped', 'limit']:
			if line.find(kw + : ) != -1:
d[kw] = int(re.search(kw + : (\d+), line).group(1))
		if len(d) == 3:
			print(%f % (100.0 * (d['skipped']+d['limit']) / (d['processed'] + d['skipped'])))
			d = {}
#! /usr/bin/env python
# -*- coding: utf-8 -*-
#
# $Id: avg.py 1226 2015-08-15 19:14:57Z coelho $
#

import argparse
ap = argparse.ArgumentParser(description='show stats about data: count average stddev [min q1 median q3 max]...')
ap.add_argument('--median', default=True, action='store_true',
help='compute median and quartile values')
ap.add_argument('--no-median', dest='median', default=True,
action='store_false',
help='do not compute median and quartile values')
ap.add_argument('--more', default=False, action='store_true',
help='show some more stats')
ap.add_argument('--limit', type=float, default=None,
help='set limit for counting below limit values')
ap.add_argument('--length', type=int, default=None,
help='set expected length, assume 0 if beyond')
ap.add_argument('file', nargs='*', help='list of files to process')
opt = ap.parse_args()

# option consistency
if opt.limit:
	opt.more = True
if opt.more:
	opt.median = True

# reset arguments for fileinput
import sys
sys.argv[1:] = opt.file

import fileinput

n, skipped, vals = 0, 0, []
k, vmin, vmax = None, None, None
sum1, sum2 = 0.0, 0.0

for line in fileinput.input():
	try:
		v = float(line)
		if opt.median: # keep track only if needed
			vals.append(v)
		if k is None: # first time
			k, vmin, vmax = v, v, v
		else: # next time
			vmin = min(vmin, v)
			vmax = max(vmax, v)
		n += 1
		vmk = v - k
		sum1 += vmk
		sum2 += vmk * vmk
	except ValueError: # float conversion failed
		skipped += 1

if opt.length:
	assert some data seen, n  0
	missing = int(opt.length) - len(vals)

Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner
On 2015.08.19 at 12:42:45 +0900, Tatsuo Ishii wrote:

 I wonder how extended protocol is handled by this proposal. Suppose
 load balacing mode is enabled. PQprepare is executed on standby1. Then
 PQexecPrepared gets called. This may be executed on standby2, which
 will fail because there's no prepared statement created by the former
 PQprepare call.

Here we are discussing load-balancing on the client level, not on the
statement level.

Suppose that we have 100 readonly clients and 3 standby servers + master.
If all clients specify all four servers in the their connect strings,
and connect randomly to them, each server would have approximately 25
clients.

But once connection is established, each client works with one
server (at least until communication failure occurs and it would call
PQreset. In this case it has to reprepare statements anyway).



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Albe Laurenz
Victor Wagner wrote:
 I wonder how useful this is at the present time.

 If the primary goes down and the client gets connected to the standby,
 it would have read-only access there.  Most applications wouldn't cope
 well with that.

 It is supposed that somebody (either system administrator or some
 cluster management software) have noticed failure of master and promoted
 one of the standbys to master.
 
 So, clients have only to find out which cluster node serves as master
 just now.
 
 Idea is that we don't need any extra administration actions such as IP
 migration to do it. Clients have list of alternate servers and discover
 which one to work with by trial and error.

Yes, but that will only work reliably if the (read-only) standby does not
allow connections before it is promoted.

 I consider in my proposal following situations:
 
 1. Warm standby - doesn't accept client connection at all unless
 promoted to master.
 
 2. Hot standby - we have two types of clients - one for which readonly
 access is sufficient, and other that need to connect only to master.
 In this case intention to write is explicitely stated in the connect
 string (readonly=false) and connect procedure would check if node it
 tries to connect allowed write.

I think that these are both valid use cases.

And as Robert said, there are people out using BDR or other proprietary
multi-master solutions, so there might well be an audience for this feature.

 host=main-server host=standby1 host=standby2 port=5432 dbname=database

 It seems a bit arbitrary to require that all servers use the same port.
 Maybe parameters like host2, port2, host3, port3 etc. might be better.

 I've thought about this approach. But PostgreSQL administration guide
 insists that all servers in the cluster should have as identical
 configuration as possible to simplify administration.

 Moreover I've seldom have seen configurations where postgresql is
 accepting connection on non-default port.

We do it all the time.

 Using host1, host2 etc would have unintended connotations, such is this
 is first, main server. I think that client should treat all given
 servers as equal and let cluster administration to choose which one
 would accept connection.

I don't think that my idea of host, host3 is very appealing myself,
but I still don't like your original proposal of having multiple host
parameters.

One problem with that is that this syntax is already allowed, but
your proposal would silently change the semantics.
Today, if you have multiple host parameters, the last one wins.
So with your modification in place, some connect strings that work today
would start behaving in unexpected ways.

Maybe a better idea would be:
  host=db1.myorg.com,db2.myorg.com port=5432,2345

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Amit Langote
On 2015-08-18 PM 10:43, David Fetter wrote:

 After the first command is done, the second command would take exclusive
 lock on table_name, scan the table to check if it contains any values
 outside the boundaries defined by FOR VALUES clause defined previously,
 throw error if so, mark as valid partition of parent if not.
 
 One small change to make this part more efficient:
 
 1. Take the access exclusive lock on table_name.
 2. Check for a matching constraint on it.
 3. If it's there, mark it as a valid partition.
 4. If not, check for values outside the boundaries as above.
 

That's an interesting idea. Thanks!

By a matching constraint, I guess you mean a 'valid' constraint from which
the declared partition constraint can be proven to follow. For (a simple)
example, from a CHECK (a = 100 AND a  150) on table_name, the partition
constraint implied by FOR VALUES START (100) END (200) can be assumed to hold.

 Should the be a *valid* constraint?  Perhaps that should be
 parameterized, as I'm not yet seeing a compelling argument either
 direction.  I'm picturing something like:
 
 ALTER TABLE table_name SET VALID PARTITION OF parent [TRUST]
 
 where TRUST would mean that an existing constraint need not be VALID.
 

Hmm, I'd think this step must be able to assert the partition constraint
beyond any doubt. If the DBA added the constraint and marked it invalid,
she should first VALIDATE the constraint to make it valid by performing
whatever steps necessary before. IOW, a full heap scan at least once is
inevitable (the reason why we might want to make this a two step process
at all). Am I missing something?

 
 5. Detach partition

 ALTER TABLE partitioned_table
 DETACH PARTITION partition_name [USING table_name]

 This removes partition_name as partition of partitioned_table. The table
 continues to exist with the same name or 'table_name', if specified.
 pg_class.relispartition is set to false for the table, so it behaves like
 a normal table.
 
 Could this take anything short of an access exclusive lock on the
 parent?
 

Yes, both the step 1 of ATTACH command and DETACH command take access
exclusive lock on the parent. They are rather quick metadata changes, so
should not stall others significantly, I think.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread ''Victor Wagner *EXTERN*' *EXTERN*'
On 2015.08.19 at 07:15:30 +, Albe Laurenz wrote:

  Idea is that we don't need any extra administration actions such as IP
  migration to do it. Clients have list of alternate servers and discover
  which one to work with by trial and error.
 
 Yes, but that will only work reliably if the (read-only) standby does not
 allow connections before it is promoted.

It would just take a bit more time for client and a bit more load for
server - to make sure that this connection is read-write by
issuing

show transaction_read_only 

statement before considering connection useful.

 
 And as Robert said, there are people out using BDR or other proprietary
 multi-master solutions, so there might well be an audience for this feature.
 
Unfortunately I have no experience with such solutions, so I'd greatly
appreciate feedback from those people.

I've modelled my proposal after another proprietary solution  - Oracle
RAC.


 One problem with that is that this syntax is already allowed, but
 your proposal would silently change the semantics.
 Today, if you have multiple host parameters, the last one wins.
 So with your modification in place, some connect strings that work today
 would start behaving in unexpected ways.

This is serious argument. But what the use case of these connect strings
now? 

It seems to me that in most cases last host in the connect string would
be only host which accepts connections, so it wins anyway

 
 Maybe a better idea would be:
   host=db1.myorg.com,db2.myorg.com port=5432,2345

I've tried not to introduce new delimiters. But this syntax definitely
have some advantages. At least it allows to specify host-port pairs as
two parallel lists.

Other idea - allow to specify host-port pair as argument of host
parameter. 

  host=db1.myorg.com:5432

It is consistent with URL syntax and system administrators are used to
it. And with long list of hosts there is less chances to made an error
as host and corresponding port come together.

But your variant allows to handle hostaddr parameter same way as host
and port.
-- 
Victor Wagner vi...@wagner.pp.ru


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Make HeapTupleSatisfiesMVCC more concurrent

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 00:49, Tom Lane t...@sss.pgh.pa.us wrote:

 Jeff Janes jeff.ja...@gmail.com writes:
  When we check a tuple for MVCC, it has to pass checks that the inserting
  transaction has committed, and that it committed before our snapshot
  began.  And similarly that the deleting transaction hasn't committed, or
  did so after our snapshot.

  XidInMVCCSnapshot is (or can be) very much cheaper
  than TransactionIdIsInProgress, because the former touches only local
  memory while the latter takes a highly contended lock and inspects shared
  memory.  We do the slow one first, but we could do the fast one first and
  sometimes short-circuit the slow one.  If the transaction is in our
  snapshot, it doesn't matter if it is still in progress or not.

  This was discussed back in 2013 (
 
 http://www.postgresql.org/message-id/CAMkU=1yy-YEQVvqj2xJitT1EFkyuFk7uTV_hrOMGyGMxpU=n...@mail.gmail.com
 ),
  and I wanted to revive it. The recent lwlock atomic changes haven't made
  the problem irrelevant.

  This patch swaps the order of the checks under some conditions.

 Just thinking about this ... I wonder why we need to call
 TransactionIdIsInProgress() at all rather than believing the answer from
 the snapshot?  Under what circumstances could TransactionIdIsInProgress()
 return true where XidInMVCCSnapshot() had not?

 I'm thinking maybe TransactionIdIsInProgress is only needed for non-MVCC
 snapshot types.


+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Freeze avoidance of very large table.

2015-08-19 Thread Masahiko Sawada
On Wed, Aug 19, 2015 at 1:28 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Aug 18, 2015 at 7:27 AM, Masahiko Sawada sawada.m...@gmail.com 
 wrote:
 I have encountered the much cases where pg_stat_statement,
 pgstattuples are required in production, so I basically agree with
 moving such extension into core.
 But IMO, the diagnostic tools for visibility map, heap (pageinspect)
 and so on, are a kind of debugging tool.

 Just because something might be required in production isn't a
 sufficient reason to put it in core.  Debugging tools, or anything
 else, can be required in production, too.

 Attached latest v11 patches, which is separated into 2 patches: frozen
 bit patch and diagnostic function patch.
 Moving diagnostic function into core is still under the discussion,
 but this patch puts such function into core because the diagnostic
 function for visibility map needs to be in core to execute regression
 test at least.

 As has been discussed recently, there are other ways to handle that.

The currently regression test for VM is that we just compare between
the total number of all-visible and all-frozen in VM before and after
VACUUM, and don't check particular a bit in VM.
we could substitute it to the ANALYZE command with enough sampling
number and checking pg_class.relallvisible and pg_class.relallfrozen.

So another way is that diagnostic function for VM is put into
something contrib (pg_freespacemap or pageinspect), and if we want to
use such function in production, we can install such extension as in
the past.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2015-08-19 Thread Amit Langote
On 2015-08-19 AM 02:57, Marc Mamin wrote:
 2. Creating a partition of a partitioned table

 CREATE TABLE table_name
 PARTITION OF partitioned_table_name
 FOR VALUES values_spec;

 Where values_spec is:

 listvalues: [IN] (val1, ...)

 
 Would it make sense to allow one complementary partition to the listvalues?  
 
 listvalues: [[NOT] IN] (val1, ...)
 
 I've thought a few times about moving data with some most common values to 
 dedicated partitions
 and keeping the rest in a separate one...
 

Thanks, that's definitely something to consider.

I have been thinking of a sort of default list partition for the rest of
values. Would you rather declare that with something like the below than
having to enumerate all the values in a NOT IN list? Or the NOT IN way is
more intuitive/friendly?

CREATE TABLE _rest PARTITION OF table_name FOR VALUES [ IN ] DEFAULT

Of course, at most one such partition would be allowed.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Albe Laurenz
Victor Wagner wrote:
   Idea is that we don't need any extra administration actions such as IP
   migration to do it. Clients have list of alternate servers and discover
   which one to work with by trial and error.
 
  Yes, but that will only work reliably if the (read-only) standby does not
  allow connections before it is promoted.
 
 It would just take a bit more time for client and a bit more load for
 server - to make sure that this connection is read-write by
 issuing
 
 show transaction_read_only
 
 statement before considering connection useful.

That's not very comfortable, and a lot of middleware software won't easily
learn the trick.

But even without that use case I think that the feature is probably
worth the effort.

[about having multiple host parameters in the connection string]

  One problem with that is that this syntax is already allowed, but
  your proposal would silently change the semantics.
  Today, if you have multiple host parameters, the last one wins.
  So with your modification in place, some connect strings that work today
  would start behaving in unexpected ways.
 
 This is serious argument. But what the use case of these connect strings
 now?

 It seems to me that in most cases last host in the connect string would
 be only host which accepts connections, so it wins anyway

I'm not saying that it is particularly wide-spread and useful; it could
happen through careless editing of connection strings or by using a
connection service file entry
(http://www.postgresql.org/docs/current/static/libpq-pgservice.html)
and overriding the host parameter on the command line.

  Maybe a better idea would be:
host=db1.myorg.com,db2.myorg.com port=5432,2345
 
 I've tried not to introduce new delimiters. But this syntax definitely
 have some advantages. At least it allows to specify host-port pairs as
 two parallel lists.
 
 Other idea - allow to specify host-port pair as argument of host
 parameter.
 
   host=db1.myorg.com:5432
 
 It is consistent with URL syntax and system administrators are used to
 it. And with long list of hosts there is less chances to made an error
 as host and corresponding port come together.

I don't think that is very attactive as it confuses the distinction between
host and port.  What would you do with

   host=db1.myorg.com:2345 port=1234

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Victor Wagner
On 2015.08.19 at 12:55:15 +0530, Amit Kapila wrote:

  I think that failover procedure should begin before first connection is
  ever established.
 
 
 As far as I understand, failover gets initiated once the master server goes
 down or is not accessible due to some reason, so for such cases if you
 have the connection to both the servers then it might not work.

Master server might go down when client is not started yet. 
And when client starts up, it has to find out which server to connect
now.

Consider point-of-sale terminals, bank offices or anything else, which
do not work round the clock. Clerk comes to his workplace in the
morning, switches on terminal and inserts her smartcard to authorize
with server. She doesn't need to know what server name is and where it
is located. Either application finds the server automatically, or
support engineer has to be called to fix things.

Moreover, in some situations restart of application (or even client
terminal) is acceptable price for failover, as long as there is no need
to manually fix the configuration.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] allowing wal_level change at run time

2015-08-19 Thread Magnus Hagander
On Tue, Aug 18, 2015 at 7:46 PM, Andres Freund and...@anarazel.de wrote:

 On 2015-08-18 13:24:54 -0400, Peter Eisentraut wrote:
  On 8/18/15 12:35 PM, Robert Haas wrote:
   If archive_mode=on or max_wal_senders0, then we need at least
   wal_level=archive.  Otherwise wal_level=minimal is enough.
 
  Totally forgot about max_wal_senders.
 
  However, the thread I linked to earlier aimed for a different master
  plan (or if not, I'm aiming for it now).  There is camp 1, which wants
  to keep all the defaults the same, for performance or something like
  that.  And there is camp 2, which wants to have a replication-friendly
  setup by default.  Instead of fighting over this, your idea was to be
  able to switch between 1 and 2 easily (which means in particular without
  restarts).

 I don't think not requiring restarts is sufficient, having to twiddle a
 bunch of parameters manually still is a lot more effort than people see
 as necessary.

 The only reason I think changing the default is a good approach is that
 it's doable within a relatively short amount of time.

  But if we tie the effective wal_level to archive_mode or
  max_wal_senders, both of which are restart-only, then we haven't gained
  anything.  (We would have removed half a GUC parameter, effectively.
  Not bad, but not very exciting.)

 ISTM that it's not too hard to
 a) make archive_mode PGC_SIGHUP
 b) make wal_level PGC_SIGHUP
 c) automatically increase wal_level to logical whenever a logical
replication slot is defined

 it seems considerably harder to

 d) make max_wal_senders PGC_SIGHUP
 e) make max_replication_slots PGC_SIGHUP

 because they need shmem, locks, and everything.


 Therefore I propose something slightly different:

 We change the default of max_wal_senders, max_replication_slots, to some
 reasonably high setting and make wal_level an internal automagically
 determined variable. archive_mode = on automatically increases the level
 to what's now hot-standby.

 To deal with streaming replication, we automatically create slots for
 replicas, but, by default, *without* having them reserve WAL. The slot
 name would be determined in some automatic fashion (uuid or something)
 and stored in a new file in the data directory.  That allows us to
 increase the internal wal_level to hot_standby/archive whenever a
 replica has connected (and thus a physical slot exists), and to logical
 whenever a logical slot exists.



What happens the first time? Meaning I'm on wal_level=minimal and take a
base backup. Then when the replica first connects 10 minutes later, it
needs WAL back in time, which was logged at wal_level=minimal.

So you'd need to bump it up whenever a base backup is done -- but then you
can't drop it back down again or your base backup will be useless.

Or am I missing something?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Proposal: Implement failover on libpq connect level.

2015-08-19 Thread Amit Kapila
On Wed, Aug 19, 2015 at 1:23 PM, Victor Wagner vi...@wagner.pp.ru wrote:

 On 2015.08.19 at 12:55:15 +0530, Amit Kapila wrote:

   I think that failover procedure should begin before first connection
is
   ever established.
  
 
  As far as I understand, failover gets initiated once the master server
goes
  down or is not accessible due to some reason, so for such cases if you
  have the connection to both the servers then it might not work.

 Master server might go down when client is not started yet.
 And when client starts up, it has to find out which server to connect
 now.


Always try with the first server specified in connection string and if that
is not available try with second and so on.  I think for the case of
failover,
the design shouldn't be much complicated and it is a standard thing provided
by most of the client-side drivers in other databases.  Considering what
currently PostgreSQL offers in terms of high-availability functionality, for
load-balancing, we need to be careful of many more things like redirecting
read-queries to standby's, write statements should be executed via
connection
to master.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hi,

On 08/20/2015 04:15 AM, Tomas Vondra wrote:

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.


I agree we need to put a few more safeguards there (e.g. make sure we
don't overflow INT when counting the buckets, which may happen with the
amounts of work_mem we'll see in the wild soon).

But I think we should not do any extensive changes to how we size the
hashtable - that's not something we should do in a bugfix I think.


Attached are two alternative version of a backpatch. Both limit the 
nbuckets so that it never exceeds MaxAllocSize - effectively 512MB due 
to the doubling (so ~67M buckets on 64-bit architectures).


The only difference is that the 'alternative' patch limits max_pointers

+   /* ensure we don't exceed the maximum allocation size */
+   max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));

so it affects both nbuckets and nbatches. That seems a bit more correct, 
but I guess whoever gets this many batches would be grateful even for 
the quick crash.



For master, I think the right patch is what KaiGai-san posted in June. I 
don't think we should really try making it smarter about handling 
overestimates at this point - that's 9.6 stuff IMNSHO.


I find it a bit awkward that we only have MemoryContextAllocHuge and 
repalloc_huge, especially as nodeHash.c needs MemoryContextAllocHuge + 
memset to zero the chunk.


So I think we should extend the memutils API by adding palloc0_huge (and 
possibly palloc_huge, although that's not needed by nodeHash.c).



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..e72ac8b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -475,6 +475,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
 		lbuckets = Min(lbuckets, max_pointers);
+		lbuckets = Min(lbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) lbuckets;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
@@ -491,6 +492,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
 		dbuckets = Min(dbuckets, max_pointers);
+		dbuckets = Min(dbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) dbuckets;
 
 		nbatch = 1;
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..ca90aed 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -465,6 +465,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	/* ensure we don't exceed the maximum allocation size */
+	max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));
 
 	if (inner_rel_bytes  hash_table_bytes)
 	{
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers