Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Abhijit Menon-Sen
At 2015-06-10 13:22:27 -0400, robertmh...@gmail.com wrote:
>
> I'm not clear on which of these options you are voting for:
> 
> (1) include pg_log in pg_basebackup as we do currently
> (2) exclude it
> (3) add a switch controlling whether or not it gets excluded
> 
> I can live with (3), but I bet most people want (2).

Thanks for spelling out the options.

I strongly prefer (2), but I could live with (3) if it were done as a
GUC setting. (And if that's what we decide to do, I'm willing to write
up the patch.)

Whether or not it's a good idea to let one's logfiles grow to >8GB, the
fact that doing so breaks base backups means that being able to exclude
pg_log *somehow* is more of a necessity than personal preference.

On the other hand, I don't like the idea of doing (3) by adding command
line arguments to pg_basebackup and adding a new option to the command.
I don't think that level of "flexibility" is justified; it would also
make it easier to end up with a broken base backup (by inadvertently
excluding more than you meant to).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-10 Thread David Rowley
On 11 June 2015 at 16:15, Bruce Momjian  wrote:

> I have committed the first draft of the 9.5 release notes.  You can view
> the output here:
>
> http://momjian.us/pgsql_docs/release-9-5.html
>
>
Thanks Bruce.

Would you also be able to mention something about f15821e and d222585 ?

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-10 Thread Michael Paquier
On Thu, Jun 11, 2015 at 1:51 AM, Fujii Masao  wrote:
> Shouldn't pg_rewind ignore that failure of operation? If the file is not
> found in source server, the file doesn't need to be copied to destination
> server obviously. So ISTM that pg_rewind safely can skip copying that file.
> Thought?

I think that you should fail. Let's imagine that the master to be
rewound has removed a relation file before being stopped cleanly after
its standby has been promoted that was here at the last checkpoint
before forking, and that the standby still has the relation file after
promotion. You should be able to copy it to be able to replay WAL on
it. If the standby has removed a file in the file map after taking the
file map, I guess that the best thing to do is fail because the file
that should be here for the rewound node cannot be fetched.
Documentation should be made clearer about that with a better error
message...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in comment in setrefs.c

2015-06-10 Thread Etsuro Fujita
I ran into a typo in a comment in setrefs.c.  Patch attached.

Best regards,
Etsuro Fujita
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index a7f65dd..162a52e 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1452,7 +1452,7 @@ fix_scan_expr_walker(Node *node, fix_scan_expr_context *context)
  *	  subplans, by setting the varnos to OUTER_VAR or INNER_VAR and setting
  *	  attno values to the result domain number of either the corresponding
  *	  outer or inner join tuple item.  Also perform opcode lookup for these
- *	  expressions. and add regclass OIDs to root->glob->relationOids.
+ *	  expressions, and add regclass OIDs to root->glob->relationOids.
  */
 static void
 set_join_references(PlannerInfo *root, Join *join, int rtoffset)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-10 Thread Amit Langote
On 2015-06-11 PM 01:15, Bruce Momjian wrote:
> I have committed the first draft of the 9.5 release notes.  You can view
> the output here:
> 
>   http://momjian.us/pgsql_docs/release-9-5.html
>   
> and it will eventually appear here:
> 
>   http://www.postgresql.org/docs/devel/static/release.html
> 
> I am ready to make suggested adjustments, though I am traveling for
> conferences for the next ten days so there might a delay in my replies.
> 

In the last section E.1.3.11.1. pgbench, there is:

+  
+   
+Add information about buffer pins to pg_buffercache
+display (Andres Freund)
+   
+  

Should be moved its own section?

Thanks for working on this!

Regards,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor improvement to func.sgml

2015-06-10 Thread Etsuro Fujita

On 2015/06/05 6:51, Robert Haas wrote:

On Mon, Jun 1, 2015 at 10:44 PM, Etsuro Fujita
 wrote:

Here is a doc patch to add materialized views and foreign tables to
database objects that pg_table_is_visible() can be used with.


Good catch, as usual.  Committed.


Thanks for picking this up!

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] comment doesn't match code

2015-06-10 Thread Etsuro Fujita

On 2015/06/10 20:18, Robert Haas wrote:

/*
  * ALTER TABLE INHERIT
  *
  * Add a parent to the child's parents. This verifies that all the columns and
  * check constraints of the parent appear in the child and that they have the
  * same data types and expressions.
  */
static void
ATPrepAddInherit(Relation child_rel)
{
 if (child_rel->rd_rel->reloftype)
 ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("cannot change inheritance of typed table")));
}


I agree with you.

Besides that, ISTM taht the error message is a little bit confusing 
because I think typed tables cannot inherit.  Maybe I'm missing 
something though.  Proposed patch attached.


Best regards,
Etsuro Fujita
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 9853,9862  ATExecEnableDisableRule(Relation rel, char *trigname,
  
  /*
   * ALTER TABLE INHERIT
-  *
-  * Add a parent to the child's parents. This verifies that all the columns and
-  * check constraints of the parent appear in the child and that they have the
-  * same data types and expressions.
   */
  static void
  ATPrepAddInherit(Relation child_rel)
--- 9853,9858 
***
*** 9864,9873  ATPrepAddInherit(Relation child_rel)
  	if (child_rel->rd_rel->reloftype)
  		ereport(ERROR,
  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
!  errmsg("cannot change inheritance of typed table")));
  }
  
  /*
   * Return the address of the new parent relation.
   */
  static ObjectAddress
--- 9860,9873 
  	if (child_rel->rd_rel->reloftype)
  		ereport(ERROR,
  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
!  errmsg("typed tables cannot inherit")));
  }
  
  /*
+  * Add a parent to the child's parents. This verifies that all the columns and
+  * check constraints of the parent appear in the child and that they have the
+  * same data types and expressions.
+  *
   * Return the address of the new parent relation.
   */
  static ObjectAddress

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-10 Thread Michael Paquier
On Thu, Jun 11, 2015 at 1:15 PM, Bruce Momjian  wrote:
> I have committed the first draft of the 9.5 release notes.  You can view
> the output here:
>
> http://momjian.us/pgsql_docs/release-9-5.html
>
> and it will eventually appear here:
>
> http://www.postgresql.org/docs/devel/static/release.html
>
> I am ready to make suggested adjustments, though I am traveling for
> conferences for the next ten days so there might a delay in my replies.

Here are some review comments:
+   
+RETURN WHERE
+   
What is that?

+   
+WHAT IS A STATISTICS SNAPSHOT?
+   
+  
It defines the last time when the global statistics file of pgstat has
been written. Perhaps documentation should be made clearer.

+   
+The remote snapshot must have been exported by
+pg_export_snapshot() or been defined as a
+logical replication slot.  This can be used by parallel
+pg_dump to use a consistent snapshot across
+pg_dump processes.
+   
Perhaps "or been defined when creating a logical replication slot
through a replication connection".

+  
+   
+Simplify WAL record format (Heikki Linnakangas)
+   
+
+   
+This allows external tools to more easily process WAL
+files.
+   
+  
More precision could be brought here. What the new format allows is
actually to track more easily what are the blocks modified for
relations, something possible without having the knowledge of the
record type directly.

+   
+This is particularly helpful for warm standbys.
+   
"for warm standbys to control the timing at which WAL segment files
are retrieved from a WAL archive."

I think that the following items should be added as well:
- Improvement of file version information for Windows builds (Noah
Misch, Michael Paquier), commit ee9569e. The file version information
was missing for a set of contrib modules as well as a handful of
libraries and binaries (like conversion_procs, zic, pg_regress, etc.).
This item should mention that all the binaries and libraries produced
by a Windows build now contain file version information. This could be
merged as well with this item as both are related:
+   
+Add icons to all MSVC-built binaries (Noah Misch)
+   
- Move pg_lzcompress and pg_lzdecompress to libpqcommon, commit
40bede54. This was some legwork for wal_compression but external
binary tools can take advantage of using it now more freely. Those
APIs have been reworked as well to be more generic, somewhat similarly
to the interface lz4 exposes to the user.
- Addition of palloc_extended (8c8a886) to give module developers a
fallback plan instead of OOM ERROR that palloc throws mandatorily.
MemoryContextAllocExtended() can be used on another memory context
than the current one similarly (bd4e2fd9). Feel free to discard this
one if this is not appropriate in the release notes.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-10 Thread Amit Kapila
On Thu, Jun 11, 2015 at 9:45 AM, Bruce Momjian  wrote:
>
> I have committed the first draft of the 9.5 release notes.  You can view
> the output here:
>
> http://momjian.us/pgsql_docs/release-9-5.html
>

Thanks for writing the Release notes.

Some comments:

Have pg_basebackup use a tablespace mapping file, to allow for file paths
of 100+ characters in length

I think this is not completely correct. This is mainly done to allow
usage of tar format in Windows when tablespaces are present
in database, although we have eventually done it for both
Windows and Linux in the same way.  So how about:

Have pg_basebackup use a tablespace mapping file, to allow usage of tar
format
consistently across all platforms

Also shall we mention about below in Migrations to 9.5 section

"pg_basebackup will not not work in tar mode against 9.4 and older servers,
 as we have introduced a new protocol option in that mode."


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-10 Thread Amit Kapila
On Wed, Jun 10, 2015 at 12:09 PM, Fujii Masao  wrote:
>
> On Tue, Jun 9, 2015 at 3:29 PM, Amit Kapila 
wrote:
> > On Tue, Jun 9, 2015 at 10:56 AM, Fujii Masao 
wrote:
> >> Or what about removing tablespace_map file at the beginning of recovery
> >> whenever backup_label doesn't exist?
> >
> > Yes, thats another way, but is it safe to assume that user won't need
> > that file,
>
> Is there really case where tablespace_map is necessary even though
backup_label
> doesn't exist? If not, it seems safe to get rid of the file when
backup_label
> is not present.
>
> > I mean in the valid scenario (where both backup_label and
> > tablespace_map are present and usable) also, we rename them to
> > *.old rather than deleting it.
>
> Yep, I'm OK to make the recovery rename the file to *.old rather than
delete it.
>

This sounds safe to me, unless anybody else has different opinion
I will write a patch to fix this way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] 9.5 release notes

2015-06-10 Thread Bruce Momjian
I have committed the first draft of the 9.5 release notes.  You can view
the output here:

http://momjian.us/pgsql_docs/release-9-5.html

and it will eventually appear here:

http://www.postgresql.org/docs/devel/static/release.html

I am ready to make suggested adjustments, though I am traveling for
conferences for the next ten days so there might a delay in my replies.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DBT-3 with SF=20 got failed

2015-06-10 Thread Kouhei Kaigai
Hello,

I got the following error during DBT-3 benchmark with SF=20.

  psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
  psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0  0x7f669d29a989 in raise () from /lib64/libc.so.6
#1  0x7f669d29c098 in abort () from /lib64/libc.so.6
#2  0x0090ccfd in ExceptionalCondition (conditionName=0xb18130 
"!(((Size) (size) <= ((Size) 0x3fff)))",
errorType=0xb17efd "FailedAssertion", fileName=0xb17e40 "mcxt.c", 
lineNumber=856) at assert.c:54
#3  0x0093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4  0x00673045 in ExecHashTableCreate (node=0x7f669de951f0, 
hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
#5  0x006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6  0x0065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7  0x00681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8  0x0065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9  0x00681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x0065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x00681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x0065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x00681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x0065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
memset().
Indeed, this hash table is constructed towards the relation with 
nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be "Huge" version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.

dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-# s_name,
dbt3c-# count(*) as numwait
dbt3c-# from
dbt3c-# supplier,
dbt3c-# lineitem l1,
dbt3c-# orders,
dbt3c-# nation
dbt3c-# where
dbt3c-# s_suppkey = l1.l_suppkey
dbt3c-# and o_orderkey = l1.l_orderkey
dbt3c-# and o_orderstatus = 'F'
dbt3c-# and l1.l_receiptdate > l1.l_commitdate
dbt3c-# and exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l2
dbt3c(# where
dbt3c(# l2.l_orderkey = l1.l_orderkey
dbt3c(# and l2.l_suppkey <> l1.l_suppkey
dbt3c(# )
dbt3c-# and not exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l3
dbt3c(# where
dbt3c(# l3.l_orderkey = l1.l_orderkey
dbt3c(# and l3.l_suppkey <> l1.l_suppkey
dbt3c(# and l3.l_receiptdate > l3.l_commitdate
dbt3c(# )
dbt3c-# and s_nationkey = n_nationkey
dbt3c-# and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-# s_name
dbt3c-# order by
dbt3c-# numwait desc,
dbt3c-# s_name
dbt3c-# LIMIT 100;

QUERY PLAN

--
--
--
 Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
   Output: supplier.s_name, (count(*))
   ->  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
 Output: supplier.s_name, (count(*))
 Sort Key: (count(*)) DESC, supplier.s_name
 ->  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
   Output: supplier.s_name, count(*)
   Group Key: supplier.s_name
   ->  Nested Loop Anti Join  (cost=4831094.94..6792765.21 rows=1 
width=26)
 Output: supplier.s_name
 ->  Nested Loop  (cost=4831094.37..6792737.52 rows=1 
width=34)
   Output: supplier.s_name, l1.l_suppkey, l1.l_orderkey
   Join Filter: (supplier.s_nationkey = 
nation.n_nationkey)
   ->  Nested Loop  (cost=4831094.37..6792736.19 rows=1 
width=38)
 Output: supplier.s_name, supplier.s_nationkey, 
l1.l_suppkey, l1.l_orderkey
 ->  Nested Loop  (cost=4831093.81..6792728.20 
rows=1 width=42)

Re: [HACKERS] Draft release notes for 9.4.4 et al

2015-06-10 Thread Josh Berkus
On 06/09/2015 05:17 PM, Michael Paquier wrote:
> On Wed, Jun 10, 2015 at 8:41 AM, Josh Berkus  wrote:
>> On 06/09/2015 04:38 PM, Michael Paquier wrote:
>>> On Wed, Jun 10, 2015 at 8:31 AM, Josh Berkus  wrote:
 Tom, all:

 First draft of the release announcement.

 Please improve/edit/correct.  Thanks!

So, anyone else?  Particularly, have I summed up the fixed multixact
issue and how it affects 9.3 users correctly?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-10 Thread Noah Misch
On Mon, Jun 08, 2015 at 03:15:04PM +0200, Andres Freund wrote:
> One more thing:
> Our testing infrastructure sucks. Without writing C code it's basically
> impossible to test wraparounds and such. Even if not particularly useful
> for non-devs, I really think we should have functions for creating
> burning xids/multixacts in core. Or at least in some extension.

+1.  This keeps coming up, so it's worth maintaining a verified and speedy
implementation.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] minor issues in pg_rewind

2015-06-10 Thread Michael Paquier
On Thu, Jun 11, 2015 at 2:38 AM, Fujii Masao  wrote:
> * Remove invalid option character "N" from the third argument (valid option
> string) of getopt_long().
> * Use pg_free() or pfree() to free the memory allocated by pg_malloc() or
> palloc() instead of always using free().
> * Assume problem is no disk space if write() fails but doesn't set errno.
> * Fix several typos.

This looks good to me, especially the ENOSPC enforcement for
pg_control. We may want to rename datapagemap_iterate to
datapagemap_init as well. I can't really see what iterate means in
this context.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Amit Langote
On Wed, Jun 10, 2015 at 8:48 PM, Rosiński Krzysztof 2 - Detal
 wrote:
> How to use this optimization ?
>
>
>
> select *
>
> from table join partitioned_table on (
>
>   table.part_id = partitioned_table.id
>
>   and hash_func_mod(table.part_id) = hash_func_mod(partitioned_table.id)
>
> )
>

If I read the proposed idea correctly, you wouldn't need to add that
second condition.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] error message diff with Perl 5.22.0

2015-06-10 Thread Peter Eisentraut
On 6/6/15 10:32 PM, Alvaro Herrera wrote:
> Peter Eisentraut wrote:
>> With the recently released Perl 5.22.0, the tests fail thus:
>>
>> -ERROR:  Global symbol "$global" requires explicit package name at line 3.
>> -Global symbol "$other_global" requires explicit package name at line 4.
>> +ERROR:  Global symbol "$global" requires explicit package name (did you 
>> forget to declare "my $global"?) at line 3.
>> +Global symbol "$other_global" requires explicit package name (did you 
>> forget to declare "my $other_global"?) at line 4.
>>  CONTEXT:  compilation of PL/Perl function "uses_global"
>>
>>
>> With PL/Python, this happens for just about every other release, and we 
>> usually add another expected file.  I don't see anything like that for 
>> PL/Perl yet.  Should we add a new expected file, or is there a different 
>> preferred solution?
> 
> How many .sql files does this affect?  Alternate expected output is
> bothersome; if more than one test file is affected, I think the best is
> to isolate the cases where this appears to a single .sql file, as short
> as possible, so that we don't have to touch it for anything else, and so
> that we don't have to touch the isolated file except for similar
> changes.

It's only one file.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb - path

2015-06-10 Thread Andrew Dunstan


On 06/10/2015 06:08 PM, Josh Berkus wrote:

WFM.  So the idea is that if json_pointer is implemented as a type, then
we'll have an operator for "jsonb - json_pointer"?




Right.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Refactoring speculative insertion with unique indexes a little

2015-06-10 Thread Peter Geoghegan
Currently, speculative insertion (the INSERT ... ON CONFLICT DO UPDATE
executor/storage infrastructure) uses checkUnique ==
UNIQUE_CHECK_PARTIAL for unique indexes, which is a constant
originally only used by deferred unique constraints. It occurred to me
that this has a number of disadvantages:

* It confuses separation of concerns. Pushing down this information to
the nbtree AM makes it clear why it's slightly special from a
speculative insertion point of view. For example, the nbtree AM does
not actually require "livelock insurance" (for unique indexes,
although in principle not for nbtree-based exclusion constraints,
which are possible).

* UNIQUE_CHECK_PARTIAL is not only not the same thing as
UNIQUE_CHECK_SPECULATIVE (a new constant for the enum). It's also
naturally mutually exclusive with it (since we do not and cannot
support deferred unique constraints as arbiters). Let's represent this
directly.

* It makes a conflict not detected by the pre-check always insert an
index tuple, even though that occurs after a point where it's already
been established that the pointed-to TID is doomed -- it must go on to
be super deleted. Why bother bloating the index?


I'm actually not really motivated by wanting to reduce bloat here
(that was always something that I thought was a non-issue with *any*
implemented speculative insertion prototype [1]). Rather, by actually
physically inserting an index tuple unnecessarily, the implication is
that it makes sense to do so (perhaps for roughly the same reason it
makes sense with deferred unique constraints, or some other
complicated and subtle reason.). AFAICT that implication is incorrect,
though; I see no reason why inserting that index tuple serves any
purpose, and it clearly *can* be avoided with little effort.

Attached patch updates speculative insertion along these lines.

In passing, I have make ExecInsertIndexTuples() give up when a
speculative insertion conflict is detected. Again, this is not about
bloat prevention; it's about making the code easier to understand by
not doing something that is unnecessary, and in avoiding that also
avoiding the implication that it is necessary. There are already
enough complicated interactions that *are* necessary (e.g. "livelock
avoidance" for exclusion constraints). Let us make our intent clearer.

The patch also updates the AM interface documentation (the part
covering unique indexes). It seems quite natural to me to document the
theory of operation for speculative insertion there.

Thoughts?

[1] 
https://wiki.postgresql.org/wiki/Value_locking#.232._.22Promise.22_heap_tuples_.28Heikki_Linnakangas.29
-- 
Peter Geoghegan
From 693dc7b53d311b6a739fea0eea9c2f7147673caf Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 Jun 2015 17:34:16 -0700
Subject: [PATCH] Refactor speculative insertion with unique indexes

Add a dedicated IndexUniqueCheck constant that relates to the
speculative insertion case:  UNIQUE_CHECK_SPECULATIVE.  This allows
btinsert() (and, in principle, any amcanunique aminsert function) to
avoid physically inserting an IndexTuple in the event of detecting a
conflict during speculative insertion's second phase.  With btree, this
avoidance can occur at the critical point in _bt_doinsert() immediately
after conclusively establishing that there is a conflict, but
immediately before actually calling _bt_insertonpg() to proceed with
physical IndexTuple insertion.

It only makes sense to soldier on in the hope that the conflicts can
later be resolved for deferrable unique constraints -- speculative
inserters have no chance of working out a way to proceed without first
deleting the would-be-pointed-to heap tuple already physically inserted.

This isn't really about bloat, which seems unlikely to be a significant
problem with speculative insertion in practice.  Rather, it's about
bringing clarity to how btree (or future amcanunique AMs) handle
speculative insertion.  It seems quite natural to avoid a patently
unnecessary unique index physical IndexTuple insertion in the rare case
where that can currently happen.  Structuring the code so that the btree
access method has some knowledge of speculative insertion callers feels
like a good way of representing its contract with the executor for
speculative insertion.  Besides, it was inaccurate to assert that
UNIQUE_CHECK_PARTIAL only concerned deferrable unique indexes, and never
speculative insertion; the AM interface documentation and various
comments now recognize speculative insertion callers.  Allowing the
access method documentation to falsely claim that UNIQUE_CHECK_PARTIAL
only concerned deferrable constraints could conceivably cause bugs in
third party access method code (back when UNIQUE_CHECK_PARTIAL *did*
include speculative insertion, which, of course, is no longer the case
with this commit).  The access method documentation has been changed to
clarify matters, including describing the new and slightly distinct
UNIQUE_CHECK_SPECULATIVE case.
---
 doc/

Re: [HACKERS] pg_xlog -> pg_xjournal?

2015-06-10 Thread Jeff Janes
On Tue, Jun 2, 2015 at 2:45 AM, Mark Kirkwood  wrote:

> On 01/06/15 05:29, Joel Jacobson wrote:
>
>> While anyone who is familiar with postgres would never do something as
>> stupid as to delete pg_xlog,
>> according to Google, there appears to be a fair amount of end-users out
>> there having made the irrevocable mistake of deleting the precious
>> directory,
>> a decision made on the assumption that since "it has *log* in the name
>> so it must be unimportant"
>> (
>> http://stackoverflow.com/questions/12897429/what-does-pg-resetxlog-do-and-how-does-it-work
>> ).
>>
>> If we could turn back time, would we have picked "pg_xlog" as the most
>> optimal name for this important directory, or would we have come up with
>> a more user-friendly name?
>>
>> Personally, I have never had any problems with pg_xlog, but I realize
>> there are quite a few unlucky new users who end up in trouble.
>>
>> My suggestion is to use "pg_xjournal" instead of "pg_xlog" when new
>> users create a new data directory using initdb, and allow for both
>> directories to exist (exclusive or, i.e. either one or the other, but
>> not both). That way we don't complicate the life for any existing users,
>> all their tools will continue to work who rely on pg_xlog to be named
>> pg_xlog, but only force new users to do a bit of googling when they
>> can't use whatever tool that can't find pg_xlog. When they find out it's
>> an important directory, they can simply create a symlink and their old
>> not yet updated tool will work again.
>>
>> Thoughts?
>>
>>
> +1
>
> Strongly agree - I have had people on my dba course ask about deleting
> these pesky 'log' directories (obvious confusion/conflation with pg_log
> ...)! A change of name would help reduce the temptation!
>
>
Why is it named pg_log by default anyway?  While base and global are not
named pg_base (or pg_default) and pg_global ?

If we are going to break things in some release, maybe we should rename
them all to have a little more rhyme or reason to them.  Or is there
already a rhyme or reason I am overlooking?

I would think all the config and human-readable log files/directories
should have one prefix (or absence of prefix), and all the internal
files/directories with no user serviceable parts should have a different
one.

Cheers,

Jeff


Re: [HACKERS] jsonb - path

2015-06-10 Thread Josh Berkus
On 06/10/2015 12:00 PM, Andrew Dunstan wrote:
> We need to remove the ambiguity with jsonb_delete() by renaming the
> variant that takes a text[] (meaning a path) as the second argument to
> jsonb_delete_path. That seems uncontroversial.

Speaking as a user ... works for me.

> We need to rename the corresponding operator from'-' to, say, '-#', or
> alternatively remove it. The question is which.

Rename, I think.

> Future plans that might affect this issue: possible implementations of
> Json Pointer (rfc 6901), Json Patch (rfc 6902) and Json Merge Patch (rfc
> 7396). The last one is on this list for completeness - it seems to me a
> lot less useful than the others, but I included it because Peter felt
> strongly about the lack of recursive merge. Json Patch could probably
> stand on its owm once we have Json Pointer, so that's really the thing
> we need to talk about. Undeneath the hood, I think we could make
> json_pointer be simply an array of text. If we did that, we could make
> an implicit cast from text[] to it,  and we could also have the input
> routine recognize an input string beginning with '{' and parse it
> directly as an array of text, since a standard json pointer expression
> has to being with '/' unless it's completely empty. Given all of that, I
> think, fingers crossed, it should be fairly safe to change the signature
> of all the functions and operators that currently take text[] as their
> path parameter to take a json_pointer instead without causing too much
> grief.
> 
> Proceeding from that, I'm rather inclined to say that the answer is to
> rename the operator rather than remove it, and that's what I'm going to
> do unless there's a groundswell that says no.

WFM.  So the idea is that if json_pointer is implemented as a type, then
we'll have an operator for "jsonb - json_pointer"?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it possible to have a "fast-write" Index?

2015-06-10 Thread Claudio Freire
On Wed, Jun 10, 2015 at 6:01 PM, deavid  wrote:
> By now, my results were a bit disappointing: (comparing gin_btree against
> regular btree for a column with very low cardinality)
> - create index and updates: about 10-20% faster (i had a primary key, so
> btree unique checks may be here blurring the results)

That could be the effect of GIN's buffering (lets call it LSM? it's similar)

So that with pure btrees could get a similar speedup.

On Wed, Jun 10, 2015 at 6:01 PM, deavid  wrote:
> What i've found is, I was wrong on fillfactor. (Maybe something has changed
> here since postgresql 8.1). I believed a fillfactor lower than 80 will do
> more harm than good. At least that was the case 5 years ago. Now I could get
> a noticeable speedup with fillfactor=50 in the case of updating the whole
> table.

8.1 didn't have HOT. I'd bet it's that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Is it possible to have a "fast-write" Index?

2015-06-10 Thread deavid
Hi again. I tried to do some test on my office computer, but after spending
2-3 hours I gave up. I'm going to need a real SSD disk to try these things.
100k rows of my "delivery notes" table use 100MB of disk; and 2Gb of RAM
may be not enough to emulate a fast IO. (I was disabling fsync, activating
write caches, etc)

I downloaded Postgresql 9.5-dev from git sources, compiled everything and
restored there two client databases (10Gb each one). My goal was to test
gin_btree and brin indexes as well, but i gave up before doing a complete
test of gin.

Now I have a better plan, I'm going to use my laptop (intel i5, 4Gb of ram)
and i will put here a spare SSD i wasn't using (OCZ Agility 2 120Gb). Hope
this time I could get some figures closer to production.

By now, my results were a bit disappointing: (comparing gin_btree against
regular btree for a column with very low cardinality)
- create index and updates: about 10-20% faster (i had a primary key, so
btree unique checks may be here blurring the results)
- selects: about 2-5 times slower
- index size: about 2 times smaller

What i've found is, I was wrong on fillfactor. (Maybe something has changed
here since postgresql 8.1). I believed a fillfactor lower than 80 will do
more harm than good. At least that was the case 5 years ago. Now I could
get a noticeable speedup with fillfactor=50 in the case of updating the
whole table.

Hope i could setup this laptop soon and get those tests done.

El sáb., 6 jun. 2015 a las 13:07, k...@rice.edu () escribió:

> On Fri, Jun 05, 2015 at 11:54:01PM +, deavid wrote:
> > Thanks to everybody for answering. I wasn't expecting this attention;
> this
> > is a great community :-)
> >
> > Jim asked me about something real. Well, the problem is this showed up
> more
> > than five years ago, and keeps popping from time to time since in
> different
> > circumstances. I solved them in different ways each time, depending the
> > exact use-case. I wanted to generalize, because seems a good feature for
> > several situations; and I don't expect a solution for me as each time I
> hit
> > with this I found some way to sort it out.
> > As Jim said, we need here are figures for real examples, and i don't have
> > yet. I'll do my "homework" and email back with exact problems with exact
> > timing. Give me a week or two.
> >
> > Also, some of you are talking about IO. Well, it's hard to say without
> the
> > figures here, but I'm pretty sure I'm hitting CPU time only. We use SSD
> on
> > those big databases, and also in my tests i tried setting fsync=off.
> >
> > So the problem is: i see a low iowait, and CPU time for one core is at
> > 80-90% most of the time. I can buy more ram, better disks, or cpu's with
> > more cores. But one cpu core would have more-or-less the same speed no
> > matter how much money you invest.
> >
> > When someone wants a delayed-write index is similar to setting
> >  "synchronous_commit = off". We want to give an OK to the backend as soon
> > as is possible and do this work in background. But we also want some
> > reliability against crashes.
> >
> > Also, if the task is done in background it may be done from other
> backend,
> > so probably several indexes could be packed at once using different
> backend
> > processes. We could use the entire cpu if our index writes aren't tied to
> > the session who wrote the row.
> >
> > PD: I'm very interested on existent approaches like GIN or BRIN (this one
> > is new to me). Thanks a lot; i'll try them in my tests.
>
> Hi David,
>
> Here is an interesting read comparing LSM and Fractal Tree indexing:
>
>
> http://highscalability.com/blog/2014/8/6/tokutek-white-paper-a-comparison-of-log-structured-merge-lsm.html
>
> Regards,
> Ken
>


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Gurjeet Singh
On Wed, Jun 10, 2015 at 9:10 AM, Gurjeet Singh  wrote:

>
> I am in the process of writing up a doc patch, and will submit that as
> well in a short while.
>

Please find attached the patch with the doc update.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


physical_repl_slot_activate_restart_lsn.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further issues with jsonb semantics, documentation

2015-06-10 Thread Peter Geoghegan
On Wed, Jun 10, 2015 at 11:48 AM, Andrew Dunstan  wrote:
> Sorry for the delay on this. I've been mostly off the grid, having an all
> too rare visit from Tom "Mr Enum" Dunstan, and I misunderstood what you were
> suggesting,

Thank you for working with me to address this. I've been busy with
other things the past few days too.

> Please submit a patch to adjust the treatment of negative integers in the
> old functions to be consistent with their treatment in the new functions.
> i.e. in the range [-n,-1] they should refer to the corresponding element
> counting from the right.

I've almost finished that patch. I'm currently blocked on deciding
what to do about the old path-orientated operators (#> and #>> for
json and jsonb types). It's rather painful to support pushing down
negative subscripting there -- maybe we should just not do so for
those variants, especially given that they're already notationally
inconsistent with the other operators that I'll be updating. What do
you think?

Maybe I'll come up with a simpler way of making that work by taking a
fresh look at it, but haven't done that yet.

My current, draft approach to making subscripting work with the json
variants (not the jsonb variants) is to use a second get_worker() call
in the event of a negative subscript, while making the first such call
(the existing get_worker() call) establish the number of top-level
array elements. That isn't beautiful, and involves some amount of
redundant work, but it's probably better than messing with
get_worker() in a more invasive way. Besides, this second get_worker()
call only needs to happen in the event of a negative subscript, and
I'm only really updating this (that is, updating routines like
json_array_element()) to preserve consistency with jsonb. What do you
think of that idea?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Robert Haas
On Wed, Jun 10, 2015 at 1:58 PM, Andres Freund  wrote:
>> Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
>> try your patch there, bound to varying numbers of sockets.
>
> It'd be a significant amount of work to rebase it ontop current HEAD. I
> guess the easiest thing would be to try an older version of the patch
> with the xadd in place, and use a tree from back then.

We could do that, I guess.  But we've made other significant
improvements in the meantime, so...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb - path

2015-06-10 Thread Tom Lane
Andrew Dunstan  writes:
> Future plans that might affect this issue: possible implementations of 
> Json Pointer (rfc 6901), Json Patch (rfc 6902) and Json Merge Patch (rfc 
> 7396). The last one is on this list for completeness - it seems to me a 
> lot less useful than the others, but I included it because Peter felt 
> strongly about the lack of recursive merge. Json Patch could probably 
> stand on its owm once we have Json Pointer, so that's really the thing 
> we need to talk about. Undeneath the hood, I think we could make 
> json_pointer be simply an array of text. If we did that, we could make 
> an implicit cast from text[] to it,

Implicit cross-type-category casts are almost always a bad idea.
Usually you don't find out how bad until after you've shipped it.

It's possible this would be okay, given the probable limited usage
of json_pointer.  But I think it's darn risky to make decisions now
that are contingent on the assumption that such a cast won't have
untenable side-effects.

In particular, such a cast would create substantial hazards for any
operator that's named similarly to one of the json(b) operators and could
take a text array on the right.  While there don't seem to be any of those
at the moment (at least not in core), this is certainly another good
reason not to use "-" as the jsonb_delete operator.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb - path

2015-06-10 Thread Petr Jelinek
On Wed, Jun 10, 2015 at 9:00 , Andrew Dunstan  
wrote:
This is an attempt to summarize What I think is now the lone 
outstanding jsonb issue.


We need to remove the ambiguity with jsonb_delete() by renaming the 
variant that takes a text[] (meaning a path) as the second argument 
to jsonb_delete_path. That seems uncontroversial.


We need to rename the corresponding operator from'-' to, say, '-#', 
or alternatively remove it. The question is which.


Future plans that might affect this issue: possible implementations 
of Json Pointer (rfc 6901), Json Patch (rfc 6902) and Json Merge 
Patch (rfc 7396). The last one is on this list for completeness - it 
seems to me a lot less useful than the others, but I included it 
because Peter felt strongly about the lack of recursive merge. Json 
Patch could probably stand on its owm once we have Json Pointer, so 
that's really the thing we need to talk about. Undeneath the hood, I 
think we could make json_pointer be simply an array of text. If we 
did that, we could make an implicit cast from text[] to it,  and we 
could also have the input routine recognize an input string beginning 
with '{' and parse it directly as an array of text, since a standard 
json pointer expression has to being with '/' unless it's completely 
empty. Given all of that, I think, fingers crossed, it should be 
fairly safe to change the signature of all the functions and 
operators that currently take text[] as their path parameter to take 
a json_pointer instead without causing too much grief.


Hmm, so our implementation of json pointer would be slightly 
non-standard as it would support alternative input syntax. This does 
not make me thrilled but since we can't really make it work any other 
way, I guess it's pragmatic solution...



Proceeding from that, I'm rather inclined to say that the answer is 
to rename the operator rather than remove it, and that's what I'm 
going to do unless there's a groundswell that says no.


+1 for renaming


--
Petr Jelinek  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] jsonb - path

2015-06-10 Thread Andrew Dunstan
This is an attempt to summarize What I think is now the lone outstanding 
jsonb issue.


We need to remove the ambiguity with jsonb_delete() by renaming the 
variant that takes a text[] (meaning a path) as the second argument to 
jsonb_delete_path. That seems uncontroversial.


We need to rename the corresponding operator from'-' to, say, '-#', or 
alternatively remove it. The question is which.


Future plans that might affect this issue: possible implementations of 
Json Pointer (rfc 6901), Json Patch (rfc 6902) and Json Merge Patch (rfc 
7396). The last one is on this list for completeness - it seems to me a 
lot less useful than the others, but I included it because Peter felt 
strongly about the lack of recursive merge. Json Patch could probably 
stand on its owm once we have Json Pointer, so that's really the thing 
we need to talk about. Undeneath the hood, I think we could make 
json_pointer be simply an array of text. If we did that, we could make 
an implicit cast from text[] to it,  and we could also have the input 
routine recognize an input string beginning with '{' and parse it 
directly as an array of text, since a standard json pointer expression 
has to being with '/' unless it's completely empty. Given all of that, I 
think, fingers crossed, it should be fairly safe to change the signature 
of all the functions and operators that currently take text[] as their 
path parameter to take a json_pointer instead without causing too much 
grief.


Proceeding from that, I'm rather inclined to say that the answer is to 
rename the operator rather than remove it, and that's what I'm going to 
do unless there's a groundswell that says no.


We should also in 9.6 provide an operator or function that removes all 
the named top-level keys. That operator's function would not be a json 
pointer, but an array of key names.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Further issues with jsonb semantics, documentation

2015-06-10 Thread Andrew Dunstan


On 06/05/2015 01:51 PM, Andrew Dunstan wrote:


On 06/05/2015 01:39 PM, Peter Geoghegan wrote:

On Thu, Jun 4, 2015 at 12:10 PM, Peter Geoghegan  wrote:
But I agree that it's not a great contribution to science, 
especially since

the index will be applied to the list of elements in the somewhat
counter-intuitive storage order we use, and we could just raise an 
error if

we try to apply integer delete to an object instead of an array.

Cool. Do you want to write a patch, or should I?

Also, what about negative array subscripting (making the 9.4-era
"operator jsonb -> integer" operator support that for consistency with
the new "operator jsonb - integer" operator)? Should I write the
patch? Will you commit it if I do?

Please let me know if you want me to write these two patches.




Send the first one, I'm still thinking about the second one.




Sorry for the delay on this. I've been mostly off the grid, having an 
all too rare visit from Tom "Mr Enum" Dunstan, and I misunderstood what 
you were suggesting,


Please submit a patch to adjust the treatment of negative integers in 
the old functions to be consistent with their treatment in the new 
functions. i.e. in the range [-n,-1] they should refer to the 
corresponding element counting from the right.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 13:52:14 -0400, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund  wrote:
> > Well, not necessarily. If you can write your algorithm in a way that
> > xadd etc are used, instead of a lock cmpxchg, you're actually never
> > spinning on x86 as it's guaranteed to succeed.  I think that's possible
> > for most of the places that currently lock buffer headers.
> 
> Well, it will succeed by looping inside the instruction, I suppose.  But OK.

On x86 atomic ops hold the bus lock for a short while - that's why
they're that expensive - and in that case you directly can do useful
work (xadd) or just return after reading the current value if there was
a concurrent change (cmpxchg). Afaik there's a more fundamental
difference than one variant just doing the retry in microcode. It's hard
to get definitive answers to that.

> > (I had a version of the lwlock changes that used xadd for shared lock
> > acquisition - but the code needed to back out in error cases made things
> > more complicated, and the benefit on a four socket machine wasn't that
> > large)
> 
> Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
> try your patch there, bound to varying numbers of sockets.

It'd be a significant amount of work to rebase it ontop current HEAD. I
guess the easiest thing would be to try an older version of the patch
with the xadd in place, and use a tree from back then.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Robert Haas
On Wed, Jun 10, 2015 at 1:39 PM, Andres Freund  wrote:
> In the uncontended case lwlocks are just as fast as spinlocks now, with
> the exception of the local tracking array. They're faster if there's
> differences with read/write lockers.

If nothing else, the spinlock calls are inline, while the lwlock calls
involve a function call.  So in the uncontended case I think the
spinlock stuff is like 2 instructions, an atomic and a branch.

>> I'm not altogether convinced that a simple replacement of spinlocks
>> with atomics is going to solve this problem.  If it does, great.  But
>> that doesn't eliminate spinning; it just moves it from the spinlock
>> loop to a CAS loop.
>
> Well, not necessarily. If you can write your algorithm in a way that
> xadd etc are used, instead of a lock cmpxchg, you're actually never
> spinning on x86 as it's guaranteed to succeed.  I think that's possible
> for most of the places that currently lock buffer headers.

Well, it will succeed by looping inside the instruction, I suppose.  But OK.

> (I had a version of the lwlock changes that used xadd for shared lock
> acquisition - but the code needed to back out in error cases made things
> more complicated, and the benefit on a four socket machine wasn't that
> large)

Now that we (EnterpriseDB) have this 8-socket machine, maybe we could
try your patch there, bound to varying numbers of sockets.

>> This is clearly better: when the CAS works,
>> you're done, as opposed to having to then manipulate the cache line
>> further and release the spinlock, during any of which the cache line
>> may get stolen from you.
>
> It's not just that, it's also that there's no chance of sleeping while
> holding a lock. Which doesn't happen that infrequently in contended, cpu
> bound workloads.

That's a really good point.

>> However, I'm worried that it will still be possible to create the same
>> kinds of performance collapses that we see with spinlocks with the CAS
>> loops with which we place them - or even worse problems, because
>> clearly the spin_delay stuff *works*, and we're unlikely to
>> incorporate similarly sophisticated logic into every CAS loop.
>
> I doubt it's really neccessary, but It shouldn't be too hard to
> generalize the sleeping logic so it could be used with little code in
> the individual callsites.

We should probably experiment with it to see whether it is needed.  I
am fine with leaving it out if it isn't, but it's worth testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] minor issues in pg_rewind

2015-06-10 Thread Fujii Masao
Hi,

Attached patch fixes the minor issues in pg_rewind. The fixes are

* Remove invalid option character "N" from the third argument (valid option
string) of getopt_long().

* Use pg_free() or pfree() to free the memory allocated by pg_malloc() or
palloc() instead of always using free().

* Assume problem is no disk space if write() fails but doesn't set errno.

* Fix several typos.

Regards,

-- 
Fujii Masao
diff --git a/src/bin/pg_rewind/copy_fetch.c b/src/bin/pg_rewind/copy_fetch.c
index 1ca00d1..991e348 100644
--- a/src/bin/pg_rewind/copy_fetch.c
+++ b/src/bin/pg_rewind/copy_fetch.c
@@ -43,7 +43,7 @@ traverse_datadir(const char *datadir, process_file_callback_t callback)
 /*
  * recursive part of traverse_datadir
  *
- * parent_path is the current subdirectory's path relative to datadir,
+ * parentpath is the current subdirectory's path relative to datadir,
  * or NULL at the top level.
  */
 static void
@@ -262,5 +262,5 @@ execute_pagemap(datapagemap_t *pagemap, const char *path)
 		copy_file_range(path, offset, offset + BLCKSZ, false);
 		/* Ok, this block has now been copied from new data dir to old */
 	}
-	free(iter);
+	pg_free(iter);
 }
diff --git a/src/bin/pg_rewind/datapagemap.c b/src/bin/pg_rewind/datapagemap.c
index 3477366..ff10bc9 100644
--- a/src/bin/pg_rewind/datapagemap.c
+++ b/src/bin/pg_rewind/datapagemap.c
@@ -67,7 +67,7 @@ datapagemap_add(datapagemap_t *map, BlockNumber blkno)
  * Start iterating through all entries in the page map.
  *
  * After datapagemap_iterate, call datapagemap_next to return the entries,
- * until it returns NULL. After you're done, use free() to destroy the
+ * until it returns false. After you're done, use pg_free() to destroy the
  * iterator.
  */
 datapagemap_iterator_t *
@@ -122,5 +122,5 @@ datapagemap_print(datapagemap_t *map)
 	while (datapagemap_next(iter, &blocknum))
 		printf("  blk %u\n", blocknum);
 
-	free(iter);
+	pg_free(iter);
 }
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 589a01a..d6a743f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -105,10 +105,16 @@ write_target_range(char *buf, off_t begin, size_t size)
 	{
 		int			writelen;
 
+		errno = 0;
 		writelen = write(dstfd, p, writeleft);
 		if (writelen < 0)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+errno = ENOSPC;
 			pg_fatal("could not write file \"%s\": %s\n",
 	 dstpath, strerror(errno));
+		}
 
 		p += writelen;
 		writeleft -= writelen;
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index cb2bf4d..3821e9c 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -114,7 +114,7 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 		case FILE_TYPE_DIRECTORY:
 			if (exists && !S_ISDIR(statbuf.st_mode))
 			{
-/* it's a directory in target, but not in source. Strange.. */
+/* it's a directory in source, but not in target. Strange.. */
 pg_fatal("\"%s\" is not a directory\n", localpath);
 			}
 
@@ -135,7 +135,7 @@ process_source_file(const char *path, file_type_t type, size_t newsize,
 )
 			{
 /*
- * It's a symbolic link in target, but not in source.
+ * It's a symbolic link in source, but not in target.
  * Strange..
  */
 pg_fatal("\"%s\" is not a symbolic link\n", localpath);
@@ -354,7 +354,7 @@ process_block_change(ForkNumber forknum, RelFileNode rnode, BlockNumber blkno)
 		entry = *e;
 	else
 		entry = NULL;
-	free(path);
+	pfree(path);
 
 	if (entry)
 	{
@@ -530,7 +530,7 @@ print_filemap(void)
  * Does it look like a relation data file?
  *
  * For our purposes, only files belonging to the main fork are considered
- * relation files. Other forks are alwayes copied in toto, because we cannot
+ * relation files. Other forks are always copied in toto, because we cannot
  * reliably track changes to them, because WAL only contains block references
  * for the main fork.
  */
diff --git a/src/bin/pg_rewind/libpq_fetch.c b/src/bin/pg_rewind/libpq_fetch.c
index 14a8610..6ffd24e 100644
--- a/src/bin/pg_rewind/libpq_fetch.c
+++ b/src/bin/pg_rewind/libpq_fetch.c
@@ -69,7 +69,7 @@ libpqConnect(const char *connstr)
 	pg_free(str);
 
 	/*
-	 * Also check that full_page-writes are enabled. We can get torn pages if
+	 * Also check that full_page_writes is enabled. We can get torn pages if
 	 * a page is modified while we read it with pg_read_binary_file(), and we
 	 * rely on full page images to fix them.
 	 */
@@ -465,5 +465,5 @@ execute_pagemap(datapagemap_t *pagemap, const char *path)
 
 		fetch_file_range(path, offset, offset + BLCKSZ);
 	}
-	free(iter);
+	pg_free(iter);
 }
diff --git a/src/bin/pg_rewind/logging.h b/src/bin/pg_rewind/logging.h
index 0272a22..efbc9b7 100644
--- a/src/bin/pg_rewind/logging.h
+++ b/src/bin/pg_rewind/logging.h
@@ -32,4 +32,4 @@ extern void pg_fatal(const char *fmt,...) pg_attribute_printf(1, 2) pg_attribute
 
 extern void progress

Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 13:19:14 -0400, Robert Haas wrote:
> On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund  wrote:
> > I think we should just gank spinlocks asap. The hard part is removing
> > them from lwlock.c's slow path and the buffer headers imo. After that we
> > should imo be fine replacing them with lwlocks.
> 
> Mmmph.  I'm not convinced there's any point in replacing
> lightly-contended spinlocks with slower, more-complex lwlocks.  But
> maybe it's best to stay away from that argument and focus on getting
> rid of the spinlocks that are hosing us right now.

In the uncontended case lwlocks are just as fast as spinlocks now, with
the exception of the local tracking array. They're faster if there's
differences with read/write lockers.

> I'm not altogether convinced that a simple replacement of spinlocks
> with atomics is going to solve this problem.  If it does, great.  But
> that doesn't eliminate spinning; it just moves it from the spinlock
> loop to a CAS loop.

Well, not necessarily. If you can write your algorithm in a way that
xadd etc are used, instead of a lock cmpxchg, you're actually never
spinning on x86 as it's guaranteed to succeed.  I think that's possible
for most of the places that currently lock buffer headers.

(I had a version of the lwlock changes that used xadd for shared lock
acquisition - but the code needed to back out in error cases made things
more complicated, and the benefit on a four socket machine wasn't that
large)

> This is clearly better: when the CAS works,
> you're done, as opposed to having to then manipulate the cache line
> further and release the spinlock, during any of which the cache line
> may get stolen from you.

It's not just that, it's also that there's no chance of sleeping while
holding a lock. Which doesn't happen that infrequently in contended, cpu
bound workloads.

> However, I'm worried that it will still be possible to create the same
> kinds of performance collapses that we see with spinlocks with the CAS
> loops with which we place them - or even worse problems, because
> clearly the spin_delay stuff *works*, and we're unlikely to
> incorporate similarly sophisticated logic into every CAS loop.

I doubt it's really neccessary, but It shouldn't be too hard to
generalize the sleeping logic so it could be used with little code in
the individual callsites.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Joshua D. Drake


On 06/10/2015 10:22 AM, Robert Haas wrote:


On Wed, Jun 10, 2015 at 1:12 PM, Joshua D. Drake  wrote:

On 06/10/2015 10:01 AM, Andres Freund wrote:

On 2015-06-10 09:57:17 -0700, Jeff Janes wrote:

Mine goal isn't that.  My goal is to have a consistent backup without
having to shut down the server to take a cold one, or having to manually
juggle the pg_start_backup, etc. commands.


A basebackup won't necessarily give you a consistent log though...


I am -1 on this idea. It just doesn't seem to make sense. There are too many
variables where it won't work or won't be relevant.


I'm not clear on which of these options you are voting for:

(1) include pg_log in pg_basebackup as we do currently
(2) exclude it
(3) add a switch controlling whether or not it gets excluded

I can live with (3), but I bet most people want (2).



Sorry I wasn't clear. #2


Sincerely,

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Robert Haas
On Wed, Jun 10, 2015 at 1:12 PM, Joshua D. Drake  wrote:
> On 06/10/2015 10:01 AM, Andres Freund wrote:
>> On 2015-06-10 09:57:17 -0700, Jeff Janes wrote:
>>> Mine goal isn't that.  My goal is to have a consistent backup without
>>> having to shut down the server to take a cold one, or having to manually
>>> juggle the pg_start_backup, etc. commands.
>>
>> A basebackup won't necessarily give you a consistent log though...
>
> I am -1 on this idea. It just doesn't seem to make sense. There are too many
> variables where it won't work or won't be relevant.

I'm not clear on which of these options you are voting for:

(1) include pg_log in pg_basebackup as we do currently
(2) exclude it
(3) add a switch controlling whether or not it gets excluded

I can live with (3), but I bet most people want (2).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Robert Haas
On Wed, Jun 10, 2015 at 11:58 AM, Andres Freund  wrote:
> I think we should just gank spinlocks asap. The hard part is removing
> them from lwlock.c's slow path and the buffer headers imo. After that we
> should imo be fine replacing them with lwlocks.

Mmmph.  I'm not convinced there's any point in replacing
lightly-contended spinlocks with slower, more-complex lwlocks.  But
maybe it's best to stay away from that argument and focus on getting
rid of the spinlocks that are hosing us right now.

I'm not altogether convinced that a simple replacement of spinlocks
with atomics is going to solve this problem.  If it does, great.  But
that doesn't eliminate spinning; it just moves it from the spinlock
loop to a CAS loop.  This is clearly better: when the CAS works,
you're done, as opposed to having to then manipulate the cache line
further and release the spinlock, during any of which the cache line
may get stolen from you.  However, I'm worried that it will still be
possible to create the same kinds of performance collapses that we see
with spinlocks with the CAS loops with which we place them - or even
worse problems, because clearly the spin_delay stuff *works*, and
we're unlikely to incorporate similarly sophisticated logic into every
CAS loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Joshua D. Drake


On 06/10/2015 10:01 AM, Andres Freund wrote:


On 2015-06-10 09:57:17 -0700, Jeff Janes wrote:

Mine goal isn't that.  My goal is to have a consistent backup without
having to shut down the server to take a cold one, or having to manually
juggle the pg_start_backup, etc. commands.


A basebackup won't necessarily give you a consistent log though...


I am -1 on this idea. It just doesn't seem to make sense. There are too 
many variables where it won't work or won't be relevant.


JD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Postgres GSSAPI Encryption

2015-06-10 Thread Robbie Harwood
Robbie Harwood  writes:

> Stephen Frost  writes:
>
>> Robbie,
>>
>> * Robbie Harwood (rharw...@redhat.com) wrote:
>>
>>> We'd I think also want a new kind of HBA entry (probably something along
>>> the lines of `hostgss` to contrast with `hostssl`), but I'm not sure
>>> what we'd want to do for the counterpart of `hostnossl` (`hostnogss`?
>>> But then do we need `hostnogssnossl`?  Is this even a useful setting to
>>> have?), so that will probably require broader discussion.
>>
>> Are you intending to support GSSAPI encryption *without* using the
>> GSSAPI authentication mechanism?  If not, maybe we can come up with a
>> way to have an option to the GSSAPI auth mechanism that behaves the way
>> we want for GSSAPI encryption.  Perhaps something like:
>>
>> encryption = (none | request | require)
>>
>> If you need an option for it to still be able to fall-thru to the next
>> pg_hba line, that'd be more difficult, but is that really a sensible
>> use-case?
>
> That's a good point.  I don't see GSSAPI encryption without GSSAPI
> authentication as a particularly compelling use case, so a setting like
> that (with default presumably to request for backward compatibility)
> seems perfect.
>
> As for fall-through on failure, I don't really know what's reasonable
> here.  My understanding of the way it works currently suggests that it
> would be *expected* to fall-through based on the way other things
> behave, though it's certainly less effort on my part if it does not.
>
>>> Finally, I think there are a couple different ways the protocol could
>>> look.  I'd ideally like to opportunistically encrypt all
>>> GSSAPI-authenticated connections and fallback to non-encrypted when the
>>> other end doesn't support it (possibly dropping the connection if this
>>> causes it to not match HBA), but I'm not sure if this will work with the
>>> existing code.  A (less-nice) option could be to add a new
>>> authentication method for gss->encryption, which feels aesthetically
>>> misplaced.  The approach used for SSL today could probably be adopted as
>>> a third approach, but I don't really see a gain from doing it this way
>>> since encryption happens after authentication (opposite of SSL) in
>>> GSSAPI.
>>
>> I'd definitely like to see us opportunistically encrypt all GSSAPI
>> authenticated connections also.  The main case to consider is how to
>> support migrating users who are currently using GSSAPI + SSL to GSSAPI
>> auth+encryption, and how to support them if they want to continue to use
>> GSSAPI just for user auth and use SSL for encryption.
>
> So if we go with the "encryption" option, we might not need a specific
> entry for GSS hosts.  For the SSL encryption/GSS auth case, we'd want it
> to work the way it does now where you specify "hostssl" and then "gss"
> as the method.  Then for the GSS for auth and encryption, one would use
> a normal "host" entry, then specify gss as the method, and then set
> encryption=require.
>
> One consequence of this would be that you can do "double encryption" by
> specifying "hostssl", then "gss" with "encrypt = require".  I don't
> think this is something more desirable than just one or the other, but
> unless it's actually a problem (and I don't think it is) to have around,
> I think the setup would work nicely.  We could also default "encrypt" to
> "none" when hostssl is specified if it is a problem.

I've coded up the GSSAPI encryption and is on my github[0].  It's
missing a number of things before merge, including proper error
handling, correct ACL behavior (by and large, it currently doesn't
respect hba.conf), and exposing configuration handles in hba.conf and
the client for the settings we've talked about above, as well as
documentation of all that.

What is here, importantly, is the fallback for old clients to new
servers and new clients to old servers.  A parameter is sent at the
start of normal connection (i.e., after SSL querying), and if the server
sees it, we encrypt; otherwise, the server will generate a failure
similar to the application_name case (and we will retry without it in
the same manner).

Thanks!
--Robbie

[0]: https://github.com/frozencemetery/postgres/


signature.asc
Description: PGP signature


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Andres Freund
On 2015-06-10 09:57:17 -0700, Jeff Janes wrote:
> Mine goal isn't that.  My goal is to have a consistent backup without
> having to shut down the server to take a cold one, or having to manually
> juggle the pg_start_backup, etc. commands.

A basebackup won't necessarily give you a consistent log though...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Jeff Janes
On Wed, Jun 10, 2015 at 8:29 AM, Robert Haas  wrote:

> On Mon, Jun 8, 2015 at 12:09 AM, Michael Paquier
>  wrote:
> >> Recently, one of our customers has had a basebackup fail because pg_log
> >> contained files that were >8GB:
> >> FATAL: archive member "pg_log/postgresql-20150119.log" too large for
> tar format
> >>
> >> I think pg_basebackup should also skip pg_log entries, as it does for
> >> pg_stats_temp and pg_replslot, etc. I've attached a patch along those
> >> lines for discussion.
> >
> > And a recent discussion about that is this one:
> >
> http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216
> > Bringing the point: some users may want to keep log files in a base
> > backup, and some users may want to skip some of them, and not only
> > pg_log. Hence we may want more flexibility than what is proposed here.
>
> That seems pretty thin.  If you're taking a base backup, your goal is
> to create a standby.


Mine goal isn't that.  My goal is to have a consistent backup without
having to shut down the server to take a cold one, or having to manually
juggle the pg_start_backup, etc. commands.  I do occasionally use it start
up a standby for training/testing purposes, but mostly it is for D-R (in
which I would rather have the logs) and for cloning test/dev/QA
environments (in which case I go delete the logs if I don't want them)


> Copying logs is in no way an integral part of
> that, and we would not copy them if they were stored outside the data
> directory.  If we accept the proposal that this needs to be more
> complicated, will we also accept a proposal to make pg_basebackup
> include relevant files from /var/log when the PostgreSQL logs are
> stored there?
>

I think it is pretty intuitive that if you have your logs go to pg_log,
they get backed up with the other pg_ stuff, and if you change it go
elsewhere, then you need to handle it yourself.

Cheers,

Jeff


[HACKERS] pg_rewind failure by file deletion in source server

2015-06-10 Thread Fujii Masao
Hi,

While testing pg_rewind, I got the following error and pg_rewind failed.

$ pg_rewind -D ... --source-server="..." -P
ERROR:  could not open file "base/13243/16384" for reading: No
such file or directory
STATEMENT:  SELECT path, begin,
  pg_read_binary_file(path, begin, len) AS chunk
FROM fetchchunks

As far as I read the pg_rewind code, ISTM that the file deletion in
source server while pg_rewind is running can cause pg_rewind to fail.
That is, at first pg_rewind picks up the files to copy (or do some actions)
and creates the file map. Then it performs the actual operation (e.g.,
file copy from source to dest) according to the file map. The problem
can happen if the source server deletes the file listed in the file map
before pg_rewind performs the actual operations. The copy of the file
must fail because it's not found in source server, and then pg_rewind
exits with an error.

Shouldn't pg_rewind ignore that failure of operation? If the file is not
found in source server, the file doesn't need to be copied to destination
server obviously. So ISTM that pg_rewind safely can skip copying that file.
Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Rosiński Krzysztof 2 - Detal
How to use this optimization ?

select *
from table join partitioned_table on (
  table.part_id = partitioned_table.id
  and hash_func_mod(table.part_id) = hash_func_mod(partitioned_table.id)
)




Re: [HACKERS] reaper should restart archiver even on standby

2015-06-10 Thread Alvaro Herrera
Fujii Masao wrote:

> Agreed. The attached patch defines the macro to check whether archiver is
> allowed to start up or not, and uses it everywhere except sigusr1_handler.
> I made sigusr1_handler use a different condition because only it tries to
> start archiver in PM_STARTUP postmaster state and it looks a bit messy
> to add the check of that state into the centralized check condition.

WFM, but do these macros in xlog.h need a one-line comment to state
their purpose?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Gurjeet Singh
On Wed, Jun 10, 2015 at 8:36 AM, Andres Freund  wrote:

> On 2015-06-10 08:24:23 -0700, Gurjeet Singh wrote:
> > On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund 
> wrote:
> > > That doesn't look right to me. Why is this code logging a standby
> > > snapshot for physical slots?
> > >
> >
> > This is the new function I referred to above. The logging of the snapshot
> > is in 'not RecoveryInProgress()' case, meaning it's running in primary
> and
> > not in a standby.
>
> It's still done uselessly for physical slots?
>

I missed the comments on LogStandbySnapshot(). Attached is the new patch
which does the snapshot logging only for a logical replication slot.

I am in the process of writing up a doc patch, and will submit that as well
in a short while.

I have removed a few #include statements which seemed unnecessary. These
changes are not relevant to the patch, so I can readily revert those parts
if deemed necessary.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


physical_repl_slot_activate_restart_lsn.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix logical decoding sendtime update

2015-06-10 Thread Andres Freund
On 2015-06-10 17:57:42 +0200, Shulgin, Oleksandr wrote:
> it turns out, that the code in WalSndWriteData is setting the timestamp of
> the replication message just *after* it has been sent out to the client,
> thus the sendtime field always reads as zero.

Ugh, what a stupid bug. Thanks!

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fix logical decoding sendtime update

2015-06-10 Thread Shulgin, Oleksandr
Hi Hackers,

it turns out, that the code in WalSndWriteData is setting the timestamp of
the replication message just *after* it has been sent out to the client,
thus the sendtime field always reads as zero.

Attached is a trivial patch to fix this.  The physical replication path
already does the correct thing apparently.

Cheers!
--
Alex
diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
new file mode 100644
index eb1b89b..c9444ca
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
*** static void
*** 1048,1056 
  WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId 
xid,
bool last_write)
  {
-   /* output previously gathered data in a CopyData packet */
-   pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
- 
/*
 * Fill the send timestamp last, so that it is taken as late as 
possible.
 * This is somewhat ugly, but the protocol's set as it's already used 
for
--- 1048,1053 
*** WalSndWriteData(LogicalDecodingContext *
*** 1061,1066 
--- 1058,1066 
memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
   tmpbuf.data, sizeof(int64));
  
+   /* output previously gathered data in a CopyData packet */
+   pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
+ 
/* fast path */
/* Try to flush pending output to the client */
if (pq_flush_if_writable() != 0)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 11:51:06 -0400, Jan Wieck wrote:
> >ret = pg_atomic_fetch_sub_u32(&buf->state, 1);
> >
> >if (ret & BM_PIN_COUNT_WAITER)
> >{
> >pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
> >/* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER 
> > */
> >}
> 
> There are atomic AND and OR functions too, at least in the GCC built in
> parts. We might be able to come up with pg_atomic_...() versions of them and
> avoid the race part.

The race part isn't actually about that. It's that BM_PIN_COUNT_WAITER
might have been set after the fetch_sub above.

fetch_sub() itself would actually be race-free to unset a flag, if it's
a proper power of two.

> While some locks may be avoidable, some may be replaced by atomic
> operations, I believe that we will still be left with some of them.

Besides the two xlog.c ones and lwlock.c, which are hot otherwise? I
think we pretty much removed the rest?

> Optimizing spinlocks if we can do it in a generic fashion that does not hurt
> other platforms will still give us something.

Sure, I'm just doubtful that's easy.

I think we should just gank spinlocks asap. The hard part is removing
them from lwlock.c's slow path and the buffer headers imo. After that we
should imo be fine replacing them with lwlocks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
>>> As in 200%+ slower.
>> Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
> Yes.

Ok, if this can be validated, we might have a new case now for which my
suggestion would not be helpful. Reviewed, optimized code with short critical
sections and no hotspots by design could indeed be an exception where to keep
slock as they are.

> Hm, ok. Any chance you have profiles from back then?

IIUC I had shared all relevant data on the list. Does this help?
http://www.postgresql.org/message-id/4fe9eb27.9020...@schokola.de

Thanks, NIls


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 11:34 AM, Andres Freund wrote:

If you check the section where the spinlock is held there's nontrivial
code executed. Under contention you'll have the problem that if backend
tries to acquire the the spinlock while another backend holds the lock,
it'll "steal" the cacheline on which the lock and the rest of the buffer
descriptor resides. Which means that operations like buf->refcount--,the
if (), and the &= will now have to wait till the cacheline is
transferred back (as the cacheline will directly be marked as modified
on the attempted locker). All the while other backends will also try to
acquire the pin, causing the same kind of trouble.


Yes, that is the possibility of "cache line stealing while holding the 
lock" that I was talking about. It looks like we are in wild agreement 
(as Kevin usually puts it).




If this'd instead be written as

ret = pg_atomic_fetch_sub_u32(&buf->state, 1);

if (ret & BM_PIN_COUNT_WAITER)
{
pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
/* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
}


There are atomic AND and OR functions too, at least in the GCC built in 
parts. We might be able to come up with pg_atomic_...() versions of them 
and avoid the race part.




the whole thing looks entirely different. While there's obviously still
cacheline bouncing, every operation is guaranteed to make progress (on
x86 at least, but even elsewhere we'll never block another locker).

Now unlock is the simpler case, but...


While some locks may be avoidable, some may be replaced by atomic 
operations, I believe that we will still be left with some of them. 
Optimizing spinlocks if we can do it in a generic fashion that does not 
hurt other platforms will still give us something.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 17:30:33 +0200, Nils Goroll wrote:
> On 10/06/15 17:17, Andres Freund wrote:
> > On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
> > Interesting. I've been able to reproduce quite massive slowdowns doing
> > this on a 4 socket linux machine (after applying the lwlock patch that's
> > now in 9.5) 
> 
> Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
> production and I haven't even tested the patch with 9.5.

Ok.

> > As in 200%+ slower.
> 
> Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?

Yes.

> >> Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de
> > 
> > Do you have any details about the workloads that scaled badly back then?
> > It'd be interesting to find out which spinlocks they bottlenecked
> > on.
> 
> OLTP. But really the root cause from back then should be eliminated, this was
> with 9.1.3

Hm, ok. Any chance you have profiles from back then? It'd be very
interesting to know which spinlock you were contending on. If we convert
spinlocks into something more heavyweight we'll want to benchmark the
actually contending cases to avoid regressions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Andres Freund
On 2015-06-10 08:24:23 -0700, Gurjeet Singh wrote:
> On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund  wrote:
> > That doesn't look right to me. Why is this code logging a standby
> > snapshot for physical slots?
> >
> 
> This is the new function I referred to above. The logging of the snapshot
> is in 'not RecoveryInProgress()' case, meaning it's running in primary and
> not in a standby.

It's still done uselessly for physical slots?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 11:12:46 -0400, Jan Wieck wrote:
> The test case is that 200 threads are running in a tight loop like this:
> 
> for (...)
> {
> s_lock();
> // do something with a global variable
> s_unlock();
> }
> 
> That is the most contended case I can think of, yet the short and
> predictable code while holding the lock is the intended use case for a
> spinlock.

But lots of our code isn't that predictable. Check e.g. the buffer
spinlock case:

static void
UnpinBuffer(volatile BufferDesc *buf, bool fixOwner)
{
PrivateRefCountEntry *ref;

/* not moving as we're likely deleting it soon anyway */
ref = GetPrivateRefCountEntry(buf->buf_id + 1, false);
Assert(ref != NULL);

if (fixOwner)
ResourceOwnerForgetBuffer(CurrentResourceOwner,
  
BufferDescriptorGetBuffer(buf));

Assert(ref->refcount > 0);
ref->refcount--;
if (ref->refcount == 0)
{
/* I'd better not still hold any locks on the buffer */
Assert(!LWLockHeldByMe(buf->content_lock));
Assert(!LWLockHeldByMe(buf->io_in_progress_lock));

LockBufHdr(buf);

/* Decrement the shared reference count */
Assert(buf->refcount > 0);
buf->refcount--;

/* Support LockBufferForCleanup() */
if ((buf->flags & BM_PIN_COUNT_WAITER) &&
buf->refcount == 1)
{
/* we just released the last pin other than the 
waiter's */
int wait_backend_pid = 
buf->wait_backend_pid;

buf->flags &= ~BM_PIN_COUNT_WAITER;
UnlockBufHdr(buf);
ProcSendSignal(wait_backend_pid);
}
else
UnlockBufHdr(buf);

ForgetPrivateRefCountEntry(ref);
}
}


If you check the section where the spinlock is held there's nontrivial
code executed. Under contention you'll have the problem that if backend
tries to acquire the the spinlock while another backend holds the lock,
it'll "steal" the cacheline on which the lock and the rest of the buffer
descriptor resides. Which means that operations like buf->refcount--,the
if (), and the &= will now have to wait till the cacheline is
transferred back (as the cacheline will directly be marked as modified
on the attempted locker). All the while other backends will also try to
acquire the pin, causing the same kind of trouble.

If this'd instead be written as

ret = pg_atomic_fetch_sub_u32(&buf->state, 1);

if (ret & BM_PIN_COUNT_WAITER)
{
   pg_atomic_fetch_sub_u32(&buf->state, BM_PIN_COUNT_WAITER);
   /* XXX: deal with race that another backend has set BM_PIN_COUNT_WAITER */
}

the whole thing looks entirely different. While there's obviously still
cacheline bouncing, every operation is guaranteed to make progress (on
x86 at least, but even elsewhere we'll never block another locker).

Now unlock is the simpler case, but...


- Andres



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll


On 10/06/15 17:17, Andres Freund wrote:
> On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
>> On larger Linux machines, we have been running with spin locks replaced by
>> generic posix mutexes for years now. I personally haven't look at the code 
>> for
>> ages, but we maintain a patch which pretty much does the same thing still:
> 
> Interesting. I've been able to reproduce quite massive slowdowns doing
> this on a 4 socket linux machine (after applying the lwlock patch that's
> now in 9.5) 

Sorry, I cannot comment on this, 9.4.1 is the latest we are running in
production and I haven't even tested the patch with 9.5.

> As in 200%+ slower.

Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?

>> Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de
> 
> Do you have any details about the workloads that scaled badly back then?
> It'd be interesting to find out which spinlocks they bottlenecked
> on.

OLTP. But really the root cause from back then should be eliminated, this was
with 9.1.3

I only got woken up by s_lock() in email subjects.

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-10 Thread Robert Haas
On Mon, Jun 8, 2015 at 12:09 AM, Michael Paquier
 wrote:
>> Recently, one of our customers has had a basebackup fail because pg_log
>> contained files that were >8GB:
>> FATAL: archive member "pg_log/postgresql-20150119.log" too large for tar 
>> format
>>
>> I think pg_basebackup should also skip pg_log entries, as it does for
>> pg_stats_temp and pg_replslot, etc. I've attached a patch along those
>> lines for discussion.
>
> And a recent discussion about that is this one:
> http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216
> Bringing the point: some users may want to keep log files in a base
> backup, and some users may want to skip some of them, and not only
> pg_log. Hence we may want more flexibility than what is proposed here.

That seems pretty thin.  If you're taking a base backup, your goal is
to create a standby.  Copying logs is in no way an integral part of
that, and we would not copy them if they were stored outside the data
directory.  If we accept the proposal that this needs to be more
complicated, will we also accept a proposal to make pg_basebackup
include relevant files from /var/log when the PostgreSQL logs are
stored there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Gurjeet Singh
On Wed, Jun 10, 2015 at 8:07 AM, Andres Freund  wrote:

> On 2015-06-10 08:00:28 -0700, Gurjeet Singh wrote:
>
> > pg_create_logical_replication_slot() prevents LSN from being
> > recycled that by looping (worst case 2 times) until there's no
> > conflict with the checkpointer recycling the segment. So I have used
> > the same approach.
>
> There's no need to change anything for logical slots? Or do you just
> mean that you moved the existing code?
>

Yes, I turned the code from logical replication into a function and used it
from logical and physical replication.


> >
> >  /*
> > + * Grab and save an LSN value to prevent WAL recycling past that point.
> > + */
> > +void
> > +ReplicationSlotRegisterRestartLSN()
> > +{

...

> > + /*
> > +  * Let's start with enough information if we can, so log a
> standby
> > +  * snapshot and start decoding at exactly that position.
> > +  */
> > + if (!RecoveryInProgress())
> > + {
> > + XLogRecPtr  flushptr;
> > +
> > + /* start at current insert position */
> > + slot->data.restart_lsn = GetXLogInsertRecPtr();
> > +
> > + /* make sure we have enough information to start */
> > + flushptr = LogStandbySnapshot();
> > +
> > + /* and make sure it's fsynced to disk */
> > + XLogFlush(flushptr);
> > + }
> > + else
> > + slot->data.restart_lsn = GetRedoRecPtr();
> > +
>


> That doesn't look right to me. Why is this code logging a standby
> snapshot for physical slots?
>

This is the new function I referred to above. The logging of the snapshot
is in 'not RecoveryInProgress()' case, meaning it's running in primary and
not in a standby.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 16:07:50 +0200, Nils Goroll wrote:
> On larger Linux machines, we have been running with spin locks replaced by
> generic posix mutexes for years now. I personally haven't look at the code for
> ages, but we maintain a patch which pretty much does the same thing still:

Interesting. I've been able to reproduce quite massive slowdowns doing
this on a 4 socket linux machine (after applying the lwlock patch that's
now in 9.5) in readonly workloads. As in 200%+ slower. And that was with
a new kernel/glibc.  That was primarily due to buffer header
spinlocks. For write workloads the difference was smaller, but still
noticeably. There xlog.c's spinlocks where noticeable which are usually
held very shortly.

> Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de

Do you have any details about the workloads that scaled badly back then?
It'd be interesting to find out which spinlocks they bottlenecked
on.

> I understand that there are systems out there which have less efficient posix
> mutex implementations than Linux (which uses futexes), but I think it would
> still be worth considering to do away with the roll-your-own spinlocks on
> systems whose posix mutexes are known to behave.

If we get rid of the 'hot' spinlocks something very roughly like this
will possibly be realistic (we can't rely on pthreads, but ...). I don't
think it'll be before that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
On 10/06/15 17:12, Jan Wieck wrote:
> for (...)
> {
> s_lock();
> // do something with a global variable
> s_unlock();
> }

OK, I understand now, thank you. I am not sure if this test case is appropriate
for the critical sections in postgres (if it was, we'd not have the problem we
are discussion).

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 11:06 AM, Nils Goroll wrote:

On 10/06/15 16:18, Jan Wieck wrote:


I have played with test code that isolates a stripped down version of s_lock()
and uses it with multiple threads. I then implemented multiple different
versions of that s_lock(). The results with 200 concurrent threads are that
using a __sync_val_compare_and_swap() to acquire the lock and then falling back
to a futex() is limited to about 500,000 locks/second. Spinning for 10 times and
then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.

Note that the __sync_val_compare_and_swap() GCC built in seems identical in
performance with the assembler xchgb operation used by PostgreSQL today on 
x84_64.


These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.

Can you please explain what the message is you are trying to get across?


The test case is that 200 threads are running in a tight loop like this:

for (...)
{
s_lock();
// do something with a global variable
s_unlock();
}

That is the most contended case I can think of, yet the short and 
predictable code while holding the lock is the intended use case for a 
spinlock.


The code in s_lock() is what is doing multiple CAS attempts, then sleep. 
The code is never holding the lock for 1ms. Sorry if that wasn't clear.



Regards, Jan


--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll


On 10/06/15 17:01, Andres Freund wrote:
>> > - The fact that well behaved mutexes have a higher initial cost could even
>> >   motivate good use of them rather than optimize misuse.
> Well. There's many locks in a RDBMS that can't realistically be
> avoided. So optimizing for no and moderate contention isn't something
> you can simply forgo.

Let's get back to my initial suggestion:

On 10/06/15 16:07, Nils Goroll wrote:
> I think it would
> still be worth considering to do away with the roll-your-own spinlocks on
> systems whose posix mutexes are known to behave.

Where we use the mutex patch we have not seen any relevant negative impact -
neither in benchmarks nor in production.

So, yes, postgres should still work fine on a 2-core laptop and I don't see any
reason why using posix mutexes *where they are known to behave* would do any 
harm.

And, to be honest, Linux is quite dominant, so solving the issue for this
platform would be a start at least.

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Andres Freund
On 2015-06-10 08:00:28 -0700, Gurjeet Singh wrote:
> Attached is the patch that takes the former approach (initialize
> restart_lsn when the slot is created).

If it's an option that's imo a sane approach.

> pg_create_logical_replication_slot() prevents LSN from being
> recycled that by looping (worst case 2 times) until there's no
> conflict with the checkpointer recycling the segment. So I have used
> the same approach.

There's no need to change anything for logical slots? Or do you just
mean that you moved the existing code?
>  
>  /*
> + * Grab and save an LSN value to prevent WAL recycling past that point.
> + */
> +void
> +ReplicationSlotRegisterRestartLSN()
> +{
> + ReplicationSlot *slot = MyReplicationSlot;
> +
> + Assert(slot != NULL);
> + Assert(slot->data.restart_lsn == InvalidXLogRecPtr);
> +
> + /*
> +  * The replication slot mechanism is used to prevent removal of required
> +  * WAL. As there is no interlock between this and checkpoints required, 
> WAL
> +  * segment could be removed before ReplicationSlotsComputeRequiredLSN() 
> has
> +  * been called to prevent that. In the very unlikely case that this 
> happens
> +  * we'll just retry.
> +  */
> + while (true)
> + {
> + XLogSegNo   segno;
> +
> + /*
> +  * Let's start with enough information if we can, so log a 
> standby
> +  * snapshot and start decoding at exactly that position.
> +  */
> + if (!RecoveryInProgress())
> + {
> + XLogRecPtr  flushptr;
> +
> + /* start at current insert position */
> + slot->data.restart_lsn = GetXLogInsertRecPtr();
> +
> + /* make sure we have enough information to start */
> + flushptr = LogStandbySnapshot();
> +
> + /* and make sure it's fsynced to disk */
> + XLogFlush(flushptr);
> + }
> + else
> + slot->data.restart_lsn = GetRedoRecPtr();
> +
> + /* prevent WAL removal as fast as possible */
> + ReplicationSlotsComputeRequiredLSN();
> +
> + /*
> +  * If all required WAL is still there, great, otherwise retry. 
> The
> +  * slot should prevent further removal of WAL, unless there's a
> +  * concurrent ReplicationSlotsComputeRequiredLSN() after we've 
> written
> +  * the new restart_lsn above, so normally we should never need 
> to loop
> +  * more than twice.
> +  */
> + XLByteToSeg(slot->data.restart_lsn, segno);
> + if (XLogGetLastRemovedSegno() < segno)
> + break;
> + }
> +}

That doesn't look right to me. Why is this code logging a standby
snapshot for physical slots?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
On 10/06/15 16:18, Jan Wieck wrote:
> 
> I have played with test code that isolates a stripped down version of s_lock()
> and uses it with multiple threads. I then implemented multiple different
> versions of that s_lock(). The results with 200 concurrent threads are that
> using a __sync_val_compare_and_swap() to acquire the lock and then falling 
> back
> to a futex() is limited to about 500,000 locks/second. Spinning for 10 times 
> and
> then doing a usleep(1000) (one millisecond) gives me 25 million locks/second.
> 
> Note that the __sync_val_compare_and_swap() GCC built in seems identical in
> performance with the assembler xchgb operation used by PostgreSQL today on 
> x84_64.

These numbers don't work for me. Do IUC that you are not holding the lock for
any reasonable time? If yes, the test case is invalid (the uncontended case is
never relevant). If No, the numbers don't match up - if you held one lock for
1ms, you'd not get more than 1000 locks/s anyway. If you had 200 locks, you'd
get 200.000 locks/s.

Can you please explain what the message is you are trying to get across?

Thanks, Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 10:59 AM, Robert Haas wrote:

On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane  wrote:

Jan Wieck  writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".


Hm.  One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help.  The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no?  Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)


The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works.  The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.


And note that spins_per_delay is global. Getting ANY lock without delay 
adds 100, regardless of that being a high or low contented one. Your 
process only needs to hit one low contention slock every 100 calls to 
securely peg this value >=900.



Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 16:55:31 +0200, Nils Goroll wrote:
> But still I am convinced that on today's massively parallel NUMAs, spinlocks 
> are
> plain wrong:

Sure. But a large number of installations are not using massive NUMA
systems, so we can't focus on optimizing for NUMA.

We definitely have quite some catchup to do there. Unfortunately most of
the problems are only reproducible on 4, 8 socket machines, and it's
hard to get hand on those for prolonged amounts of time.

> - Even if critical sections are kept minimal, they can still become hot spots

That's why we started to remove several of them...

> - The fact that well behaved mutexes have a higher initial cost could even
>   motivate good use of them rather than optimize misuse.

Well. There's many locks in a RDBMS that can't realistically be
avoided. So optimizing for no and moderate contention isn't something
you can simply forgo.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] replication slot restart_lsn initialization

2015-06-10 Thread Gurjeet Singh
On Tue, May 5, 2015 at 5:53 PM, Andres Freund  wrote:

>
> > Was there any consideration for initializing restart_lsn to the latest
> > WAL write pointer when a slot is created? Or for allowing an optional
> > parameter in pg_create_(physical|logical)_replication_slot() for
> > specifying the restart_lsn at slot creation?
>
> I've been wondering about allowing for the latter alternative. I could
> have used it a couple times. The former doesn't make much sense to me,
> it could be too far *ahead* in many cases actually.  A patch for this
> would be fairly trivial.
>

Attached is the patch that takes the former approach (initialize
restart_lsn when the slot is created). I think it's better than the latter
approach (depend on user to specify an LSN) because the LSN user specifies
may have already been recycled. pg_create_logical_replication_slot()
prevents LSN from being recycled that by looping (worst case 2 times) until
there's no conflict with the checkpointer recycling the segment. So I have
used the same approach.

The function pg_create_physical_replication_slot() now has an additional
boolean parameter 'activate' which user can use to allocate restart_lsn as
part of the creation process.

Best regards,
-- 
Gurjeet Singh http://gurjeet.singh.im/


physical_repl_slot_activate_restart_lsn.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Robert Haas
On Wed, Jun 10, 2015 at 10:20 AM, Tom Lane  wrote:
> Jan Wieck  writes:
>> The attached patch demonstrates that less aggressive spinning and (much)
>> more often delaying improves the performance "on this type of machine".
>
> Hm.  One thing worth asking is why the code didn't converge to a good
> value of spins_per_delay without help.  The value should drop every time
> we had to delay, so under heavy contention it ought to end up small
> anyhow, no?  Maybe we just need to alter the feedback loop a bit.
>
> (The comment about uniprocessors vs multiprocessors seems pretty wacko in
> this context, but at least the sign of the feedback term seems correct.)

The code seems to have been written with the idea that we should
converge to MAX_SPINS_PER_DELAY if spinning *ever* works.  The way
that's implemented is that, if we get a spinlock without having to
delay, we add 100 to spins_per_delay, but if we have to delay at least
once (potentially hundreds of times), then we subtract 1.
spins_per_delay will be >900 most of the time even if only 1% of the
lock acquisitions manage to get the lock without delaying.

It is possible that, as you say, all we need to do is alter the
feedback loop so that, say, we subtract 1 every time we delay (rather
than every time we have at least 1 delay) and add 1 (rather than 100)
every time we don't end up needing to delay.  I'm a bit concerned,
though, that this would tend to make spins_per_delay unstable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 10:25:32 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Unfortunately there's no portable futex support. That's what stopped us
> > from adopting them so far.  And even futexes can be significantly more
> > heavyweight under moderate contention than our spinlocks - It's rather
> > easy to reproduce scenarios where futexes cause significant slowdown in
> > comparison to spinning in userspace (just reproduce contention on a
> > spinlock where the protected area will be *very* short - i.e. no cache
> > misses, no branches or such).
> 
> Which, you'll note, is the ONLY case that's allowed by our coding rules
> for spinlock use.  If there are any locking sections that are not very
> short straight-line code, or at least code with easily predicted branches,
> we need to fix those before we worry about the spinlock mechanism per
> se.

We haven't followed that all that strictly imo. While lwlocks are a bit
less problematic in 9.5 (as they take far fewer spinlocks), they're
still far from perfect as we manipulate linked lists while holding a
lock.  We malso do lots of hard to predict stuff while the buffer header
spinlock is held...

> Optimizing for misuse of the mechanism is not the way.

Agreed. I'm not particularly interested in optimizing spinlocks. We
should get rid of most.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
On 10/06/15 16:20, Andres Freund wrote:
> That's precisely what I referred to in the bit you cut away...

I apologize, yes.

On 10/06/15 16:25, Tom Lane wrote:
> Optimizing for misuse of the mechanism is not the way.

I absolutely agree and I really appreciate all efforts towards lockless data
structures or at least better concurrency using classical mutual exclusion.

But still I am convinced that on today's massively parallel NUMAs, spinlocks are
plain wrong:

- Even if critical sections are kept minimal, they can still become hot spots

- When they do, we get potentially massive negative scalability, it will be
  hard to exclude the possibility of a system "tilting" under (potentially yet
  unknown) load patterns as long as userland slocks exist.

  Briefly: When slocks fail, they fail big time

- slocks optimize for the best case, but I think on today's systems we should
  optimize for the worst case.

- The fact that well behaved mutexes have a higher initial cost could even
  motivate good use of them rather than optimize misuse.

Cheers,

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 10:20 AM, Tom Lane wrote:

Jan Wieck  writes:

The attached patch demonstrates that less aggressive spinning and (much)
more often delaying improves the performance "on this type of machine".


Hm.  One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help.  The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no?  Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)


The feedback loop looks a bit heavy leaning towards increasing the spin 
count vs. decreasing it (100 up vs. 1 down). I have test time booked on 
the machine for tomorrow and will test that as well.


However, to me it seems that with the usual minimum sleep() interval of 
1ms, once we have to delay at all we are already losing. That spinning 
10x still outperforms the same code with 1,000 spins per delay by factor 
5 tells me that "on this particular box" something is going horribly 
wrong once we get over the tipping point in concurrency. As said, I am 
not sure what exactly that is yet. At a minimum the probability that 
another CPU package is stealing the cache line from the one, holding the 
spinlock, is going up. Which cannot possibly be good for performance. 
But I would expect that to show a more gradual drop in throughput than 
what I see in the pgbench -S example. Kevin had speculated to me that it 
may be possible that at that tipping point the kernel starts feeling the 
need to relocate the memory page in question to whichever cpu package 
makes the most failing requests and thus ends up with a huge round robin 
page relocation project. Unfortunately I don't know how to confirm or 
disprove that theory.


This is done on CentOS 7 with kernel 3.10 BTW. And no, I am not at 
liberty to install a different distribution or switch to another kernel.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-10 Thread Alvaro Herrera
Prakash Itnal wrote:
> Hello,
> 
> Recently we encountered a issue where the disc space is continuously
> increasing towards 100%. Then a manual vacuum freed the disc space. But
> again it is increasing. When digged more it is found that auto-vacuuming
> was not running or it is either stucked/hanged.

Hm, we have seen this on Windows, I think.

Is the "stats collector" process running?  Is it stuck?

If you attach to process 6504 (autovac launcher), what's the backtrace?

> 4) Last run auto-vacuum:
> SELECT now(), schemaname, relname, last_vacuum, last_autovacuum, 
> vacuum_count, autovacuum_count FROM pg_stat_user_tables;
> 
>   now  | schemaname |relname| last_vacuum |   
>  last_autovacuum| vacuum_count | autovacuum_count 
> ---++---+-+---+--+--
>  2015-06-10 01:03:03.574212+02 | public | abcd  | | 
> 2015-04-18 00:52:35.008874+02 |0 |2
>  2015-06-10 01:03:03.574212+02 | public | xyz   | | 
> 2015-05-02 06:01:35.220651+02 |0 |   20
> 
> NOTE: I changed the relname for above two tables due to confidentiality.

Are there dead tuples in tables?  Maybe vacuums are getting executed and
these values are not updated, for instance?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reaper should restart archiver even on standby

2015-06-10 Thread Fujii Masao
On Wed, Jun 10, 2015 at 11:12 PM, Alvaro Herrera
 wrote:
> Fujii Masao wrote:
>> On Tue, Jun 9, 2015 at 5:21 AM, Alvaro Herrera  
>> wrote:
>> > Fujii Masao wrote:
>
>> > Can't we create
>> > some common function that would be called both here and on ServerLoop?
>>
>> Agreed. So, what about the attached patch?
>
> No attachment ...

Ooops... Attached.

>> OTOH, in the other places where archiver is started up,
>> we can reach there during not only recovery but also normal processing.
>> So the conditions that we need to check are different.
>
> I think it would be simpler to centralize knowledge in a single
> function, and have that function take an argument indicating whether
> we're in recovery or normal processing, instead of spreading it to every
> place that can possibly start the archiver.

Agreed. The attached patch defines the macro to check whether archiver is
allowed to start up or not, and uses it everywhere except sigusr1_handler.
I made sigusr1_handler use a different condition because only it tries to
start archiver in PM_STARTUP postmaster state and it looks a bit messy
to add the check of that state into the centralized check condition.

Regards,

-- 
Fujii Masao
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***
*** 406,411  static pid_t StartChildProcess(AuxProcType type);
--- 406,422 
  static void StartAutovacuumWorker(void);
  static void InitPostmasterDeathWatchHandle(void);
  
+ /*
+  * Archiver is allowed to start up at the current postmaster state?
+  *
+  * If WAL archiving is enabled always, we are allowed to start archiver
+  * even during recovery.
+  */
+ #define PgArchStartupAllowed()	\
+ 	((XLogArchivingActive() && pmState == PM_RUN) ||	\
+ 	 (XLogArchivingStandby() &&	\
+ 	  (pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY)))
+ 
  #ifdef EXEC_BACKEND
  
  #ifdef WIN32
***
*** 1649,1669  ServerLoop(void)
  		if (PgStatPID == 0 && pmState == PM_RUN)
  			PgStatPID = pgstat_start();
  
! 		/*
! 		 * If we have lost the archiver, try to start a new one.
! 		 *
! 		 * If WAL archiving is enabled always, we try to start a new archiver
! 		 * even during recovery.
! 		 */
! 		if (PgArchPID == 0 && wal_level >= WAL_LEVEL_ARCHIVE)
! 		{
! 			if ((pmState == PM_RUN && XLogArchiveMode > ARCHIVE_MODE_OFF) ||
! ((pmState == PM_RECOVERY || pmState == PM_HOT_STANDBY) &&
!  XLogArchiveMode == ARCHIVE_MODE_ALWAYS))
! 			{
  PgArchPID = pgarch_start();
- 			}
- 		}
  
  		/* If we need to signal the autovacuum launcher, do so now */
  		if (avlauncher_needs_signal)
--- 1660,1668 
  		if (PgStatPID == 0 && pmState == PM_RUN)
  			PgStatPID = pgstat_start();
  
! 		/* If we have lost the archiver, try to start a new one. */
! 		if (PgArchPID == 0 && PgArchStartupAllowed())
  PgArchPID = pgarch_start();
  
  		/* If we need to signal the autovacuum launcher, do so now */
  		if (avlauncher_needs_signal)
***
*** 2669,2675  reaper(SIGNAL_ARGS)
  			 */
  			if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
  AutoVacPID = StartAutoVacLauncher();
! 			if (XLogArchivingActive() && PgArchPID == 0)
  PgArchPID = pgarch_start();
  			if (PgStatPID == 0)
  PgStatPID = pgstat_start();
--- 2668,2674 
  			 */
  			if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
  AutoVacPID = StartAutoVacLauncher();
! 			if (PgArchStartupAllowed() && PgArchPID == 0)
  PgArchPID = pgarch_start();
  			if (PgStatPID == 0)
  PgStatPID = pgstat_start();
***
*** 2810,2816  reaper(SIGNAL_ARGS)
  			if (!EXIT_STATUS_0(exitstatus))
  LogChildExit(LOG, _("archiver process"),
  			 pid, exitstatus);
! 			if (XLogArchivingActive() && pmState == PM_RUN)
  PgArchPID = pgarch_start();
  			continue;
  		}
--- 2809,2815 
  			if (!EXIT_STATUS_0(exitstatus))
  LogChildExit(LOG, _("archiver process"),
  			 pid, exitstatus);
! 			if (PgArchStartupAllowed())
  PgArchPID = pgarch_start();
  			continue;
  		}
***
*** 4833,4843  sigusr1_handler(SIGNAL_ARGS)
  		 * files.
  		 */
  		Assert(PgArchPID == 0);
! 		if (wal_level >= WAL_LEVEL_ARCHIVE &&
! 			XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
! 		{
  			PgArchPID = pgarch_start();
- 		}
  
  		pmState = PM_RECOVERY;
  	}
--- 4832,4839 
  		 * files.
  		 */
  		Assert(PgArchPID == 0);
! 		if (XLogArchivingStandby())
  			PgArchPID = pgarch_start();
  
  		pmState = PM_RECOVERY;
  	}
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***
*** 128,133  extern int	wal_level;
--- 128,135 
  
  #define XLogArchivingActive() \
  	(XLogArchiveMode > ARCHIVE_MODE_OFF && wal_level >= WAL_LEVEL_ARCHIVE)
+ #define XLogArchivingStandby() \
+ 	(XLogArchiveMode == ARCHIVE_MODE_ALWAYS && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
  /*

-- 
Sent via pgsql-hackers mailing list (pgsql-ha

Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Tom Lane
Andres Freund  writes:
> Unfortunately there's no portable futex support. That's what stopped us
> from adopting them so far.  And even futexes can be significantly more
> heavyweight under moderate contention than our spinlocks - It's rather
> easy to reproduce scenarios where futexes cause significant slowdown in
> comparison to spinning in userspace (just reproduce contention on a
> spinlock where the protected area will be *very* short - i.e. no cache
> misses, no branches or such).

Which, you'll note, is the ONLY case that's allowed by our coding rules
for spinlock use.  If there are any locking sections that are not very
short straight-line code, or at least code with easily predicted branches,
we need to fix those before we worry about the spinlock mechanism per se.
Optimizing for misuse of the mechanism is not the way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 16:12:05 +0200, Nils Goroll wrote:
>
> On 10/06/15 16:05, Andres Freund wrote:
> > it'll nearly always be beneficial to spin
>
> Trouble is that postgres cannot know if the process holding the lock actually
> does run, so if it doesn't, all we're doing is burn cycles and make the 
> problem
> worse.

That's precisely what I referred to in the bit you cut away...

> Contrary to that, the kernel does know, so for a (f|m)utex which fails to
> acquire immediately and thus needs to syscall, the kernel has the option to 
> spin
> only if the lock holder is running (the "adaptive" mutex).

Unfortunately there's no portable futex support. That's what stopped us
from adopting them so far.  And even futexes can be significantly more
heavyweight under moderate contention than our spinlocks - It's rather
easy to reproduce scenarios where futexes cause significant slowdown in
comparison to spinning in userspace (just reproduce contention on a
spinlock where the protected area will be *very* short - i.e. no cache
misses, no branches or such).

I think we should eventually work on replacing most of the currently
spinlock using code to either use lwlocks (which will enter the kernel
under contention, but not otherwise) or use lockless programming
techniques. I think there's relatively few relevant places left. Most
prominetly the buffer header spinlocks...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Tom Lane
Jan Wieck  writes:
> The attached patch demonstrates that less aggressive spinning and (much) 
> more often delaying improves the performance "on this type of machine". 

Hm.  One thing worth asking is why the code didn't converge to a good
value of spins_per_delay without help.  The value should drop every time
we had to delay, so under heavy contention it ought to end up small
anyhow, no?  Maybe we just need to alter the feedback loop a bit.

(The comment about uniprocessors vs multiprocessors seems pretty wacko in
this context, but at least the sign of the feedback term seems correct.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 10:07 AM, Nils Goroll wrote:

On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.


I have played with test code that isolates a stripped down version of 
s_lock() and uses it with multiple threads. I then implemented multiple 
different versions of that s_lock(). The results with 200 concurrent 
threads are that using a __sync_val_compare_and_swap() to acquire the 
lock and then falling back to a futex() is limited to about 500,000 
locks/second. Spinning for 10 times and then doing a usleep(1000) (one 
millisecond) gives me 25 million locks/second.


Note that the __sync_val_compare_and_swap() GCC built in seems identical 
in performance with the assembler xchgb operation used by PostgreSQL 
today on x84_64.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reaper should restart archiver even on standby

2015-06-10 Thread Alvaro Herrera
Fujii Masao wrote:
> On Tue, Jun 9, 2015 at 5:21 AM, Alvaro Herrera  
> wrote:
> > Fujii Masao wrote:

> > Can't we create
> > some common function that would be called both here and on ServerLoop?
> 
> Agreed. So, what about the attached patch?

No attachment ...

> > We also have sigusr1_handler that starts an archiver -- why does that
> > one use different conditions?
> 
> Because that code path can be reached only during recovery.
> So XLogArchivingActive() which indicates whether archiver is
> allowed to start during normal processing doesn't need to be
> checked there.

Makes sense.

> OTOH, in the other places where archiver is started up,
> we can reach there during not only recovery but also normal processing.
> So the conditions that we need to check are different.

I think it would be simpler to centralize knowledge in a single
function, and have that function take an argument indicating whether
we're in recovery or normal processing, instead of spreading it to every
place that can possibly start the archiver.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll

On 10/06/15 16:05, Andres Freund wrote:
> it'll nearly always be beneficial to spin

Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.

Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the "adaptive" mutex).

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] "could not adopt C locale" failure at startup on Windows

2015-06-10 Thread Tom Lane
Noah Misch  writes:
> On Tue, Jun 09, 2015 at 12:24:02PM -0400, Tom Lane wrote:
>> Yeah, my first instinct was to blame ca325941 as well, but I don't think
>> any of that code executes during init_locale().  Also,
>> http://www.postgresql.org/message-id/20150326040321.2492.24...@wrigleys.postgresql.org
>> says it's been seen in 9.4.1.

> The return value check and error message were new in 9.4.1.  I suspect the
> underlying problem has been present far longer, undetected.

Oooh ... I'd forgotten that 6fdba8ceb was so recent.  Agreed, what we are
seeing is probably a situation that's been there for a long time, but we
were ignoring the failure up to now (and, evidently, it wasn't really
creating any problems).

> I can reproduce this with "initdb --locale=C",
> postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the
> Windows ANSI code page set to CP936.  (Choose "Chinese (Simplified, PRC)" in
> Control Panel -> Region and Language -> Administrative -> Language for
> non-Unicode programs.)  It is neither necessary nor sufficient to change
> Control Panel -> Region and Language -> Formats -> Format.  Binaries from
> postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem.  Note
> that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding.

Hm.  I could understand getting encoding difficulties in that environment,
but it's hard to see why they'd manifest like this.  Can you trace through
pg_perm_setlocale and figure out why it's reporting failure?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

Nils



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
Hi,

On 2015-06-10 09:54:00 -0400, Jan Wieck wrote:
> model name  : Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

> numactl --hardware shows the distance to the attached memory as 10, the
> distance to every other node as 21. I interpret that as the machine having
> one NUMA bus with all cpu packages attached to that, rather than individual
> connections from cpu to cpu or something different.

Generally that doesn't say very much - IIRC the distances are defined by
the bios.

> What led me into that spinlock area was the fact that a wall clock based
> systemtap FlameGraph showed a large portion of the time spent in
> BufferPin() and BufferUnpin().

I've seen that as a bottleneck in the past as well. My plan to fix that
is to "simply" make buffer pinning lockless for the majority of cases. I
don't have access to hardware to test that at higher node counts atm
though.

My guess is that the pins are on the btree root pages. But it'd be good
to confirm that.

> >Maybe we need to adjust the amount of spinning, but to me such drastic
> >differences are a hint that we should tackle the actual contention
> >point. Often a spinlock for something regularly heavily contended can be
> >worse than a queued lock.
> 
> I have the impression that the code assumes that there is little penalty for
> accessing the shared byte in a tight loop from any number of cores in
> parallel. That apparently is true for some architectures and core counts,
> but no longer holds for these machines with many sockets.

It's just generally a tradeoff. It's beneficial to spin longer if
there's only mild amounts of contention. If the likelihood of getting
the spinlock soon is high (i.e. existing, but low contention), it'll
nearly always be beneficial to spin. If the likelihood is low, it'll be
mostly beneficial to sleep.  The latter is especially true if a machine
is sufficiently overcommitted that it's likely that it'll sleep while
holding a spinlock. The danger of sleeping while holding a spinlock,
without targeted wakeup, is why spinlocks in userspace aren't a really
good idea.

My bet is that if you'd measure using different number iterations for
different spinlocks you'd find some where the higher number of
iterations is rather beneficial as well.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

On 06/10/2015 09:28 AM, Andres Freund wrote:

On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:

On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
never recovers from there.


85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?


model name  : Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

numactl --hardware shows the distance to the attached memory as 10, the 
distance to every other node as 21. I interpret that as the machine 
having one NUMA bus with all cpu packages attached to that, rather than 
individual connections from cpu to cpu or something different.


pgbench scale=300, -Msimple.



Could you share a call graph perf profile?


I do not have them handy at the moment and the machine is in use for 
something else until tomorrow. I will forward perf and systemtap based 
graphs ASAP.


What led me into that spinlock area was the fact that a wall clock based 
systemtap FlameGraph showed a large portion of the time spent in

BufferPin() and BufferUnpin().




The attached patch demonstrates that less aggressive spinning and
(much) more often delaying improves the performance "on this type of
machine". The 8 socket machine in question scales to over 350,000 TPS.


Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.


There is more wrong with the machine in question than just that. But for 
the moment I am satisfied with having a machine where I can reproduce 
this phenomenon in what appears to be a worst case.




Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.


I have the impression that the code assumes that there is little penalty 
for accessing the shared byte in a tight loop from any number of cores 
in parallel. That apparently is true for some architectures and core 
counts, but no longer holds for these machines with many sockets.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add pg_audit, an auditing extension

2015-06-10 Thread Noah Misch
On Tue, Jun 09, 2015 at 03:54:59PM -0400, David Steele wrote:
> I've certainly had quite the experience as a first-time contributor
> working on this patch.  Perhaps I bit off more than I should have and I
> definitely managed to ruffle a few feathers along the way.  I learned a
> lot about how the community works, both the good and the bad.  Fear not,
> though, I'm not so easily discouraged and you'll definitely be hearing
> more from me.

Glad to hear it.

> The stated purpose of contrib is: "include porting tools, analysis
> utilities, and plug-in features that are not part of the core PostgreSQL
> system, mainly because they address a limited audience or are too
> experimental to be part of the main source tree. This does not preclude
> their usefulness."
> 
> Perhaps we should consider modifying that language, because from my
> perspective pg_audit fit the description perfectly.

"What is contrib?" attracts enduring controversy; see recent thread "RFC:
Remove contrib entirely" for the latest episode.  However that discussion
concludes, that documentation passage is not too helpful as a guide to
predicting contrib patch reception.  (Most recent contrib additions had an
obvious analogy to an existing module, sidestepping the question.)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The Future of Aggregation

2015-06-10 Thread Kevin Grittner
David Rowley  wrote:
> On 10 June 2015 at 02:52, Kevin Grittner  wrote:
>> David Rowley  wrote:
>>> The idea I discussed in the link in item 5 above gets around this
>>> problem, but it's a perhaps more surprise filled implementation
>>> as it will mean "select avg(x),sum(x),count(x) from t" is
>>> actually faster than "select sum(x),count(x) from t" as the agg
>>> state for avg() will satisfy sum and count too.
>>
>> I'm skeptical that it will be noticeably faster.  It's easy to see
>> why this optimization will make a query *with all three* faster,
>> but I would not expect the process of accumulating the sum and
>> count to be about the same speed whether performed by one
>> transition function or two.  Of course I could be persuaded by a
>> benchmark showing otherwise.

Of course, after reading Tom's post and digging into what
aggregates share a transition function, I was already prepared to
eat my words above.  Since the sum() aggregate is using the
xxx_avg_accum transition function, it is clearly doing the work of
accumulating the count already, so it's clear that the above can
be a win.

> Assuming that if we reuse the avg(x) state for count(x) and
> sum(x) then it will perform almost exactly like a query
> containing just avg(x), the only additional overhead is the call
> to the final functions per group, so in the following case that's
> likely immeasurable:
>
> /* setup */ create table millionrowtable as select
> generate_series(1,100)::numeric as x;
> /* test 1 */ SELECT sum(x) / count(x)  from millionrowtable;
> /* test 2 */ SELECT avg(x) from millionrowtable;
>
> Test 1:
> 274.979 ms
> 272.104 ms
> 269.915 ms
>
> Test 2:
> 229.619 ms
> 220.703 ms
> 234.743 ms
>

> (About 19% slower)

Of course, with Tom's approach you would see the benefit; the two
statements should run at about the same speed.

I am a little curious what sort of machine you're running on,
because my i7 is much slower.  I ran a few other tests with your
table for perspective.

To get the raw time to just pass the tuples:

SELECT from millionrowtable where xmin = '0';
Time: 125.340 ms
Time: 124.443 ms
Time: 115.629 ms

Just the count(*) of those rows didn't boost the time much:

SELECT count(*) from millionrowtable;
Time: 132.128 ms
Time: 128.036 ms
Time: 125.400 ms

The NULL check added by specifying count(x) boosted it more:

SELECT count(x) from millionrowtable;
Time: 165.858 ms
Time: 163.872 ms
Time: 165.448 ms

A NULL check plus numeric addition gets expensive:

SELECT sum(x) from millionrowtable;
Time: 366.879 ms
Time: 364.503 ms
Time: 365.418 ms

Since sum() and avg() use the same transition function, I was
suprised to see a difference here:

SELECT avg(x) from millionrowtable;
Time: 374.339 ms
Time: 372.294 ms
Time: 366.933 ms

Here's the statement you are talking about optimizing:

SELECT sum(x), count(x)  from millionrowtable;
Time: 441.331 ms
Time: 442.501 ms
Time: 436.930 ms

To confirm that projecting the extra column compared to avg() was
not significant:

SELECT sum(x) / count(x)  from millionrowtable;
Time: 442.404 ms
Time: 436.241 ms
Time: 442.381 ms

So this can reasonably be compared to the avg(x) time above.

On my machine this optimization could be expected to shave off
about 16% of current run time.

One question that arose in my mind running this was whether might
be able to combine sum(x) with count(*) if x was NOT NULL, even
though the arguments don't match.  It might not be worth the
gymnastics of recognizing the special case, and I certainly
wouldn't recommend looking at that optimization in a first pass;
but it might be worth jotting down on a list somewhere

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Bruce Momjian
On Wed, Jun 10, 2015 at 09:18:56AM -0400, Jan Wieck wrote:
> The attached patch demonstrates that less aggressive spinning and
> (much) more often delaying improves the performance "on this type of
> machine". The 8 socket machine in question scales to over 350,000
> TPS.
> 
> The patch is meant to demonstrate this effect only. It has a
> negative performance impact on smaller machines and client counts <
> #cores, so the real solution will probably look much different. But
> I thought it would be good to share this and start the discussion
> about reevaluating the spinlock code before PGCon.

Wow, you are in that code!  We kind of guessed on some of those
constants many years ago, and never revisited it.  It would be nice to
get a better heuristic for those, but I am concerned it would require
the user to specify the number of CPU cores.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Andres Freund
On 2015-06-10 09:18:56 -0400, Jan Wieck wrote:
> On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, a
> pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput then
> takes a very sharp dive and reaches around 20,000 TPS at 120 clients. It
> never recovers from there.

85k? Phew, that's pretty bad. What exact type of CPU is this? Which
pgbench scale? Did you use -M prepared?

Could you share a call graph perf profile?

> The attached patch demonstrates that less aggressive spinning and
> (much) more often delaying improves the performance "on this type of
> machine". The 8 socket machine in question scales to over 350,000 TPS.

Even that seems quite low. I've gotten over 500k TPS on a four socket
x86 machine, and about 700k on a 8 socket x86 machine.

Maybe we need to adjust the amount of spinning, but to me such drastic
differences are a hint that we should tackle the actual contention
point. Often a spinlock for something regularly heavily contended can be
worse than a queued lock.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why no jsonb_exists_path()?

2015-06-10 Thread Alexander Korotkov
Josh,

On Tue, Jun 9, 2015 at 9:16 PM, Josh Berkus  wrote:

> Dmitry, Alexander:
>
> I'm noticing a feature gap for JSONB operators; we have no way to do this:
>
> jsonb_col ? ARRAY['key1','key2','key3']
>

What documents do you expect to match this operator?
Such syntax can be interpreted in very different semantics. Checking keys
only at top level, any level, sequence starting from top level ... etc.

'{"key1": "value1", "key2": "value2", "key3": "value3"}'
'{"key1":{ "key2": {"key3: "value"}}}'
'{"key1": "value1"}
'[{"key1": "value1"}, {"key2": "value2"}, {"key3": "value3"}]'

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Jan Wieck

Hi,

I think I may have found one of the problems, PostgreSQL has on machines 
with many NUMA nodes. I am not yet sure what exactly happens on the NUMA 
bus, but there seems to be a tipping point at which the spinlock 
concurrency wreaks havoc and the performance of the database collapses.


On a machine with 8 sockets, 64 cores, Hyperthreaded 128 threads total, 
a pgbench -S peaks with 50-60 clients around 85,000 TPS. The throughput 
then takes a very sharp dive and reaches around 20,000 TPS at 120 
clients. It never recovers from there.


The attached patch demonstrates that less aggressive spinning and (much) 
more often delaying improves the performance "on this type of machine". 
The 8 socket machine in question scales to over 350,000 TPS.


The patch is meant to demonstrate this effect only. It has a negative 
performance impact on smaller machines and client counts < #cores, so 
the real solution will probably look much different. But I thought it 
would be good to share this and start the discussion about reevaluating 
the spinlock code before PGCon.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 0afcba1..edaa8fd 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -83,8 +83,8 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 	 * the probability of unintended failure) than to fix the total time
 	 * spent.
 	 */
-#define MIN_SPINS_PER_DELAY 10
-#define MAX_SPINS_PER_DELAY 1000
+#define MIN_SPINS_PER_DELAY 4
+#define MAX_SPINS_PER_DELAY 10
 #define NUM_DELAYS			1000
 #define MIN_DELAY_USEC		1000L
 #define MAX_DELAY_USEC		100L
@@ -144,7 +144,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
 	{
 		/* we never had to delay */
 		if (spins_per_delay < MAX_SPINS_PER_DELAY)
-			spins_per_delay = Min(spins_per_delay + 100, MAX_SPINS_PER_DELAY);
+			spins_per_delay = Min(spins_per_delay + 2, MAX_SPINS_PER_DELAY);
 	}
 	else
 	{
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index c63cf54..42eb353 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -973,7 +973,7 @@ extern slock_t dummy_spinlock;
 extern int s_lock(volatile slock_t *lock, const char *file, int line);
 
 /* Support for dynamic adjustment of spins_per_delay */
-#define DEFAULT_SPINS_PER_DELAY  100
+#define DEFAULT_SPINS_PER_DELAY  10
 
 extern void set_spins_per_delay(int shared_spins_per_delay);
 extern int	update_spins_per_delay(int shared_spins_per_delay);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_archivecleanup bug (invalid filename input)

2015-06-10 Thread Michael Paquier
On Wed, Jun 10, 2015 at 12:29 PM, Joshua D. Drake  
wrote:
>
> On 06/09/2015 05:54 PM, Michael Paquier wrote:
>
>> Looking at the documentation what is expected is not a path to a
>> segment file, but only a segment file name:
>> http://www.postgresql.org/docs/devel/static/pgarchivecleanup.html
>> So the current behavior is correct, it is actually what
>> SetWALFileNameForCleanup@pg_archivecleanup.c checks in the code.
>
>
> O.k so I would say: Should we adjust this behavior? It seems to me that
> it should accept a path.

Perhaps others have a different opinion, but I don't see much the
point. First, archive_cleanup_command uses %r with only a segment file
name. Then, in the case where pg_archivecleanup is called within a
script and that the segment name is retrieved from a given path that
you may want to append, this script is going to scan this directory to
fetch the segment file name anyway, actually killing the purpose of
appending a path.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Amit Langote
On Wed, Jun 10, 2015 at 8:33 PM, Kouhei Kaigai  wrote:
>> On 2015-06-10 PM 01:42, Kouhei Kaigai wrote:
>> >
>> > Let's assume a table which is partitioned to four portions,
>> > and individual child relations have constraint by hash-value
>> > of its ID field.
>> >
>> >   tbl_parent
>> >+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
>> >+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
>> >+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
>> >+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)
>> >
>> > If someone tried to join another relation with tbl_parent
>> > using equivalence condition, like X = tbl_parent.ID, we
>> > know inner tuples that does not satisfies the condition
>> >   hash_func(X) % 4 = 0
>> > shall be never joined to the tuples in tbl_child_0.
>> > So, we can omit to load these tuples to inner hash table
>> > preliminary, then it potentially allows to split the
>> > inner hash-table.
>> >
>>
>> Unless I am missing something (of your idea or how hash join works), it seems
>> that there is no way to apply the filter qual (hash_func(X) % 4 = 0, etc.) at
>> the Hash node. So, that qual would not be able to limit what gets into the
>> inner hash table, right? Perhaps the qual needs to be pushed all the way down
>> to the Hash's underlying scan if that makes sense.
>>
> Really? It seems to me just below of the ExecProcNode() in MultiExecHash()
> is my expected location to filter out obviously unmatched tuples.
> As long as we can construct a qualifier based on CHECK() constraint
> of the other side, ExecQual() makes a decision whether fetched tuple
> should be loaded to inner hash-table, or not.
>

Ah that's an idea. I was thinking of unmodified MultiExecHash().

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-10 Thread Prakash Itnal
Hello,

Recently we encountered a issue where the disc space is continuously
increasing towards 100%. Then a manual vacuum freed the disc space. But
again it is increasing. When digged more it is found that auto-vacuuming
was not running or it is either stucked/hanged.

Version: 9.1.12
Auto vacuum is enabled: check configuration details attached file.
Auto vacuum daemon running.
>From stats it shows that auto-vacuum last run almost more than month back.
There are no error logs from database.

The attached file has all these details. If any other details needed please
let me know. I will try to collect it and share. Please help to analyze why
auto-vacuum stopped suddenly?


-- 
Cheers,
Prakash
1) Auto-vacuum configuration:
=
 autovacuum  | on   
| Starts the autovacuum subprocess.
 autovacuum_analyze_scale_factor | 0.1  
| Number of tuple inserts, updates or deletes prior to analyze as a fraction of 
reltuples.
 autovacuum_analyze_threshold| 50   
| Minimum number of tuple inserts, updates or deletes prior to analyze.
 autovacuum_freeze_max_age   | 2
| Age at which to autovacuum a table to prevent transaction ID wraparound.
 autovacuum_max_workers  | 3
| Sets the maximum number of simultaneously running autovacuum worker processes.
 autovacuum_naptime  | 1min 
| Time to sleep between autovacuum runs.
 autovacuum_vacuum_cost_delay| 20ms 
| Vacuum cost delay in milliseconds, for autovacuum.
 autovacuum_vacuum_cost_limit| -1   
| Vacuum cost amount available before napping, for autovacuum.
 autovacuum_vacuum_scale_factor  | 0.2  
| Number of tuple updates or deletes prior to vacuum as a fraction of reltuples.
 autovacuum_vacuum_threshold | 50   
| Minimum number of tuple updates or deletes prior to vacuum.
 log_autovacuum_min_duration | -1   
| Sets the minimum execution time above which autovacuum actions will be logged.
=

2) Auto-vacuum daemon is running:
ps -eg | grep autovacuum
5432  6504  6001  0 Apr14 ?00:02:45 postgres: autovacuum launcher 
process

3) Table sizes:
SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size",
pg_size_pretty(pg_total_relation_size(C.oid) - pg_relation_size(C.oid)) as 
"External Size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
AND C.relkind <> 'i'
AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 20;

 relation | total_size | External Size 
--++---
 public.abcd  | 1108 MB| 592 kB
 public.xyz   | 5904 kB| 56 kB

4) Last run auto-vacuum:
SELECT now(), schemaname, relname, last_vacuum, last_autovacuum, vacuum_count, 
autovacuum_count FROM pg_stat_user_tables;

  now  | schemaname |relname| last_vacuum | 
   last_autovacuum| vacuum_count | autovacuum_count 
---++---+-+---+--+--
 2015-06-10 01:03:03.574212+02 | public | abcd  | | 
2015-04-18 00:52:35.008874+02 |0 |2
 2015-06-10 01:03:03.574212+02 | public | xyz   | | 
2015-05-02 06:01:35.220651+02 |0 |   20

NOTE: I changed the relname for above two tables due to confidentiality.
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Kouhei Kaigai
> On 2015-06-10 PM 01:42, Kouhei Kaigai wrote:
> >
> > Let's assume a table which is partitioned to four portions,
> > and individual child relations have constraint by hash-value
> > of its ID field.
> >
> >   tbl_parent
> >+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
> >+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
> >+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
> >+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)
> >
> > If someone tried to join another relation with tbl_parent
> > using equivalence condition, like X = tbl_parent.ID, we
> > know inner tuples that does not satisfies the condition
> >   hash_func(X) % 4 = 0
> > shall be never joined to the tuples in tbl_child_0.
> > So, we can omit to load these tuples to inner hash table
> > preliminary, then it potentially allows to split the
> > inner hash-table.
> >
> 
> Unless I am missing something (of your idea or how hash join works), it seems
> that there is no way to apply the filter qual (hash_func(X) % 4 = 0, etc.) at
> the Hash node. So, that qual would not be able to limit what gets into the
> inner hash table, right? Perhaps the qual needs to be pushed all the way down
> to the Hash's underlying scan if that makes sense.
>
Really? It seems to me just below of the ExecProcNode() in MultiExecHash()
is my expected location to filter out obviously unmatched tuples.
As long as we can construct a qualifier based on CHECK() constraint
of the other side, ExecQual() makes a decision whether fetched tuple
should be loaded to inner hash-table, or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] comment doesn't match code

2015-06-10 Thread Robert Haas
/*
 * ALTER TABLE INHERIT
 *
 * Add a parent to the child's parents. This verifies that all the columns and
 * check constraints of the parent appear in the child and that they have the
 * same data types and expressions.
 */
static void
ATPrepAddInherit(Relation child_rel)
{
if (child_rel->rd_rel->reloftype)
ereport(ERROR,
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 errmsg("cannot change inheritance of typed table")));
}

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing XLOG_DEBUG check in AdvanceXLInsertBuffer()?

2015-06-10 Thread Michael Paquier
On Wed, Jun 10, 2015 at 8:02 PM, Andres Freund  wrote:
> Does somebody mind me backpatching the missing XLOG_DEBUG &&?

ISTM that it is a good idea to have it in REL9_4_STABLE as well.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Missing XLOG_DEBUG check in AdvanceXLInsertBuffer()?

2015-06-10 Thread Andres Freund
Hi,

When compiling with WAL_DEBUG defined, but wal_debug set to off, there's
a lot of DEBUG1 spew like
DEBUG:  initialized 1 pages, upto 40/3977E000
DEBUG:  initialized 9 pages, upto 40/3979
DEBUG:  initialized 1 pages, upto 40/39792000
DEBUG:  initialized 1 pages, upto 40/39794000
DEBUG:  initialized 1 pages, upto 40/39796000
DEBUG:  initialized 1 pages, upto 40/39798000
I find that quite annoying. That specific elog() has been there since
9a20a9b21 in 9.4.

Does somebody mind me backpatching the missing XLOG_DEBUG &&?

#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
{
elog(DEBUG1, "initialized %d pages, upto %X/%X",
 npages, (uint32) (NewPageEndPtr >> 32), (uint32) 
NewPageEndPtr);
}
#endif

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Amit Langote
On 2015-06-10 PM 05:53, Atri Sharma wrote:
> On Wed, Jun 10, 2015 at 2:16 PM, Amit Langote > wrote:
> 
>>
>> Perhaps the qual needs to be pushed all the way down
>> to the Hash's underlying scan if that makes sense.
>>
> 
> And that is a Pandora's box of troubles IMHO unless done in a very careful
> manner.
> 

More appropriately, that's a predicate (should not have called it a qual)
derived from partitioning-optimization specific knowledge.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Atri Sharma
On Wed, Jun 10, 2015 at 2:16 PM, Amit Langote  wrote:

>
> Perhaps the qual needs to be pushed all the way down
> to the Hash's underlying scan if that makes sense.
>


And that is a Pandora's box of troubles IMHO unless done in a very careful
manner.


Re: [HACKERS] [idea] table partition + hash join

2015-06-10 Thread Amit Langote
On 2015-06-10 PM 01:42, Kouhei Kaigai wrote:
> 
> Let's assume a table which is partitioned to four portions,
> and individual child relations have constraint by hash-value
> of its ID field.
> 
>   tbl_parent
>+ tbl_child_0 ... CHECK(hash_func(id) % 4 = 0)
>+ tbl_child_1 ... CHECK(hash_func(id) % 4 = 1)
>+ tbl_child_2 ... CHECK(hash_func(id) % 4 = 2)
>+ tbl_child_3 ... CHECK(hash_func(id) % 4 = 3)
> 
> If someone tried to join another relation with tbl_parent
> using equivalence condition, like X = tbl_parent.ID, we
> know inner tuples that does not satisfies the condition
>   hash_func(X) % 4 = 0
> shall be never joined to the tuples in tbl_child_0.
> So, we can omit to load these tuples to inner hash table
> preliminary, then it potentially allows to split the
> inner hash-table.
> 

Unless I am missing something (of your idea or how hash join works), it seems
that there is no way to apply the filter qual (hash_func(X) % 4 = 0, etc.) at
the Hash node. So, that qual would not be able to limit what gets into the
inner hash table, right? Perhaps the qual needs to be pushed all the way down
to the Hash's underlying scan if that makes sense.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-10 Thread Andres Freund
On 2015-06-10 01:57:22 -0400, Noah Misch wrote:
> I think I agree with everything after your first sentence.  I liked your
> specific proposal to split StartupXLOG(), but making broad-appeal
> restructuring proposals is hard.  I doubt we would get good results by casting
> a wide net for restructuring ideas.

I'm not meaning that we should actively strive to find as many things to
refactor as possible (yes, over-emphasized a bit). But that we shouldn't
skip refactoring if we notice something structurally bad, just because
it's been that way and we don't want to touch something "working". That
argument has e.g. been made repeatedly for xlog.c contents.

My feeling is that we're reaching the stage where a significant number
of bugs are added because code is structured "needlessly" complicated
and/or repetitive. And better testing can only catch so much - often
enough somebody has to think of all the possible corner cases.

> Automated testing has a lower barrier to
> entry and is far less liable to make things worse instead of better.  I can
> hope for good results from a TestSuiteFest, but not from a RestructureFest.
> That said, if folks initiate compelling restructure proposals, we should be
> willing to risk bugs from them like we risk bugs to acquire new
> features.

Sure, increasing testing and reviews are good independently. And
especially testing actually makes refactoring much more realistic.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpoints vs restartpoints

2015-06-10 Thread Andres Freund
On 2015-06-10 11:20:19 +1200, Thomas Munro wrote:
> I was wondering about this in the context of the recent multixact
> work, since such configurations could leave you with different SLRU
> files on disk which in some versions might change the behaviour in
> interesting ways.

Note that trigger a restartpoint everytime a checkpoint is replayed
wouldn't realistically fix this. Restartpoints are performed in the
background (the checkpointer), not in the startup process itself. Not
doing that would be prohibitive performance wise, because each
checkpoint would stop replication progress for seconds to tens of
minutes.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-10 Thread Amit Kapila
On Tue, Jun 9, 2015 at 8:37 PM, Andrew Dunstan  wrote:

>
> On 06/08/2015 11:19 PM, Amit Kapila wrote:
>
>>
>> I think Robert and Alvaro also seems to be inclined towards throwing
>> error for such a case, so let us do that way, but one small point is that
>> don't you think that similar code in destroy_tablespace_directories()
>> under
>> label "remove_symlink:" should use similar logic?
>>
>
>
> Yes, probably.
>

Okay, I have updated the patch to destroy_tablespace_directories() code
as well in the attached patch. I have tried to modify
remove_tablespace_symlink(), so that it can be called from
destroy_tablespace_directories(), but that is making it more complex,
especially due to the reason that destroy_tablespace_directories()
treats ENOENT as warning rather than ignoring it.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


remove_only_symlinks_during_recovery_v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >