date:20141229

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Heikki Linnakangas


On 12/19/2014 11:30 AM, Petr Jelinek wrote:

as promised I am sending code-comment patch that explains the use of
commit timestamps + nodeid C api for the conflict resolution, comments
welcome.


That's a little bit better, but I have to say I'm still not impressed. 
There are so many implicit assumptions in the system. The first 
assumption is that a 32-bit node id is sufficient. I'm sure it is for 
many replication systems, but might not be for all. Then there's the 
assumption that the node id should be sticky, i.e. it's set for the 
whole session. Again, I'm sure that's useful for many systems, but I 
could just as easily imagine that you'd want it to reset after every commit.


To be honest, I think this patch should be reverted. Instead, we should 
design a system where extensions can define their own SLRUs to store 
additional per-transaction information. That way, each extension can 
have as much space per transaction as needed, and support functions that 
make most sense with the information. Commit timestamp tracking would be 
one such extension, and for this node ID stuff, you could have another 
one (or include it in the replication extension).


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Andres Freund

On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:
 That's a little bit better, but I have to say I'm still not impressed. There
 are so many implicit assumptions in the system. The first assumption is that
 a 32-bit node id is sufficient.

Seriously? Are we going to build facilities for replication systems with
that many nodes? It seems absolutely unrealistic that a) somebody does
this b) that we'll blindly meet the demands of such a super hypothetical
scenario.

 Then there's the assumption that the node id should be sticky,
 i.e. it's set for the whole session. Again, I'm sure that's useful for
 many systems, but I could just as easily imagine that you'd want it to
 reset after every commit.

It's trivial to add that/reset it manually. So what?

 To be honest, I think this patch should be reverted. Instead, we should
 design a system where extensions can define their own SLRUs to store
 additional per-transaction information. That way, each extension can have as
 much space per transaction as needed, and support functions that make most
 sense with the information. Commit timestamp tracking would be one such
 extension, and for this node ID stuff, you could have another one (or
 include it in the replication extension).

If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Petr Jelinek


On 29/12/14 11:16, Andres Freund wrote:

On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:

That's a little bit better, but I have to say I'm still not impressed. There
are so many implicit assumptions in the system. The first assumption is that
a 32-bit node id is sufficient.


Seriously? Are we going to build facilities for replication systems with
that many nodes? It seems absolutely unrealistic that a) somebody does
this b) that we'll blindly meet the demands of such a super hypothetical
scenario.



+1, Honestly, if somebody will ever have setup with more nodes than what 
fits into 32bits they will run into bigger problems than nodeid being 
too small.



Then there's the assumption that the node id should be sticky,
i.e. it's set for the whole session. Again, I'm sure that's useful for
many systems, but I could just as easily imagine that you'd want it to
reset after every commit.


It's trivial to add that/reset it manually. So what?


Yes you can reset in the extension after commit, or you can actually 
override both commit timestamp and nodeid after commit if you so wish.





To be honest, I think this patch should be reverted. Instead, we should
design a system where extensions can define their own SLRUs to store
additional per-transaction information. That way, each extension can have as
much space per transaction as needed, and support functions that make most
sense with the information. Commit timestamp tracking would be one such
extension, and for this node ID stuff, you could have another one (or
include it in the replication extension).


If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.



Right, I would love to have custom SLRUs but I don't see it happening 
given those two restrictions, otherwise I would write the CommitTs patch 
that way in the first place...



--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Heikki Linnakangas


On 12/29/2014 12:39 PM, Petr Jelinek wrote:

On 29/12/14 11:16, Andres Freund wrote:

On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:

To be honest, I think this patch should be reverted. Instead, we should
design a system where extensions can define their own SLRUs to store
additional per-transaction information. That way, each extension can have as
much space per transaction as needed, and support functions that make most
sense with the information. Commit timestamp tracking would be one such
extension, and for this node ID stuff, you could have another one (or
include it in the replication extension).


If somebody wants that they should develop it. But given that we, based
on previous discussions, don't want to run user defined code in the
relevant phase during transaction commit *and* replay I don't think it'd
be all that easy to do it fast and flexible.


Right, I would love to have custom SLRUs but I don't see it happening
given those two restrictions, otherwise I would write the CommitTs patch
that way in the first place...


Transaction commit and replay can treat the per-transaction information 
as an opaque blob. It just needs to be included in the commit record, 
and replay needs to write it to the SLRU. That way you don't need to run 
any user-defined code in those phases.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The return value of allocate_recordbuf()

2014-12-29 Thread Heikki Linnakangas


On 12/26/2014 09:31 AM, Fujii Masao wrote:

Hi,

While reviewing FPW compression patch, I found that allocate_recordbuf()
always returns TRUE though its source code comment says that FALSE is
returned if out of memory. Its return value is checked in two places, but
which is clearly useless.

allocate_recordbuf() was introduced by 7fcbf6a, and then changed by
2c03216 so that palloc() is used instead of malloc and FALSE is never returned
even if out of memory. So this seems an oversight of 2c03216. Maybe
we should change it so that it checks whether we can enlarge the memory
with the requested size before actually allocating the memory?


Hmm. There is no way to check beforehand if a palloc() will fail because 
of OOM. We could check for MaxAllocSize, though.


Actually, before 2c03216, when we used malloc() here, the maximum record 
size was 4GB. Now it's only 1GB, because of MaxAllocSize. Are we OK with 
that, or should we use palloc_huge?


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-12-29 Thread Abhijit Menon-Sen

Hi.

I've changed pgaudit to work as you suggested.

A quick note on the implementation: pgaudit was already installing an
ExecutorCheckPerms_hook anyway; I adapted code from ExecRTECheckPerms
to check if the audit role has been granted any of the permissions
required for the operation.

This means there are three ways to configure auditing:

1. GRANT … ON … TO audit, which logs any operations that correspond to
   the granted permissions.

2. Set pgaudit.roles = 'r1, r2, …', which logs everything done by r1,
   r2, and any of their descendants.

3. Set pgaudit.log = 'read, write, …', which logs any events in any of
   the listed classes.

(This is a small change from the earlier behaviour where, if a role was
listed in .roles, it was still subject to the .log setting. I find that
more useful in practice, but since we're discussing Stephen's proposal,
I implemented what he said.)

The new pgaudit.c is attached here for review. Nothing else has changed
from the earlier submission; and everything is in the github repository
(github.com/2ndQuadrant/pgaudit).

-- Abhijit
/*
 * pgaudit/pgaudit.c
 *
 * Copyright © 2014, PostgreSQL Global Development Group
 *
 * Permission to use, copy, modify, and distribute this software and
 * its documentation for any purpose, without fee, and without a
 * written agreement is hereby granted, provided that the above
 * copyright notice and this paragraph and the following two
 * paragraphs appear in all copies.
 *
 * IN NO EVENT SHALL THE AUTHOR BE LIABLE TO ANY PARTY FOR DIRECT,
 * INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
 * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
 * DOCUMENTATION, EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED
 * OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 * THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
 * A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN AS
 * IS BASIS, AND THE AUTHOR HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE,
 * SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
 */

#include postgres.h

#include access/htup_details.h
#include access/sysattr.h
#include access/xact.h
#include catalog/catalog.h
#include catalog/objectaccess.h
#include catalog/pg_class.h
#include commands/dbcommands.h
#include catalog/pg_proc.h
#include commands/event_trigger.h
#include executor/executor.h
#include executor/spi.h
#include miscadmin.h
#include libpq/auth.h
#include nodes/nodes.h
#include tcop/utility.h
#include utils/acl.h
#include utils/builtins.h
#include utils/guc.h
#include utils/lsyscache.h
#include utils/memutils.h
#include utils/rel.h
#include utils/syscache.h
#include utils/timestamp.h

PG_MODULE_MAGIC;

void _PG_init(void);

Datum pgaudit_func_ddl_command_end(PG_FUNCTION_ARGS);
Datum pgaudit_func_sql_drop(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(pgaudit_func_ddl_command_end);
PG_FUNCTION_INFO_V1(pgaudit_func_sql_drop);

/*
 * pgaudit_roles_str is the string value of the pgaudit.roles
 * configuration variable, which is a list of role names.
 */

char *pgaudit_roles_str = NULL;

/*
 * pgaudit_log_str is the string value of the pgaudit.log configuration
 * variable, e.g. read, write, user. Each token corresponds to a flag
 * in enum LogClass below. We convert the list of tokens into a bitmap
 * in pgaudit_log for internal use.
 */

char *pgaudit_log_str = NULL;
static uint64 pgaudit_log = 0;

enum LogClass {
	LOG_NONE = 0,

	/* SELECT */
	LOG_READ = (1  0),

	/* INSERT, UPDATE, DELETE, TRUNCATE */
	LOG_WRITE = (1  1),

	/* GRANT, REVOKE, ALTER … */
	LOG_PRIVILEGE = (1  2),

	/* CREATE/DROP/ALTER ROLE */
	LOG_USER = (1  3),

	/* DDL: CREATE/DROP/ALTER */
	LOG_DEFINITION = (1  4),

	/* DDL: CREATE OPERATOR etc. */
	LOG_CONFIG = (1  5),

	/* VACUUM, REINDEX, ANALYZE */
	LOG_ADMIN = (1  6),

	/* Function execution */
	LOG_FUNCTION = (1  7),

	/* Absolutely everything; not available via pgaudit.log */
	LOG_ALL = ~(uint64)0
};

/*
 * This module collects AuditEvents from various sources (event
 * triggers, and executor/utility hooks) and passes them to the
 * log_audit_event() function.
 *
 * An AuditEvent represents an operation that potentially affects a
 * single object. If an underlying command affects multiple objects,
 * multiple AuditEvents must be created to represent it.
 */

typedef struct {
	NodeTag type;
	const char *object_id;
	const char *object_type;
	const char *command_tag;
	const char *command_text;
	bool granted;
} AuditEvent;

/*
 * Returns the oid of the hardcoded audit role.
 */

static Oid
audit_role_oid()
{
	HeapTuple roleTup;
	Oid oid = InvalidOid;

	roleTup = SearchSysCache1(AUTHNAME, PointerGetDatum(audit));
	if (HeapTupleIsValid(roleTup)) {
		oid = HeapTupleGetOid(roleTup);
		ReleaseSysCache(roleTup);
	}

	return oid;
}

/* Returns true if either pgaudit.roles or pgaudit.log is set. */

static inline bool
pgaudit_configured()
{
	return (pgaudit_roles_str  *pgaudit_roles_str)

Re: [HACKERS] [COMMITTERS] pgsql: Keep track of transaction commit timestamps

2014-12-29 Thread Andres Freund

On 2014-12-29 12:50:23 +0200, Heikki Linnakangas wrote:
 On 12/29/2014 12:39 PM, Petr Jelinek wrote:
 On 29/12/14 11:16, Andres Freund wrote:
 On 2014-12-29 12:06:07 +0200, Heikki Linnakangas wrote:
 To be honest, I think this patch should be reverted. Instead, we should
 design a system where extensions can define their own SLRUs to store
 additional per-transaction information. That way, each extension can have 
 as
 much space per transaction as needed, and support functions that make most
 sense with the information. Commit timestamp tracking would be one such
 extension, and for this node ID stuff, you could have another one (or
 include it in the replication extension).
 
 If somebody wants that they should develop it. But given that we, based
 on previous discussions, don't want to run user defined code in the
 relevant phase during transaction commit *and* replay I don't think it'd
 be all that easy to do it fast and flexible.
 
 Right, I would love to have custom SLRUs but I don't see it happening
 given those two restrictions, otherwise I would write the CommitTs patch
 that way in the first place...
 
 Transaction commit and replay can treat the per-transaction information as
 an opaque blob. It just needs to be included in the commit record, and
 replay needs to write it to the SLRU. That way you don't need to run any
 user-defined code in those phases.

Meh. Only if you want to duplicate the timestamps from the commits.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Andres Freund

Hi,

On 2014-12-25 11:57:29 +0530, Abhijit Menon-Sen wrote:
 -extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
 +extern void pg_init_comp_crc32c(void);

How about pg_choose_crc_impl() or something?

 +extern pg_crc32 (*pg_comp_crc32c)(pg_crc32 crc, const void *data, size_t 
 len);
  
  /*
   * CRC calculation using the CRC-32C (Castagnoli) polynomial.
 
 diff --git a/src/port/pg_crc.c b/src/port/pg_crc.c
 index 2f9857b..6be17b0 100644
 --- a/src/port/pg_crc.c
 +++ b/src/port/pg_crc.c
 @@ -21,6 +21,13 @@
  #include utils/pg_crc.h
  #include utils/pg_crc_tables.h
  
 +#if defined(HAVE_CPUID_H)
 +#include cpuid.h
 +#elif defined(_MSC_VER)
 +#include intrin.h
 +#include nmmintrin.h
 +#endif
 +
  static inline uint32 bswap32(uint32 x)
  {
  #if defined(__GNUC__) || defined(__clang__)
 @@ -39,8 +46,8 @@ static inline uint32 bswap32(uint32 x)
  #define cpu_to_le32(x) x
  #endif
  
 -pg_crc32
 -pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
 +static pg_crc32
 +pg_comp_crc32c_sb8(pg_crc32 crc, const void *data, size_t len)

_sb8? Unless I miss something it's not slice by 8 but rather bytewise?

  {
   const unsigned char *p = data;
   const uint32 *p8;
 @@ -61,7 +68,6 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len)
*/
  
   p8 = (const uint32 *) p;
 -
   while (len = 8)
   {
   uint32 a = *p8++ ^ cpu_to_le32(crc);
 @@ -101,8 +107,102 @@ pg_comp_crc32c(pg_crc32 crc, const void *data, size_t 
 len)
*/
  
   p = (const unsigned char *) p8;
 - while (len--  0)
 + while (len  0)
 + {
   crc = pg_crc32c_table[0][(crc ^ *p++)  0xFF] ^ (crc  8);
 + len--;
 + }
 +
 + return crc;
 +}
  
 +static pg_crc32
 +pg_asm_crc32b(pg_crc32 crc, unsigned char data)
 +{

Should be marked inline.

 +#ifdef __GNUC__
 + __asm__ (crc32b %[data], %[crc]\n : [crc] +r (crc) : [data] rm 
 (data));

Have you checked which version of gcc introduced named references to
input/output parameters?

   return crc;
 +#elif defined(_MSC_VER)
 + return _mm_crc32_u8(crc, data);
 +#else
 +#error Don't know how to generate crc32b instruction
 +#endif
  }
 +
 +static pg_crc32
 +pg_asm_crc32q(uint64 crc, unsigned long long data)
 +{

inline.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Abhijit Menon-Sen

At 2014-12-29 13:22:28 +0100, and...@2ndquadrant.com wrote:

 How about pg_choose_crc_impl() or something?

Done.

 _sb8? Unless I miss something it's not slice by 8 but rather bytewise?

This is meant to apply on top of the earlier patches I posted to
implement slice-by-8. I'll attach both here.

 Should be marked inline.

Done (and the other one too).

  +#ifdef __GNUC__
  +   __asm__ (crc32b %[data], %[crc]\n : [crc] +r (crc) : [data] rm 
  (data));
 
 Have you checked which version of gcc introduced named references to
 input/output parameters?

No. The documentation calls them asmSymbolicNames, but I can't find
the term (or various likely alternatives, e.g. symbolic_operand may be
related) either in the release notes, the changelog, or the source (on
Github). Does anyone know, or know how to find out?

Meanwhile, I have attached the two patches with the other modifications.

-- Abhijit
diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index f871cba..55934e5 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -41,6 +41,8 @@
 
 typedef uint32 pg_crc32;
 
+extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
+
 /*
  * CRC calculation using the CRC-32C (Castagnoli) polynomial.
  *
@@ -51,7 +53,7 @@ typedef uint32 pg_crc32;
 #define INIT_CRC32C(crc) ((crc) = 0x)
 #define FIN_CRC32C(crc)	((crc) ^= 0x)
 #define COMP_CRC32C(crc, data, len)	\
-	COMP_CRC32_NORMAL_TABLE(crc, data, len, pg_crc32c_table)
+	((crc) = pg_comp_crc32c((crc), (char *) (data), (len)))
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 /*
@@ -115,7 +117,7 @@ do {			  \
 } while (0)
 
 /* Constant tables for CRC-32C and CRC-32 polynomials */
-extern CRCDLLIMPORT const uint32 pg_crc32c_table[];
+extern CRCDLLIMPORT const uint32 pg_crc32c_table[8][256];
 extern CRCDLLIMPORT const uint32 pg_crc32_table[];
 
 #endif   /* PG_CRC_H */
diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h
index cb6b470..f814c91 100644
--- a/src/include/utils/pg_crc_tables.h
+++ b/src/include/utils/pg_crc_tables.h
@@ -28,71 +28,519 @@
  * This table is based on the so-called Castagnoli polynomial (the same
  * that is used e.g. in iSCSI).
  */
-const uint32 pg_crc32c_table[256] = {
-	0x, 0xF26B8303, 0xE13B70F7, 0x1350F3F4,
-	0xC79A971F, 0x35F1141C, 0x26A1E7E8, 0xD4CA64EB,
-	0x8AD958CF, 0x78B2DBCC, 0x6BE22838, 0x9989AB3B,
-	0x4D43CFD0, 0xBF284CD3, 0xAC78BF27, 0x5E133C24,
-	0x105EC76F, 0xE235446C, 0xF165B798, 0x030E349B,
-	0xD7C45070, 0x25AFD373, 0x36FF2087, 0xC494A384,
-	0x9A879FA0, 0x68EC1CA3, 0x7BBCEF57, 0x89D76C54,
-	0x5D1D08BF, 0xAF768BBC, 0xBC267848, 0x4E4DFB4B,
-	0x20BD8EDE, 0xD2D60DDD, 0xC186FE29, 0x33ED7D2A,
-	0xE72719C1, 0x154C9AC2, 0x061C6936, 0xF477EA35,
-	0xAA64D611, 0x580F5512, 0x4B5FA6E6, 0xB93425E5,
-	0x6DFE410E, 0x9F95C20D, 0x8CC531F9, 0x7EAEB2FA,
-	0x30E349B1, 0xC288CAB2, 0xD1D83946, 0x23B3BA45,
-	0xF779DEAE, 0x05125DAD, 0x1642AE59, 0xE4292D5A,
-	0xBA3A117E, 0x4851927D, 0x5B016189, 0xA96AE28A,
-	0x7DA08661, 0x8FCB0562, 0x9C9BF696, 0x6EF07595,
-	0x417B1DBC, 0xB3109EBF, 0xA0406D4B, 0x522BEE48,
-	0x86E18AA3, 0x748A09A0, 0x67DAFA54, 0x95B17957,
-	0xCBA24573, 0x39C9C670, 0x2A993584, 0xD8F2B687,
-	0x0C38D26C, 0xFE53516F, 0xED03A29B, 0x1F682198,
-	0x5125DAD3, 0xA34E59D0, 0xB01EAA24, 0x42752927,
-	0x96BF4DCC, 0x64D4CECF, 0x77843D3B, 0x85EFBE38,
-	0xDBFC821C, 0x2997011F, 0x3AC7F2EB, 0xC8AC71E8,
-	0x1C661503, 0xEE0D9600, 0xFD5D65F4, 0x0F36E6F7,
-	0x61C69362, 0x93AD1061, 0x80FDE395, 0x72966096,
-	0xA65C047D, 0x5437877E, 0x4767748A, 0xB50CF789,
-	0xEB1FCBAD, 0x197448AE, 0x0A24BB5A, 0xF84F3859,
-	0x2C855CB2, 0xDEEEDFB1, 0xCDBE2C45, 0x3FD5AF46,
-	0x7198540D, 0x83F3D70E, 0x90A324FA, 0x62C8A7F9,
-	0xB602C312, 0x44694011, 0x5739B3E5, 0xA55230E6,
-	0xFB410CC2, 0x092A8FC1, 0x1A7A7C35, 0xE811FF36,
-	0x3CDB9BDD, 0xCEB018DE, 0xDDE0EB2A, 0x2F8B6829,
-	0x82F63B78, 0x709DB87B, 0x63CD4B8F, 0x91A6C88C,
-	0x456CAC67, 0xB7072F64, 0xA457DC90, 0x563C5F93,
-	0x082F63B7, 0xFA44E0B4, 0xE9141340, 0x1B7F9043,
-	0xCFB5F4A8, 0x3DDE77AB, 0x2E8E845F, 0xDCE5075C,
-	0x92A8FC17, 0x60C37F14, 0x73938CE0, 0x81F80FE3,
-	0x55326B08, 0xA759E80B, 0xB4091BFF, 0x466298FC,
-	0x1871A4D8, 0xEA1A27DB, 0xF94AD42F, 0x0B21572C,
-	0xDFEB33C7, 0x2D80B0C4, 0x3ED04330, 0xCCBBC033,
-	0xA24BB5A6, 0x502036A5, 0x4370C551, 0xB11B4652,
-	0x65D122B9, 0x97BAA1BA, 0x84EA524E, 0x7681D14D,
-	0x2892ED69, 0xDAF96E6A, 0xC9A99D9E, 0x3BC21E9D,
-	0xEF087A76, 0x1D63F975, 0x0E330A81, 0xFC588982,
-	0xB21572C9, 0x407EF1CA, 0x532E023E, 0xA145813D,
-	0x758FE5D6, 0x87E466D5, 0x94B49521, 0x66DF1622,
-	0x38CC2A06, 0xCAA7A905, 0xD9F75AF1, 0x2B9CD9F2,
-	0xFF56BD19, 0x0D3D3E1A, 0x1E6DCDEE, 0xEC064EED,
-	0xC38D26C4, 0x31E6A5C7, 0x22B65633, 0xD0DDD530,
-	0x0417B1DB, 0xF67C32D8, 0xE52CC12C, 0x1747422F,
-	0x49547E0B, 0xBB3FFD08, 0xA86F0EFC, 0x5A048DFF,
-	0x8ECEE914, 0x7CA56A17, 0x6FF599E3, 0x9D9E1AE0,
-	0xD3D3E1AB, 0x21B862A8, 0x32E8915C, 0xC083125F,
-	0x144976B4, 0xE622F5B7, 0xF5720643, 0x07198540,
-	0x590AB964,

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner

Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, Dec 26, 2014 at 12:38 PM, Kevin Grittner kgri...@ymail.com wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:

 Just for starters, a 40XXX error report will fail to provide the
 duplicated key's value.  This will be a functional regression,

 Not if, as is normally the case, the transaction is retried from
 the beginning on a serialization failure.  Either the code will
 check for a duplicate (as in the case of the OP on this thread) and
 they won't see the error, *or* the the transaction which created
 the duplicate key will have committed before the start of the retry
 and you will get the duplicate key error.

 I'm not buying that; that argument assumes duplicate key errors are
 always 'upsert' driven.  Although OP's code may have checked for
 duplicates it's perfectly reasonable (and in many cases preferable) to
 force the transaction to fail and report the error directly back to
 the application.  The application will then switch on the error code
 and decide what to do: retry for deadlock/serialization or abort for
 data integrity error.  IOW, the error handling semantics are
 fundamentally different and should not be mixed.

I think you might be agreeing with me without realizing it.  Right 
now you get duplicate key error even if the duplication is caused 
by a concurrent transaction -- it is not possible to check the 
error code (well, SQLSTATE, technically) to determine whether this 
is fundamentally a serialization problem.  What we're talking about 
is returning the serialization failure return code for the cases 
where it is a concurrent transaction causing the failure and 
continuing to return the duplicate key error for all other cases.

Either I'm not understanding what you wrote above, or you seem to
be arguing for being able to distinguish between errors caused by
concurrent transactions and those which aren't.


--
Kevin GrittnerEDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure

On Mon, Dec 29, 2014 at 8:03 AM, Kevin Grittner kgri...@ymail.com wrote:
 Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, Dec 26, 2014 at 12:38 PM, Kevin Grittner kgri...@ymail.com
 wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:

 Just for starters, a 40XXX error report will fail to provide the
 duplicated key's value.  This will be a functional regression,

 Not if, as is normally the case, the transaction is retried from
 the beginning on a serialization failure.  Either the code will
 check for a duplicate (as in the case of the OP on this thread) and
 they won't see the error, *or* the the transaction which created
 the duplicate key will have committed before the start of the retry
 and you will get the duplicate key error.

 I'm not buying that; that argument assumes duplicate key errors are
 always 'upsert' driven.  Although OP's code may have checked for
 duplicates it's perfectly reasonable (and in many cases preferable) to
 force the transaction to fail and report the error directly back to
 the application.  The application will then switch on the error code
 and decide what to do: retry for deadlock/serialization or abort for
 data integrity error.  IOW, the error handling semantics are
 fundamentally different and should not be mixed.

 I think you might be agreeing with me without realizing it.  Right
 now you get duplicate key error even if the duplication is caused
 by a concurrent transaction -- it is not possible to check the
 error code (well, SQLSTATE, technically) to determine whether this
 is fundamentally a serialization problem.  What we're talking about
 is returning the serialization failure return code for the cases
 where it is a concurrent transaction causing the failure and
 continuing to return the duplicate key error for all other cases.

 Either I'm not understanding what you wrote above, or you seem to
 be arguing for being able to distinguish between errors caused by
 concurrent transactions and those which aren't.

Well, I'm arguing that duplicate key errors are not serialization
failures unless it's likely the insertion would succeed upon a retry;
a proper insert, not an upsert.  If that's the case with what you're
proposing, then it makes sense to me.  But that's not what it sounds
like...your language suggests AIUI that having the error simply be
caused by another transaction being concurrent would be sufficient to
switch to a serialization error (feel free to correct me if I'm
wrong!).

In other words, the current behavior is:
txn A,B begin
txn A inserts
txn B inserts over A, locks, waits
txn A commits.  B aborts with duplicate key error

Assuming that case is untouched, then we're good!  My long winded
point above is that case must fail with duplicate key error; a
serialization error is suggesting the transaction should be retried
and it shouldn't be...it would simply fail a second time.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: decreasing memory needlessly consumed by array_agg

2014-12-29 Thread Ali Akbar

2014-12-29 14:38 GMT+07:00 Jeff Davis pg...@j-davis.com:

 Just jumping into this patch now. Do we think this is worth changing the
 signature of functions in array.h, which might be used from a lot of
 third-party code? We might want to provide new functions to avoid a
 breaking change.


V6 patch from Tomas only change initArrayResult* functions. initArrayResult
is new API in 9.5 (commit bac2739), with old API still works as-is.

-- 
Ali Akbar

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner

Merlin Moncure mmonc...@gmail.com wrote:

 Well, I'm arguing that duplicate key errors are not serialization
 failures unless it's likely the insertion would succeed upon a retry;
 a proper insert, not an upsert.  If that's the case with what you're
 proposing, then it makes sense to me.  But that's not what it sounds
 like...your language suggests AIUI that having the error simply be
 caused by another transaction being concurrent would be sufficient to
 switch to a serialization error (feel free to correct me if I'm
 wrong!).

 In other words, the current behavior is:
 txn A,B begin
 txn A inserts
 txn B inserts over A, locks, waits
 txn A commits.  B aborts with duplicate key error

 Assuming that case is untouched, then we're good!  My long winded
 point above is that case must fail with duplicate key error; a
 serialization error is suggesting the transaction should be retried
 and it shouldn't be...it would simply fail a second time.

What I'm proposing is that for serializable transactions B would
get a serialization failure; otherwise B would get a duplicate key
error.  If the retry of B looks at something in the database to
determine what it's primary key should be it will get a new value
on the retry, since it will be starting after the commit of A.  If
it is using a literal key, not based on something changed by A, it
will get a duplicate key error on the retry, since it will be
starting after the commit of A.

It will either succeed on retry or get an error for a different
reason.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Serialization exception : Who else was involved?

2014-12-29 Thread Kevin Grittner

Craig Ringer cr...@2ndquadrant.com wrote:

 I don't see how that'd necessarily correctly identify the
 query/queries in the other tx that're involved, though.

 Perhaps I'm thinking in terms of more complicated serialization
 failures?

Yeah, it might be possible to provide useful information about
specific conflicting queries in some cases, but I'm skeptical that
it would be available in the majority of cases.  In most cases you
can probably identify one or two other transactions that are
involved.  In at least some cases you won't even have that.

For one fun case to think about, see this example where a read-only
transaction fails with on conflict with two already-committed
transactions:

https://wiki.postgresql.org/wiki/SSI#Rollover

Also consider when there is a long-running transactions and SSI
falls back to SLRU summarization.

If we can find a way to provide some useful information in some
cases without harming performance, that's fine as long as nobody
expects that it will be available in all cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure

On Mon, Dec 29, 2014 at 9:09 AM, Kevin Grittner kgri...@ymail.com wrote:
 Merlin Moncure mmonc...@gmail.com wrote:
 In other words, the current behavior is:
 txn A,B begin
 txn A inserts
 txn B inserts over A, locks, waits
 txn A commits.  B aborts with duplicate key error

 What I'm proposing is that for serializable transactions B would
 get a serialization failure; otherwise B would get a duplicate key
 error.  If the retry of B looks at something in the database to
 determine what it's primary key should be it will get a new value
 on the retry, since it will be starting after the commit of A.  If
 it is using a literal key, not based on something changed by A, it
 will get a duplicate key error on the retry, since it will be
 starting after the commit of A.

 It will either succeed on retry or get an error for a different
 reason.

In that case: we don't agree.  How come duplicate key errors would be
reported as serialization failures but not RI errors (for example,
inserting a record pointing to another record which a concurrent
transaction deleted)?

IMO, serialization errors are an implementation artifact and should
not mask well defined errors in SQL under any circumstances (or if
they must, those cases should absolutely be as narrow as possible).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Greg Stark

On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure mmonc...@gmail.com wrote:
 In that case: we don't agree.  How come duplicate key errors would be
 reported as serialization failures but not RI errors (for example,
 inserting a record pointing to another record which a concurrent
 transaction deleted)?

The key question is not the type of error but whether the transaction
saw another state previously. That is, if you select to check for
duplicate keys, don't see any, and then try to insert and get a
duplicate key error that would be easy to argue is a serialization
error. The same could be true for an RI check -- if you select some
data and then insert a record that refers to the data you already
verified existed and get an RI failure then you could consider that a
serialization failure.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Publish autovacuum informations

2014-12-29 Thread Guillaume Lelarge

Hey,

There are times where I would need more informations on the autovacuum
processes.

I'd love to know what each worker is currently doing. I can get something
like this from the pg_stat_activity view but it doesn't give me as much
informations as the WorkerInfoData struct.

I'd also love to have more informations on the contents of the tables list
(how many tables still to process, which table next, what kind of
processing they'll get, etc... kinda what you have in the autovac_table
struct).

All in all, I want to get informations that are typically stored in shared
memory, handled by the autovacuum launcher and autovacuum workers. I first
thought I could get that by writing some C functions embedded in an
extension. But it doesn't seem to me I can access this part of the shared
memory from a C function. If I'm wrong, I'd love to get a pointer on how to
do this.

Otherwise, I wonder what would be more welcome: making the shared memory
structs handles available outside of the autovacuum processes (and then
build an extension to decode the informations I need), or adding functions
in core to get access to this information (in that case, no need for an
extension)?

Thanks.

Regards.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com

Re: [HACKERS] Publish autovacuum informations

2014-12-29 Thread Tom Lane

Guillaume Lelarge guilla...@lelarge.info writes:
 All in all, I want to get informations that are typically stored in shared
 memory, handled by the autovacuum launcher and autovacuum workers. I first
 thought I could get that by writing some C functions embedded in an
 extension. But it doesn't seem to me I can access this part of the shared
 memory from a C function. If I'm wrong, I'd love to get a pointer on how to
 do this.

 Otherwise, I wonder what would be more welcome: making the shared memory
 structs handles available outside of the autovacuum processes (and then
 build an extension to decode the informations I need), or adding functions
 in core to get access to this information (in that case, no need for an
 extension)?

Either one of those approaches would cripple our freedom to change those
data structures; which we've done repeatedly in the past and will surely
want to do again.  So I'm pretty much -1 on exposing them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Publish autovacuum informations

2014-12-29 Thread Guillaume Lelarge

2014-12-29 17:03 GMT+01:00 Tom Lane t...@sss.pgh.pa.us:

 Guillaume Lelarge guilla...@lelarge.info writes:
  All in all, I want to get informations that are typically stored in
 shared
  memory, handled by the autovacuum launcher and autovacuum workers. I
 first
  thought I could get that by writing some C functions embedded in an
  extension. But it doesn't seem to me I can access this part of the shared
  memory from a C function. If I'm wrong, I'd love to get a pointer on how
 to
  do this.

  Otherwise, I wonder what would be more welcome: making the shared memory
  structs handles available outside of the autovacuum processes (and then
  build an extension to decode the informations I need), or adding
 functions
  in core to get access to this information (in that case, no need for an
  extension)?

 Either one of those approaches would cripple our freedom to change those
 data structures; which we've done repeatedly in the past and will surely
 want to do again.  So I'm pretty much -1 on exposing them.


I don't see how that's going to deny us the right to change any structs. If
they are in-core functions, we'll just have to update them.  If they are
extension functions, then the developer of those functions would simply
need to update his code.


-- 
Guillaume.
  http://blog.guillaume.lelarge.info
  http://www.dalibo.com

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure

On Mon, Dec 29, 2014 at 9:44 AM, Greg Stark st...@mit.edu wrote:
 On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure mmonc...@gmail.com wrote:
 In that case: we don't agree.  How come duplicate key errors would be
 reported as serialization failures but not RI errors (for example,
 inserting a record pointing to another record which a concurrent
 transaction deleted)?

 The key question is not the type of error but whether the transaction
 saw another state previously.

[combining replies -- nikita, better not to top-post (FYI)]

How is that relevant?  Serialization errors only exist as a concession
to concurrency and performance.  Again, they should be returned as
sparsely as possible because they provide absolutely (as Tom pointed
out) zero detail to the application.   The precise definition of the
error is up to us, but I'd rather keep it to it's current rather
specific semantics.

On Mon, Dec 29, 2014 at 9:48 AM, Nikita Volkov nikita.y.vol...@mail.ru wrote:
 I believe, the objections expressed in this thread miss a very important
 point of all this: the isolation property (the I in ACID) is violated.

 Here’s a quote from the Wikipedia article on ACID:

 The isolation property ensures that the concurrent execution of transactions
 results in a system state that would be obtained if transactions were
 executed serially, i.e., one after the other.

 The original example proves that it's violated. Such behaviour can neither
 be expected by a user, nor is even mentioned anywhere. Instead in the first
 paragraph of the “About” section of the Postgres site it states:

 It is fully ACID compliant

 Which is basically just a lie, until issues like this one get dealt with.

That's simply untrue: inconvenience != acid violation

Transaction levels provide certain guarantees regarding the state of
the data in the presence of concurrent overlapping operations.   They
do not define the mechanism of failure or how/when those failures
should occur.  To prove your statement, you need to demonstrate how a
transaction left the database in a bad state given concurrent activity
without counting failures.  Postgres can, and does, for example,
return concurrency type errors more aggressively than it needs to for
the 'repeatable read', level.  Point being, this is completely ok as
database implementation is free to understand that, just as it's free
to define precisely how and when it fails given concurrency as long as
those guarantees are provided.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner

Merlin Moncure mmonc...@gmail.com wrote:

 Serialization errors only exist as a concession to concurrency
 and performance.  Again, they should be returned as sparsely as
 possible because they provide absolutely (as Tom pointed
 out) zero detail to the application.

That is false.  They provide an *extremely* valuable piece of
information which is not currently available when you get a
duplicate key error -- whether the error occurred because of a race
condition and will not fail for the same cause if retried.

 The precise definition of the error is up to us, but I'd rather
 keep it to it's current rather specific semantics.

The semantics are so imprecise that Tom argued that we should
document that transactions should be retried from the start when
you get the duplicate key error, since it *might* have been caused
by a race condition.

I'm curious how heavily you use serializable transactions, because
I have trouble believing that those who rely on them as their 
primary (or only) strategy for dealing with race conditions under 
high concurrency would take that position.

As for the fact that RI violations also don't return a
serialization failure when caused by a race with concurrent
transactions, I view that as another weakness in PostgreSQL.  I
don't think there is a problem curing one without curing the other
at the same time.  I have known of people writing their own
triggers to enforce RI rather than defining FKs precisely so that
they can get a serialization failure return code and do automatic
retry if it is caused by a race condition.  That's less practical
to compensate for when it comes to unique indexes or constraints.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Nikita Volkov

 [combining replies -- nikita, better not to top-post (FYI)]

I'm sorry. I don't know what you mean. I just replied to an email.

 To prove your statement, you need to demonstrate how a transaction left
the database in a bad state given concurrent activity without counting
failures.

1. Transaction A looks up a row by ID 1 and gets an empty result.
2. Concurrent transaction B inserts a row with ID 1.
3. Transaction A goes on with the presumption that a row with ID 1 does not
exist, because a transaction is supposed to be isolated and because it has
made sure that the row does not exist. With this presumption it confidently
inserts a row with ID 1 only to get Postgres report a duplicate key. Wat?


2014-12-29 19:17 GMT+03:00 Merlin Moncure mmonc...@gmail.com:

 On Mon, Dec 29, 2014 at 9:44 AM, Greg Stark st...@mit.edu wrote:
  On Mon, Dec 29, 2014 at 3:31 PM, Merlin Moncure mmonc...@gmail.com
 wrote:
  In that case: we don't agree.  How come duplicate key errors would be
  reported as serialization failures but not RI errors (for example,
  inserting a record pointing to another record which a concurrent
  transaction deleted)?
 
  The key question is not the type of error but whether the transaction
  saw another state previously.

 [combining replies -- nikita, better not to top-post (FYI)]

 How is that relevant?  Serialization errors only exist as a concession
 to concurrency and performance.  Again, they should be returned as
 sparsely as possible because they provide absolutely (as Tom pointed
 out) zero detail to the application.   The precise definition of the
 error is up to us, but I'd rather keep it to it's current rather
 specific semantics.

 On Mon, Dec 29, 2014 at 9:48 AM, Nikita Volkov nikita.y.vol...@mail.ru
 wrote:
  I believe, the objections expressed in this thread miss a very important
  point of all this: the isolation property (the I in ACID) is violated.
 
  Here’s a quote from the Wikipedia article on ACID:
 
  The isolation property ensures that the concurrent execution of
 transactions
  results in a system state that would be obtained if transactions were
  executed serially, i.e., one after the other.
 
  The original example proves that it's violated. Such behaviour can
 neither
  be expected by a user, nor is even mentioned anywhere. Instead in the
 first
  paragraph of the “About” section of the Postgres site it states:
 
  It is fully ACID compliant
 
  Which is basically just a lie, until issues like this one get dealt with.

 That's simply untrue: inconvenience != acid violation

 Transaction levels provide certain guarantees regarding the state of
 the data in the presence of concurrent overlapping operations.   They
 do not define the mechanism of failure or how/when those failures
 should occur.  To prove your statement, you need to demonstrate how a
 transaction left the database in a bad state given concurrent activity
 without counting failures.  Postgres can, and does, for example,
 return concurrency type errors more aggressively than it needs to for
 the 'repeatable read', level.  Point being, this is completely ok as
 database implementation is free to understand that, just as it's free
 to define precisely how and when it fails given concurrency as long as
 those guarantees are provided.

 merlin

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Nikita Volkov

I believe, the objections expressed in this thread miss a very important
point of all this: the isolation property (the I in ACID) is violated.

Here’s a quote from the Wikipedia article on ACID
http://en.wikipedia.org/wiki/ACID:

The isolation property ensures that the concurrent execution of
transactions results in a system state that would be obtained if
transactions were executed serially, i.e., one after the other.

The original example proves that it's violated. Such behaviour can neither
be expected by a user, nor is even mentioned anywhere. Instead in the first
paragraph of the “About” section of the Postgres site
http://www.postgresql.org/about/ it states:

It is fully ACID compliant

Which is basically just a lie, until issues like this one get dealt with.


2014-12-29 18:31 GMT+03:00 Merlin Moncure mmonc...@gmail.com:

 On Mon, Dec 29, 2014 at 9:09 AM, Kevin Grittner kgri...@ymail.com wrote:
  Merlin Moncure mmonc...@gmail.com wrote:
  In other words, the current behavior is:
  txn A,B begin
  txn A inserts
  txn B inserts over A, locks, waits
  txn A commits.  B aborts with duplicate key error
 
  What I'm proposing is that for serializable transactions B would
  get a serialization failure; otherwise B would get a duplicate key
  error.  If the retry of B looks at something in the database to
  determine what it's primary key should be it will get a new value
  on the retry, since it will be starting after the commit of A.  If
  it is using a literal key, not based on something changed by A, it
  will get a duplicate key error on the retry, since it will be
  starting after the commit of A.
 
  It will either succeed on retry or get an error for a different
  reason.

 In that case: we don't agree.  How come duplicate key errors would be
 reported as serialization failures but not RI errors (for example,
 inserting a record pointing to another record which a concurrent
 transaction deleted)?

 IMO, serialization errors are an implementation artifact and should
 not mask well defined errors in SQL under any circumstances (or if
 they must, those cases should absolutely be as narrow as possible).

 merlin

Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera

Alvaro Herrera wrote:

 Patch 0005 adds getObjectIdentityParts(), which returns the object
 identity in arrays that can be passed to pg_get_object_address.  This
 part needs slight revisions so I'm not sure I will be able to push
 tomorrow.

Here's a fresh version of this patch.  I chose to add a SQL-accessible
version, pg_identify_object_as_address, to make it easier to test.  In
doing so I noticed a couple of bugs, and most interestingly I noticed
that it was essentially impossible to cleanly address an array type;
doing a roundtrip through the new functions would get me the base type
when I used integer[] but the array type when I used _int4.  This
looked like a problem, so I traced through it and noticed that we're
using the type name *list* as a list, rather than as a TypeName, to
refer to OBJECT_TYPE and OBJECT_DOMAIN; I hadn't understood the
significance of this until I realized that domains would be represented
with arrayBounds set to a non-empty list for the integer[] syntax, but
the name list would have pg_catalog integer only; when the rest of
TypeName was discarded, the fact that we were talking about an array was
completely forgotten.  Before the dawn of time we had this:

-static void
-CommentType(List *typename, char *comment)
-{
-   TypeName   *tname;
-   Oid oid;
-
-   /* XXX a bit of a crock; should accept TypeName in COMMENT syntax */
-   tname = makeTypeNameFromNameList(typename);

where the XXX comment was removed by commit c10575ff005c330d047534562
without a corresponding comment in the new function.

I'm going to see about changing the grammar to get this fixed; this
patch is important because it will enable us to complete^Wcontinue
working on the DDL deparse testing framework.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera

(The changes in the regression test are bogus, BTW; I didn't care enough
to get them fixed before sorting out the rest of the mess.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
commit 89c8cbed0072ad4d921128b834fcb4f9e2eb4c33
Author: Alvaro Herrera alvhe...@alvh.no-ip.org
Date:   Mon Dec 22 18:32:43 2014 -0300

array objname/objargs stuff

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 24c64b7..112b6a0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17772,6 +17772,23 @@ FOR EACH ROW EXECUTE PROCEDURE suppress_redundant_updates_trigger();
  identifier present in the identity is quoted if necessary.
 /entry
/row
+   row
+entryliteraladdress_names/literal/entry
+entrytypetext[]/type/entry
+entry
+ An array that, together with literaladdress_args/literal,
+ can be used by the functionpg_get_object_address()/function to
+ recreate the object address in a remote server containing an
+ identically named object of the same kind.
+/entry
+   /row
+   row
+entryliteraladdress_args/literal/entry
+entrytypetext[]/type/entry
+entry
+ Complement for literaladdress_names/literal above.
+/entry
+   /row
   /tbody
  /tgroup
 /informaltable
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 85079d6..789af5f 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -74,6 +74,7 @@
 #include utils/builtins.h
 #include utils/fmgroids.h
 #include utils/lsyscache.h
+#include utils/memutils.h
 #include utils/syscache.h
 #include utils/tqual.h
 
@@ -556,8 +557,9 @@ static void getRelationTypeDescription(StringInfo buffer, Oid relid,
 		   int32 objectSubId);
 static void getProcedureTypeDescription(StringInfo buffer, Oid procid);
 static void getConstraintTypeDescription(StringInfo buffer, Oid constroid);
-static void getOpFamilyIdentity(StringInfo buffer, Oid opfid);
-static void getRelationIdentity(StringInfo buffer, Oid relid);
+static void getOpFamilyIdentity(StringInfo buffer, Oid opfid, List **objname,
+	List **objargs);
+static void getRelationIdentity(StringInfo buffer, Oid relid, List **objname);
 
 /*
  * Translate an object name and arguments (as passed by the parser) to an
@@ -2960,6 +2962,66 @@ pg_identify_object(PG_FUNCTION_ARGS)
 }
 
 /*
+ * SQL-level callable function to obtain object type + identity
+ */
+Datum
+pg_identify_object_as_address(PG_FUNCTION_ARGS)
+{
+	Oid			classid = PG_GETARG_OID(0);
+	Oid			objid = PG_GETARG_OID(1);
+	int32		subobjid = PG_GETARG_INT32(2);
+	ObjectAddress address;
+	char	   *identity;
+	List	   *names;
+	List	   *args;
+	Datum		values[3];
+	bool		nulls[3];
+	TupleDesc	tupdesc;
+	HeapTuple	htup;
+
+	address.classId = classid;
+	address.objectId = objid;
+	address.objectSubId = subobjid;
+
+	/*
+	 * Construct a tuple descriptor for the result row.  This must match this
+	 * function's pg_proc entry!
+	 */
+	tupdesc = CreateTemplateTupleDesc(3, false);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, type,
+	   TEXTOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, object_names,
+	   TEXTARRAYOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, object_args,
+	   TEXTARRAYOID, -1, 0);
+
+	tupdesc = BlessTupleDesc(tupdesc);
+
+	/* object type */
+	values[0] = CStringGetTextDatum(getObjectTypeDescription(address));
+	nulls[0] = false;
+
+	/* object identity */
+	identity = getObjectIdentityParts(address, names, args);
+	pfree(identity);
+
+	/* object_names */
+	values[1] = PointerGetDatum(strlist_to_textarray(names));
+	nulls[1] = false;
+
+	/* object_args */
+	if (args)
+		values[2] = PointerGetDatum(strlist_to_textarray(args));
+	else
+		values[2] = PointerGetDatum(construct_empty_array(TEXTOID));
+	nulls[2] = false;
+
+	htup = heap_form_tuple(tupdesc, values, nulls);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
+}
+
+/*
  * Return a palloc'ed string that describes the type of object that the
  * passed address is for.
  *
@@ -3215,7 +3277,7 @@ getProcedureTypeDescription(StringInfo buffer, Oid procid)
 }
 
 /*
- * Return a palloc'ed string that identifies an object.
+ * Obtain a given object's identity, as a palloc'ed string.
  *
  * This is for machine consumption, so it's not translated.  All elements are
  * schema-qualified when appropriate.
@@ -3223,14 +3285,42 @@ getProcedureTypeDescription(StringInfo buffer, Oid procid)
 char *
 getObjectIdentity(const ObjectAddress *object)
 {
+	return getObjectIdentityParts(object, NULL, NULL);
+}
+
+/*
+ * As above, but more detailed.
+ *
+ * There are two sets of return values: the identity itself as a palloc'd
+ * string is returned.  objname and objargs, if not NULL, are output parameters
+ * that receive lists of C-strings that are useful to give

Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-29 Thread Tom Lane

Adrian Klaver adrian.kla...@aklaver.com writes:
 So it seems you can turn ON_ERROR_ROLLBACK on with either 1 or 'on', but you 
 can only turn it off with 'off'.
 With ON_ERROR_STOP 1/on and 0/off both seem to work. 

 Is this expected?

on_error_stop_hook() uses ParseVariableBool, while
on_error_rollback_hook() uses some hard-coded logic that checks for
interactive and off and otherwise assumes on.  This is inconsistent
first in not accepting as many spellings as ParseVariableBool, and second
in not complaining about invalid input --- ParseVariableBool does, so
why not here?

echo_hook, echo_hidden_hook, histcontrol_hook, and verbosity_hook are
equally cavalierly written.

For on_error_rollback_hook and echo_hidden_hook, where on and off
are documented values, I think it would make sense to allow all spellings
accepted by ParseVariableBool, say like this:

 if (newval == NULL)
 pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;
 else if (pg_strcasecmp(newval, interactive) == 0)
 pset.on_error_rollback = PSQL_ERROR_ROLLBACK_INTERACTIVE;
-else if (pg_strcasecmp(newval, off) == 0)
-pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;
-else
-pset.on_error_rollback = PSQL_ERROR_ROLLBACK_ON;
+else if (ParseVariableBool(newval))
+pset.on_error_rollback = PSQL_ERROR_ROLLBACK_ON;
+else
+pset.on_error_rollback = PSQL_ERROR_ROLLBACK_OFF;

The other 3 functions don't need to accept on/off I think, but they should
print a warning for unrecognized input like ParseVariableBool does.

And while we're at it, we should consistently allow case-insensitive
input; right now it looks like somebody rolled dice to decide whether
to use strcmp or pg_strcasecmp in these functions.  Per line, not
even per function.

Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] recovery_min_apply_delay with a negative value

2014-12-29 Thread Fabrízio de Royes Mello

On Sun, Dec 28, 2014 at 12:31 PM, Michael Paquier michael.paqu...@gmail.com
wrote:

 Hi all,

 While reviewing another patch, I have noticed that
recovery_min_apply_delay can have a negative value. And the funny part is
that we actually attempt to apply a delay even in this case, per se this
condition recoveryApplyDelay@xlog.c:
 /* nothing to do if no delay configured */
 if (recovery_min_apply_delay == 0)
 return false;
 Shouldn't we simply leave if recovery_min_apply_delay is lower 0, and not
only equal to 0?


Seems reasonable.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
 Timbira: http://www.timbira.com.br
 Blog: http://fabriziomello.github.io
 Linkedin: http://br.linkedin.com/in/fabriziomello
 Twitter: http://twitter.com/fabriziomello
 Github: http://github.com/fabriziomello

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Merlin Moncure

On Mon, Dec 29, 2014 at 10:47 AM, Nikita Volkov nikita.y.vol...@mail.ru wrote:
 [combining replies -- nikita, better not to top-post (FYI)]

[combining replied again]

 I'm sorry. I don't know what you mean. I just replied to an email.

http://www.idallen.com/topposting.html

 To prove your statement, you need to demonstrate how a transaction left
 the database in a bad state given concurrent activity without counting
 failures.

 1. Transaction A looks up a row by ID 1 and gets an empty result.
 2. Concurrent transaction B inserts a row with ID 1.
 3. Transaction A goes on with the presumption that a row with ID 1 does not
 exist, because a transaction is supposed to be isolated and because it has
 made sure that the row does not exist. With this presumption it confidently
 inserts a row with ID 1 only to get Postgres report a duplicate key. Wat?

Your understanding of isolation is incorrect.   Transaction A does not
go on with anything -- it's guaranteed to fail in this case.  The only
debatable point here is how exactly it fails.  Again, isolation's job
is to protect the data.

On Mon, Dec 29, 2014 at 10:53 AM, Kevin Grittner kgri...@ymail.com wrote:
 The semantics are so imprecise that Tom argued that we should
 document that transactions should be retried from the start when
 you get the duplicate key error, since it *might* have been caused
 by a race condition.

That sounds off to me also.  In terms of a classic uniqueness
constraint (say, a identifying user name), every violation is
technically a race condition -- whether or not the transactions
overlap on time is completely irrelevant.  If the transactions
touching off the error happen to overlap or not is an accident of
timing and irrelevant; a serialization error suggests that the
transaction should be retried when in fact it shouldn't be,
particularly just to get the *actual* error.  What if the transaction
is non-trivial?  Why do we want to bother our users about those
details at all?

Consider the 'idiomatic upsert' as it exists in the documentation (!):

LOOP
-- first try to update the key
UPDATE db SET b = data WHERE a = key;
IF found THEN
RETURN;
END IF;
--   XXX merlin's note: if any dependent table throws a UV,
--   say, via a trigger, this code will loop endlessly
-- not there, so try to insert the key
-- if someone else inserts the same key concurrently,
-- we could get a unique-key failure
BEGIN
INSERT INTO db(a,b) VALUES (key, data);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- do nothing, and loop to try the UPDATE again
END;
END LOOP;

By changing the error code, for decades worth of dealing with this
problem, you've just converted a server side loop to a full round
trip, and, if the user does not automatically retry serialization
failures, broken his/her code.  It's impossible to fix the round trip
issue, at least provably, because there is no way to know for sure
that the serialization failure is coming from this exact insertion, or
say, a dependent trigger (aside: the idiomatic example aught to be
checking the table name!) such that your loop (either here or from
application) would execute a bazillion times until some other
transaction clears.  OK, this is a mostly academic detail, but the
picture is not so clear as you're saying, I think; you're travelling
at high speed in uncertain waters.

The key point here is that OP issued a SELECT first, and he's chaining
DML decisions to the output of that select. He's expecting that SELECT
to be protected via ACID, but it isn't and can't be unless you're
prepared to predicate lock every row selected.  What he wants is for
the database to bounce his transaction because the select lied to him,
but that can't be done obviously.

 I'm curious how heavily you use serializable transactions, because
 I have trouble believing that those who rely on them as their
 primary (or only) strategy for dealing with race conditions under
 high concurrency would take that position.

I don't use them much, admittedly.  That said, I don't use them as
race condition guards.  I use locks or other techniques to manage the
problem.   I tend to build out applications on top of functions and
the inability to set isolation mode inside a function confounds me
from using anything but 'read committed'.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] ON_ERROR_ROLLBACK

2014-12-29 Thread Adrian Klaver


On 12/29/2014 09:55 AM, Tom Lane wrote:

Adrian Klaver adrian.kla...@aklaver.com writes:

So it seems you can turn ON_ERROR_ROLLBACK on with either 1 or 'on', but you 
can only turn it off with 'off'.
With ON_ERROR_STOP 1/on and 0/off both seem to work.



Is this expected?






Given the lack of previous complaints, this probably isn't backpatching
material, but it sure seems like a bit of attention to consistency
would be warranted here.


I would appreciate it, thanks.



regards, tom lane




--
Adrian Klaver
adrian.kla...@aklaver.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes superuser review

2014-12-29 Thread Adam Brightwell

Magnus,

Thanks for the feedback.


 BACKUP - allows role to perform pg_dump* backups of whole database.


 I'd suggest it's called DUMP if that's what it allows, to keep it separate
 from the backup parts.


Makes sense to me.

That seems really bad names, IMHO. Why? Because we use WAL and XLOG
 throughout documentation and parameters and code to mean *the same thing*.
 And here they'd suddenly mean different things. If we need them as separate
 privileges, I think we need much better names. (And a better description -
 what is xlog operations really?)


Fair enough, ultimately what I was trying to address is the following
concern raised by Alvaro:

To me, what this repeated discussion on this particular BACKUP point
says, is that the ability to run pg_start/stop_backend and the xlog
related functions should be a different privilege, i.e. something other
than BACKUP; because later we will want the ability to grant someone the
ability to run pg_dump on the whole database without being superuser,
and we will want to use the name BACKUP for that.  So I'm inclined to
propose something more specific for this like WAL_CONTROL or
XLOG_OPERATOR, say.

Upon re-reading it (and other discussions around it) I believe that I must
have misinterpreted.  Initially, I read it to mean that we needed the
following separate permissions.

1) ability to pg_dump
2) ability to start/stop backups
3) ability to execute xlog related functions

When indeed, what it meant was to have the following separate (effectively
merging #2 and #3):

1) ability to pg_dump
2) ability to start/stop backups *and* ability to execute xlog related
functions.

My apologies on that confusion.

Given this clarification:

I think #1 could certainly be answered by using DUMP.  I have no strong
opinion in either direction, though I do think that BACKUP does make the
most sense for #2.  Previously, Stephen had mentioned a READONLY capability
that could effectively work for pg_dump, though, Jim's suggestion of
keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
either case, I certainly wouldn't mind having a wider agreement/consensus
on this approach.

So, here is a revised list of proposed attributes:

* BACKUP - allows role to perform backup operations (as originally proposed)
* LOG - allows role to rotate log files - remains broad enough to consider
future log related operations
* DUMP -  allows role to perform pg_dump* backups of whole database
* MONITOR - allows role to view pg_stat_* details (as originally proposed)
* PROCSIGNAL - allows role to signal backend processes (as originally
proposed)well

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner

Merlin Moncure mmonc...@gmail.com wrote:
 On Mon, Dec 29, 2014 at 10:53 AM, Kevin Grittner kgri...@ymail.com wrote:
 The semantics are so imprecise that Tom argued that we should
 document that transactions should be retried from the start when
 you get the duplicate key error, since it *might* have been caused
 by a race condition.

 That sounds off to me also. In terms of a classic uniqueness
 constraint (say, a identifying user name), every violation is
 technically a race condition -- whether or not the transactions
 overlap on time is completely irrelevant.

That is completely bogus. The point is that if you return a
serialization failure the transaction should be immediately retried
from the beginning (including, in many cases, acquiring key
values). If the error was caused by concurrent insertion of a
duplicate key where the transaction does *not* acquire the value
within the transaction, *then* you get the duplicate key error on
the retry.

 If the transactions
 touching off the error happen to overlap or not is an accident of
 timing and irrelevant; a serialization error suggests that the
 transaction should be retried when in fact it shouldn't be,
 particularly just to get the *actual* error. What if the transaction
 is non-trivial? Why do we want to bother our users about those
 details at all?

Where serializable transactions are used to manage race conditions
the users typically do not see them. The application code does not
see them. There is normally some sort of framework (possibly using
dependency injection, although not necessarily) which catches these
and retries the transaction from the start without notifying the
user or letting the application software know that it happened.
There is normally some server-side logging so that high volumes of
rollbacks can be identified and fixed. In a real-world situation
where this was used for 100 production databases running millions
of transactions per day, I saw about 10 or 20 serialization
failures per day across the whole set of database servers.

While I have certainly heard of workloads where it didn't work out
that well, Dan Ports found that many common benchmarks perform
about that well. Quoting from the peer-reviewed paper presented in
Istanbul[1]:

| SSI outperforms S2PL for all transaction mixes, and does so by a
| significant margin when the fraction of read-only transactions is
| high. On these workloads, there are more rw-conflicts between
| concurrent transactions, so locking imposes a larger performance
| penalty. (The 100%-read-only workload is a special case; there are
| no lock conflicts under S2PL, and SSI has no overhead because all
| snapshots are safe.) The 150-warehouse configuration (Figure 5b )
| behaves similarly, but the differences are less pronounced: on this
| disk-bound benchmark, CPU overhead is not a factor, and improved
| concurrency has a limited benefit. Here, the performance of SSI
| is indistinguishable from that of SI. Transactions rarely need to
| be retried; in all cases, the serialization failure rate was under
| 0.25%.

 Consider the 'idiomatic upsert' as it exists in the documentation (!):

That documentation should probably be updated to indicate which
isolation levels need that code. If you are relying on
serializable transactions that dance is unnecessary and pointless.
The rule when relying on serializable transactions is that you
write the code to behave correctly if the transaction executing it
is running by itself. Period. No special handling for race
conditions. Detecting race conditions is the job of the database
engine and retrying affected transactions is the job of the
framework. Absolutely nobody who understands serializable
transactions would use that idiom inside of serializable
transactions.

 By changing the error code, for decades worth of dealing with this
 problem, you've just converted a server side loop to a full round
 trip, and, if the user does not automatically retry serialization
 failures, broken his/her code.

If they are not automatically retrying on serialization failures,
they should probably not be using serializable transactions. That
is rather the point. No need for table locks. No need for SELECT
FOR UPDATE. No need to special-case for concurrent transactions.
Just do it. Let the framework retry as needed.

I have no problem with there being people who choose not to use
this approach. What I'm asking is that they not insist on
PostgreSQL being needlessly crippled for those who *do* use it this
way.

 It's impossible to fix the round trip
 issue, at least provably, because there is no way to know for sure
 that the serialization failure is coming from this exact insertion, or
 say, a dependent trigger (aside: the idiomatic example aught to be
 checking the table name!) such that your loop (either here or from
 application) would execute a bazillion times until some other
 transaction clears.

Until the inserting transaction commits, it would block the other
insertion

Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-12-29 Thread Bruce Momjian

On Sat, Dec 27, 2014 at 08:05:42PM -0500, Robert Haas wrote:
 On Wed, Dec 24, 2014 at 11:20 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  I just verified that I can still reproduce the problem:
 
  # aligned case (max_connections=401)
  afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 
  -j 96 -T 100 -S
  progress: 1.0 s, 405170.2 tps, lat 0.195 ms stddev 0.928
  progress: 2.0 s, 467011.1 tps, lat 0.204 ms stddev 0.140
  progress: 3.0 s, 462832.1 tps, lat 0.205 ms stddev 0.154
  progress: 4.0 s, 471035.5 tps, lat 0.202 ms stddev 0.154
  progress: 5.0 s, 500329.0 tps, lat 0.190 ms stddev 0.132
 
  BufferDescriptors is at 0x7f63610a6960 (which is 32byte aligned)
 
  # unaligned case (max_connections=400)
  afreund@axle:~$ pgbench -P 1 -h /tmp/ -p5440 postgres -n -M prepared -c 96 
  -j 96 -T 100 -S
  progress: 1.0 s, 202271.1 tps, lat 0.448 ms stddev 1.232
  progress: 2.0 s, 223823.4 tps, lat 0.427 ms stddev 3.007
  progress: 3.0 s, 227584.5 tps, lat 0.414 ms stddev 4.760
  progress: 4.0 s, 221095.6 tps, lat 0.410 ms stddev 4.390
  progress: 5.0 s, 217430.6 tps, lat 0.454 ms stddev 7.913
  progress: 6.0 s, 210275.9 tps, lat 0.411 ms stddev 0.606
  BufferDescriptors is at 0x7f1718aeb980 (which is 64byte aligned)
 
 So, should we increase ALIGNOF_BUFFER from 32 to 64?  Seems like
 that's what these results are telling us.

I am glad someone else considers this important.  Andres reported the
above 2x pgbench difference in February, but no action was taken as
everyone felt there needed to be more performance testing, but it never
happened:


http://www.postgresql.org/message-id/20140202151319.gd32...@awork2.anarazel.de

I have now performance tested this by developing the attached two
patches which both increase the Buffer Descriptors allocation by 64
bytes.  The first patch causes each 64-byte Buffer Descriptor struct to
align on a 32-byte boundary but not a 64-byte boundary, while the second
patch aligns it with a 64-byte boundary.

I tried many tests, including this:

$ pgbench --initialize --scale 1 pgbench
$ pgbench --protocol prepared --client 16 --jobs 16 --transactions 
10 --select-only pgbench

I cannot measure any difference on my dual-CPU-socket, 16-vcore server
(http://momjian.us/main/blogs/pgblog/2012.html#January_20_2012).   I
thought this test would cause the most Buffer Descriptor contention
between the two CPUs.  Can anyone else see a difference when testing
these two patches?  (The patch reports alignment in the server logs.)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
new file mode 100644
index ff6c713..c4dce5b
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
*** InitBufferPool(void)
*** 67,72 
--- 67,73 
  	bool		foundBufs,
  foundDescs;
  
+ 	fprintf(stderr, Buffer Descriptors size = %ld\n, sizeof(BufferDesc));
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct(Buffer Descriptors,
  		NBuffers * sizeof(BufferDesc), foundDescs);
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
new file mode 100644
index 2ea2216..669c07f
*** a/src/backend/storage/ipc/shmem.c
--- b/src/backend/storage/ipc/shmem.c
*** ShmemInitStruct(const char *name, Size s
*** 327,332 
--- 327,335 
  	ShmemIndexEnt *result;
  	void	   *structPtr;
  
+ 	if (strcmp(name, Buffer Descriptors) == 0)
+ 		size = BUFFERALIGN(size) + 64;
+ 
  	LWLockAcquire(ShmemIndexLock, LW_EXCLUSIVE);
  
  	if (!ShmemIndex)
*** ShmemInitStruct(const char *name, Size s
*** 413,418 
--- 416,432 
  			 \%s\ (%zu bytes requested),
  			name, size)));
  		}
+ 		if (strcmp(name, Buffer Descriptors) == 0)
+ 		{
+ 			/* align on 32 */
+ 			if ((int64)structPtr % 32 != 0)
+ structPtr = (void *)((int64)structPtr + 32 - (int64)structPtr % 32);
+ 			/* align on 32 but not 64 */
+ 			if ((int64)structPtr % 64 == 0)
+ structPtr = (void *)((int64)structPtr + 32);
+ 		}
+ 		fprintf(stderr, shared memory alignment of %s:  %ld-byte\n, name,
+ 			(int64)structPtr % 64 == 0 ? 64 : (int64)structPtr % 64);
  		result-size = size;
  		result-location = structPtr;
  	}
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
new file mode 100644
index ff6c713..c4dce5b
*** a/src/backend/storage/buffer/buf_init.c
--- b/src/backend/storage/buffer/buf_init.c
*** InitBufferPool(void)
*** 67,72 
--- 67,73 
  	bool		foundBufs,
  foundDescs;
  
+ 	fprintf(stderr, Buffer Descriptors size = %ld\n, sizeof(BufferDesc));
  	BufferDescriptors = (BufferDesc *)
  		ShmemInitStruct(Buffer Descriptors,
  		NBuffers * sizeof(BufferDesc), foundDescs);
diff --git a/src/backend/storage/ipc/shmem.c

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Kevin Grittner

[errata]


Kevin Grittner kgri...@ymail.com wrote:

 Quoting from the peer-reviewed paper presented in Istanbul[1]:

That should have been [3], not [1].


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes superuser review

2014-12-29 Thread Stephen Frost

Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
  I'd suggest it's called DUMP if that's what it allows, to keep it separate
  from the backup parts.
 
 Makes sense to me.

I'm fine calling it 'DUMP', but for different reasons.

We have no (verifiable) idea what client program is being used to
connect and therefore we shouldn't try to tie the name of the client
program to the permission.

That said, a 'DUMP' privilege which allows the user to dump the contents
of the entire database is entirely reasonable.  We need to be clear in
the documentation though- such a 'DUMP' privilege is essentially
granting USAGE on all schemas and table-level SELECT on all tables and
sequences (anything else..?).  To be clear, a user with this privilege
can utilize that access without using pg_dump.

One other point is that this shouldn't imply any other privileges, imv.
I'm specifically thinking of BYPASSRLS- that's independently grantable
and therefore should be explicitly set, if it's intended.  Things
should work 'sanely' with any combination of the two options.
Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
I'd like to see roles which have only the BACKUP privilege be unable to
directly read any data (modulo things granted to PUBLIC).  This would
allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
without being able to actually access any of the data (this would
require the actual backup system to have a similar control, but that's
entirely possible with more advanced SANs and enterprise backup
solutions).

  That seems really bad names, IMHO. Why? Because we use WAL and XLOG
  throughout documentation and parameters and code to mean *the same thing*.
  And here they'd suddenly mean different things. If we need them as separate
  privileges, I think we need much better names. (And a better description -
  what is xlog operations really?)
 
 
 Fair enough, ultimately what I was trying to address is the following
 concern raised by Alvaro:
 
 To me, what this repeated discussion on this particular BACKUP point
 says, is that the ability to run pg_start/stop_backend and the xlog
 related functions should be a different privilege, i.e. something other
 than BACKUP; because later we will want the ability to grant someone the
 ability to run pg_dump on the whole database without being superuser,
 and we will want to use the name BACKUP for that.  So I'm inclined to
 propose something more specific for this like WAL_CONTROL or
 XLOG_OPERATOR, say.

Note that the BACKUP role attribute was never intended to cover the
pg_dump use-case.  Simply the name of it caused confusion though.  I'm
not sure if adding a DUMP role attribute is sufficient enough to address
that confusion, but I haven't got a better idea either.

 When indeed, what it meant was to have the following separate (effectively
 merging #2 and #3):
 
 1) ability to pg_dump
 2) ability to start/stop backups *and* ability to execute xlog related
 functions.

That sounds reasonable to me (and is what was initially proposed, though
I've come around to the thinking that this BACKUP role attribute should
also allow pg_xlog_replay_pause/resume(), as those can be useful on
replicas).

 Given this clarification:
 
 I think #1 could certainly be answered by using DUMP.  I have no strong
 opinion in either direction, though I do think that BACKUP does make the
 most sense for #2.  Previously, Stephen had mentioned a READONLY capability
 that could effectively work for pg_dump, though, Jim's suggestion of
 keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
 either case, I certainly wouldn't mind having a wider agreement/consensus
 on this approach.

The read-all vs. ability-to-pg_dump distinction doesn't really exist for
role attributes, as I see it (see my comments above).  That said, having
DUMP or read-all is different from read-*only*, which would probably be
good to have independently.  I can imagine a use-case for a read-only
account which only has read ability for those tables, schemas, etc,
explicitly granted to it.

There is one issue that occurs to me, however.  We're talking about
pg_dump, but what about pg_dumpall?  In particular, I don't think the
DUMP privilege should provide access to pg_authid, as that would allow
the user to bypass the privilege system in some environments by using
the hash to log in as a superuser.  Now, I don't encourage using
password based authentication, especially for superuser accounts, but
lots of people do.  The idea with these privileges is to allow certain
operations to be performed by a non-superuser while preventing trivial
access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
achieve that.

 So, here is a revised list of proposed attributes:
 
 * BACKUP - allows role to perform backup operations (as originally proposed)
 * LOG - allows role to rotate log files - remains broad enough to consider
 future log related operations

Re: [HACKERS] replicating DROP commands across servers

2014-12-29 Thread Alvaro Herrera

Here's a patch that tweaks the grammar to use TypeName in COMMENT,
SECURITY LABEL, and DROP for the type and domain cases.  The required
changes in the code are pretty minimal, thankfully.  Note the slight
changes to the new object_address test also.

I think I'm pretty much done with this now, so I intend to push this
first thing tomorrow and then the rebased getObjectIdentityParts patch
sometime later.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
commit 701a2b7cc3cc6dff865db64ecaeb2fd8bb07b396
Author: Alvaro Herrera alvhe...@alvh.no-ip.org
Date:   Mon Dec 29 18:55:59 2014 -0300

Use TypeName to represent type names in certain commands

In COMMENT, DROP and SECURITY LABEL we were representing types as a list
of names, rather than the TypeName struct that's used everywhere; this
practice also recently found its way into the new pg_get_object_address.

Right now it doesn't affect anything (other than some code cleanliness),
but it is more problematic for future changes that operate with object
addresses from the SQL interface; type details such as array-ness were
being forgotten.

diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 7cb46e1..9ca609d 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -646,13 +646,11 @@ get_object_address(ObjectType objtype, List *objname, List *objargs,
 break;
 			case OBJECT_DOMCONSTRAINT:
 {
-	List		   *domname;
 	ObjectAddress	domaddr;
 	char		   *constrname;
 
-	domname = list_truncate(list_copy(objname), list_length(objname) - 1);
-	constrname = strVal(llast(objname));
-	domaddr = get_object_address_type(OBJECT_DOMAIN, domname, missing_ok);
+	domaddr = get_object_address_type(OBJECT_DOMAIN, objname, missing_ok);
+	constrname = strVal(linitial(objargs));
 
 	address.classId = ConstraintRelationId;
 	address.objectId = get_domain_constraint_oid(domaddr.objectId,
@@ -1291,14 +1289,13 @@ get_object_address_attrdef(ObjectType objtype, List *objname,
  * Find the ObjectAddress for a type or domain
  */
 static ObjectAddress
-get_object_address_type(ObjectType objtype,
-		List *objname, bool missing_ok)
+get_object_address_type(ObjectType objtype, List *objname, bool missing_ok)
 {
 	ObjectAddress address;
 	TypeName   *typename;
 	Type		tup;
 
-	typename = makeTypeNameFromNameList(objname);
+	typename = (TypeName *) linitial(objname);
 
 	address.classId = TypeRelationId;
 	address.objectId = InvalidOid;
@@ -1428,27 +1425,8 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 	 * given object type.  Most use a simple string Values list, but there
 	 * are some exceptions.
 	 */
-	if (type == OBJECT_TYPE || type == OBJECT_DOMAIN)
-	{
-		Datum	*elems;
-		bool	*nulls;
-		int		nelems;
-		TypeName *typname;
-
-		deconstruct_array(namearr, TEXTOID, -1, false, 'i',
-		  elems, nulls, nelems);
-		if (nelems != 1)
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg(name list length must be exactly %d, 1)));
-		if (nulls[0])
-			ereport(ERROR,
-	(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-	 errmsg(name or argument lists may not contain nulls)));
-		typname = typeStringToTypeName(TextDatumGetCString(elems[0]));
-		name = typname-names;
-	}
-	else if (type == OBJECT_CAST)
+	if (type == OBJECT_TYPE || type == OBJECT_DOMAIN || type == OBJECT_CAST ||
+		type == OBJECT_DOMCONSTRAINT)
 	{
 		Datum	*elems;
 		bool	*nulls;
@@ -1533,18 +1511,13 @@ pg_get_object_address(PG_FUNCTION_ARGS)
 	 */
 	switch (type)
 	{
-		case OBJECT_DOMCONSTRAINT:
-			if (list_length(name)  2)
-ereport(ERROR,
-		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-		 errmsg(name list length must be at least %d, 2)));
-			break;
 		case OBJECT_LARGEOBJECT:
 			if (list_length(name) != 1)
 ereport(ERROR,
 		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-		 errmsg(name list length must be %d, 1)));
+		 errmsg(name list length must be exactly %d, 1)));
 			break;
+		case OBJECT_DOMCONSTRAINT:
 		case OBJECT_OPCLASS:
 		case OBJECT_OPFAMILY:
 		case OBJECT_CAST:
diff --git a/src/backend/commands/dropcmds.c b/src/backend/commands/dropcmds.c
index 8583581..21a2ae4 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -264,10 +264,14 @@ does_not_exist_skipping(ObjectType objtype, List *objname, List *objargs)
 	{
 		case OBJECT_TYPE:
 		case OBJECT_DOMAIN:
-			if (!schema_does_not_exist_skipping(objname, msg, name))
 			{
-msg = gettext_noop(type \%s\ does not exist, skipping);
-name = TypeNameToString(makeTypeNameFromNameList(objname));
+TypeName   *typ = linitial(objname);
+
+if (!schema_does_not_exist_skipping(typ-names, msg, name))
+{
+	msg = gettext_noop(type \%s\ does not exist, skipping);
+	name = TypeNameToString(typ);
+}
 			}
 			break;
 		case

Re: [HACKERS] nls and server log

2014-12-29 Thread Jim Nasby


On 12/28/14, 2:56 AM, Craig Ringer wrote:

On 12/25/2014 02:35 AM, Euler Taveira wrote:

Hi,

Currently the same message goes to server log and client app. Sometimes
it bothers me since I have to analyze server logs and discovered that
lc_messages is set to pt_BR and to worse things that stup^H^H^H
application parse some error messages in portuguese.


IMO logging is simply broken for platforms where the postmaster and all
DBs don't share an encoding. We mix different encodings in log messages
and provide no way to separate them out. Nor is there a way to log
different messages to different files.

It's not just an issue with translations. We mix and mangle encodings of
user-supplied text, like RAISE strings in procs, for example.

We really need to be treating encoding for logging and for the client
much more separately than we currently do. I think any consideration of
translations for logging should be done with the underlying encoding
issues in mind.


Agreed.


My personal opinion is that we should require the server log to be
capable of representing all chars in the encodings used by any DB. Which
in practice means that we always just log in utf-8 if the user wants to
permit DBs with different encodings. An alternative would be one file
per database, always in the encoding of that database.


How much of this issue is caused by trying to machine-parse log files? Is a 
better option to improve that case, possibly doing something like including a 
field in each line that tells you the encoding for that entry?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-29 Thread Jim Nasby


On 12/28/14, 2:45 PM, Peter Geoghegan wrote:

On Sun, Dec 28, 2014 at 12:37 PM, Jeff Davis pg...@j-davis.com wrote:

Do others have similar numbers? I'm quite surprised at how little
work_mem seems to matter for these plans (HashJoin might be a different
story though). I feel like I made a mistake -- can someone please do a
sanity check on my numbers?


I have seen external sorts that were quicker than internal sorts
before. With my abbreviated key patch, under certain circumstances
external sorts are faster, while presumably the same thing is true of
int4 attribute sorts today. Actually, I saw a 10MB work_mem setting
that was marginally faster than a multi-gigabyte one that fit the
entire sort in memory. It probably has something to do with caching
effects dominating over the expense of more comparisons, since higher
work_mem settings that still resulted in an external sort were slower
than the 10MB setting.

I was surprised by this too, but it has been independently reported by
Jeff Janes.


I haven't tested for external faster than internal in a while, but I've 
certainly seen this effect before. Generally, once you get beyond a certain 
size (maybe 100MB?) you run the risk of a tapesort being faster than an 
internal sort.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Peter Geoghegan

Hi Jeff,

On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Using the vallock2 version of V1.8, using the test I previously described, I
 get some all-null rows, which my code should never create.  Also, the index
 and table don't agree, in this example I find 3 all-null rows in the table,
 but only 2 in the index.  I've attached an example output of querying via
 index and via full table scan, and also the pageinspect output of the blocks
 which have the 3 rows, in case that is helpful.

Interesting. Thanks a lot for your help!

 This was just a straight forward issue of firing queries at the database,
 the crash-inducing part of my test harness was not active during this test.
 I also ran it with my crashing patch reversed out, in case I introduced the
 problem myself, and it still occurs.

 Using V1.7 of the vallock2 patch, I saw the same thing with some all-null
 rows.  I also saw some other issues where two rows with the same key value
 would be present twice in the table (violating the unique constraint) but
 only one of them would appear in the index.  I suspect it is caused by the
 same issue as the all-null rows, and maybe I just didn't run v1.8 enough
 times to find that particular manifestation under v1.8.

This is almost certainly a latent bug with approach #2 to value
locking, that has probably been there all along. Semantics and syntax
have been a recent focus, and so the probability that I introduced a
regression of this nature in any recent revision seems low. I am going
to investigate the problem, and hope to have a diagnosis soon.

Once again, thanks!
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Misaligned BufferDescriptors causing major performance problems on AMD

2014-12-29 Thread Andres Freund

On 2014-12-29 16:59:05 -0500, Bruce Momjian wrote:
 I am glad someone else considers this important.

I do consider it important. I just considered the lwlock scalability
more important...

 Andres reported the above 2x pgbench difference in February, but no
 action was taken as everyone felt there needed to be more performance
 testing, but it never happened:

FWIW, I have no idea what exactly should be tested there. Right now I
think what we should do is to remove the BUFFERALIGN from shmem.c and
instead add explicit alignment code in a couple callers
(BufferDescriptors/Blocks, proc.c stuff).

   $ pgbench --initialize --scale 1 pgbench

Scale 1 isn't particularly helpful in benchmarks, not even read only
ones.

   $ pgbench --protocol prepared --client 16 --jobs 16 --transactions 
 10 --select-only pgbench

I'd suspect you're more likely to see differences with a higher client
count. Also, I seriously doubt 100k xacts is enough to get stable
results - even on my laptop I get 100k+ TPS. I'd suggest using something
like -P 1 -T 100 or so.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CATUPDATE confusion?

2014-12-29 Thread Adam Brightwell

All,

On Sat, Dec 27, 2014 at 6:31 PM, Noah Misch n...@leadboat.com wrote:

 On Sat, Dec 27, 2014 at 06:26:02PM -0500, Tom Lane wrote:
  Stephen Frost sfr...@snowman.net writes:
   * Tom Lane (t...@sss.pgh.pa.us) wrote:
   Plan C (remove CATUPDATE altogether) also has some merit.  But adding
 a
   superuser override there would be entirely pointless.
 
   This is be my recommendation.  I do not see the point of carrying the
   option around just to confuse new users of pg_authid and reviewers of
   the code.
 
  Yeah ... if no one's found it interesting in the 20 years since the
  code left Berkeley, it's unlikely that interest will emerge in the
  future.

 No objection here.


Given this discussion, I have attached a patch that removes CATUPDATE for
review/discussion.

One of the interesting behaviors (or perhaps not) is how 'pg_class_aclmask'
handles an invalid role id when checking permissions against 'rolsuper'
instead of 'rolcatupdate'.  This is demonstrated by the
'has_table_privilege' regression test expected results.  In summary,
'has_rolcatupdate' raises an error when an invalid OID is provided,
however, as documented in the source 'superuser_arg' does not, simply
returning false.  Therefore, when checking for superuser-ness of an
non-existent role, the result will be false and not an error.  Perhaps this
is OK, but my concern would be on the expected behavior around items like
'has_table_privilege'.  My natural thought would be that we would want to
preserve that specific functionality, though short of adding a
'has_rolsuper' function that will raise an appropriate error, I'm uncertain
of an approach.  Thoughts?

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index 9ceb96b..635032d
*** a/doc/src/sgml/catalogs.sgml
--- b/doc/src/sgml/catalogs.sgml
***
*** 1416,1430 
   /row
  
   row
-   entrystructfieldrolcatupdate/structfield/entry
-   entrytypebool/type/entry
-   entry
-Role can update system catalogs directly.  (Even a superuser cannot do
-this unless this column is true)
-   /entry
-  /row
- 
-  row
entrystructfieldrolcanlogin/structfield/entry
entrytypebool/type/entry
entry
--- 1416,1421 
*** SELECT * FROM pg_locks pl LEFT JOIN pg_p
*** 8479,8494 
   /row
  
   row
-   entrystructfieldrolcatupdate/structfield/entry
-   entrytypebool/type/entry
-   entry/entry
-   entry
-Role can update system catalogs directly.  (Even a superuser cannot do
-this unless this column is true)
-   /entry
-  /row
- 
-  row
entrystructfieldrolcanlogin/structfield/entry
entrytypebool/type/entry
entry/entry
--- 8470,8475 
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
new file mode 100644
index d30612c..18623ef
*** a/src/backend/catalog/aclchk.c
--- b/src/backend/catalog/aclchk.c
*** aclcheck_error_type(AclResult aclerr, Oi
*** 3423,3448 
  }
  
  
- /* Check if given user has rolcatupdate privilege according to pg_authid */
- static bool
- has_rolcatupdate(Oid roleid)
- {
- 	bool		rolcatupdate;
- 	HeapTuple	tuple;
- 
- 	tuple = SearchSysCache1(AUTHOID, ObjectIdGetDatum(roleid));
- 	if (!HeapTupleIsValid(tuple))
- 		ereport(ERROR,
- (errcode(ERRCODE_UNDEFINED_OBJECT),
-  errmsg(role with OID %u does not exist, roleid)));
- 
- 	rolcatupdate = ((Form_pg_authid) GETSTRUCT(tuple))-rolcatupdate;
- 
- 	ReleaseSysCache(tuple);
- 
- 	return rolcatupdate;
- }
- 
  /*
   * Relay for the various pg_*_mask routines depending on object kind
   */
--- 3423,3428 
*** pg_class_aclmask(Oid table_oid, Oid role
*** 3620,3627 
  
  	/*
  	 * Deny anyone permission to update a system catalog unless
! 	 * pg_authid.rolcatupdate is set.   (This is to let superusers protect
! 	 * themselves from themselves.)  Also allow it if allowSystemTableMods.
  	 *
  	 * As of 7.4 we have some updatable system views; those shouldn't be
  	 * protected in this way.  Assume the view rules can take care of
--- 3600,3606 
  
  	/*
  	 * Deny anyone permission to update a system catalog unless
! 	 * pg_authid.rolsuper is set.  Also allow it if allowSystemTableMods.
  	 *
  	 * As of 7.4 we have some updatable system views; those shouldn't be
  	 * protected in this way.  Assume the view rules can take care of
*** pg_class_aclmask(Oid table_oid, Oid role
*** 3630,3636 
  	if ((mask  (ACL_INSERT | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE | ACL_USAGE)) 
  		IsSystemClass(table_oid, classForm) 
  		classForm-relkind != RELKIND_VIEW 
! 		!has_rolcatupdate(roleid) 
  		!allowSystemTableMods)
  	{
  #ifdef ACLDEBUG
--- 3609,3615 
  	if ((mask  (ACL_INSERT | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE | ACL_USAGE))

Re: [HACKERS] BUG #12330: ACID is broken for unique constraints

2014-12-29 Thread Jim Nasby


On 12/29/14, 10:53 AM, Kevin Grittner wrote:

Merlin Moncure mmonc...@gmail.com wrote:


Serialization errors only exist as a concession to concurrency
and performance.  Again, they should be returned as sparsely as
possible because they provide absolutely (as Tom pointed
out) zero detail to the application.


That is false.  They provide an *extremely* valuable piece of
information which is not currently available when you get a
duplicate key error -- whether the error occurred because of a race
condition and will not fail for the same cause if retried.


As for missing details like the duplicated key value, is there a reasonable way 
to add that as an errdetail() to a serialization failure error?

We do still have to be careful here though; you could still have code using our 
documented upsert methodology inside a serialized transaction. For example, via 
a 3rd party extension that can't assume you're using serialized transactions. 
Would it still be OK to treat that as a serialization failure and bubble that 
all the way back to the application?

As part of this, we should probably modify our upsert example so it takes 
transaction_isolation into account...


As for the fact that RI violations also don't return a
serialization failure when caused by a race with concurrent
transactions, I view that as another weakness in PostgreSQL.  I
don't think there is a problem curing one without curing the other
at the same time.  I have known of people writing their own
triggers to enforce RI rather than defining FKs precisely so that
they can get a serialization failure return code and do automatic
retry if it is caused by a race condition.  That's less practical
to compensate for when it comes to unique indexes or constraints.


Wow, that's horrible. :(
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Greg Sabino Mullane


-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

I am working on enhancing the ping() method of DBD::Pg. The goal of that 
is for a user to be able to determine if the connection to the database 
is still valid. The basic way we do this is to send a simple SELECT via 
PQexec and then check for a valid return value (and when in doubt, we check 
PQstatus). This works fine for most transaction statuses, including idle, 
active, and idle in transaction. It even works for copy in and copy out, 
although it obviously invalidates the current COPY (caveat emptor!). 
The problem comes when ping() is called and we are in a failed transaction. 
After some experimenting, the best solution I found is to send the PQexec, 
and then check if PQresultErrorField(result, 'C') is '25P02'. If it is, then 
all is well, in that the server is still alive. If it is not, then we 
can assume the backend is bad (for example, PQerrorMessage gives a 
could not receive data from server: Bad file descriptor). Being that we 
cannot do a rollback before calling the PQexec, is this a decent solution? 
Can we depend on really serious errors always trumping the expected 25P02?

- -- 
Greg Sabino Mullane g...@turnstep.com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201412291942
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-BEGIN PGP SIGNATURE-

iEYEAREDAAYFAlSh9QEACgkQvJuQZxSWSsjDMQCg3CO1eyrFXNUnfRbk/rRJmrCl
PEoAnRl+M67kTkuZDi+3zMyVyblLvl9I
=uW6Q
-END PGP SIGNATURE-




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes superuser review

2014-12-29 Thread Jim Nasby


On 12/29/14, 4:01 PM, Stephen Frost wrote:

Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:

I'd suggest it's called DUMP if that's what it allows, to keep it separate
from the backup parts.


Makes sense to me.


I'm fine calling it 'DUMP', but for different reasons.

We have no (verifiable) idea what client program is being used to
connect and therefore we shouldn't try to tie the name of the client
program to the permission.

That said, a 'DUMP' privilege which allows the user to dump the contents
of the entire database is entirely reasonable.  We need to be clear in
the documentation though- such a 'DUMP' privilege is essentially
granting USAGE on all schemas and table-level SELECT on all tables and
sequences (anything else..?).  To be clear, a user with this privilege
can utilize that access without using pg_dump.


Yeah... it may be better to call this something other than DUMP (see below).


One other point is that this shouldn't imply any other privileges, imv.
I'm specifically thinking of BYPASSRLS- that's independently grantable
and therefore should be explicitly set, if it's intended.  Things
should work 'sanely' with any combination of the two options.


That does violate POLA, but it's probably worth doing so...


Similairly, DUMP shouldn't imply BACKUP or visa-versa.  In particular,
I'd like to see roles which have only the BACKUP privilege be unable to
directly read any data (modulo things granted to PUBLIC).  This would
allow an unprivileged user to manage backups, kick off ad-hoc ones, etc,
without being able to actually access any of the data (this would
require the actual backup system to have a similar control, but that's
entirely possible with more advanced SANs and enterprise backup
solutions).


Yes, since a binary backup doesn't need to actually read data, it shouldn't 
implicitly allow a user granted that role to either.


Given this clarification:

I think #1 could certainly be answered by using DUMP.  I have no strong
opinion in either direction, though I do think that BACKUP does make the
most sense for #2.  Previously, Stephen had mentioned a READONLY capability
that could effectively work for pg_dump, though, Jim's suggestion of
keeping 'read-all' separate from 'ability to pg_dump' seems logical.  In
either case, I certainly wouldn't mind having a wider agreement/consensus
on this approach.


The read-all vs. ability-to-pg_dump distinction doesn't really exist for
role attributes, as I see it (see my comments above).  That said, having
DUMP or read-all is different from read-*only*, which would probably be
good to have independently.  I can imagine a use-case for a read-only
account which only has read ability for those tables, schemas, etc,
explicitly granted to it.


My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
extensive locking than regular selects do, which can cause production problems.


There is one issue that occurs to me, however.  We're talking about
pg_dump, but what about pg_dumpall?  In particular, I don't think the
DUMP privilege should provide access to pg_authid, as that would allow
the user to bypass the privilege system in some environments by using
the hash to log in as a superuser.  Now, I don't encourage using
password based authentication, especially for superuser accounts, but
lots of people do.  The idea with these privileges is to allow certain
operations to be performed by a non-superuser while preventing trivial
access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
achieve that.


Ugh, hadn't thought about that. :(


So, here is a revised list of proposed attributes:

* BACKUP - allows role to perform backup operations (as originally proposed)
* LOG - allows role to rotate log files - remains broad enough to consider
future log related operations
* DUMP -  allows role to perform pg_dump* backups of whole database
* MONITOR - allows role to view pg_stat_* details (as originally proposed)
* PROCSIGNAL - allows role to signal backend processes (as originally
proposed)well


Given the confusion that can exist with the data reading stuff, perhaps BINARY 
BACKUP or XLOG would be a better term, especially since this will probably have 
some overlap with streaming replication.

Likewise, if we segregate DUMP and BYPASSRLS then I think we need to call DUMP 
something else. Otherwise, it's a massive foot-gun; you get a successful backup only to 
find out it contains only a small part of the database.

My how this has become a can of worms...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Andrew Gierth

 Greg == Greg Sabino Mullane g...@turnstep.com writes:

 Greg I am working on enhancing the ping() method of DBD::Pg. The
 Greg goal of that is for a user to be able to determine if the
 Greg connection to the database is still valid. The basic way we do
 Greg this is to send a simple SELECT via PQexec

Why not PQexec(conn, ) ?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes superuser review

2014-12-29 Thread Stephen Frost

Jim,

* Jim Nasby (jim.na...@bluetreble.com) wrote:
 On 12/29/14, 4:01 PM, Stephen Frost wrote:
 That said, a 'DUMP' privilege which allows the user to dump the contents
 of the entire database is entirely reasonable.  We need to be clear in
 the documentation though- such a 'DUMP' privilege is essentially
 granting USAGE on all schemas and table-level SELECT on all tables and
 sequences (anything else..?).  To be clear, a user with this privilege
 can utilize that access without using pg_dump.
 
 Yeah... it may be better to call this something other than DUMP (see below).

Did you favor something else below..?  I don't see it.

 One other point is that this shouldn't imply any other privileges, imv.
 I'm specifically thinking of BYPASSRLS- that's independently grantable
 and therefore should be explicitly set, if it's intended.  Things
 should work 'sanely' with any combination of the two options.
 
 That does violate POLA, but it's probably worth doing so...

That really depends on your view of the privilege.  I'm inclined to view
it from the more-tightly-constrained point of view which matches up with
what I anticipate it actually providing.  That doesn't translate into a
short name very well though, unfortunately.

 The read-all vs. ability-to-pg_dump distinction doesn't really exist for
 role attributes, as I see it (see my comments above).  That said, having
 DUMP or read-all is different from read-*only*, which would probably be
 good to have independently.  I can imagine a use-case for a read-only
 account which only has read ability for those tables, schemas, etc,
 explicitly granted to it.
 
 My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
 extensive locking than regular selects do, which can cause production 
 problems.

The locks aren't any more extensive than regular selects, it's just that
the entire pg_dump works inside of one large transaction which lasts a
good long time and simply that can cause issues with high-rate update
systems.

 There is one issue that occurs to me, however.  We're talking about
 pg_dump, but what about pg_dumpall?  In particular, I don't think the
 DUMP privilege should provide access to pg_authid, as that would allow
 the user to bypass the privilege system in some environments by using
 the hash to log in as a superuser.  Now, I don't encourage using
 password based authentication, especially for superuser accounts, but
 lots of people do.  The idea with these privileges is to allow certain
 operations to be performed by a non-superuser while preventing trivial
 access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
 achieve that.
 
 Ugh, hadn't thought about that. :(

I don't think it kills the whole idea, but we do need to work out how to
address it.

 * BACKUP - allows role to perform backup operations (as originally proposed)
 * LOG - allows role to rotate log files - remains broad enough to consider
 future log related operations
 * DUMP -  allows role to perform pg_dump* backups of whole database
 * MONITOR - allows role to view pg_stat_* details (as originally proposed)
 * PROCSIGNAL - allows role to signal backend processes (as originally
 proposed)well
 
 Given the confusion that can exist with the data reading stuff, perhaps 
 BINARY BACKUP or XLOG would be a better term, especially since this will 
 probably have some overlap with streaming replication.

I don't really like 'xlog' as a permission.  BINARY BACKUP is
interesting but if we're going to go multi-word, I would think we would
do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
really a big fan of the two-word options though.

 Likewise, if we segregate DUMP and BYPASSRLS then I think we need to call 
 DUMP something else. Otherwise, it's a massive foot-gun; you get a 
 successful backup only to find out it contains only a small part of the 
 database.

That's actually already dealt with.  pg_dump defaults to setting the
row_security GUC to 'off', in which case you'll get an error if you hit
a table that has RLS enabled and you don't have BYPASSRLS.  If you're
not checking for errors from pg_dump, well, there's a lot of ways you
could run into problems.

 My how this has become a can of worms...

I'm not sure it's as bad as all that, but it does require some
thinking..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] CATUPDATE confusion?

2014-12-29 Thread Stephen Frost

Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
 Given this discussion, I have attached a patch that removes CATUPDATE for
 review/discussion.

Thanks!

 One of the interesting behaviors (or perhaps not) is how 'pg_class_aclmask'
 handles an invalid role id when checking permissions against 'rolsuper'
 instead of 'rolcatupdate'.  This is demonstrated by the
 'has_table_privilege' regression test expected results.  In summary,
 'has_rolcatupdate' raises an error when an invalid OID is provided,
 however, as documented in the source 'superuser_arg' does not, simply
 returning false.  Therefore, when checking for superuser-ness of an
 non-existent role, the result will be false and not an error.  Perhaps this
 is OK, but my concern would be on the expected behavior around items like
 'has_table_privilege'.  My natural thought would be that we would want to
 preserve that specific functionality, though short of adding a
 'has_rolsuper' function that will raise an appropriate error, I'm uncertain
 of an approach.  Thoughts?

This role oid check is only getting called in this case because it's
pg_authid which is getting checked (because it's a system catalog table)
and it's then falling into this one case.  I don't think we need to
preserve that.

If you do the same query with that bogus OID but a normal table, you get
the same 'f' result that the regression test gives with this change.
We're not back-patching this and I seriously doubt anyone is relying on
this special response for system catalog tables.

So, unless anyone wants to object, I'll continue to look over this patch
for eventual application against master.  If there are objections, it'd
be great if they could be voiced sooner rather than later.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] nls and server log

2014-12-29 Thread Craig Ringer

On 12/30/2014 06:39 AM, Jim Nasby wrote:

 
 How much of this issue is caused by trying to machine-parse log files?
 Is a better option to improve that case, possibly doing something like
 including a field in each line that tells you the encoding for that entry?

That'd be absolutely ghastly. You couldn't just view the logs with
'less' or a text editor if your logs had mixed encodings, you'd need
some kind of special PostgreSQL log viewer tool.

Why would we possibly do that when we could just emit utf-8 instead?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Detecting backend failures via libpq / DBD::Pg

2014-12-29 Thread Jim Nasby


On 12/29/14, 6:43 PM, Greg Sabino Mullane wrote:

I am working on enhancing the ping() method of DBD::Pg. The goal of that
is for a user to be able to determine if the connection to the database
is still valid.


This is actually a VERY common thing for monitoring frameworks to do. 
Currently, the general method seems to be something akin to

psql -c 'SELECT 1' ...

perhaps instead of going through all the effort that you currently are we could 
add a FEBE ping command, and expose that through a special psql option (or 
maybe a special pg_ping command).

This won't be noticeably faster than SELECT 1, but it would prevent a bunch of 
pointless work on the backend side, and should greatly simplify DBD's ping(). 
Only thing I'm not sure of is if this could be made to be safe within a COPY... 
:(
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Additional role attributes superuser review

2014-12-29 Thread Jim Nasby


On 12/29/14, 7:22 PM, Stephen Frost wrote:

One other point is that this shouldn't imply any other privileges, imv.
 I'm specifically thinking of BYPASSRLS- that's independently grantable
 and therefore should be explicitly set, if it's intended.  Things
 should work 'sanely' with any combination of the two options.


That does violate POLA, but it's probably worth doing so...

That really depends on your view of the privilege.  I'm inclined to view
it from the more-tightly-constrained point of view which matches up with
what I anticipate it actually providing.  That doesn't translate into a
short name very well though, unfortunately.


READ MOST? ;)


 The read-all vs. ability-to-pg_dump distinction doesn't really exist for
 role attributes, as I see it (see my comments above).  That said, having
 DUMP or read-all is different from read-*only*, which would probably be
 good to have independently.  I can imagine a use-case for a read-only
 account which only has read ability for those tables, schemas, etc,
 explicitly granted to it.


My specific concern about DUMP vs read all/only is IIRC dump needs to do more 
extensive locking than regular selects do, which can cause production problems.

The locks aren't any more extensive than regular selects, it's just that
the entire pg_dump works inside of one large transaction which lasts a
good long time and simply that can cause issues with high-rate update
systems.


Good, so really DUMP and READ ALL are the same.


 There is one issue that occurs to me, however.  We're talking about
 pg_dump, but what about pg_dumpall?  In particular, I don't think the
 DUMP privilege should provide access to pg_authid, as that would allow
 the user to bypass the privilege system in some environments by using
 the hash to log in as a superuser.  Now, I don't encourage using
 password based authentication, especially for superuser accounts, but
 lots of people do.  The idea with these privileges is to allow certain
 operations to be performed by a non-superuser while preventing trivial
 access to superuser.  Perhaps it's pie-in-the-sky, but my hope is to
 achieve that.


Ugh, hadn't thought about that.:(

I don't think it kills the whole idea, but we do need to work out how to
address it.


Yeah... it does indicate that we shouldn't call this thing DUMP, but perhaps as 
long as we put appropriate HINTS in the error paths that pg_dumpall would hit 
it's OK to call it DUMP. (My specific concern is it's natural to assume that 
DUMP would include pg_dumpall).



Given the confusion that can exist with the data reading stuff, perhaps BINARY 
BACKUP or XLOG would be a better term, especially since this will probably have 
some overlap with streaming replication.

I don't really like 'xlog' as a permission.  BINARY BACKUP is
interesting but if we're going to go multi-word, I would think we would
do PITR BACKUP or something FILESYSTEM BACKUP or similar.  I'm not
really a big fan of the two-word options though.


BINARY? Though that's pretty generic. STREAM might be better, even though it's 
not exactly accurate for a simple backup.

Perhaps this is just a documentation issue, but there's enough caveats floating 
around here that I'm reluctant to rely on that.


Likewise, if we segregate DUMP and BYPASSRLS then I think we need to call DUMP 
something else. Otherwise, it's a massive foot-gun; you get a successful backup only to 
find out it contains only a small part of the database.

That's actually already dealt with.  pg_dump defaults to setting the
row_security GUC to 'off', in which case you'll get an error if you hit
a table that has RLS enabled and you don't have BYPASSRLS.  If you're
not checking for errors from pg_dump, well, there's a lot of ways you
could run into problems.


This also indicates DUMP is better than something like READ ALL, if we can 
handle the pg_dumpall case.

Based on all this, my inclination is DUMP with appropriate hints for 
pg_dumpall, and BINARY.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Peter Geoghegan

On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I've attached an example output of querying via index and via full table
 scan, and also the pageinspect output of the blocks which have the 3 rows,
 in case that is helpful.

You might have also included output from pageinspect's bt_page_items()
function. Take a look at the documentation patch I just posted if the
details are unclear.

Thanks
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What exactly is our CRC algorithm?

2014-12-29 Thread Abhijit Menon-Sen

Hi.

I'm re-attaching the two patches as produced by format-patch. I haven't
listed any reviewers. It's either just Andres, or maybe a lot of people.

Is anyone in a position to try out the patches on MSVC and see if they
build and work sensibly, please? (Otherwise it may be better to remove
those bits from the patch for now.)

Thanks.

-- Abhijit
From d5a90d31f9c35fea2636ec6baf6acc66e91fd655 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Tue, 30 Dec 2014 12:41:19 +0530
Subject: Implement slice-by-8 CRC calculation

The COMP_CRC32C macro now calls pg_comp_crc32c(), which processes eight
data bytes at a time. Its output is identical to the byte-at-a-time CRC
code. (This change does not apply to the LEGACY_CRC32 computation.)

Author: Abhijit Menon-Sen
---
 src/include/utils/pg_crc.h|   6 +-
 src/include/utils/pg_crc_tables.h | 578 +-
 src/port/pg_crc.c |  87 ++
 3 files changed, 604 insertions(+), 67 deletions(-)

diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index f871cba..55934e5 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -41,6 +41,8 @@
 
 typedef uint32 pg_crc32;
 
+extern pg_crc32 pg_comp_crc32c(pg_crc32 crc, const void *data, size_t len);
+
 /*
  * CRC calculation using the CRC-32C (Castagnoli) polynomial.
  *
@@ -51,7 +53,7 @@ typedef uint32 pg_crc32;
 #define INIT_CRC32C(crc) ((crc) = 0x)
 #define FIN_CRC32C(crc)	((crc) ^= 0x)
 #define COMP_CRC32C(crc, data, len)	\
-	COMP_CRC32_NORMAL_TABLE(crc, data, len, pg_crc32c_table)
+	((crc) = pg_comp_crc32c((crc), (char *) (data), (len)))
 #define EQ_CRC32C(c1, c2) ((c1) == (c2))
 
 /*
@@ -115,7 +117,7 @@ do {			  \
 } while (0)
 
 /* Constant tables for CRC-32C and CRC-32 polynomials */
-extern CRCDLLIMPORT const uint32 pg_crc32c_table[];
+extern CRCDLLIMPORT const uint32 pg_crc32c_table[8][256];
 extern CRCDLLIMPORT const uint32 pg_crc32_table[];
 
 #endif   /* PG_CRC_H */
diff --git a/src/include/utils/pg_crc_tables.h b/src/include/utils/pg_crc_tables.h
index cb6b470..f814c91 100644
--- a/src/include/utils/pg_crc_tables.h
+++ b/src/include/utils/pg_crc_tables.h
@@ -28,71 +28,519 @@
  * This table is based on the so-called Castagnoli polynomial (the same
  * that is used e.g. in iSCSI).
  */
-const uint32 pg_crc32c_table[256] = {
-	0x, 0xF26B8303, 0xE13B70F7, 0x1350F3F4,
-	0xC79A971F, 0x35F1141C, 0x26A1E7E8, 0xD4CA64EB,
-	0x8AD958CF, 0x78B2DBCC, 0x6BE22838, 0x9989AB3B,
-	0x4D43CFD0, 0xBF284CD3, 0xAC78BF27, 0x5E133C24,
-	0x105EC76F, 0xE235446C, 0xF165B798, 0x030E349B,
-	0xD7C45070, 0x25AFD373, 0x36FF2087, 0xC494A384,
-	0x9A879FA0, 0x68EC1CA3, 0x7BBCEF57, 0x89D76C54,
-	0x5D1D08BF, 0xAF768BBC, 0xBC267848, 0x4E4DFB4B,
-	0x20BD8EDE, 0xD2D60DDD, 0xC186FE29, 0x33ED7D2A,
-	0xE72719C1, 0x154C9AC2, 0x061C6936, 0xF477EA35,
-	0xAA64D611, 0x580F5512, 0x4B5FA6E6, 0xB93425E5,
-	0x6DFE410E, 0x9F95C20D, 0x8CC531F9, 0x7EAEB2FA,
-	0x30E349B1, 0xC288CAB2, 0xD1D83946, 0x23B3BA45,
-	0xF779DEAE, 0x05125DAD, 0x1642AE59, 0xE4292D5A,
-	0xBA3A117E, 0x4851927D, 0x5B016189, 0xA96AE28A,
-	0x7DA08661, 0x8FCB0562, 0x9C9BF696, 0x6EF07595,
-	0x417B1DBC, 0xB3109EBF, 0xA0406D4B, 0x522BEE48,
-	0x86E18AA3, 0x748A09A0, 0x67DAFA54, 0x95B17957,
-	0xCBA24573, 0x39C9C670, 0x2A993584, 0xD8F2B687,
-	0x0C38D26C, 0xFE53516F, 0xED03A29B, 0x1F682198,
-	0x5125DAD3, 0xA34E59D0, 0xB01EAA24, 0x42752927,
-	0x96BF4DCC, 0x64D4CECF, 0x77843D3B, 0x85EFBE38,
-	0xDBFC821C, 0x2997011F, 0x3AC7F2EB, 0xC8AC71E8,
-	0x1C661503, 0xEE0D9600, 0xFD5D65F4, 0x0F36E6F7,
-	0x61C69362, 0x93AD1061, 0x80FDE395, 0x72966096,
-	0xA65C047D, 0x5437877E, 0x4767748A, 0xB50CF789,
-	0xEB1FCBAD, 0x197448AE, 0x0A24BB5A, 0xF84F3859,
-	0x2C855CB2, 0xDEEEDFB1, 0xCDBE2C45, 0x3FD5AF46,
-	0x7198540D, 0x83F3D70E, 0x90A324FA, 0x62C8A7F9,
-	0xB602C312, 0x44694011, 0x5739B3E5, 0xA55230E6,
-	0xFB410CC2, 0x092A8FC1, 0x1A7A7C35, 0xE811FF36,
-	0x3CDB9BDD, 0xCEB018DE, 0xDDE0EB2A, 0x2F8B6829,
-	0x82F63B78, 0x709DB87B, 0x63CD4B8F, 0x91A6C88C,
-	0x456CAC67, 0xB7072F64, 0xA457DC90, 0x563C5F93,
-	0x082F63B7, 0xFA44E0B4, 0xE9141340, 0x1B7F9043,
-	0xCFB5F4A8, 0x3DDE77AB, 0x2E8E845F, 0xDCE5075C,
-	0x92A8FC17, 0x60C37F14, 0x73938CE0, 0x81F80FE3,
-	0x55326B08, 0xA759E80B, 0xB4091BFF, 0x466298FC,
-	0x1871A4D8, 0xEA1A27DB, 0xF94AD42F, 0x0B21572C,
-	0xDFEB33C7, 0x2D80B0C4, 0x3ED04330, 0xCCBBC033,
-	0xA24BB5A6, 0x502036A5, 0x4370C551, 0xB11B4652,
-	0x65D122B9, 0x97BAA1BA, 0x84EA524E, 0x7681D14D,
-	0x2892ED69, 0xDAF96E6A, 0xC9A99D9E, 0x3BC21E9D,
-	0xEF087A76, 0x1D63F975, 0x0E330A81, 0xFC588982,
-	0xB21572C9, 0x407EF1CA, 0x532E023E, 0xA145813D,
-	0x758FE5D6, 0x87E466D5, 0x94B49521, 0x66DF1622,
-	0x38CC2A06, 0xCAA7A905, 0xD9F75AF1, 0x2B9CD9F2,
-	0xFF56BD19, 0x0D3D3E1A, 0x1E6DCDEE, 0xEC064EED,
-	0xC38D26C4, 0x31E6A5C7, 0x22B65633, 0xD0DDD530,
-	0x0417B1DB, 0xF67C32D8, 0xE52CC12C, 0x1747422F,
-	0x49547E0B, 0xBB3FFD08, 0xA86F0EFC, 0x5A048DFF,
-	0x8ECEE914, 0x7CA56A17, 0x6FF599E3, 0x9D9E1AE0,
-

Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-12-29 Thread Jeff Janes

On Mon, Dec 29, 2014 at 9:12 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Dec 29, 2014 at 2:29 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  Using the vallock2 version of V1.8, using the test I previously
 described, I
  get some all-null rows, which my code should never create.  Also, the
 index
  and table don't agree, in this example I find 3 all-null rows in the
 table,
  but only 2 in the index.

 Just to be clear: You haven't found any such issue with approach #1 to
 value locking, right?


Correct, I haven't seen any problems with approach #1



 I'm curious about how long it took you to see the issue with #2. Were
 there any special steps? What were the exact steps involved in turning
 off the hard crash mechanism you mention?


Generally the problem will occur early on in the process, and if not then
it will not occur at all.  I think that is because the table starts out
empty, and so a lot of insertions collide with each other.  Once the table
is more thoroughly populated, most query takes the CONFLICT branch and
therefore two insertion-branches are unlikely to collide.

At its simplest, I just use the count_upsert.pl script and your patch and
forget all the rest of the stuff from my test platform.

So:

pg_ctl stop -D /tmp/data2; rm /tmp/data2 -r;
../torn_bisect/bin/pg_ctl initdb -D /tmp/data2;
../torn_bisect/bin/pg_ctl start -D /tmp/data2 -o --fsync=off -w ;
createdb;
perl count_upsert.pl 8 10

A run of count_upsert.pl 8 10 takes about 30 seconds on my machine (8
core), and if it doesn't create a problem then I just destroy the database
and start over.

The fsync=off is not important, I've seen the problem once without it.  I
just include it because otherwise the run takes a lot longer.

I've attached another version of the count_upsert.pl script, with some more
logging targeted to this particular issue.

The problem shows up like this:

init done at count_upsert.pl line 97.
sum is 1036
count is 9720
seq scan doesn't match index scan  1535 == 1535 and 1 == 6 $VAR1 = [
  [
6535,
-21
  ],
.
(Thousands of more lines, as it outputs the entire table twice, once
gathered by seq scan, once by bitmap index scan).

The first three lines are normal, the problem starts with the seq scan
doesn't match...

In this case the first problem it ran into was that key 1535 was present
once with a count column of 1 (found by seq scan) and once with a count
column of 6 (found by index scan).  It was also in the seq scan with a
count of 6, but the way the comparison works is that it sorts each
representation of the table by the key column value and then stops at the
first difference, in this case count columns 1 == 6 failed the assertion.

If you get some all-NULL rows, then you will also get Perl warnings issued
when the RETURNING clause starts returning NULL when none are expected to
be.

The overall pattern seems to be pretty streaky.  It could go 20 iterations
with no problem, and then it will fail several times in a row.  I've seen
this pattern quite a bit with other race conditions as well, I think that
they may be sensitive to how memory gets laid out between CPUs, and that
might depend on some longer-term characteristic of the state of the machine
that survives an initdb.

By the way, I also got a new error message a few times that I think might
be a manifestation of the same thing:

ERROR:  duplicate key value violates unique constraint foo_index_idx
DETAIL:  Key (index)=(6106) already exists.
STATEMENT:  insert into foo (index, count) values ($2,$1) on conflict
(index)
  update set count=TARGET.count + EXCLUDED.count
returning foo.count

Cheers,

Jeff


count_upsert.pl
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

52 matches

Mail list logo