[HACKERS] preproc.c compilation error

2014-01-09 Thread Rugal Bernstein
Hello all:
   This is my first time mail to all, yesterday I tried to compile
postgresin my linux, but an error keep bother me.

env:

Ubuntu 13.10
Linux rugal-TM8473 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux
gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu9)
bison (GNU Bison) 2.7.12-4996

version might be update to dated

[rugal@rugal-TM8473 postgresql] git pull
Already up-to-date.


{
make -C preproc all
make[4]: Entering directory `/home/rugal/workspace/postgres
ql/src/interfaces/ecpg/preproc'
gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
-pthread  -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS
-DECPG_COMPILE -I../include -I../../../../src/interfaces/ecpg/include -I.
-I. -DMAJOR_VERSION=4 -DMINOR_VERSION=10 -DPATCHLEVEL=0
-I../../../../src/include -D_GNU_SOURCE   -c -o preproc.o preproc.c -MMD
-MP -MF .deps/preproc.Po
In file included from preproc.y:15004:0:
pgc.l: In function ‘base_yylex’:
pgc.l:403:24: error: ‘ET_FATAL’ undeclared (first use in this function)
   BEGIN(state_before);
^
pgc.l:403:24: note: each undeclared identifier is reported only once for
each function it appears in
In file included from preproc.y:15004:0:
pgc.l: In function ‘parse_include’:
pgc.l:1366:28: error: ‘ET_FATAL’ undeclared (first use in this function)
if (!yyin)
^
make[4]: *** [preproc.o] Error 1
}



This problem keep bothering even after I use
[./configure --enable-depend]

and I find a solution

--- 
a/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832
+++ 
b/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d
@@ 
-186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832#l186
+186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d#l186@@
 struct assignment

 enum errortype
 {
-   ET_WARNING, ET_ERROR, ET_FATAL
+   ET_WARNING, ET_ERROR
 };


After add this enum,  ET_FATAL  . compilation become successful!
Is there any mistakes or typos with the latest version?

regard!


Re: [HACKERS] preproc.c compilation error

2014-01-09 Thread Rugal Bernstein
and even successfully compiled PG, I got warning

preproc.y: In function ‘vmmerror’:
preproc.y:76:2: warning: enumeration value ‘ET_FATAL’ not handled in switch
[-Wswitch]
  switch(type)
  ^



2014/1/9 Rugal Bernstein ryujinwr...@gmail.com

 Hello all:
This is my first time mail to all, yesterday I tried to compile
 postgres in my linux, but an error keep bother me.

 env:

 Ubuntu 13.10
 Linux rugal-TM8473 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC
 2013 x86_64 x86_64 x86_64 GNU/Linux
 gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu9)
 bison (GNU Bison) 2.7.12-4996

 version might be update to dated

 [rugal@rugal-TM8473 postgresql] git pull
 Already up-to-date.


 {
 make -C preproc all
 make[4]: Entering directory `/home/rugal/workspace/postgres
 ql/src/interfaces/ecpg/preproc'
 gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith
 -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
 -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
 -pthread  -D_REENTRANT -D_THREAD_SAFE -D_POSIX_PTHREAD_SEMANTICS
 -DECPG_COMPILE -I../include -I../../../../src/interfaces/ecpg/include -I.
 -I. -DMAJOR_VERSION=4 -DMINOR_VERSION=10 -DPATCHLEVEL=0
 -I../../../../src/include -D_GNU_SOURCE   -c -o preproc.o preproc.c -MMD
 -MP -MF .deps/preproc.Po
 In file included from preproc.y:15004:0:
 pgc.l: In function ‘base_yylex’:
 pgc.l:403:24: error: ‘ET_FATAL’ undeclared (first use in this function)
BEGIN(state_before);
 ^
 pgc.l:403:24: note: each undeclared identifier is reported only once for
 each function it appears in
 In file included from preproc.y:15004:0:
 pgc.l: In function ‘parse_include’:
 pgc.l:1366:28: error: ‘ET_FATAL’ undeclared (first use in this function)
 if (!yyin)
 ^
 make[4]: *** [preproc.o] Error 1
 }



 This problem keep bothering even after I use
 [./configure --enable-depend]

 and I find a solution

 --- 
 a/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832
 +++ 
 b/src/interfaces/ecpg/preproc/type.hhttp://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d
 @@ 
 -186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=b7768fd66146e16547f385d30272e2a42e4d6832;hb=b7768fd66146e16547f385d30272e2a42e4d6832#l186
 +186,7http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/interfaces/ecpg/preproc/type.h;h=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d;hb=cd0d1da8c4a1a641acd9f6b6b7dfa88e7241394d#l186@@
  struct assignment

  enum errortype
  {
 -   ET_WARNING, ET_ERROR, ET_FATAL
 +   ET_WARNING, ET_ERROR
  };


 After add this enum,  ET_FATAL  . compilation become successful!
 Is there any mistakes or typos with the latest version?

 regard!



Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-09 Thread Robert Haas
On Tue, Jan 7, 2014 at 8:56 AM, MauMau maumau...@gmail.com wrote:
 I suppose we know (or at least believe) those encodings during backend
 startup:

 * client encoding - the client_encoding parameter passed in the startup
 packet, or if that's not present, client_encoding GUC value.

 * server encoding - the encoding of strings gettext() returns.  That is what
 GetPlatformEncoding() returns.

Suppose the startup packet itself is malformed.  How will you report the error?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread MauMau

From: Andres Freund and...@2ndquadrant.com

On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote:

If we have the following:

db0-db1:down

Using the model (as I understand it) that is being discussed we have
increased our failure rate because the moment db1:down we also lose db0. 
The

node db0 may be up but if it isn't going to process transactions it is
useless. I can tell you that I have exactly 0 customers that would want 
that

model because a single node failure would cause a double node failure.


That's why you should configure a second standby as another (candidate)
synchronous replica, also listed in synchronous_standby_names.


Let me ask a (probably) stupid question.  How is the sync rep different from 
RAID-1?


When I first saw sync rep, I expected that it would provide the same 
guarantees as RAID-1 in terms of durability (data is always mirrored on two 
servers) and availability (if one server goes down, another server continues 
full service).


The cost is reasonable with RAID-1.  The sync rep requires high cost to get 
both durability and availability --- three servers.


Am I expecting too much?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failed assertion root-hasLateralRTEs on initsplan.c

2014-01-09 Thread Robert Haas
On Tue, Jan 7, 2014 at 1:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Emre Hasegeli e...@hasegeli.com writes:
 I get assertion failure on initsplan.c line 1325 while executing following 
 query
 on HEAD (edc43458d797a5956f4bf39af18cf62abb0077db). It works fine
 without --enable-cassert.

 update subscriber set properties = hstore(a) from (select firstname,
 lastname from player where player.id = subscriber.id) as a;

 Hm, AFAICS this query should absolutely *not* work; the reference to
 subscriber.id inside the sub-select is illegal.  It might be legal with
 LATERAL, but not otherwise.  So I think this is a parser bug, and there's
 nothing wrong with the planner's Assert.  9.2 and earlier throw the
 error I'd expect, so probably something in the LATERAL patches broke
 this case; will look.

 The next question is if we should allow it with LATERAL.  That would
 essentially be treating subscriber as having implicitly appeared at the
 start of the FROM list, which I guess is all right ... but does anyone
 want to argue against it?  I seem to recall some old discussions about
 allowing the update target to be explicitly shown in FROM, in case you
 wanted say to left join it against something else.  Allowing this implicit
 appearance might limit our options if we ever get around to trying to do
 that.  On the other hand, those discussions were a long time back, so
 maybe it'll never happen anyway.

I still think that would be a good thing to do, but I don't see a
problem.  The way I imagine it would work is: if the alias used for
the update target also appears in the FROM clause, then we treat them
as the same thing (after checking that they refer to the same table in
both cases).  Otherwise, we add the update target as an additional
from-list item.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-09 Thread MauMau

From: Robert Haas robertmh...@gmail.com
Suppose the startup packet itself is malformed.  How will you report the 
error?


I think we have no choice but to report the error in English, because we 
don't know what the client wants.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] multibyte messages are displayed incorrectly on the client

2014-01-09 Thread MauMau

From: Robert Haas robertmh...@gmail.com
Suppose the startup packet itself is malformed.  How will you report the 
error?


I think we have no choice but to report the error in English, because we 
don't know what the client wants.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preproc.c compilation error

2014-01-09 Thread Michael Meskes
You have to rebuild the auto-generated pgc.c, preproc.c and preproc.y or simply
remove them to force a rebuild.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery to backup point

2014-01-09 Thread Heikki Linnakangas

On 12/09/2013 03:05 PM, MauMau wrote:

From: Heikki Linnakangas hlinnakan...@vmware.com

Thanks. Looks sane, although I don't much like the proposed interface
to trigger this, setting recovery_target_time='backup_point'. What the
code actually does is to stop recovery as soon as you reach
consistency, which might not have anything to do with a backup. If you
set it on a warm standby server, for example, it will end recovery as
soon as it reaches consistency, but there was probably no backup taken
at that point.


Thank you for reviewing so rapidly.  I thought I would check the end of
backup in recoveryStopsHere(), by matching XLOG_BACKUP_END and
ControlFile-backupStartPoint for backups taken on the primary, and
comparing the current redo location with ControlFile-backupEndPoint for
backups taken on the standby.  However, that would duplicate much code
in XLOG_BACKUP_END redo processing and checkRecoveryConsistency().
Besides, the code works only when the user explicitly requests recovery
to backup point, not when he starts the warm standby server.  (I wonder
I'm answering correctly.)


I was thinking that you have a warm standby server, and you decide to 
stop using it as a warm standby, and promote it. You'd do that by 
stopping it, modifying recovery.conf to remove standby_mode, and set a 
recovery target, and then restart.


After some refactoring and fixing bugs in the existing code, I came up 
with the attached patch. I called the option simply recovery_target, 
with the only allowed value of immediate. IOW, if you want to stop 
recovery as early as possible, you add recovery_target='immediate' to 
recovery.conf. Now that we have four different options to set the 
recovery target with, I rearranged the docs slightly. How does this look 
to you?


- Heikki
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a2361d7..854b5fd 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1124,7 +1124,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
para
 If you want to recover to some previous point in time (say, right before
 the junior DBA dropped your main transaction table), just specify the
-required stopping point in filenamerecovery.conf/.  You can specify
+required link linkend=recovery-target-settingsstopping point/link in filenamerecovery.conf/.  You can specify
 the stop point, known as the quoterecovery target/, either by
 date/time, named restore point or by completion of a specific transaction
 ID.  As of this writing only the date/time and named restore point options
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 550cdce..a723338 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -199,8 +199,33 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
   sect1 id=recovery-target-settings
 
 titleRecovery Target Settings/title
+ para
+  By default, recovery will recover to the end of the WAL log. The
+  following parameters can be used to specify an earlier stopping point.
+  At most one of varnamerecovery_target/,
+  varnamerecovery_target_name/, varnamerecovery_target_time/, or
+  varnamerecovery_target_xid/ can be specified. 
+ /para
  variablelist
 
+ varlistentry id=recovery-target xreflabel=recovery_target_name
+  termvarnamerecovery_target/varnameliteral = 'immediate'/literal/term
+  indexterm
+primaryvarnamerecovery_target/ recovery parameter/primary
+  /indexterm
+  listitem
+   para
+This parameter specifies that recovery should end as soon as a
+consistency is reached, ie. as early as possible. When restoring from an
+online backup, this means the point where taking the backup ended.
+   /para
+   para
+Technically, this is a string parameter, but literal'immediate'/l
+is currently the only allowed value.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=recovery-target-name xreflabel=recovery_target_name
   termvarnamerecovery_target_name/varname
(typestring/type)
@@ -212,10 +237,6 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
para
 This parameter specifies the named restore point, created with
 functionpg_create_restore_point()/ to which recovery will proceed.
-At most one of varnamerecovery_target_name/,
-xref linkend=recovery-target-time or
-xref linkend=recovery-target-xid can be specified.  The default is to
-recover to the end of the WAL log.
/para
   /listitem
  /varlistentry
@@ -231,10 +252,6 @@ restore_command = 'copy C:\\server\\archivedir\\%f %p'  # Windows
para
 This parameter specifies the time stamp up to which recovery
 will proceed.
-At most one of varnamerecovery_target_time/,
-xref linkend=recovery-target-name or
- 

Re: [HACKERS] Planning time in explain/explain analyze

2014-01-09 Thread Stephen Frost
Andreas, Robert,

* Andreas Karlsson (andr...@proxel.se) wrote:
 A patch with updated documentation is attached.

Thanks for working on this!

 On 01/02/2014 04:08 AM, Robert Haas wrote:
 I'm wondering whether the time should be stored inside the PlannedStmt
 node instead of passing it around separately. One possible problem
 with the way you've done things here is that, in the case of a
 prepared statement, EXPLAIN ANALYZE will emit the time needed to call
 GetCachedPlan(), even if that function didn't do any replanning.  Now
 you could argue that this is the correct behavior, but I think there's
 a decent argument that what we ought to show there is the amount of
 time that was required to create the plan that we're displaying at the
 time it was created, rather than the amount of time that was required
 to figure out that we didn't need to replan.

Agreed, and really, it'd be nice to know *both*.  If we're worried about
the timing cost when going through a no-op GetCachedPlan(), then perhaps
we don't add that, but if we actually *do* re-plan, it'd be handy to
know that and to know the timing it took.

 A minor side benefit of this approach is that you wouldn't need to
 change the signature for ExplainOnePlan(), which would avoid breaking
 extensions that may call it.

Agreed.

 A possible argument against printing the time to create the plan is
 that unless it was created when running EXPLAIN we will not know it.

This is perhaps the biggest point against- if we keep it this way...

 I do not think we want to always measure the time it took to
 generate a plan due to slow clocks on some architectures. Also I
 feel that such a patch would be more invasive.

The slow-clock argument is really quite frustrating for those of us who
are looking to add more and more metrics to PG.  We're nowhere near the
level that we need to be and it shows (particularly for users coming
from $OTHER-RDBMS).  Perhaps we should try and come up with a solution
to address those cases (turn off metrics ala turning off stats?) while
not preventing us from gathering metrics on more reasonable systems.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] array_length(anyarray)

2014-01-09 Thread Dean Rasheed
On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote:



 2013/12/19 David Fetter da...@fetter.org

 On Wed, Dec 18, 2013 at 09:27:54PM +0100, Marko Tiikkaja wrote:
  Hi,
 
  Attached is a patch to add support for array_length(anyarray), which
  only works for one-dimensional arrays, returns 0 for empty arrays
  and complains if the array's lower bound isn't 1.  In other words,
  does the right thing when used with the arrays people use 99% of the
  time.

 +1 for adding this.


 +1


I think that having 2 functions called array_length() that each behave
differently for empty arrays would just lead to confusion.

The SQL standard defines a function called cardinality() that returns
the number of elements of a collection (array or multiset), so why
don't we call it that?


 length should be irrelevant to fact so array starts from 1, 0 or anything
 else

Yes, this should just return the number of elements, and 0 for an empty array.

How it should behave for multi-dimensional arrays is less clear, but
I'd argue that it should return the total number of elements, i.e.
cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it
consistent with the choices we've already made for unnest() and
ordinality:
 - cardinality(foo) = (select count(*) from unnest(foo)).
 - unnest with ordinality would always result in ordinals in the range
[1, cardinality].

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nested hstore patch

2014-01-09 Thread Andrew Dunstan





On 01/08/2014 04:29 PM, Oleg Bartunov wrote:

Attached is a new version of patch, which addresses most issues raised
by Andres.

It's long holidays in Russia now and it happened that Teodor is
traveling with family, so Teodor asked me to reply. Comments in code
will be added asap.


Oleg,

Please merge in the jsonb work and resubmit. See 
https://github.com/feodor/postgres/commits/jsonb_and_hstore I not that 
this repo does not apparently contain any of your latest changes.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] preproc.c compilation error

2014-01-09 Thread Rugal Bernstein
Thanks, seems it is because the first time pgc.c and others are generated,
but [make distclean] did not clean them, which lead to this problem!
after [rm pgc.c preproc.h preproc.c preproc.y]
it is now successfully compiled without any warning! thank you!

Java Developer; Mysql/Oracle DBA; C/Java/Python/Bash; Vim; Linux; Pianist;
Hadoop/Spark/Storm
blog http://rugal.ml
github https://github.com/Rugal
twitter https://twitter.com/ryujinwrath


On Thu, Jan 9, 2014 at 9:45 PM, Michael Meskes mes...@postgresql.orgwrote:

 You have to rebuild the auto-generated pgc.c, preproc.c and preproc.y or
 simply
 remove them to force a rebuild.

 Michael
 --
 Michael Meskes
 Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
 Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
 Jabber: michael.meskes at gmail dot com
 VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL



Re: [HACKERS] preproc.c compilation error

2014-01-09 Thread Andres Freund
On 2014-01-09 22:12:53 +0800, Rugal Bernstein wrote:
 Thanks, seems it is because the first time pgc.c and others are generated,
 but [make distclean] did not clean them, which lead to this problem!
 after [rm pgc.c preproc.h preproc.c preproc.y]
 it is now successfully compiled without any warning! thank you!

maintainer-clean removes those, distclean doesn't because they are
distributed so people without flex/bison can compile postgres.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2014-01-09 Thread Amit Kapila
On Fri, Dec 6, 2013 at 6:41 PM, Amit Kapila amit.kapil...@gmail.com wrote:
Agreed, summarization of data for LZ/Chunkwise encoding especially for
non-compressible (hundred tiny fields, all changed/half changed) or less
compressible data (hundred tiny fields, half nulled) w.r.t CPU
 usage is as below:

a. For hard disk, there is an overhead of 7~16% with LZ delta encoding
and there is an overhead of 5~8% with Chunk wise encoding.

b. For Tempfs (which means operate on RAM as disk), there is an
 overhead of 19~26%
with LZ delta encoding and there is an overhead of 9~18% with
 Chunk wise encoding.

 There might be some variation of data (in your last mail the overhead
 for chunkwise method for Tempfs was  12%),
 but in general the data suggests that chunk wise encoding has less
 overhead than LZ encoding for non-compressible data
 and for others it is better or equal.

 Now, I think we have below options for this patch:
 a. If the performance overhead for worst case is acceptable (we can
 try to reduce some more, but don't think it will be something
 drastic),
 then this can be done without any flag/option.
 b. Have it with table level option to enable/disable WAL compression
 c. Drop this patch, as for worst cases there is some performance overhead.
 d. Go back and work more on it, if there is any further suggestions
 for improvement.

Based on data posted previously for both approaches
(lz_delta, chunk_wise_encoding) and above options, I have improved
the last version of patch by keeping chunk wise approach and
provided a table level option to user.

Changes in this version of patch:
--
1. Implement decoding, it is almost similar to pglz_decompress as
the format to store encoded data is not changed much.

2. Provide a new reloption to specify Wal compression
for update operation on table
Create table tbl(c1 char(100)) With (compress_wal = true);

Alternative options:
a. compress_wal can take input as operation, e.g. 'insert', 'update',
b. use alternate syntax:
Create table tbl(c1 char(100))  Compress Wal For Update;
c. anything better?

3. Fixed below 2 defects in encoding:
a. In function pgrb_find_match(), if last byte of chunk matches,
it consider whole chunk as match.
b. If there is no match, it copies chunk as it is to encoded data,
while copying, it is ignoring last byte.
Due to defect fixes, data can vary, but I don't think there can be
any major change.

Points to consider
-

1. As the current algorithm store the entry for same chunks at head of list,
   it will always find last but one chunk (we don't store last 4 bytes) for
   long matching string during match phase in encoding (pgrb_delta_encode).

   We can improve it either by storing same chunks at end of list instead of at
   head or by trying to find a good_match technique used in lz algorithm.
   Finding good_match technique can have overhead in some of the cases
   when there is actually no match.

2. Another optimization that we can do in pgrb_find_match(), is that
currently if
it doesn't find the first chunk (chunk got by hash index) matching, it
continues to find the match in other chunks. I am not sure if there is any
benefit to search for other chunks if first one is not matching.

3. We can move code from pg_lzcompress.c to some new file pg_rbcompress.c,
if we want to move, then we need to either duplicate some common macros
like pglz_out_tag or keep it common, but might be change the name.

4. Decide on min and max chunksize. (currently kept as 2 and 4 respectively).
The point to consider is that if we keep bigger chunk sizes, then it can
save us on CPU cycles, but less reduction in Wal, on the other side if we
keep it small it can have better reduction in Wal but consume more CPU
cycles.

5. kept an guc variable 'wal_update_compression_ratio', for test purpose, we
   can remove it before commit.

7. docs needs to be updated, tab completion needs some work.

8. We can extend Alter Table to set compress option for table.


Thoughts/Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgrb_delta_encoding_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/09/2014 05:09 AM, Robert Treat wrote:
 On Wed, Jan 8, 2014 at 6:15 PM, Josh Berkus j...@agliodbs.com wrote:
 Stephen,


 I'm aware, my point was simply that we should state, up-front in
 25.2.7.3 *and* where we document synchronous_standby_names, that it
 requires at least three servers to be involved to be a workable
 solution.
 It's a workable solution with 2 servers.  That's a low-availability,
 high-integrity solution; the user has chosen to double their risk of
 not accepting writes against never losing a write.  That's a perfectly
 valid configuration, and I believe that NTT runs several applications
 this way.

 In fact, that can already be looked at as a kind of auto-degrade mode:
 if there aren't two nodes, then the database goes read-only.

 Might I also point out that transactions are synchronous or not
 individually?  The sensible configuration is for only the important
 writes being synchronous -- in which case auto-degrade makes even less
 sense.

 I really think that demand for auto-degrade is coming from users who
 don't know what sync rep is for in the first place.  The fact that other
 vendors are offering auto-degrade as a feature instead of the ginormous
 foot-gun it is adds to the confusion, but we can't help that.

 I think the problem here is that we tend to have a limited view of
 the right way to use synch rep. If I have 5 nodes, and I set 1
 synchronous and the other 3 asynchronous, I've set up a known
 successor in the event that the leader fails. 
But there is no guarantee that the synchronous replica actually
is ahead of async ones.

 In this scenario
 though, if the successor fails, you actually probably want to keep
 accepting writes; since you weren't using synchronous for durability
 but for operational simplicity. I suspect there are probably other
 scenarios where users are willing to trade latency for improved and/or
 directed durability but not at the extent of availability, don't you?

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/09/2014 12:05 AM, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
 On 2014-01-08 17:56:37 -0500, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
 That's why you should configure a second standby as another (candidate)
 synchronous replica, also listed in synchronous_standby_names.
 Perhaps we should stress in the docs that this is, in fact, the *only*
 reasonable mode in which to run with sync rep on?  Where there are
 multiple replicas, because otherwise Drake is correct that you'll just
 end up having both nodes go offline if the slave fails.
 Which, as it happens, is actually documented.
 I'm aware, my point was simply that we should state, up-front in
 25.2.7.3 *and* where we document synchronous_standby_names, that it
 requires at least three servers to be involved to be a workable
 solution.

 Perhaps we should even log a warning if only one value is found in
 synchronous_standby_names...
You can have only one name in synchronous_standby_names and
have multiple slaves connecting with that name

Also, I can attest that I have had clients who want exactly that - a system
stop until admin intervention in case of a designated sync standby failing.

And they actually run more than one standby, they just want to make
sure that sync rep to 2nd data center always happens.


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/08/2014 11:49 PM, Tom Lane wrote:
 Joshua D. Drake j...@commandprompt.com writes:
 On 01/08/2014 01:55 PM, Tom Lane wrote:
 Sync mode is about providing a guarantee that the data exists on more than
 one server *before* we tell the client it's committed.  If you don't need
 that guarantee, you shouldn't be using sync mode.  If you do need it,
 it's not clear to me why you'd suddenly not need it the moment the going
 actually gets tough.
 As I understand it what is being suggested is that if a subscriber or 
 target goes down, then the master will just sit there and wait. When I 
 read that, I read that the master will no longer process write 
 transactions. If I am wrong in that understanding then cool. If I am not 
 then that is a serious problem with a production scenario. There is an 
 expectation that a master will continue to function if the target is 
 down, synchronous or not.
 Then you don't understand the point of sync mode, and you shouldn't be
 using it.  The point is *exactly* to refuse to commit transactions unless
 we can guarantee the data's been replicated.
For single host scenario this would be similar to asking for
a mode which turns fsync=off in case of disk failure :)


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/09/2014 01:57 PM, MauMau wrote:
 From: Andres Freund and...@2ndquadrant.com
 On 2014-01-08 14:42:37 -0800, Joshua D. Drake wrote:
 If we have the following:

 db0-db1:down

 Using the model (as I understand it) that is being discussed we have
 increased our failure rate because the moment db1:down we also lose
 db0. The
 node db0 may be up but if it isn't going to process transactions it is
 useless. I can tell you that I have exactly 0 customers that would
 want that
 model because a single node failure would cause a double node failure.

 That's why you should configure a second standby as another (candidate)
 synchronous replica, also listed in synchronous_standby_names.

 Let me ask a (probably) stupid question.  How is the sync rep
 different from RAID-1?

 When I first saw sync rep, I expected that it would provide the same
 guarantees as RAID-1 in terms of durability (data is always mirrored
 on two servers) and availability (if one server goes down, another
 server continues full service).
What you describe is most like A-sync rep.

Sync rep makes sure that data is always replicated before confirming to
writer.


Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/09/2014 02:01 AM, Jim Nasby wrote:
 On 1/8/14, 6:05 PM, Tom Lane wrote:
 Josh Berkusj...@agliodbs.com  writes:
 On 01/08/2014 03:27 PM, Tom Lane wrote:
 What we lack, and should work on, is a way for sync mode to have
 M larger
 than one.  AFAICS, right now we'll report commit as soon as
 there's one
 up-to-date replica, and some high-reliability cases are going to
 want
 more.
 Sync N times is really just a guarantee against data loss as long as
 you lose N-1 servers or fewer.  And it becomes an even
 lower-availability solution if you don't have at least N+1 replicas.
 For that reason, I'd like to see some realistic actual user demand
 before we take the idea seriously.
 Sure.  I wasn't volunteering to implement it, just saying that what
 we've got now is not designed to guarantee data survival across failure
 of more than one server.  Changing things around the margins isn't
 going to improve such scenarios very much.

 It struck me after re-reading your example scenario that the most
 likely way to figure out what you had left would be to see if some
 additional system (think Nagios monitor, or monitors) had records
 of when the various database servers went down.  This might be
 what you were getting at when you said logging, but the key point
 is it has to be logging done on an external server that could survive
 failure of the database server.  postmaster.log ain't gonna do it.

 Yeah, and I think that the logging command that was suggested allows
 for that *if configured correctly*.
*But* for relying on this, we would also need to make logging
*synchronous*,
which would probably not go down well with many people, as it makes things
even more fragile from availability viewpoint (and slower as well).

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery to backup point

2014-01-09 Thread MauMau

From: Heikki Linnakangas hlinnakan...@vmware.com

After some refactoring and fixing bugs in the existing code, I came up
with the attached patch. I called the option simply recovery_target,
with the only allowed value of immediate. IOW, if you want to stop
recovery as early as possible, you add recovery_target='immediate' to
recovery.conf. Now that we have four different options to set the
recovery target with, I rearranged the docs slightly. How does this look
to you?


I'm almost comfortable with your patch.  There are two comments:

C1. The following parts seem to be mistakenly taken from my patch.  These 
are not necessary for your patch, aren't they?


@@ -6238,6 +6277,10 @@ StartupXLOG(void)
   ereport(LOG,
 (errmsg(starting point-in-time recovery to XID %u,
   recoveryTargetXid)));
+  else if (recoveryTarget == RECOVERY_TARGET_TIME 
+   recoveryTargetTime == 0)
+   ereport(LOG,
+ (errmsg(starting point-in-time recovery to backup point)));
  else if (recoveryTarget == RECOVERY_TARGET_TIME)
   ereport(LOG,
 (errmsg(starting point-in-time recovery to %s,
@@ -6971,6 +7017,22 @@ StartupXLOG(void)
if (switchedTLI  AllowCascadeReplication())
 WalSndWakeup();

+/*
+ * If we have reached the end of base backup during recovery
+ * to the backup point, exit redo loop.
+ */
+if (recoveryTarget == RECOVERY_TARGET_TIME 
+ recoveryTargetTime == 0  reachedConsistency)
+{
+ if (recoveryPauseAtTarget)
+ {
+  SetRecoveryPause(true);
+  recoveryPausesHere();
+ }
+ reachedStopPoint = true;
+ break;
+}
+
/* Exit loop if we reached inclusive recovery target */
if (recoveryStopsAfter(record))
{
@@ -7116,6 +7178,9 @@ StartupXLOG(void)
  %s transaction %u,
  recoveryStopAfter ? after : before,
  recoveryStopXid);
+  else if (recoveryTarget == RECOVERY_TARGET_TIME 
+   recoveryStopTime == 0)
+   snprintf(reason, sizeof(reason), at backup point);
  else if (recoveryTarget == RECOVERY_TARGET_TIME)
   snprintf(reason, sizeof(reason),
  %s %s\n,


C2. recovery_target = 'immediate' sounds less intuitive than my suggestion 
recovery_target_time = 'backup_point', at least for those who want to 
recover to the backup point.
Although I don't have a good naming sense in English, the value should be a 
noun, not an adjective like immediate, because the value specifies the 
target (point) of recovery.


Being related to C2, I wonder if users would understand the following part 
in the documentation.


+This parameter specifies that recovery should end as soon as a
+consistency is reached, ie. as early as possible.

The subsequent sentence clarifies the use case for recovery from an online 
backup, but in what use cases do they specify this parameter?  For example, 
when do the users face the following situation?



I was thinking that you have a warm standby server, and you decide to
stop using it as a warm standby, and promote it. You'd do that by
stopping it, modifying recovery.conf to remove standby_mode, and set a
recovery target, and then restart.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [bug fix] ECPG app crashes due to SIGBUS on SPARC Solaris

2014-01-09 Thread Michael Meskes
On Sun, Jan 05, 2014 at 03:42:42PM +0900, MauMau wrote:
 I ran the ECPG regression test with the unpatched 64-bit PostgreSQL
 9.2.4 on SPARC Solaris, and it succeeded (all 54 tests passed).  For
 ...

Thanks a lot. Patch applied to HEAD and all the backbranches. Will push shortly.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-09 Thread Dean Rasheed
On 15 December 2013 01:57, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 I think even the FLOAT case deserves some consideration.  What's the
 worst-case drift?

 Complete loss of all significant digits.

 The case I was considering earlier of single-row windows could be made
 safe (I think) if we apply the negative transition function first, before
 incorporating the new row(s).  Then for example if you've got float8 1e20
 followed by 1, you compute (1e20 - 1e20) + 1 and get the right answer.
 It's not so good with two-row windows though:

 Table   correct sum of  negative-transition
 this + next value   result
 1e201e201e20 + 1 = 1e20
 1   1   1e20 - 1e20 + 0 = 0
 0

 In general, folks who do aggregate operations on
 FLOATs aren't expecting an exact answer, or one which is consistent
 beyond a certain number of significant digits.

 Au contraire.  People who know what they're doing expect the results
 to be what an IEEE float arithmetic unit would produce for the given
 calculation.  They know how the roundoff error ought to behave, and they
 will not thank us for doing a calculation that's not the one specified.
 I will grant you that there are plenty of clueless people out there
 who *don't* know this, but they shouldn't be using float arithmetic
 anyway.

 And Dave is right: how many bug reports would we get about NUMERIC is
 fast, but FLOAT is slow?

 I've said this before, but: we can make it arbitrarily fast if we don't
 have to get the right answer.  I'd rather get it's slow complaints
 than this is the wrong answer complaints.


Hi,

Reading over this, I realised that there is a problem with NaN
handling --- once the state becomes NaN, it can never recover. So the
results using the inverse transition function don't match HEAD in
cases like this:

create table t(a int, b numeric);
insert into t values(1,1),(2,2),(3,'NaN'),(4,3),(5,4);
select a, b,
   sum(b) over(order by a rows between 1 preceding and current row)
  from t;

which in HEAD produces:

 a |  b  | sum
---+-+-
 1 |   1 |   1
 2 |   2 |   3
 3 | NaN | NaN
 4 |   3 | NaN
 5 |   4 |   7
(5 rows)

but with this patch produces:

 a |  b  | sum
---+-+-
 1 |   1 |   1
 2 |   2 |   3
 3 | NaN | NaN
 4 |   3 | NaN
 5 |   4 | NaN
(5 rows)

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tomonari Katsumata
Hi,

Somebody is reading this thread?

This problem seems still remaining on REL9_3_STABLE.
Many users would face this problem, so we should
resolve this in next release.

I think his patch is reasonable to fix this problem.

Please check this again.

regards,
--
Tomonari Katsumata



2013/12/12 Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp

 Hello, we happened to see server crash on archive recovery under
 some condition.

 After TLI was incremented, there should be the case that the WAL
 file for older timeline is archived but not for that of the same
 segment id but for newer timeline. Archive recovery should fail
 for the case with PANIC error like follows,

 | PANIC: record with zero length at 0/1820D40

 Replay script is attached. This issue occured for 9.4dev, 9.3.2,
 and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
 TLI before trying archive for older TLIs.

 This occurrs during fetching checkpoint redo record in archive
 recovery.

  if (checkPoint.redo  RecPtr)
  {
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

 And this is caused by that the segment file for older timeline in
 archive directory is preferred to that for newer timeline in
 pg_xlog.

 Looking into pg_xlog before trying the older TLIs in archive like
 9.2- fixes this issue. The attached patch is one possible
 solution for 9.4dev.

 Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

 regards,

 --
 Kyotaro Horiguchi
 NTT Open Source Software Center


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread MauMau

From: Hannu Krosing ha...@2ndquadrant.com

On 01/09/2014 01:57 PM, MauMau wrote:

Let me ask a (probably) stupid question.  How is the sync rep
different from RAID-1?

When I first saw sync rep, I expected that it would provide the same
guarantees as RAID-1 in terms of durability (data is always mirrored
on two servers) and availability (if one server goes down, another
server continues full service).

What you describe is most like A-sync rep.

Sync rep makes sure that data is always replicated before confirming to
writer.


Really?  RAID-1 is a-sync?

Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP patch (v2) for updatable security barrier views

2014-01-09 Thread Tom Lane
Dean Rasheed dean.a.rash...@gmail.com writes:
 My first thought was that it should just preprocess any security
 barrier quals in subquery_planner() in the same way as other quals are
 preprocessed. But thinking about it further, those quals are destined
 to become the quals of subqueries in the range table, so we don't
 actually want to preprocess them at that stage --- that will happen
 later when the new subquery is planned by recursion back into
 subquery_planner(). So I think the right answer is to make
 adjust_appendrel_attrs() handle recursion into sublink subqueries.

TBH, this sounds like doubling down on a wrong design choice.  I see
no good reason that updatable security views should require any
fundamental rearrangements of the order of operations in the planner;
and I doubt that this is the last bug you'll have if you insist on
doing that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Amit Langote
On Thu, Dec 12, 2013 at 11:00 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Hello, we happened to see server crash on archive recovery under
 some condition.

 After TLI was incremented, there should be the case that the WAL
 file for older timeline is archived but not for that of the same
 segment id but for newer timeline. Archive recovery should fail
 for the case with PANIC error like follows,

 | PANIC: record with zero length at 0/1820D40

 Replay script is attached. This issue occured for 9.4dev, 9.3.2,
 and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
 TLI before trying archive for older TLIs.

 This occurrs during fetching checkpoint redo record in archive
 recovery.

 if (checkPoint.redo  RecPtr)
 {
   /* back up to find the record */
   record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);

 And this is caused by that the segment file for older timeline in
 archive directory is preferred to that for newer timeline in
 pg_xlog.

 Looking into pg_xlog before trying the older TLIs in archive like
 9.2- fixes this issue. The attached patch is one possible
 solution for 9.4dev.

 Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.

 regards,


Horiguchi-san,

Wonder if the following commit in 9.2 branch was to address a
similar/same problem?
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=4aed94f1660fb55bc825bf7f3135379dab28eb55

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Failed assertion root-hasLateralRTEs on initsplan.c

2014-01-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 7, 2014 at 1:15 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The next question is if we should allow it with LATERAL.  That would
 essentially be treating subscriber as having implicitly appeared at the
 start of the FROM list, which I guess is all right ... but does anyone
 want to argue against it?  I seem to recall some old discussions about
 allowing the update target to be explicitly shown in FROM, in case you
 wanted say to left join it against something else.  Allowing this implicit
 appearance might limit our options if we ever get around to trying to do
 that.  On the other hand, those discussions were a long time back, so
 maybe it'll never happen anyway.

 I still think that would be a good thing to do, but I don't see a
 problem.  The way I imagine it would work is: if the alias used for
 the update target also appears in the FROM clause, then we treat them
 as the same thing (after checking that they refer to the same table in
 both cases).  Otherwise, we add the update target as an additional
 from-list item.

Um, well, no; this does make it harder.  Consider

update t1 ... from lateral (select...) ss join (t1 left join ...)

You propose that we identify t1 in the sub-JOIN clause with the target
table.  What if we have already resolved some outer references in
subselect ss as belonging to t1?  Now we have an illegal reference
structure in the FROM clause, which is likely to lead to all sorts
of grief.

I'm sure we could forbid this combination of features, with some klugy
parse-time check or other, but it feels like we started from wrong
premises somewhere.

It might be better if we simply didn't allow lateral references to the
target table for now.  We could introduce them in combination with the
other feature, in which case we could say that the lateral reference has
to be to an explicit reference to the target table in FROM, ie, if you
want a lateral reference to t1 in ss you must write

update t1 ... from t1 join lateral (select...) ss;

The fly in the ointment is that we've already shipped a couple of
9.3.x releases that allowed lateral references to the target table.
Even though this wasn't suggested or documented anywhere, somebody
might be relying on it already.

I'm inclined though to pull it back anyway, now that I've thought
about it some more.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Hannu Krosing
On 01/09/2014 04:15 PM, MauMau wrote:
 From: Hannu Krosing ha...@2ndquadrant.com
 On 01/09/2014 01:57 PM, MauMau wrote:
 Let me ask a (probably) stupid question.  How is the sync rep
 different from RAID-1?

 When I first saw sync rep, I expected that it would provide the same
 guarantees as RAID-1 in terms of durability (data is always mirrored
 on two servers) and availability (if one server goes down, another
 server continues full service).
 What you describe is most like A-sync rep.

 Sync rep makes sure that data is always replicated before confirming to
 writer.

 Really?  RAID-1 is a-sync?
Not exactly, as there is no master just controller writing to two
equal disks.

But having a degraded mode makes it
more like async - it continues even with single disk and syncs later if
and when the 2nd disk comes back.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-09 Thread Tom Lane
Dean Rasheed dean.a.rash...@gmail.com writes:
 Reading over this, I realised that there is a problem with NaN
 handling --- once the state becomes NaN, it can never recover. So the
 results using the inverse transition function don't match HEAD in
 cases like this:

Ouch!  That takes out numeric, float4, and float8 in one fell swoop.

Given the relative infrequency of NaNs in most data, it seems like
it might still be possible to get a speedup if we could use inverse
transitions until we hit a NaN, then do it the hard way until the
NaN is outside the window, then go back to inverse transitions.
I'm not sure though if this is at all practical from an implementation
standpoint.  We certainly don't want the core code knowing about
anything as datatype-specific as a NaN, but maybe the inverse transition
function could have an API that allows reporting I can't do it here,
fall back to the hard way.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-09 Thread Florian Pflug
On Jan9, 2014, at 17:15 , Tom Lane t...@sss.pgh.pa.us wrote:
 Dean Rasheed dean.a.rash...@gmail.com writes:
 Reading over this, I realised that there is a problem with NaN
 handling --- once the state becomes NaN, it can never recover. So the
 results using the inverse transition function don't match HEAD in
 cases like this:
 
 Ouch!  That takes out numeric, float4, and float8 in one fell swoop.

For numeric, it seems that this could be overcome by having the state
be a pair (s numeric, n numeric). s would track the sum of non-NaNs
summands and n would track the number of NaN summands. The final
function would return NaN if n  0 and s otherwise. The pair could
be represented as a value of type numeric[] to avoid having to invent
a new type for this.

For float 4 and float8, wasn't the consensus that the potential
lossy-ness of addition makes this impossible anyway, even without the
NaN issue? But...

 Given the relative infrequency of NaNs in most data, it seems like
 it might still be possible to get a speedup if we could use inverse
 transitions until we hit a NaN, then do it the hard way until the
 NaN is outside the window, then go back to inverse transitions.
 I'm not sure though if this is at all practical from an implementation
 standpoint.  We certainly don't want the core code knowing about
 anything as datatype-specific as a NaN, but maybe the inverse transition
 function could have an API that allows reporting I can't do it here,
 fall back to the hard way.

that sounds like it might be possible to make things work for float4
and float8 afterall, if we can determine whether a particular addition
was lossy or not.

best regards,
Florian Pflug






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-09 Thread Florian Pflug
On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote:
 length should be irrelevant to fact so array starts from 1, 0 or anything
 else
 
 Yes, this should just return the number of elements, and 0 for an empty array.

+1. Anything that complains about arrays whose lower bound isn't 1 really
needs a *way* less generic name than array_length(). 

 
 How it should behave for multi-dimensional arrays is less clear, but
 I'd argue that it should return the total number of elements, i.e.
 cardinality('{{1,2},{3,4}}'::int[][]) = 4. That would make it
 consistent with the choices we've already made for unnest() and
 ordinality:
 - cardinality(foo) = (select count(*) from unnest(foo)).
 - unnest with ordinality would always result in ordinals in the range
 [1, cardinality].

+1

best regards,
Florian Pflug





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS]

2014-01-09 Thread Robert Haas
On Tue, Jan 7, 2014 at 10:55 PM, Dilip kumar dilip.ku...@huawei.com wrote:
 Below attached patch is implementing following todo item..

 machine-readable pg_controldata?

 http://www.postgresql.org/message-id/4b901d73.8030...@agliodbs.com

 Possible approaches:

 1.   Implement as backend function and provide a view to user.

I think this would be useful.

 -  But In this approach user can only get this information when
 server is running.

That is true, but we also have the command-line tool for when it isn't.

 2.   Extend pg_controldata tool to provide value of an individual
 attribute.

 A first draft version of the patch is attached in the mail, implemented
 using approach 2.

I think this is really ugly, and I don't see what it accomplishes.  If
the user wants only one setting from pg_controldata, they can just
grep for the line that contains that value.  You mention that the
patch skips printing the name of the field, avoiding parsing
difficulties for the user, but the parsing here is trivial: just skip
everything up to the first colon, plus any subsequent whitespace.

It's worth keeping in mind that the information in pg_controldata is,
by and large, not something we expect users to need all the time.  Are
there use cases for wanting to look at it?  Sure.  But not a ton of
them.  If we thought that everyone who works with PostgreSQL would
need to write a script to fetch, say, the catalog version number, then
it might be worth having pg_controldata --catalog-version to make that
easy so that everyone doesn't have to write pg_controldata | grep
'^Catalog version' | sed 's/.*: *//' , but considering that it's only
something that comes up rarely, I don't really think it's worth the
extra code and documentation to add an option for it.  And similarly
for the other pg_controldata fields.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Robert Haas
On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 VACUUM cleans up blocks, which is nice because it happens offline in a
 lazy manner.

 We also make SELECT clean up blocks as it goes. That is useful in OLTP
 workloads, but it means that large SQL queries and pg_dump effectively
 do much the same work as VACUUM, generating huge amounts of I/O and
 WAL on the master, the cost and annoyance of which is experienced
 directly by the user. That is avoided on standbys.

On a pgbench workload, though, essentially all page cleanup happens as
a result of HOT cleanups, like 99.9%.  It might be OK to have that
happen for write operations, but it would be a performance disaster if
updates didn't try to HOT-prune.  Our usual argument for doing HOT
pruning even on SELECT cleanups is that not doing so pessimizes
repeated scans, but there are clearly cases that end up worse off as a
result of that decision.

I'm not entirely wild about adding a parameter in this area because it
seems that we're increasingly choosing to further expose what arguably
ought to be internal implementation details.  The recent wal_log_hints
parameter is another recent example of this that I'm not thrilled
with, but in that case, as in this one, I can see the value of it.
Still, I think it'd be loads better to restrict what you're talking
about here to the SELECT-only case; I have a strong feeling that this
will be a disaster on write workloads.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-09 Thread Marko Tiikkaja

On 1/9/14 5:44 PM, Florian Pflug wrote:

On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote:

On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote:

length should be irrelevant to fact so array starts from 1, 0 or anything
else


Yes, this should just return the number of elements, and 0 for an empty array.


+1. Anything that complains about arrays whose lower bound isn't 1 really
needs a *way* less generic name than array_length().


Problem is, if you're operating on an array which could have a lower 
bound that isn't 1, why would you look at the length in the first place? 
 You can't access any elements by index, you'd need to look at 
array_lower().  You can't iterate over the array by index, you'd need to 
do  array_lower() .. array_lower() + array_length(), which doesn't make 
sense.  And then there's the myriad of stuff you can do with unnest() 
without actually having to look at the length.  Same goes for 
multi-dimensional arrays: you have even less things you can do there 
with only a length.


So if we give up these constraints, we also make this function 
completely useless.



Regards,
Marko Tiikkaja


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-09 Thread Tom Lane
Florian Pflug f...@phlo.org writes:
 For float 4 and float8, wasn't the consensus that the potential
 lossy-ness of addition makes this impossible anyway, even without the
 NaN issue? But...

Well, that was my opinion, I'm not sure if it was consensus ;-).
But NaN is an orthogonal problem I think.  I'm not sure whether it
has analogues in other data types.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Bruce Momjian
On Thu, Jan  9, 2014 at 04:55:22PM +0100, Hannu Krosing wrote:
 On 01/09/2014 04:15 PM, MauMau wrote:
  From: Hannu Krosing ha...@2ndquadrant.com
  On 01/09/2014 01:57 PM, MauMau wrote:
  Let me ask a (probably) stupid question.  How is the sync rep
  different from RAID-1?
 
  When I first saw sync rep, I expected that it would provide the same
  guarantees as RAID-1 in terms of durability (data is always mirrored
  on two servers) and availability (if one server goes down, another
  server continues full service).
  What you describe is most like A-sync rep.
 
  Sync rep makes sure that data is always replicated before confirming to
  writer.
 
  Really?  RAID-1 is a-sync?
 Not exactly, as there is no master just controller writing to two
 equal disks.
 
 But having a degraded mode makes it
 more like async - it continues even with single disk and syncs later if
 and when the 2nd disk comes back.

I think RAID-1 is a very good comparison because it is successful
technology and has similar issues.

RAID-1 is like Postgres synchronous_standby_names mode in the sense that
the RAID-1 controller will not return success until writes have happened
on both mirrors, but it is unlike synchronous_standby_names in that it
will degrade and continue writes even when it can't write to both
mirrors.  What is being discussed is to allow the RAID-1 behavior in
Postgres.

One issue that came up in discussions is the insufficiency of writing a
degrade notice in a server log file because the log file isn't durable
from server failures, meaning you don't know if a fail-over to the slave
lost commits.  The degrade message has to be stored durably against a
server failure, e.g. on a pager, probably using a command like we do for
archive_command, and has to return success before the server continues
in degrade mode.  I assume degraded RAID-1 controllers inform
administrators in the same way.

I think RAID-1 controllers operate successfully with this behavior
because they are seen as durable and authoritative in reporting the
status of mirrors, while with Postgres, there is no central authority
that can report that degrade status of master/slaves.

Another concern with degrade mode is that once Postgres enters degrade
mode, how does it get back to synchronous_standby_names mode?  We could
have each commit wait for the timeout before continuing, but that is
going to make degrade mode unusably slow.  Would there be an admin
command?  With a timeout to force degrade mode, a temporary network
outage could cause degrade mode, while our current behavior would
recover synchronous_standby_names mode once the network was repaired.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Robert Haas
On Wed, Jan 8, 2014 at 2:39 PM, knizhnik knizh...@garret.ru wrote:
 I wonder what is the intended use case of dynamic shared memory?
 Is is primarly oriented on PostgreSQL extensions or it will be used also in
 PosatgreSQL core?

My main motivation is that I want to use it to support parallel query.
 There is unfortunately quite a bit of work left to be done before we
can make that a reality, but that's the goal.

 May be I am wrong, but I do not see some reasons for creating multiple DSM
 segments by the same extension.

Right.

 And total number of DSM segments is expected to be not very large (10). The
 same is true for synchronization primitives (LWLocks for example) needed to
 synchronize access to this DSM segments. So I am not sure if possibility to
 place locks in DSM is really so critical...
 We can just reserved some space for LWLocks which can be used by extension,
 so that LWLockAssign() can be used without RequestAddinLWLocks or
 RequestAddinLWLocks can be used not only from preloaded extension.

If you're doing all of this at postmaster startup time, that all works
fine.  If you want to be able to load up an extension on the fly, then
it doesn't.  You can only RequestAddinLWLocks() at postmaster start
time, not afterwards, so currently any extension that wants to use
lwlocks has to be loaded at postmaster startup time, or you're out of
luck.

Well.  Technically we reserve something like 3 extra lwlocks that
could be assigned later.  But relying on those to be available is not
very reliable, and also, 3 is not very many, considering that we have
something north of 32k core lwlocks in the default configuration.

 IMHO the main trouble with DSM is lack of guarantee that segment is always
 mapped to the same virtual address.
 Without such guarantee it is not possible to use direct (normal) pointers
 inside DSM.
 But there seems to be no reasonable solution.

Yeah, that basically sucks.  But it's very hard to do any better.  At
least on a 64-bit platform, there's an awful lot of address space
available, and in theory it ought to be possible to find a portion of
that address space that isn't in use by any Postgres process and have
all of the backends map the shared memory segment there.  But there's
no portable way to do that, and it seems like it would require an
awful lot of IPC to achieve consensus on where to put a new mapping.

On non-Windows platforms, Noah had the idea that could reserve a large
chunk of address space mapped as PROT_NONE and then overwrite it with
mappings later as needed.  However, I'm not sure how portable that is
or whether it'll cause performance consequences (like page table
bloat) if the space doesn't end up getting used (or if it does).  And
unless you have an awful lot of space available, it's hard to be sure
that new mappings are going to fit.  And then there's Windows.

It would be nice to have better operating system support for this.
For example, IIUC, 64-bit Linux has 128TB of address space available
for user processes.  When you clone(), it can either share the entire
address space (i.e. it's a thread) or none of it (i.e. it's a
process).  There's no option to, say, share 64TB and not the other
64TB, which would be ideal for us.  We could then map dynamic shared
memory segments into the shared portion of the address space and do
backend-private allocations in the unshared part.  Of course, even if
we had that, it wouldn't be portable, so who knows how much good it
would do.  But it would be awfully nice to have the option.

I haven't given up hope that we'll some day find a way to make
same-address mappings work, at least on some platforms.  But I don't
expect it to happen soon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 We also make SELECT clean up blocks as it goes. That is useful in OLTP
 workloads, but it means that large SQL queries and pg_dump effectively
 do much the same work as VACUUM, generating huge amounts of I/O and
 WAL on the master, the cost and annoyance of which is experienced
 directly by the user. That is avoided on standbys.

 On a pgbench workload, though, essentially all page cleanup happens as
 a result of HOT cleanups, like 99.9%.  It might be OK to have that
 happen for write operations, but it would be a performance disaster if
 updates didn't try to HOT-prune.  Our usual argument for doing HOT
 pruning even on SELECT cleanups is that not doing so pessimizes
 repeated scans, but there are clearly cases that end up worse off as a
 result of that decision.

My recollection of the discussion when HOT was developed is that it works
that way not because anyone thought it was beneficial, but simply because
we didn't see an easy way to know when first fetching a page whether we're
going to try to UPDATE some tuple on the page.  (And we can't postpone the
pruning, because the query will have tuple pointers into the page later.)
Maybe we should work a little harder on passing that information down.
It seems reasonable to me that SELECTs shouldn't be tasked with doing
HOT pruning.

 I'm not entirely wild about adding a parameter in this area because it
 seems that we're increasingly choosing to further expose what arguably
 ought to be internal implementation details.

I'm -1 for a parameter as well, but I think that just stopping SELECTs
from doing pruning at all might well be a win.  It's at least worthy
of some investigation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Robert Haas
On Thu, Jan 9, 2014 at 12:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 We also make SELECT clean up blocks as it goes. That is useful in OLTP
 workloads, but it means that large SQL queries and pg_dump effectively
 do much the same work as VACUUM, generating huge amounts of I/O and
 WAL on the master, the cost and annoyance of which is experienced
 directly by the user. That is avoided on standbys.

 On a pgbench workload, though, essentially all page cleanup happens as
 a result of HOT cleanups, like 99.9%.  It might be OK to have that
 happen for write operations, but it would be a performance disaster if
 updates didn't try to HOT-prune.  Our usual argument for doing HOT
 pruning even on SELECT cleanups is that not doing so pessimizes
 repeated scans, but there are clearly cases that end up worse off as a
 result of that decision.

 My recollection of the discussion when HOT was developed is that it works
 that way not because anyone thought it was beneficial, but simply because
 we didn't see an easy way to know when first fetching a page whether we're
 going to try to UPDATE some tuple on the page.  (And we can't postpone the
 pruning, because the query will have tuple pointers into the page later.)
 Maybe we should work a little harder on passing that information down.
 It seems reasonable to me that SELECTs shouldn't be tasked with doing
 HOT pruning.

 I'm not entirely wild about adding a parameter in this area because it
 seems that we're increasingly choosing to further expose what arguably
 ought to be internal implementation details.

 I'm -1 for a parameter as well, but I think that just stopping SELECTs
 from doing pruning at all might well be a win.  It's at least worthy
 of some investigation.

Unfortunately, there's no categorical answer.  You can come up with
workloads where HOT pruning on selects is a win; just create a bunch
of junk and then read the same pages lots of times in a row.  And you
can also come up with workloads where it's a loss; create a bunch of
junk and then read them just once.  I don't know how easy it's going
to be to set that parameter in a useful way for some particular
environment, and I think that's possibly an argument against having
it.  But the argument that we don't need a parameter because one
behavior is best for everyone is not going to fly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] newlines at end of generated SQL

2014-01-09 Thread Robert Haas
On Wed, Jan 8, 2014 at 10:17 PM, Peter Eisentraut pete...@gmx.net wrote:
 Is there a reason why the programs in src/bin/scripts all put newlines
 at the end of the SQL commands they generate?  This produces useless
 empty lines in the server log (and client output, if selected).

If you're asking whether you can go ahead and fix that, +1 from me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Jeff Janes
On Wed, Jan 8, 2014 at 3:00 PM, Josh Berkus j...@agliodbs.com wrote:

 On 01/08/2014 01:49 PM, Tom Lane wrote:
  Josh Berkus j...@agliodbs.com writes:
  If we really want auto-degrading sync rep, then we'd (at a minimum) need
  a way to determine *from the replica* whether or not it was in degraded
  mode when the master died.  What good do messages to the master log do
  you if the master no longer exists?
 
  How would it be possible for a replica to know whether the master had
  committed more transactions while communication was lost, if the master
  dies without ever restoring communication?  It sounds like pie in the
  sky from here ...

 Oh, right.  Because the main reason for a sync replica degrading is that
 it's down.  In which case it isn't going to record anything.  This would
 still be useful for sync rep candidates, though, and I'll document why
 below.  But first, lemme demolish the case for auto-degrade.

 So here's the case that we can't possibly solve for auto-degrade.
 Anyone who wants auto-degrade needs to come up with a solution for this
 case as a first requirement:


It seems like the only deterministically useful thing to do is to send a
NOTICE to the *client* that the commit has succeeded, but in degraded mode,
so keep your receipts and have your lawyer's number handy.  Whether anyone
is willing to add code to the client to process that message is doubtful,
as well as whether the client will even ever receive it if we are in the
middle of a major disruption.

But I think  there is a good probabilistic justification for an
auto-degrade mode.  (And really, what else is there?  There are never any
real guarantees of anything.  Maybe none of your replicas ever come back
up.  Maybe none of your customers do, either.)




 1. A data center network/power event starts.

 2. The sync replica goes down.

 3. A short time later, the master goes down.

 4. Data center power is restored.

 5. The master is fried and is a permanent loss.  The replica is ok, though.

 Question: how does the DBA know whether data has been lost or not?


What if he had a way of knowing that some data *has* been lost?  What can
he do about it?  What is the value in knowing it was lost after the fact,
but without the ability to do anything about it?

But let's say that instead of a permanent loss, the master can be brought
back up in a few days after replacing a few components, or in a few weeks
after sending the drives out to clean-room data recovery specialists.
 Writing has already failed over to the replica, because you couldn't wait
that long to bring things back up.

Once you get your old master back, you can see if transaction have been
lost, and if they have been you can dump the tables out to a human readable
format, use PITR and restore a copy of the replica to the point just before
the failover (although I'm not really sure exactly how to identify that
point) and dump that out, then use 'diff' tools to figure out what changes
to the database were lost, consult with the application specialists to
figure out what the application was doing that lead to those changes (if
that is not obvious) and business operations people to figure out how to
apply the analogous changes to the top of the database, and customer
service VP or someone to figure how to retroactively fix transactions that
were done after the failover which would have been differently had the lost
transactions not been lost.  Or instead of all that, you could look at the
recovered data and learn that in fact nothing had been lost, so nothing
further needs to be done.

If you were running in asyn replication mode on a busy server, there is a
virtual certainty that some transactions have been lost.  If you were
running in sync mode with possibility of auto-degrade, it is far from
certain.  That depends on how long the power event lasted, compared to how
long you had the timeout set to.

Or rather than a data-center-wide power spike, what if your master just
done fell over with no drama to the rest of the neighborhood? Inspection
after the fail-over to the replica shows the RAID controller card failed.
 There is no reason to think that a RAID controller, in the process of
failing, would have caused the replication to kick into degraded mode.  You
know from the surviving logs that the master spent 60 seconds total in
degraded mode over the last 3 months, so there is a 99.999% chance no
confirmed transactions were lost.  To be conservative, let's drop it to
99.99% because maybe some unknown mechanism did allow a failing RAID
controller to blip the network card without leaving any evidence behind.
That's a lot better than the chances of lost transactions while in async
replication mode, which could be 99.9% in the other direction.

Cheers,

Jeff


Re: [HACKERS] Add CREATE support to event triggers

2014-01-09 Thread Robert Haas
On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Craig Ringer escribió:
 Instead, can't we use your already proposed subclause structure?

 {authorization:{authorization_role:some guy,
   output:AUTHORIZATION %i{authorization_role}},
  if_not_exists: {output: IF NOT EXISTS},
  name:some schema,
  output:CREATE SCHEMA %{if_not_exists} %i{name} %{authorization}}

 i.e. if_not_exists becomes an object. All clauses are objects, all
 non-object values are user data. (right?). If the clause is absent, the
 output key is the empty string.

 The issue with that (and with your original proposal) is that you can't
 tell what these clauses are supposed to be if they're not present in the
 original query. You can't *enable* IF NOT EXISTS without pulling
 knowledge of that syntax from somewhere else.

 Depending on the problem you intend to solve there, that might be fine.

 Hmm.  This seems like a reasonable thing to do, except that I would like
 the output to always be the constant, and have some other way to
 enable the clause or disable it.  With your present boolean:
 so

 if_not_exists: {output: IF NOT EXISTS,
   present: true/false}

Why not:

if_not_exists: true/false

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb

2014-01-09 Thread Robert Haas
On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 pg_upgrade creates a script analyze_new_cluster.{sh|bat} that runs
 vacuumdb --analyze-only in three stages with different statistics target
 settings to get a fresh cluster analyzed faster.  I think this behavior
 is also useful for clusters or databases freshly created by pg_restore
 or any other loading mechanism, so it's suboptimal to have this
 constrained to pg_upgrade.

 Therefore, I suggest to add this functionality into the vacuumdb
 program.

 Seems reasonable.

 There are some details to be considered about who pg_upgrade would call
 this.  For example, would we keep creating the script and just have the
 script call vacuumdb with the new option, or would we skip the script
 altogether and just print a message from pg_upgrade?  Also, pg_upgrade
 contains logic to run a vacuum (not only analyze) in the final run when
 upgrading from PostgreSQL 8.4 to deal with the freespace map.  Not sure
 how to adapt that; maybe just keep the script and run a non-analyze
 vacuum after the analyze.

 I don't think this vacuumdb feature should deal with any
 version-conversion issues.  So it sounds like the thing to do is keep the
 wrapper script, which will give us a place to put any such special actions
 without having to kluge up vacuumdb's behavior.  That'll avoid breaking
 scripts that users might've built for using pg_upgrade, too.

I guess I don't see what's wrong with kludging up vacuumdb.  It's not
like that's a very complicated utility; what will be hurt by a few
more options?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-01-09 Thread Florian Pflug
On Jan9, 2014, at 18:09 , Tom Lane t...@sss.pgh.pa.us wrote:
 Florian Pflug f...@phlo.org writes:
 For float 4 and float8, wasn't the consensus that the potential
 lossy-ness of addition makes this impossible anyway, even without the
 NaN issue? But...
 
 Well, that was my opinion, I'm not sure if it was consensus ;-).

I'd say your example showing how it could produce completely bogus
results was pretty convincing...

 But NaN is an orthogonal problem I think.  I'm not sure whether it
 has analogues in other data types.

Transfer functions which are partially invertible are not that
uncommon, I'd say. Browsing through 9.3's list of aggregate functions,
the following come to mind

max()
  Values smaller than the maximum can be removed, removing the current
  maximum requires a rescan. By remembering the N largest values,
  the number of required rescans can be reduced, but never fully
  eliminated. Same works for min().

bool_or()
  FALSE can be removed, removing TRUE requires a rescan. Could be made
  fully invertible by counting the number of TRUE and FALSE values,
  similar to my suggestion for how to handle NaN for sum(numeric).
  Same works for bool_and().

bit_or()
  Like boo_or(), 0 can be removed, everything else requires a rescan.
  Same works for bit_and()

Plus, any aggregate with a strict transfer function would be in
exactly the same situation regarding NULL as sum(numeric) is regarding
NaN. AFAIK we don't have any such aggregate in core, though.

best regards,
Florian Pflug




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb

2014-01-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think this vacuumdb feature should deal with any
 version-conversion issues.  So it sounds like the thing to do is keep the
 wrapper script, which will give us a place to put any such special actions
 without having to kluge up vacuumdb's behavior.  That'll avoid breaking
 scripts that users might've built for using pg_upgrade, too.

 I guess I don't see what's wrong with kludging up vacuumdb.  It's not
 like that's a very complicated utility; what will be hurt by a few
 more options?

Carrying kluges forever, and exposing them to users' view.  The particular
example Peter gave was only relevant to upgrades from 8.4; why would we be
putting code into vacuumdb now for that, and expecting to support it
forevermore?  What if the code to fix up something doesn't even *work*
unless we're updating from version M.N?  Putting such code into vacuumdb
means you have to make it bulletproof against other invocation
circumstances, and document what it does (since it's a user-visible
switch), and just in general greatly increases the development overhead.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Bruce Momjian
On Thu, Jan  9, 2014 at 09:36:47AM -0800, Jeff Janes wrote:
 Oh, right.  Because the main reason for a sync replica degrading is that
 it's down.  In which case it isn't going to record anything.  This would
 still be useful for sync rep candidates, though, and I'll document why
 below.  But first, lemme demolish the case for auto-degrade.
 
 So here's the case that we can't possibly solve for auto-degrade.
 Anyone who wants auto-degrade needs to come up with a solution for this
 case as a first requirement:
 
 
 It seems like the only deterministically useful thing to do is to send a 
 NOTICE
 to the *client* that the commit has succeeded, but in degraded mode, so keep
 your receipts and have your lawyer's number handy.  Whether anyone is willing
 to add code to the client to process that message is doubtful, as well as
 whether the client will even ever receive it if we are in the middle of a 
 major
 disruption.

I don't think clients are the right place for notification.  Clients
running on a single server could have fsync=off set by the admin or
lying drives and never know it.  I can't imagine a client only wiling to
run if synchronous_standby_names is set.

The synchronous slave is something the administrator has set up and is
responsible for, so the administrator should be notified.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Claudio Freire
On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote:
 It would be nice to have better operating system support for this.
 For example, IIUC, 64-bit Linux has 128TB of address space available
 for user processes.  When you clone(), it can either share the entire
 address space (i.e. it's a thread) or none of it (i.e. it's a
 process).  There's no option to, say, share 64TB and not the other
 64TB, which would be ideal for us.  We could then map dynamic shared
 memory segments into the shared portion of the address space and do
 backend-private allocations in the unshared part.  Of course, even if
 we had that, it wouldn't be portable, so who knows how much good it
 would do.  But it would be awfully nice to have the option.

You can map a segment at fork time, and unmap it after forking. That
doesn't really use RAM, since it's supposed to be lazily allocated (it
can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE,
but I don't think that's portable).

That guarantees it's free.

Next, you can map shared memory at explicit addresses (linux's mmap
has support for that, and I seem to recall Windows did too).

All you have to do, is some book-keeping in shared memory (so all
processes can coordinate new mappings).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb

2014-01-09 Thread Robert Haas
On Thu, Jan 9, 2014 at 12:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jan 8, 2014 at 10:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I don't think this vacuumdb feature should deal with any
 version-conversion issues.  So it sounds like the thing to do is keep the
 wrapper script, which will give us a place to put any such special actions
 without having to kluge up vacuumdb's behavior.  That'll avoid breaking
 scripts that users might've built for using pg_upgrade, too.

 I guess I don't see what's wrong with kludging up vacuumdb.  It's not
 like that's a very complicated utility; what will be hurt by a few
 more options?

 Carrying kluges forever, and exposing them to users' view.  The particular
 example Peter gave was only relevant to upgrades from 8.4; why would we be
 putting code into vacuumdb now for that, and expecting to support it
 forevermore?  What if the code to fix up something doesn't even *work*
 unless we're updating from version M.N?  Putting such code into vacuumdb
 means you have to make it bulletproof against other invocation
 circumstances, and document what it does (since it's a user-visible
 switch), and just in general greatly increases the development overhead.

I was referring to the analyze-in-stages logic, which is not specific
to 8.4.  I don't see a reason not to put that into vacuumdb.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-01-09 Thread Alvaro Herrera
Robert Haas escribió:
 On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:

  Hmm.  This seems like a reasonable thing to do, except that I would like
  the output to always be the constant, and have some other way to
  enable the clause or disable it.  With your present boolean:
  so
 
  if_not_exists: {output: IF NOT EXISTS,
present: true/false}
 
 Why not:
 
 if_not_exists: true/false

Yeah, that's another option.  If we do this, though, the expansion
function would have to know that an if_not_exist element expands to IF
NOT EXISTS.  Maybe that's okay.  Right now, the expansion function is
pretty stupid, which is nice.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] integrate pg_upgrade analyze_new_cluster.sh into vacuumdb

2014-01-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I was referring to the analyze-in-stages logic, which is not specific
 to 8.4.  I don't see a reason not to put that into vacuumdb.

Right, that's Peter's core proposal, which I agreed with.  The issue
was what to do with some other steps that pg_upgrade sometimes sticks
into the analyze_new_cluster.sh script.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Alvaro Herrera
Robert Haas escribió:

 Unfortunately, there's no categorical answer.  You can come up with
 workloads where HOT pruning on selects is a win; just create a bunch
 of junk and then read the same pages lots of times in a row.  And you
 can also come up with workloads where it's a loss; create a bunch of
 junk and then read them just once.  I don't know how easy it's going
 to be to set that parameter in a useful way for some particular
 environment, and I think that's possibly an argument against having
 it.  But the argument that we don't need a parameter because one
 behavior is best for everyone is not going to fly.

In the above, there's the underlying assumption that it doesn't matter
*what* we do with the page after doing or not doing pruning.  But this
is not necessarily the case: in the case of an UPDATE, having the space
be freed beforehand is beneficial because there's the option of putting
the new version of the tuple in the same page, potentially saving lots
of I/O (bring up another destination page for the new tuple, write the
new tuple there, end up dirtying two pages instead of one).  But in a
SELECT, the effect is only that you will have to skip less dead tuples,
which is not as exciting.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Josh Berkus
Robert,

 I think the problem here is that we tend to have a limited view of
 the right way to use synch rep. If I have 5 nodes, and I set 1
 synchronous and the other 3 asynchronous, I've set up a known
 successor in the event that the leader fails. In this scenario
 though, if the successor fails, you actually probably want to keep
 accepting writes; since you weren't using synchronous for durability
 but for operational simplicity. I suspect there are probably other
 scenarios where users are willing to trade latency for improved and/or
 directed durability but not at the extent of availability, don't you?

That's a workaround for a completely different limitation though; the
inability to designate a specific async replica as first.  That is, if
there were some way to do so, you would be using that rather than sync
rep.  Extending the capabilities of that workaround is not something I
would gladly do until I had exhausted other options.

The other problem is that *many* users think they can get improved
availability, consistency AND durability on two nodes somehow, and to
heck with the CAP theorem (certain companies are happy to foster this
illusion).  Having a simple, easily-accessable auto-degrade without
treading degrade as a major monitoring event will feed this
self-deception.  I know I already have to explain the difference between
synchronous and simultaneous to practically every one of my clients
for whom I set up replication.

Realistically, degrade shouldn't be something that happens inside a
single PostgreSQL node, either the master or the replica.  It should be
controlled by some external controller which is capable of deciding on
degrade or not based on a more complex set of circumstances (e.g. Is
the replica actually down or just slow?).  Certainly this is the case
with Cassandra, VoltDB, Riak, and the other serious multinode databases.

 This isn't to say there isn't a lot of confusion around the issue.
 Designing, implementing, and configuring different guarantees in the
 presence of node failures is a non-trivial problem. Still, I'd prefer
 to see Postgres head in the direction of providing more options in
 this area rather than drawing a firm line at being a CP-oriented
 system.

I'm not categorically opposed to having any form of auto-degrade at all;
what I'm opposed to is a patch which adds auto-degrade **without adding
any additional monitoring or management infrastructure at all**.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup

2014-01-09 Thread Andreas Karlsson

On 01/09/2014 06:58 PM, Steeve Lennmark wrote:
  This patch adds the ability to relocate tablespaces by adding the

command line argument --tablespace (-T) which takes a required argument
in the format oid:tablespacedir. After all tablespaces are fetched
this code updates the symlink to point to the new tablespace location.

I would have loved to be able to pass tablespacename:tablespacedir
though, but sadly I wasn't able to figure out how to retrieve that
information without creating another connection to the database.


This feature would be a nice addition to pg_basebackup, and I agree with 
that it would be preferable to use names of oids if possible.



This feature might be missing because of some other limitation I fail
to see, if so let me know. Please be gentle, this is my first patch ;-)


It seems like you have attached the wrong patch. The only attachment I 
see is 0001-SQL-assertions-prototype.patch.


Best regards,
Andreas

--
Andreas Karlsson


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup

2014-01-09 Thread Magnus Hagander
On Thu, Jan 9, 2014 at 6:58 PM, Steeve Lennmark
stee...@handeldsbanken.sewrote:

 Currently pg_basebackup is pretty invasive when using tablespaces, at
 least using the plain format. This since it requires the tablespace to
 be written to the same location as on the server beeing backed up. This
 both breaks backing up locally using -Fp (since the tablespace would
 be written to the same location) and requires the backup user to have
 write permissions in locations it shouldn't need to have access to.


Yeah, this has been sitting on my TODO for a long time :) Glad to see
someone is picking it up.


This patch adds the ability to relocate tablespaces by adding the
 command line argument --tablespace (-T) which takes a required argument
 in the format oid:tablespacedir. After all tablespaces are fetched
 this code updates the symlink to point to the new tablespace location.

 I would have loved to be able to pass tablespacename:tablespacedir
 though, but sadly I wasn't able to figure out how to retrieve that
 information without creating another connection to the database.


You could also use the format olddir:newdir, because you do know that.
It's not the name of the tablespace. but I think it's still more
usefriendly than using the oid.


This feature might be missing because of some other limitation I fail
 to see, if so let me know. Please be gentle, this is my first patch ;-)


Nope, I think it's just been limited on time.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.


if (checkPoint.redo  RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);


Hmm. After running the script, pg_controldata says:

Latest checkpoint location:   0/290
Prior checkpoint location:0/290
Latest checkpoint's REDO location:0/290

The PANIC is actually coming from here:


/*
 * Re-fetch the last valid or last applied record, so we can identify 
the
 * exact endpoint of what we consider the valid portion of WAL.
 */
record = ReadRecord(xlogreader, LastRec, PANIC, false);


If recovery started from an online checkpoint, then I think you'd get a 
similar PANIC from the code you quoted.



And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.


Yep.


Looking into pg_xlog before trying the older TLIs in archive like
9.2- fixes this issue. The attached patch is one possible
solution for 9.4dev.

Attached files are,

  - recvtest.sh: Replay script. Step 1 and 2 makes the condition
and step 3 causes the issue.

  - archrecvfix_20131212.patch: The patch fixes the issue. Archive
recovery reads pg_xlog before trying older TLI in archive
similarly to 9.1- by this patch.


Hmm, that seems reasonable at a quick glance. I think it also needs a 
change to the state transition code earlier in the loop, to not move 
from XLOG_FROM_ARCHIVE to XLOG_FROM_PG_XLOG, if we've already tried 
reading pg_xlog.


Another way to look at this is that it's wrong that we immediately PANIC 
if we successfully restore a file from archive, but then fail to read 
the record we're looking for. Instead, we should advance to next state 
in the state machine, ie. try reading the same record from pg_xlog, and 
only give up if the read fails from all sources.


Yet another way to look at this is that we shouldn't even try to read 
the file with TLI 1 from the archive, when we know the checkpoint record 
is on timeline 2. It can't possibly work.


I've been thinking for some time that we should stop doing the scan of 
all possible TLIs, and only try to read the exact file that contains the 
record we're reading. We have that information now, in the timeline 
history file, and we already do that during streaming replication. I was 
afraid of changing the behavior of archive recovery in 9.4, but maybe 
that's the way to go in the long term.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Simon Riggs
On 9 January 2014 17:21, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs si...@2ndquadrant.com wrote:
 We also make SELECT clean up blocks as it goes. That is useful in OLTP
 workloads, but it means that large SQL queries and pg_dump effectively
 do much the same work as VACUUM, generating huge amounts of I/O and
 WAL on the master, the cost and annoyance of which is experienced
 directly by the user. That is avoided on standbys.

 On a pgbench workload, though, essentially all page cleanup happens as
 a result of HOT cleanups, like 99.9%.  It might be OK to have that
 happen for write operations, but it would be a performance disaster if
 updates didn't try to HOT-prune.  Our usual argument for doing HOT
 pruning even on SELECT cleanups is that not doing so pessimizes
 repeated scans, but there are clearly cases that end up worse off as a
 result of that decision.

 My recollection of the discussion when HOT was developed is that it works
 that way not because anyone thought it was beneficial, but simply because
 we didn't see an easy way to know when first fetching a page whether we're
 going to try to UPDATE some tuple on the page.  (And we can't postpone the
 pruning, because the query will have tuple pointers into the page later.)
 Maybe we should work a little harder on passing that information down.
 It seems reasonable to me that SELECTs shouldn't be tasked with doing
 HOT pruning.

 I'm not entirely wild about adding a parameter in this area because it
 seems that we're increasingly choosing to further expose what arguably
 ought to be internal implementation details.

 I'm -1 for a parameter as well, but I think that just stopping SELECTs
 from doing pruning at all might well be a win.  It's at least worthy
 of some investigation.

Turning HOT off completely would be an absolute disaster for OLTP on
high update use cases against medium-large tables. That scenario is
well represented by pgbench and TPC-C.  I am *not* suggesting we
recommend that and would look for very large caveats in the docs.
(That may not have been clear, I guess I just assumed people would
know I was heavily involved in the HOT project and understood its
benefits).

As stated, I am interested in turning off HOT in isolated, user
specified situations, perhaps just for isolated tables.

I'm not crazy about exposing magic parameters either but then I'm not
crazy about either automatic settings or deferring things because we
don't know how to set it. In general, I prefer the idea of having a
user settable parameter in one release then automating it in a later
release if clear settings emerge from usage. I'll submit a patch with
parameter, to allow experimentation, for possible removal at commit or
beta.

If I had to suggest a value for an internal parameter, I would say
that each SELECT statement should clean no more than 4 blocks. That
way current OLTP behaviour is mostly preserved while the big queries
and pg_dump don't suck in unpredictable ways.

I'll submit the patch and we can talk some more.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Robert Haas escribió:
 But the argument that we don't need a parameter because one
 behavior is best for everyone is not going to fly.

 In the above, there's the underlying assumption that it doesn't matter
 *what* we do with the page after doing or not doing pruning.  But this
 is not necessarily the case: in the case of an UPDATE, having the space
 be freed beforehand is beneficial because there's the option of putting
 the new version of the tuple in the same page, potentially saving lots
 of I/O (bring up another destination page for the new tuple, write the
 new tuple there, end up dirtying two pages instead of one).  But in a
 SELECT, the effect is only that you will have to skip less dead tuples,
 which is not as exciting.

Yeah.  Once they're hinted dead, it doesn't cost that much to skip over
them.  Not to mention that you might well never visit them at all, if
this is an indexscan that knows which TIDs it needs to look at.

It's possible that it can be shown that different use-cases have
sufficiently different behaviors that we really do need a user-visible
parameter.  I don't want to start from that position though.  If we
did have a simple GUC parameter, it'd likely end up in the same boat
as, say, enable_seqscan, which is way too blunt an instrument for real
world use --- so I'm afraid this would soon bloat into a request for
per-table settings, planner hints, or god knows what to try to confine
the effects to the queries where it's appropriate.  Let's not go there
without proof that we have to.  It's a much better thing if we can get
the system's native behavior to be tuned well enough by depending on
things it already knows.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Simon Riggs
On 8 January 2014 21:40, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kgri...@ymail.com writes:
 I'm torn on whether we should cave to popular demand on this; but
 if we do, we sure need to be very clear in the documentation about
 what a successful return from a commit request means.  Sooner or
 later, Murphy's Law being what it is, if we do this someone will
 lose the primary and blame us because the synchronous replica is
 missing gobs of transactions that were successfully committed.

 I'm for not caving.  I think people who are asking for this don't
 actually understand what they'd be getting.

Agreed.


Just to be clear, I made this mistake initially. Now I realise Heikki
was right and if you think about it long enough, you will too. If you
still disagree, think hard, read the archives until you do.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 But in a
 SELECT, the effect is only that you will have to skip less dead tuples,
 which is not as exciting.

Agreed.  There's also the option to have it be done based on some
expectation of future work- that is, if we have to traverse X number of
dead tuples during a select, then don't bother with HOT pruning, but if
we get up to X+Y dead tuples, then do HOT pruning.

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT.  If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Stephen Frost
* Simon Riggs (si...@2ndquadrant.com) wrote:
  I'm -1 for a parameter as well, but I think that just stopping SELECTs
  from doing pruning at all might well be a win.  It's at least worthy
  of some investigation.
 
 Turning HOT off completely would be an absolute disaster for OLTP on
 high update use cases against medium-large tables. That scenario is
 well represented by pgbench and TPC-C.  I am *not* suggesting we
 recommend that and would look for very large caveats in the docs.

This is true even if we're only talking about turning it off for the
SELECT case...?  That's what's under discussion here, after all.
Certainly, we wouldn't turn it off completely...

 (That may not have been clear, I guess I just assumed people would
 know I was heavily involved in the HOT project and understood its
 benefits).

I'm certainly aware that you were heavily involved in HOT but I don't
think anyone is argueing to turn it off for everything.

 As stated, I am interested in turning off HOT in isolated, user
 specified situations, perhaps just for isolated tables.

I tend to agree w/ Tom on this point- having this be a per-table
configurable doesn't sound very appealing to me and it wouldn't address
the case you mentioned around pg_dump, but I'm sure that'd be the next
step for this and a per-session GUC wouldn't be sufficient.

 I'm not crazy about exposing magic parameters either but then I'm not
 crazy about either automatic settings or deferring things because we
 don't know how to set it. In general, I prefer the idea of having a
 user settable parameter in one release then automating it in a later
 release if clear settings emerge from usage. I'll submit a patch with
 parameter, to allow experimentation, for possible removal at commit or
 beta.

Ugh, adding GUCs is bad *because* we end up never being able to remove
them.

 If I had to suggest a value for an internal parameter, I would say
 that each SELECT statement should clean no more than 4 blocks. That
 way current OLTP behaviour is mostly preserved while the big queries
 and pg_dump don't suck in unpredictable ways.

Right, this was one idea that I had also, as noted in the other
subthread.  I'm not convinced that it's a great idea and it'd probably
be good to do a bit of testing to see just what the cost is; perhaps
even just come up with a worst-case example to see the difference
between a clean table and one with HOT chains as deep as they can go..

 I'll submit the patch and we can talk some more.

Neat.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [PATCH] pg_basebackup: progress report max once per second

2014-01-09 Thread Magnus Hagander
On Thu, Nov 14, 2013 at 10:27 AM, Mika Eloranta m...@ohmu.fi wrote:

 On 13 Nov 2013, at 20:51, Mika Eloranta m...@ohmu.fi wrote:

  Prevent excessive progress reporting that can grow to gigabytes
  of output with large databases.

 Same patch as an attachment.


Would it not make more sense to instead store the last number printed, and
only print it if the percentage has changed? AIUI with this patch we still
print the same thing on top of itself if it takes 1 second to get 1%
further.

(Except for verbose mode - but if you're asking for verbose mode, you are
*asking* to get lots of output)


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 That said, I'm not entirely convinced that traversing these dead tuples
 is all *that* painful during SELECT.  If there's that many levels then
 hopefully it's not long til an UPDATE comes along and cleans them up.

There's always VACUUM ;-)

If you take about ten steps back, what's happening here is that
maintenance work that we'd originally delegated to VACUUM, precisely so
that it wouldn't have to be done by foreground queries, is now being done
by foreground queries.  And oddly enough, people don't like that.

There is a reasonable argument for forcing UPDATE queries to do it anyway,
to improve the odds they can do same-page updates (whether HOT or
otherwise).  And probably an INSERT should do it on a page that it's
selected as an insertion target.  But I think the argument that the
original do-maintenance-in-background-whenever-possible design was wrong
is a lot harder to sustain for SELECT or even DELETE queries.  As I said
upthread, I think the current behavior was *not* chosen for performance
reasons but just to limit the scope of what we had to change for HOT.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nested hstore patch

2014-01-09 Thread Josh Berkus
On 01/09/2014 06:12 AM, Andrew Dunstan wrote:
 Oleg,
 
 Please merge in the jsonb work and resubmit. See
 https://github.com/feodor/postgres/commits/jsonb_and_hstore I note that
 this repo does not apparently contain any of your latest changes.

I'll go further and say that if the Hstore2 patch doesn't support JSONB
for 9.4, we should postpone it to 9.5.  We really don't want to get into
a situation where we need an Hstore3 because we accepted an Hstore2
which needs to be rev'd for JSON.

Especially since there's no good reason for the JSON changes not to be
merged already.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Amit Kapila
On Thu, Jan 9, 2014 at 12:21 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote:

 Well, right now we just reopen the same object from all of the
 processes, which seems to work fine and doesn't require any of this
 complexity.  The only problem I don't know how to solve is how to make
 a segment stick around for the whole postmaster lifetime.  If
 duplicating the handle into the postmaster without its knowledge gets
 us there, it may be worth considering, but that doesn't seem like a
 good reason to rework the rest of the existing mechanism.

 I think one has to try this to see if it works as per the need. If it's not
 urgent, I can try this early next week?

 Anything we want to get into 9.4 has to be submitted by next Tuesday,
 but I don't know that we're going to get this into 9.4.

Using DuplicateHandle(), we can make segment stick for Postmaster
lifetime. I have used below test (used dsm_demo module) to verify:
Session - 1
select dsm_demo_create('this message is from session-1');
 dsm_demo_create
-
   82712

Session - 2
-
select dsm_demo_read(82712);
   dsm_demo_read

 this message is from session-1
(1 row)

Session-1
\q

-- till here it will work without DuplicateHandle as well

Session -2
select dsm_demo_read(82712);
   dsm_demo_read

 this message is from session-1
(1 row)

Session -2
\q

Session -3
select dsm_demo_read(82712);
   dsm_demo_read

 this message is from session-1
(1 row)

-- above shows that handle stays around.

Note -
Currently I have to bypass below code in dam_attach(), as it assumes
segment will not stay if it's removed from control file.

/*
* If we didn't find the handle we're looking for in the control
* segment, it probably means that everyone else who had it mapped,
* including the original creator, died before we got to this point.
* It's up to the caller to decide what to do about that.
*/
if (seg-control_slot == INVALID_CONTROL_SLOT)
{
dsm_detach(seg);
return NULL;
}


Could you let me know what exactly you are expecting in patch,
just a call to DuplicateHandle() after CreateFileMapping() or something
else as well?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread knizhnik

On 01/09/2014 09:22 PM, Robert Haas wrote:

On Wed, Jan 8, 2014 at 2:39 PM, knizhnik knizh...@garret.ru wrote:

I wonder what is the intended use case of dynamic shared memory?
Is is primarly oriented on PostgreSQL extensions or it will be used also in
PosatgreSQL core?

My main motivation is that I want to use it to support parallel query.
  There is unfortunately quite a bit of work left to be done before we
can make that a reality, but that's the goal.


I do not want to waste your time, but this topic is very interesting to 
me and I will be very pleased if you drop few words about how DSM can 
help to implement parallel query processing?
It seems to me that the main complexity is in optimizer - it needs to 
split query plan into several subplans which can be executed 
concurrently and then merge their partial results.
As far as I understand it is not possible to use multithreading for 
parallel query execution because most of PostgreSQL code is 
non-reentrant. So we need to execute this subplans by several processes. 
And unlike threads, the only way of efficient exchanging data between 
processes is shared memory. So it is clear why do we need shared memory 
for parallel query execution. But why it has to be dynamic? Why it can 
not be preallocated at start time as most of other resources used by 
PostgreSQL?





May be I am wrong, but I do not see some reasons for creating multiple DSM
segments by the same extension.

Right.


And total number of DSM segments is expected to be not very large (10). The
same is true for synchronization primitives (LWLocks for example) needed to
synchronize access to this DSM segments. So I am not sure if possibility to
place locks in DSM is really so critical...
We can just reserved some space for LWLocks which can be used by extension,
so that LWLockAssign() can be used without RequestAddinLWLocks or
RequestAddinLWLocks can be used not only from preloaded extension.

If you're doing all of this at postmaster startup time, that all works
fine.  If you want to be able to load up an extension on the fly, then
it doesn't.  You can only RequestAddinLWLocks() at postmaster start
time, not afterwards, so currently any extension that wants to use
lwlocks has to be loaded at postmaster startup time, or you're out of
luck.

Well.  Technically we reserve something like 3 extra lwlocks that
could be assigned later.  But relying on those to be available is not
very reliable, and also, 3 is not very many, considering that we have
something north of 32k core lwlocks in the default configuration.


3 is definitely too small.
But you agreed with me that number of DSM segments will be not very large.
And if we do not need fine grain locking (and IMHO it is not needed for 
most extensions), then we need just few (most likely one) lock per DSM 
segment.
It means that if instead of 3 we reserve let's say 30 LW-locks, then it 
will be enough for most extensions. And there will be almost now extra 
resources overhead, because as you wrote PostgreSQL has 32k locks in 
default configuration.


Certainly if we need independent lock for each page of DSM memory than 
there will be no other choice except placing locks in DSM segment 
itself. But once again - I do not think that most of extension needed 
shared memory will use such fine grain locking.






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread knizhnik

On 01/09/2014 09:46 PM, Claudio Freire wrote:

On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote:

It would be nice to have better operating system support for this.
For example, IIUC, 64-bit Linux has 128TB of address space available
for user processes.  When you clone(), it can either share the entire
address space (i.e. it's a thread) or none of it (i.e. it's a
process).  There's no option to, say, share 64TB and not the other
64TB, which would be ideal for us.  We could then map dynamic shared
memory segments into the shared portion of the address space and do
backend-private allocations in the unshared part.  Of course, even if
we had that, it wouldn't be portable, so who knows how much good it
would do.  But it would be awfully nice to have the option.

You can map a segment at fork time, and unmap it after forking. That
doesn't really use RAM, since it's supposed to be lazily allocated (it
can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE,
but I don't think that's portable).

That guarantees it's free.

Next, you can map shared memory at explicit addresses (linux's mmap
has support for that, and I seem to recall Windows did too).

All you have to do, is some book-keeping in shared memory (so all
processes can coordinate new mappings).
As far as I undersand the main advantage of DSM is that segment can be 
allocated at any time - not only at fork time.
And it is not because of memory consumption: even without unmap, 
allocation of some memory region doesn't cause loose pg physical memory. 
And there are usually no problem with exhaustion of virtual space at 
64-bit architecture. But using some combination of flags (as 
MAP_NORESERVE), it is usually possible to completely eliminate overhead 
of reserving some address range in virtual space. But mapping 
dynamically created segment (not at fork time) to the same address 
really seems to be a big challenge.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Filter error log statements by sqlstate

2014-01-09 Thread Oskari Saarenmaa
Allow the default log_min_error_statement to be overridden per sqlstate 
to make it possible to filter out some error types while maintaining a 
low log_min_error_statement or enable logging for some error types when 
the default is to not log anything.


I've tried to do something like this using rsyslog filters, but that's 
pretty awkward and doesn't work at all when the statement is split to 
multiple syslog messages.


https://github.com/saaros/postgres/compare/log-by-sqlstate

 src/backend/utils/error/elog.c | 183 
-

 src/backend/utils/misc/guc.c   |  14 +++-
 src/include/utils/guc.h|   4 +
 src/include/utils/guc_tables.h |   1 +
 4 files changed, 199 insertions(+), 3 deletions(-)

/ Oskari
From 61fe332f35f49c59257e9dcd0b5e2ff80f1f4055 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa o...@ohmu.fi
Date: Thu, 9 Jan 2014 20:49:28 +0200
Subject: [PATCH] Filter error log statements by sqlstate

Allow the default log_min_error_statement to be overridden per sqlstate to
make it possible to filter out some error types while maintaining a low
log_min_error_statement or enable logging for some error types when the
default is to not log anything.
---
 src/backend/utils/error/elog.c | 183 -
 src/backend/utils/misc/guc.c   |  14 +++-
 src/include/utils/guc.h|   4 +
 src/include/utils/guc_tables.h |   1 +
 4 files changed, 199 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 3de162b..c843e1a 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -74,7 +74,9 @@
 #include storage/ipc.h
 #include storage/proc.h
 #include tcop/tcopprot.h
+#include utils/builtins.h
 #include utils/guc.h
+#include utils/guc_tables.h
 #include utils/memutils.h
 #include utils/ps_status.h
 
@@ -111,6 +113,11 @@ char	   *Log_line_prefix = NULL;		/* format for extra log line info */
 int			Log_destination = LOG_DESTINATION_STDERR;
 char	   *Log_destination_string = NULL;
 
+static uint64		*log_sqlstate_error_statement = NULL;
+static size_t		log_sqlstate_error_statement_len = 0;
+
+static int get_sqlstate_error_level(int sqlstate);
+
 #ifdef HAVE_SYSLOG
 
 /*
@@ -2475,6 +2482,7 @@ static void
 write_csvlog(ErrorData *edata)
 {
 	StringInfoData buf;
+	int		requested_log_level;
 	bool		print_stmt = false;
 
 	/* static counter for line numbers */
@@ -2618,7 +2626,10 @@ write_csvlog(ErrorData *edata)
 	appendStringInfoChar(buf, ',');
 
 	/* user query --- only reported if not disabled by the caller */
-	if (is_log_level_output(edata-elevel, log_min_error_statement) 
+	requested_log_level = get_sqlstate_error_level(edata-sqlerrcode);
+	if (requested_log_level  0)
+		requested_log_level = log_min_error_statement;
+	if (is_log_level_output(edata-elevel, requested_log_level) 
 		debug_query_string != NULL 
 		!edata-hide_stmt)
 		print_stmt = true;
@@ -2691,6 +2702,7 @@ static void
 send_message_to_server_log(ErrorData *edata)
 {
 	StringInfoData buf;
+	int requested_log_level;
 
 	initStringInfo(buf);
 
@@ -2775,7 +2787,10 @@ send_message_to_server_log(ErrorData *edata)
 	/*
 	 * If the user wants the query that generated this error logged, do it.
 	 */
-	if (is_log_level_output(edata-elevel, log_min_error_statement) 
+	requested_log_level = get_sqlstate_error_level(edata-sqlerrcode);
+	if (requested_log_level  0)
+		requested_log_level = log_min_error_statement;
+	if (is_log_level_output(edata-elevel, requested_log_level) 
 		debug_query_string != NULL 
 		!edata-hide_stmt)
 	{
@@ -3577,3 +3592,167 @@ trace_recovery(int trace_level)
 
 	return trace_level;
 }
+
+
+/*
+*/
+static int
+get_sqlstate_error_level(int sqlstate)
+{
+	uint64 left = 0, right = log_sqlstate_error_statement_len;
+	while (left  right)
+	{
+		uint64 middle = left + (right - left) / 2;
+		int m_sqlstate = log_sqlstate_error_statement[middle]  32;
+
+		if (m_sqlstate == sqlstate)
+			return log_sqlstate_error_statement[middle]  0x;
+		else if (m_sqlstate  sqlstate)
+			left = middle + 1;
+		else
+			right = middle;
+	}
+	return -1;
+}
+
+bool
+check_log_sqlstate_error(char **newval, void **extra, GucSource source)
+{
+	const struct config_enum_entry *enum_entry;
+	char	   *rawstring, *new_newval, *rp;
+	List	   *elemlist;
+	ListCell   *l;
+	uint64 *new_array = NULL;
+	int i, new_array_len = 0;
+
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', elemlist))
+	{
+		/* syntax error in list */
+		GUC_check_errdetail(List syntax is invalid.);
+		pfree(rawstring);
+		list_free(elemlist);
+		return false;
+	}
+
+	/* GUC wants malloced results, allocate room for as many elements on
+	 * the list plus one to hold the array size */
+	new_array = (uint64 *) malloc(sizeof(uint64) * (list_length(elemlist) + 1));
+	if (!new_array)
+	{
+		pfree(rawstring);
+		

Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Claudio Freire
On Thu, Jan 9, 2014 at 4:24 PM, knizhnik knizh...@garret.ru wrote:
 On 01/09/2014 09:46 PM, Claudio Freire wrote:

 On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote:

 It would be nice to have better operating system support for this.
 For example, IIUC, 64-bit Linux has 128TB of address space available
 for user processes.  When you clone(), it can either share the entire
 address space (i.e. it's a thread) or none of it (i.e. it's a
 process).  There's no option to, say, share 64TB and not the other
 64TB, which would be ideal for us.  We could then map dynamic shared
 memory segments into the shared portion of the address space and do
 backend-private allocations in the unshared part.  Of course, even if
 we had that, it wouldn't be portable, so who knows how much good it
 would do.  But it would be awfully nice to have the option.

 You can map a segment at fork time, and unmap it after forking. That
 doesn't really use RAM, since it's supposed to be lazily allocated (it
 can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE,
 but I don't think that's portable).

 That guarantees it's free.

 Next, you can map shared memory at explicit addresses (linux's mmap
 has support for that, and I seem to recall Windows did too).

 All you have to do, is some book-keeping in shared memory (so all
 processes can coordinate new mappings).

 As far as I undersand the main advantage of DSM is that segment can be
 allocated at any time - not only at fork time.
 And it is not because of memory consumption: even without unmap, allocation
 of some memory region doesn't cause loose pg physical memory. And there are
 usually no problem with exhaustion of virtual space at 64-bit architecture.
 But using some combination of flags (as MAP_NORESERVE), it is usually
 possible to completely eliminate overhead of reserving some address range in
 virtual space. But mapping dynamically created segment (not at fork time) to
 the same address really seems to be a big challenge.

At fork time I only wrote about reserving the address space. After
reserving it, all you have to do is implement an allocator that works
in shared memory (protected by a lwlock of course).

In essence, a hypothetical pg_dsm_alloc(region_name) would use regular
shared memory to coordinate returning an already mapped region (same
address which is guaranteed to work since we reserved that region), or
allocate one (within the reserved address space).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread knizhnik

On 01/09/2014 11:09 PM, Amit Kapila wrote:

On Thu, Jan 9, 2014 at 12:21 AM, Robert Haas robertmh...@gmail.com wrote:

On Tue, Jan 7, 2014 at 10:20 PM, Amit Kapila amit.kapil...@gmail.com wrote:

On Tue, Jan 7, 2014 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote:

Well, right now we just reopen the same object from all of the
processes, which seems to work fine and doesn't require any of this
complexity.  The only problem I don't know how to solve is how to make
a segment stick around for the whole postmaster lifetime.  If
duplicating the handle into the postmaster without its knowledge gets
us there, it may be worth considering, but that doesn't seem like a
good reason to rework the rest of the existing mechanism.

I think one has to try this to see if it works as per the need. If it's not
urgent, I can try this early next week?

Anything we want to get into 9.4 has to be submitted by next Tuesday,
but I don't know that we're going to get this into 9.4.

Using DuplicateHandle(), we can make segment stick for Postmaster
lifetime. I have used below test (used dsm_demo module) to verify:
Session - 1
select dsm_demo_create('this message is from session-1');
  dsm_demo_create
-
82712

Session - 2
-
select dsm_demo_read(82712);
dsm_demo_read

  this message is from session-1
(1 row)

Session-1
\q

-- till here it will work without DuplicateHandle as well

Session -2
select dsm_demo_read(82712);
dsm_demo_read

  this message is from session-1
(1 row)

Session -2
\q

Session -3
select dsm_demo_read(82712);
dsm_demo_read

  this message is from session-1
(1 row)

-- above shows that handle stays around.

Note -
Currently I have to bypass below code in dam_attach(), as it assumes
segment will not stay if it's removed from control file.

/*
* If we didn't find the handle we're looking for in the control
* segment, it probably means that everyone else who had it mapped,
* including the original creator, died before we got to this point.
* It's up to the caller to decide what to do about that.
*/
if (seg-control_slot == INVALID_CONTROL_SLOT)
{
dsm_detach(seg);
return NULL;
}


Could you let me know what exactly you are expecting in patch,
just a call to DuplicateHandle() after CreateFileMapping() or something
else as well?


As far as I understand DuplicateHandle() should really do the trick: 
protect segment from deallocation.

But should postmaster be somehow notified about this handle?
For example, if we really wants to delete this segment (drop extension), 
we should somehow make Postmaster to close this handle.

How it can be done?



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Amit Kapila
On Fri, Jan 10, 2014 at 1:00 AM, knizhnik knizh...@garret.ru wrote:
 On 01/09/2014 11:09 PM, Amit Kapila wrote:


 Using DuplicateHandle(), we can make segment stick for Postmaster
 lifetime. I have used below test (used dsm_demo module) to verify:

 As far as I understand DuplicateHandle() should really do the trick: protect
 segment from deallocation.
 But should postmaster be somehow notified about this handle?
 For example, if we really wants to delete this segment (drop extension), we
 should somehow make Postmaster to close this handle.
 How it can be done?

I think we need to use some form of IPC to communicate it to Postmaster.
I could not think of any other way atm.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread knizhnik

On 01/09/2014 11:30 PM, Claudio Freire wrote:

On Thu, Jan 9, 2014 at 4:24 PM, knizhnik knizh...@garret.ru wrote:

On 01/09/2014 09:46 PM, Claudio Freire wrote:

On Thu, Jan 9, 2014 at 2:22 PM, Robert Haas robertmh...@gmail.com wrote:

It would be nice to have better operating system support for this.
For example, IIUC, 64-bit Linux has 128TB of address space available
for user processes.  When you clone(), it can either share the entire
address space (i.e. it's a thread) or none of it (i.e. it's a
process).  There's no option to, say, share 64TB and not the other
64TB, which would be ideal for us.  We could then map dynamic shared
memory segments into the shared portion of the address space and do
backend-private allocations in the unshared part.  Of course, even if
we had that, it wouldn't be portable, so who knows how much good it
would do.  But it would be awfully nice to have the option.

You can map a segment at fork time, and unmap it after forking. That
doesn't really use RAM, since it's supposed to be lazily allocated (it
can be forced to be so, I believe, with PROT_NONE and MAP_NORESERVE,
but I don't think that's portable).

That guarantees it's free.

Next, you can map shared memory at explicit addresses (linux's mmap
has support for that, and I seem to recall Windows did too).

All you have to do, is some book-keeping in shared memory (so all
processes can coordinate new mappings).

As far as I undersand the main advantage of DSM is that segment can be
allocated at any time - not only at fork time.
And it is not because of memory consumption: even without unmap, allocation
of some memory region doesn't cause loose pg physical memory. And there are
usually no problem with exhaustion of virtual space at 64-bit architecture.
But using some combination of flags (as MAP_NORESERVE), it is usually
possible to completely eliminate overhead of reserving some address range in
virtual space. But mapping dynamically created segment (not at fork time) to
the same address really seems to be a big challenge.

At fork time I only wrote about reserving the address space. After
reserving it, all you have to do is implement an allocator that works
in shared memory (protected by a lwlock of course).

In essence, a hypothetical pg_dsm_alloc(region_name) would use regular
shared memory to coordinate returning an already mapped region (same
address which is guaranteed to work since we reserved that region), or
allocate one (within the reserved address space).
Why do we need named segments? There is ShmemAlloc function in 
PostgreSQL API.
If RequestAddinShmemSpace can be used without requirement to place 
module in preloaded list, then isn't it enough for most extensions?

And ShmemInitHash can be used to maintain named regions if it is needed...

So if we have some reserved address space, do we actually need some 
special allocator for this space to allocate new segments in it?

Why existed API to shared memory is not enough?



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Claudio Freire
On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote:
 At fork time I only wrote about reserving the address space. After
 reserving it, all you have to do is implement an allocator that works
 in shared memory (protected by a lwlock of course).

 In essence, a hypothetical pg_dsm_alloc(region_name) would use regular
 shared memory to coordinate returning an already mapped region (same
 address which is guaranteed to work since we reserved that region), or
 allocate one (within the reserved address space).

 Why do we need named segments? There is ShmemAlloc function in PostgreSQL
 API.
 If RequestAddinShmemSpace can be used without requirement to place module in
 preloaded list, then isn't it enough for most extensions?
 And ShmemInitHash can be used to maintain named regions if it is needed...

If you want to dynamically create the segments, you need some way to
identify them. That is, the name. Otherwise, RequestWhateverShmemSpace
won't know when to return an already-mapped region or not.

Mind you, the name can be a number. No need to make it a string.

 So if we have some reserved address space, do we actually need some special
 allocator for this space to allocate new segments in it?
 Why existed API to shared memory is not enough?

I don't know this existing API you mention. But I think this is quite
a specific case very unlikely to be serviced from existing APIs. You
need a data structure that can map names to regions, any hash map will
do, or even an array since one wouldn't expect it to be too big, or
require it to be too fast, and then you need to unmap the reserve
mapping and put a shared region there instead, before returning the
pointer to this shared region.

So, the special thing is, the book-keeping region sits in regular
shared memory, whereas the allocated regions sit in newly-created
segments. And segments are referenced by pointers (since the address
space is fixed and shared). Is there something like that already?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Claudio Freire
On Thu, Jan 9, 2014 at 4:48 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote:
 At fork time I only wrote about reserving the address space. After
 reserving it, all you have to do is implement an allocator that works
 in shared memory (protected by a lwlock of course).

 In essence, a hypothetical pg_dsm_alloc(region_name) would use regular
 shared memory to coordinate returning an already mapped region (same
 address which is guaranteed to work since we reserved that region), or
 allocate one (within the reserved address space).

 Why do we need named segments? There is ShmemAlloc function in PostgreSQL
 API.
 If RequestAddinShmemSpace can be used without requirement to place module in
 preloaded list, then isn't it enough for most extensions?
 And ShmemInitHash can be used to maintain named regions if it is needed...

 If you want to dynamically create the segments, you need some way to
 identify them. That is, the name. Otherwise, RequestWhateverShmemSpace
 won't know when to return an already-mapped region or not.

 Mind you, the name can be a number. No need to make it a string.

 So if we have some reserved address space, do we actually need some special
 allocator for this space to allocate new segments in it?
 Why existed API to shared memory is not enough?


Oh, I notice why the confusion now.

The reserve mapping I was proposing, was a MAP_NORESERVE with PROT_NONE.

Ie: forbidden access. Which guarantees the OS won't try to allocate
physical RAM to it.

You'd have to re-map it before using, so it's not like a regular
shared memory region where you can simply allocate pointers and
intersperse bookkeeping data in-place.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 08:18 PM, Heikki Linnakangas wrote:

On 12/12/2013 04:00 AM, Kyotaro HORIGUCHI wrote:

Hello, we happened to see server crash on archive recovery under
some condition.

After TLI was incremented, there should be the case that the WAL
file for older timeline is archived but not for that of the same
segment id but for newer timeline. Archive recovery should fail
for the case with PANIC error like follows,

| PANIC: record with zero length at 0/1820D40

Replay script is attached. This issue occured for 9.4dev, 9.3.2,
and not for 9.2.6 and 9.1.11. The latter search pg_xlog for the
TLI before trying archive for older TLIs.

This occurrs during fetching checkpoint redo record in archive
recovery.


if (checkPoint.redo  RecPtr)
{
/* back up to find the record */
record = ReadRecord(xlogreader, checkPoint.redo, PANIC, false);


Hmm. After running the script, pg_controldata says:

Latest checkpoint location:   0/290
Prior checkpoint location:0/290
Latest checkpoint's REDO location:0/290

The PANIC is actually coming from here:


/*
 * Re-fetch the last valid or last applied record, so we can
identify the
 * exact endpoint of what we consider the valid portion of WAL.
 */
record = ReadRecord(xlogreader, LastRec, PANIC, false);


If recovery started from an online checkpoint, then I think you'd get a
similar PANIC from the code you quoted.


And this is caused by that the segment file for older timeline in
archive directory is preferred to that for newer timeline in
pg_xlog.


Yep.


Actually, why is the partially-filled 00010002 file 
archived in the first place? Looking at the code, it's been like that 
forever, but it seems like a bad idea. If the original server is still 
up and running, and writing more data to that file, what will happen is 
that when the original server later tries to archive it, it will fail 
because the partial version of the file is already in the archive. Or 
worse, the partial version overwrites a previously archived more 
complete version.


This is the code that does that:


/*
 * If we are establishing a new timeline, we have to copy data from the
 * last WAL segment of the old timeline to create a starting WAL segment
 * for the new timeline.
 *
 * Notify the archiver that the last WAL segment of the old timeline is
 * ready to copy to archival storage. Otherwise, it is not archived for 
a
 * while.
 */
if (endTLI != ThisTimeLineID)
{
XLogFileCopy(endLogSegNo, endTLI, endLogSegNo);

if (XLogArchivingActive())
{
XLogFileName(xlogpath, endTLI, endLogSegNo);
XLogArchiveNotify(xlogpath);
}
}


So, the rationale is that otherwise it would take a long time until that 
segment is archived. To be precise, I don't think the segment with the 
old TLI would ever be archived without the above, but the same segment 
on the new timeline would, after it fills up.


Wouldn't it be better to not archive the old segment, and instead switch 
to a new segment after writing the end-of-recovery checkpoint, so that 
the segment on the new timeline is archived sooner?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 Actually, why is the partially-filled 00010002 file 
 archived in the first place? ...

 So, the rationale is that otherwise it would take a long time until that 
 segment is archived. To be precise, I don't think the segment with the 
 old TLI would ever be archived without the above, but the same segment 
 on the new timeline would, after it fills up.

 Wouldn't it be better to not archive the old segment, and instead switch 
 to a new segment after writing the end-of-recovery checkpoint, so that 
 the segment on the new timeline is archived sooner?

Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread knizhnik

On 01/09/2014 11:48 PM, Claudio Freire wrote:

On Thu, Jan 9, 2014 at 4:39 PM, knizhnik knizh...@garret.ru wrote:

At fork time I only wrote about reserving the address space. After
reserving it, all you have to do is implement an allocator that works
in shared memory (protected by a lwlock of course).

In essence, a hypothetical pg_dsm_alloc(region_name) would use regular
shared memory to coordinate returning an already mapped region (same
address which is guaranteed to work since we reserved that region), or
allocate one (within the reserved address space).

Why do we need named segments? There is ShmemAlloc function in PostgreSQL
API.
If RequestAddinShmemSpace can be used without requirement to place module in
preloaded list, then isn't it enough for most extensions?
And ShmemInitHash can be used to maintain named regions if it is needed...

If you want to dynamically create the segments, you need some way to
identify them. That is, the name. Otherwise, RequestWhateverShmemSpace
won't know when to return an already-mapped region or not.

Mind you, the name can be a number. No need to make it a string.


So if we have some reserved address space, do we actually need some special
allocator for this space to allocate new segments in it?
Why existed API to shared memory is not enough?

I don't know this existing API you mention. But I think this is quite
a specific case very unlikely to be serviced from existing APIs. You
need a data structure that can map names to regions, any hash map will
do, or even an array since one wouldn't expect it to be too big, or
require it to be too fast, and then you need to unmap the reserve
mapping and put a shared region there instead, before returning the
pointer to this shared region.

So, the special thing is, the book-keeping region sits in regular
shared memory, whereas the allocated regions sit in newly-created
segments. And segments are referenced by pointers (since the address
space is fixed and shared). Is there something like that already?

By existed API I mostly mean 6 functions:

RequestAddinShmemSpace()
RequestAddinLWLocks()
ShmemInitStruct()
LWLockAssign()
ShmemAlloc()
ShmemInitHash()

If it will be possible to use this function without requirement for 
module to be included in shared_preload_libraries list, then do we 
really need DSM?

And it can be achieved by
1. Preserving address space (as you suggested)
2. Preserving some fixed number of free LWLocks (not very large  100).

I do not have something against creation of own allocator of named 
shared memory segments within preserved address space.
I just not sure if it is actually needed. In some sense 
RequestAddinShmemSpace() can be such allocator.






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 10:16 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

Actually, why is the partially-filled 00010002 file
archived in the first place? ...



So, the rationale is that otherwise it would take a long time until that
segment is archived. To be precise, I don't think the segment with the
old TLI would ever be archived without the above, but the same segment
on the new timeline would, after it fills up.



Wouldn't it be better to not archive the old segment, and instead switch
to a new segment after writing the end-of-recovery checkpoint, so that
the segment on the new timeline is archived sooner?


Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?


Hmm. It should be the responsibility of the original server to archive 
the segment on the old timeline. Mind you, partial segments are never 
archived, except for this one case, so how did the old segment find its 
way to the new server? A few possibilities come to mind: the DBA 
manually copied it from the old server to pg_xlog, it was streamed by 
streaming replication, or it was included in a base backup. The OP's 
test script resembles the base backup case.


In all of those cases, I don't think it's the new server's 
responsibility to archive it. If it was copied to pg_xlog manually, the 
administrator may also copy it to the archive if he feels like it. If it 
was streamed from a live server, the original server should take care of 
it. If it was included in a backup, well, it's included in the backup so 
it doesn't necessarily need to be archived.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] pgcrypto: implement gen_random_uuid

2014-01-09 Thread Oskari Saarenmaa
The only useful feature of the uuid-ossp module in my opinion is the 
uuid_generate_v4 function and as uuid-ossp is more or less abandonware 
people have had trouble building and installing it.  This patch 
implements an alternative uuid v4 generation function in pgcrypto which 
could be moved to core once there's a core PRNG with large enough 
internal state.


On my test system it took 3796 msec to generate a million UUIDs with 
pgcrypto while uuid-ossp took 20375 msec.


https://github.com/saaros/postgres/compare/pgcrypto-uuid-v4

 contrib/pgcrypto/Makefile |  2 +-
 contrib/pgcrypto/pgcrypto--1.0--1.1.sql   |  8 
 contrib/pgcrypto/{pgcrypto--1.0.sql = pgcrypto--1.1.sql} |  7 ++-
 contrib/pgcrypto/pgcrypto.c   | 22 
++

 contrib/pgcrypto/pgcrypto.control |  2 +-
 contrib/pgcrypto/pgcrypto.h   |  1 +
 doc/src/sgml/pgcrypto.sgml| 11 +++

/ Oskari
From 522fef9c3739d4c4f3c107e574e84db67a0c07a2 Mon Sep 17 00:00:00 2001
From: Oskari Saarenmaa o...@ohmu.fi
Date: Thu, 9 Jan 2014 22:24:36 +0200
Subject: [PATCH] pgcrypto: implement gen_random_uuid

---
 contrib/pgcrypto/Makefile   |   2 +-
 contrib/pgcrypto/pgcrypto--1.0--1.1.sql |   8 ++
 contrib/pgcrypto/pgcrypto--1.0.sql  | 202 ---
 contrib/pgcrypto/pgcrypto--1.1.sql  | 207 
 contrib/pgcrypto/pgcrypto.c |  22 
 contrib/pgcrypto/pgcrypto.control   |   2 +-
 contrib/pgcrypto/pgcrypto.h |   1 +
 doc/src/sgml/pgcrypto.sgml  |  11 ++
 8 files changed, 251 insertions(+), 204 deletions(-)
 create mode 100644 contrib/pgcrypto/pgcrypto--1.0--1.1.sql
 delete mode 100644 contrib/pgcrypto/pgcrypto--1.0.sql
 create mode 100644 contrib/pgcrypto/pgcrypto--1.1.sql

diff --git a/contrib/pgcrypto/Makefile b/contrib/pgcrypto/Makefile
index dadec95..1c85c98 100644
--- a/contrib/pgcrypto/Makefile
+++ b/contrib/pgcrypto/Makefile
@@ -26,7 +26,7 @@ MODULE_big	= pgcrypto
 OBJS		= $(SRCS:.c=.o)
 
 EXTENSION = pgcrypto
-DATA = pgcrypto--1.0.sql pgcrypto--unpackaged--1.0.sql
+DATA = pgcrypto--1.1.sql pgcrypto--1.0--1.1.sql pgcrypto--unpackaged--1.0.sql
 
 REGRESS = init md5 sha1 hmac-md5 hmac-sha1 blowfish rijndael \
 	$(CF_TESTS) \
diff --git a/contrib/pgcrypto/pgcrypto--1.0--1.1.sql b/contrib/pgcrypto/pgcrypto--1.0--1.1.sql
new file mode 100644
index 000..2601669
--- /dev/null
+++ b/contrib/pgcrypto/pgcrypto--1.0--1.1.sql
@@ -0,0 +1,8 @@
+/* contrib/pgcrypto/pgcrypto--1.0--1.1.sql */
+
+\echo Use ALTER EXTENSION pgcrypto UPDATE to load this file. \quit
+
+CREATE FUNCTION gen_random_uuid()
+RETURNS uuid
+AS 'MODULE_PATHNAME', 'pg_random_uuid'
+LANGUAGE C VOLATILE;
diff --git a/contrib/pgcrypto/pgcrypto--1.0.sql b/contrib/pgcrypto/pgcrypto--1.0.sql
deleted file mode 100644
index 347825e..000
--- a/contrib/pgcrypto/pgcrypto--1.0.sql
+++ /dev/null
@@ -1,202 +0,0 @@
-/* contrib/pgcrypto/pgcrypto--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use CREATE EXTENSION pgcrypto to load this file. \quit
-
-CREATE FUNCTION digest(text, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_digest'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION digest(bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_digest'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION hmac(text, text, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_hmac'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION hmac(bytea, bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_hmac'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION crypt(text, text)
-RETURNS text
-AS 'MODULE_PATHNAME', 'pg_crypt'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION gen_salt(text)
-RETURNS text
-AS 'MODULE_PATHNAME', 'pg_gen_salt'
-LANGUAGE C VOLATILE STRICT;
-
-CREATE FUNCTION gen_salt(text, int4)
-RETURNS text
-AS 'MODULE_PATHNAME', 'pg_gen_salt_rounds'
-LANGUAGE C VOLATILE STRICT;
-
-CREATE FUNCTION encrypt(bytea, bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_encrypt'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION decrypt(bytea, bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_decrypt'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION encrypt_iv(bytea, bytea, bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_encrypt_iv'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION decrypt_iv(bytea, bytea, bytea, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_decrypt_iv'
-LANGUAGE C IMMUTABLE STRICT;
-
-CREATE FUNCTION gen_random_bytes(int4)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pg_random_bytes'
-LANGUAGE C VOLATILE STRICT;
-
---
--- pgp_sym_encrypt(data, key)
---
-CREATE FUNCTION pgp_sym_encrypt(text, text)
-RETURNS bytea
-AS 'MODULE_PATHNAME', 'pgp_sym_encrypt_text'
-LANGUAGE C STRICT;
-
-CREATE FUNCTION pgp_sym_encrypt_bytea(bytea, text)
-RETURNS bytea
-AS 

Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 01/09/2014 10:16 PM, Tom Lane wrote:
 Don't we want to archive both?  If you want to recover to the end of the
 old timeline, you're going to need that file too, no?

 Hmm. It should be the responsibility of the original server to archive 
 the segment on the old timeline.

Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery.  If you're sure this same case
doesn't come up that way, then nevermind.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Heikki Linnakangas

On 01/09/2014 10:36 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 01/09/2014 10:16 PM, Tom Lane wrote:

Don't we want to archive both?  If you want to recover to the end of the
old timeline, you're going to need that file too, no?



Hmm. It should be the responsibility of the original server to archive
the segment on the old timeline.


Oh ... I was thinking about the case of a timeline switch within one
server, say as a result of PITR recovery.  If you're sure this same case
doesn't come up that way, then nevermind.


I'm not sure what you mean, but it does apply to same-server PITR 
recovery too. In that case, again, if you copy the partial segment to 
pg_xlog when doing the PITR recovery, clearly you already somehow have 
access to the partial segment, and if you want to do another PITR to the 
same point, you should just copy it again.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUG] Archive recovery failure on 9.3+.

2014-01-09 Thread Josh Berkus
On 01/09/2014 12:05 PM, Heikki Linnakangas wrote:

 Actually, why is the partially-filled 00010002 file
 archived in the first place? Looking at the code, it's been like that
 forever, but it seems like a bad idea. If the original server is still
 up and running, and writing more data to that file, what will happen is
 that when the original server later tries to archive it, it will fail
 because the partial version of the file is already in the archive. Or
 worse, the partial version overwrites a previously archived more
 complete version.

Oh!  This explains some transient errors I've seen.

 Wouldn't it be better to not archive the old segment, and instead switch
 to a new segment after writing the end-of-recovery checkpoint, so that
 the segment on the new timeline is archived sooner?

It would be better to zero-fill and switch segments, yes.  We should
NEVER be in a position of archiving two different versions of the same
segment.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Planning time in explain/explain analyze

2014-01-09 Thread Greg Stark
On Thu, Jan 9, 2014 at 1:50 PM, Stephen Frost sfr...@snowman.net wrote:
 I do not think we want to always measure the time it took to
 generate a plan due to slow clocks on some architectures. Also I
 feel that such a patch would be more invasive.

 The slow-clock argument is really quite frustrating for those of us who
 are looking to add more and more metrics to PG.

Especially since implementing gettimeofday quickly is a solved problem
on most architectures.

However I don't see the issue here. Two gettimeofday calls per query
plan is not really going to hurt even on systems where it's slow. The
problems we run into are explain analyze which runs getimeofday twice
for every node for every tuple processed. For cpu-bound queries that's
can become the dominant cost.

The only way two gettimeofday calls per query plan becomes an issue is
if you're executing non-cached queries repeatedly on data that's
entirely in ram. That means the query processing is entirely cpu-bound
and adding two syscalls could actually be noticeable. The mitigation
strategy would be to prepare and cache the query handle to execute it
again.


  We're nowhere near the
 level that we need to be and it shows (particularly for users coming
 from $OTHER-RDBMS).  Perhaps we should try and come up with a solution
 to address those cases (turn off metrics ala turning off stats?) while
 not preventing us from gathering metrics on more reasonable systems.




-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ANNOUNCE] IMCS: In Memory Columnar Store for PostgreSQL

2014-01-09 Thread Jim Nasby

On 1/9/14, 1:18 PM, knizhnik wrote:

So it is clear why do we need shared memory for parallel query execution. But 
why it has to be dynamic? Why it can not be preallocated at start time as most 
of other resources used by PostgreSQL?


That would limit us to doing something like allocating a fixed maximum of 
parallel processes (which might be workable) and only allocating a very small 
amount of memory for IPC. Small as in can only handle a small number of tuples. 
That sounds like a really inefficient way to shuffle data to and from parallel 
processes, especially because one or both sides would probably have to actually 
copy the data if we're doing it that way.

With DSM if you want to do something like a parallel sort each process can put 
their results into memory that the parent process can directly access.

Of course the other enormous win for DSM is it's the foundation for finally 
being able to resize things without a restart. For large dollar sites that 
ability would be hugely beneficial.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Planning time in explain/explain analyze

2014-01-09 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 However I don't see the issue here. Two gettimeofday calls per query
 plan is not really going to hurt even on systems where it's slow.

I tend to agree with this, especially if the calls only occur when the
user asks for the information (ie, does an EXPLAIN rather than just
executing the query).

 The only way two gettimeofday calls per query plan becomes an issue is
 if you're executing non-cached queries repeatedly on data that's
 entirely in ram. That means the query processing is entirely cpu-bound
 and adding two syscalls could actually be noticeable. The mitigation
 strategy would be to prepare and cache the query handle to execute it
 again.

This point weighs against the proposal that we time the work to fetch
a previously-prepared query plan; if we do that then the mitigation
strategy doesn't mitigate anything.

In short then, I think we should just add this to EXPLAIN and be done.
-1 for sticking the info into PlannedStmt or anything like that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Jim Nasby

On 1/9/14, 12:54 PM, Tom Lane wrote:

Stephen Frost sfr...@snowman.net writes:

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT.  If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.


There's always VACUUM ;-)

If you take about ten steps back, what's happening here is that
maintenance work that we'd originally delegated to VACUUM, precisely so
that it wouldn't have to be done by foreground queries, is now being done
by foreground queries.  And oddly enough, people don't like that.

There is a reasonable argument for forcing UPDATE queries to do it anyway,
to improve the odds they can do same-page updates (whether HOT or
otherwise).  And probably an INSERT should do it on a page that it's
selected as an insertion target.  But I think the argument that the
original do-maintenance-in-background-whenever-possible design was wrong
is a lot harder to sustain for SELECT or even DELETE queries.  As I said
upthread, I think the current behavior was *not* chosen for performance
reasons but just to limit the scope of what we had to change for HOT.


Instead of looking at how to avoid this work in SELECTs maybe it'd be more 
useful to look at how we can get it done more quickly in the background. The 
VSM is already a step in the right direction, but it seems the big use case 
here is when some bulk operation comes through and touches a sizeable number of 
blocks (but perhaps not enough to hit autovac thresholds).

ISTM it wouldn't be too difficult for a backend to track how many blocks in a 
relation it's dirtied (keep in mind that count doesn't have to be perfect). If 
we tracked that info, it could be put into a maintenance queue (LISTEN/NOTIFY?) 
along with our XID. That gives us a list of relations to vacuum and exactly 
when to vacuum them. Thanks to the VSM we wouldn't need to track individual 
pages (though it might be useful to track the minimum and maximum block IDs we 
hit, per relation).
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Robert Haas
On Thu, Jan 9, 2014 at 1:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Stephen Frost sfr...@snowman.net writes:
 That said, I'm not entirely convinced that traversing these dead tuples
 is all *that* painful during SELECT.  If there's that many levels then
 hopefully it's not long til an UPDATE comes along and cleans them up.

 There's always VACUUM ;-)

 If you take about ten steps back, what's happening here is that
 maintenance work that we'd originally delegated to VACUUM, precisely so
 that it wouldn't have to be done by foreground queries, is now being done
 by foreground queries.  And oddly enough, people don't like that.

People *think* they don't like that, because that's the way it works
right now.  If it worked some other way, there's a good chance people
would be complaining about that behavior, too.  I submitted a patch a
few years back to limit the setting of hint bits by foreground
processes to approximately 5% of the buffers they touched in a large
scan, so that no single scan would incur all the cost of setting the
hint bits; instead, the cost would be amortized over the first 20 or
so scans.  However, nobody was very enthusiastic about that patch,
because while it greatly softened the blow for the first scan,
subsequent scans were slower, because now they had to carry part of
the burden, too.  And you know what?  People didn't like *that*
either.

The problem with saying that we should let VACUUM do this work is the
same as the problem with saying that if you're late for your Concorde
flight, you should go running across the tarmac and try to catch it.
The cost of dead tuples is related in a linear fashion to the rate at
which pages are accessed.  Not coincidentally, the number of
opportunities for HOT pruning is *also* related in a linear fashion to
the rate at which pages are accessed.  This is why it works so well.
The rate at which vacuuming happens does not ramp up in the same way;
it's limited by autovacuum cost settings (which people tend not have
set correctly, and don't adjust themselves on the fly) or by their
hardware capabilities.  If autovacuum can't keep up, foreground
activity doesn't slow down to compensate; instead, the system just
bloats out of control.  While people may not like having this
maintenance activity in the foreground, they like not having it at all
even less.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup

2014-01-09 Thread Gabriele Bartolini
Hi Steeve,
 Il 09/01/14 22:10, Steeve Lennmark ha scritto:

 That's a much better solution, I attached a patch with the updated code.

 # SELECT oid, pg_tablespace_location(oid) FROM pg_tablespace;
 [...]
  16388 | /tmp/tblspc1
  16389 | /tmp/tblspc2

I'd suggest, a similar solution to the one we have adopted in Barman (if
you don't know it: www.pgbarman.org), that is:

--tablespace NAME:LOCATION [--tablespace NAME:location]

I prefer this over the location on the master as this might change over
time (at least more frequently than the tablespace name) and over servers.

 $ pg_basebackup -Xs -D backup/data -T /tmp/tblspc1:$(pwd)/backup/t1 -T
 /tmp/tblspc2:$(pwd)/backup/t2

With the above example, it would become:

$ pg_basebackup -Xs -D backup/data -T tblspc1:$(pwd)/backup/t1 -T
tblspc2:$(pwd)/backup/t2

Thanks,
Gabriele

-- 
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2014-01-09 Thread Jim Nasby

On 1/9/14, 9:01 AM, Hannu Krosing wrote:

Yeah, and I think that the logging command that was suggested allows
for that*if configured correctly*.

*But*  for relying on this, we would also need to make logging
*synchronous*,
which would probably not go down well with many people, as it makes things
even more fragile from availability viewpoint (and slower as well).


Not really... you only care about monitoring performance when the standby has 
gone AWOL *and* you haven't sent a notification yet. Once you've notified once 
you're done.

So in this case the master won't go down unless you have a double fault: 
standby goes down AND you can't get to your monitoring.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Andres Freund
On 2014-01-09 16:27:23 -0500, Robert Haas wrote:
 People *think* they don't like that, because that's the way it works
 right now.  If it worked some other way, there's a good chance people
 would be complaining about that behavior, too.

I think on of the primary reason why it's causing huge slowdowns is that
the ring buffer of scan strategies causes dirty buffer writes pretty
much immediately, when a buffer is reused.

Not that delaying the writeout would work all that effectively right
now, with the current bgwriter...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Relocation of tablespaces in pg_basebackup

2014-01-09 Thread Steeve Lennmark
On Thu, Jan 9, 2014 at 10:29 PM, Gabriele Bartolini 
gabriele.bartol...@2ndquadrant.it wrote:

 Hi Steeve,
  Il 09/01/14 22:10, Steeve Lennmark ha scritto:
 
  That's a much better solution, I attached a patch with the updated code.
 
  # SELECT oid, pg_tablespace_location(oid) FROM pg_tablespace;
  [...]
   16388 | /tmp/tblspc1
   16389 | /tmp/tblspc2

 I'd suggest, a similar solution to the one we have adopted in Barman (if
 you don't know it: www.pgbarman.org), that is:

 --tablespace NAME:LOCATION [--tablespace NAME:location]

 I prefer this over the location on the master as this might change over
 time (at least more frequently than the tablespace name) and over servers.


I'm a barman user myself so that was actually my initial thought. If
there aren't some kind of hidden internal that I've missed I don't see
a way to convert an OID (only have OID and path at this stage) to a
tablespace name. This solution, even though not optimal, is a lot
better than my initial one where I used the OID directly.


  $ pg_basebackup -Xs -D backup/data -T /tmp/tblspc1:$(pwd)/backup/t1 -T
  /tmp/tblspc2:$(pwd)/backup/t2

 With the above example, it would become:

 $ pg_basebackup -Xs -D backup/data -T tblspc1:$(pwd)/backup/t1 -T
 tblspc2:$(pwd)/backup/t2


Yeah, that would be my favourite solution.

Regards,
Steeve
--
Steeve Lennmark


Re: [HACKERS] Turning off HOT/Cleanup sometimes

2014-01-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 The problem with saying that we should let VACUUM do this work is the
 same as the problem with saying that if you're late for your Concorde
 flight, you should go running across the tarmac and try to catch it.
 The cost of dead tuples is related in a linear fashion to the rate at
 which pages are accessed.  Not coincidentally, the number of
 opportunities for HOT pruning is *also* related in a linear fashion to
 the rate at which pages are accessed.  This is why it works so well.

That seems like a large oversimplification.  Some (most?) of the costs of
dead tuples are proportional to the rate of dead tuple creation.  I grant
that there are also some costs proportional to the rate at which scans
visit dead tuples, but I really don't believe that the latter are
dominant.  So I think it's bogus to claim that the current behavior is
somehow optimal.

One more time: the sole reason it works the way it does now is that that
was the path of least resistance back in 2007, and we never yet got around
to trying to optimize that.  I'm glad to see someone wanting to revisit
the issue, but I don't think that we necessarily have to go as far as
creating user-visible knobs in order to make it better.

 The rate at which vacuuming happens does not ramp up in the same way;
 it's limited by autovacuum cost settings (which people tend not have
 set correctly, and don't adjust themselves on the fly)

True, but that seems like a pretty well-defined improvement project right
there (as well as an argument against user-visible knobs in general ;-)).
Nasby's speculations just upthread could be useful here, too.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Planning time in explain/explain analyze

2014-01-09 Thread Greg Stark
On Thu, Jan 9, 2014 at 9:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 In short then, I think we should just add this to EXPLAIN and be done.
 -1 for sticking the info into PlannedStmt or anything like that.

I'm confused. I thought I was arguing to support your suggestion that
the initial planning store the time in the cached plan and explain
should output the time the original planning took.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add CREATE support to event triggers

2014-01-09 Thread Jim Nasby

On 1/9/14, 11:58 AM, Alvaro Herrera wrote:

Robert Haas escribió:

On Wed, Jan 8, 2014 at 10:27 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:



Hmm.  This seems like a reasonable thing to do, except that I would like
the output to always be the constant, and have some other way to
enable the clause or disable it.  With your present boolean:
so

if_not_exists: {output: IF NOT EXISTS,
   present: true/false}


Why not:

if_not_exists: true/false


Yeah, that's another option.  If we do this, though, the expansion
function would have to know that an if_not_exist element expands to IF
NOT EXISTS.  Maybe that's okay.  Right now, the expansion function is
pretty stupid, which is nice.


Yeah, the source side of this will always have to understand the nuances of 
every command; it'd be really nice to not burden the other side with that as 
well. The only downside I see is a larger JSON output, but meh.

Another advantage is if you really wanted to you could modify the output 
formatting in the JSON doc to do something radically different if so inclined...
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-09 Thread Jim Nasby

On 1/9/14, 11:08 AM, Marko Tiikkaja wrote:

On 1/9/14 5:44 PM, Florian Pflug wrote:

On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote:

On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote:

length should be irrelevant to fact so array starts from 1, 0 or anything
else


Yes, this should just return the number of elements, and 0 for an empty array.


+1. Anything that complains about arrays whose lower bound isn't 1 really
needs a *way* less generic name than array_length().


Problem is, if you're operating on an array which could have a lower bound that 
isn't 1, why would you look at the length in the first place?  You can't access 
any elements by index, you'd need to look at array_lower().  You can't iterate 
over the array by index, you'd need to do  array_lower() .. array_lower() + 
array_length(), which doesn't make sense.  And then there's the myriad of stuff 
you can do with unnest() without actually having to look at the length.  Same 
goes for multi-dimensional arrays: you have even less things you can do there 
with only a length.

So if we give up these constraints, we also make this function completely 
useless.


I'm generally opposed to creating code that doesn't support the full featureset of 
something (in this case, array_lower()1). But in this case I hope we can all 
agree that allowing the user to set an arbitrary array lower bound was an enormous 
mistake. While we might not be able to ever completely remove that behavior, I find 
the idea of throwing an error to be highly enticing.

Plus, as Marko said, this function is pretty useless for non-1-based arrays.

I do agree that the name is probably too generic for this though.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Disallow arrays with non-standard lower bounds

2014-01-09 Thread Jim Nasby

ISTM that allowing users to pick arbitrary lower array bounds was a huge 
mistake. I've never seen anyone make use of it, can't think of any legitimate 
use cases for it, and hate the stupendous amount of extra code needed to deal 
with it.

Obviously we can't just drop support, but what about an initdb (or hell, even 
configure) option to disallow arrays with a lower bound  1? Unfortunately we 
can't do this with a GUC since you can store arrays in a table.
--
Jim C. Nasby, Data Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] nested hstore patch

2014-01-09 Thread Oleg Bartunov
I moved patch to the January commitfest
(https://commitfest.postgresql.org/action/patch_view?id=1289) .

Oleg

PS.

Kudos to Teodor and his mobile phone, which he used to synchronize
branches on github.


On Fri, Jan 10, 2014 at 2:08 AM, Andrew Dunstan and...@dunslane.net wrote:

 On 01/09/2014 02:11 PM, Josh Berkus wrote:

 On 01/09/2014 06:12 AM, Andrew Dunstan wrote:

 Oleg,

 Please merge in the jsonb work and resubmit. See
 https://github.com/feodor/postgres/commits/jsonb_and_hstore I note that
 this repo does not apparently contain any of your latest changes.

 I'll go further and say that if the Hstore2 patch doesn't support JSONB
 for 9.4, we should postpone it to 9.5.  We really don't want to get into
 a situation where we need an Hstore3 because we accepted an Hstore2
 which needs to be rev'd for JSON.

 Especially since there's no good reason for the JSON changes not to be
 merged already.



 After some work by Oleg, for which I'm grateful, and a little more by me,
 here is a combined patch for the jsonb and nested hstore work.

 Outstanding issues with the jsonb stuff:

  * I have replicated all the json processing functions for jsonb
(although not the json generating functions, such as to_json). Most
of these currently work by turning the jsonb back into json and then
processing as before. I am sorting out some technical issues and
hope to have all of these rewritten to use the native jsonb API in a
few days time.
  * We still need to document jsonb. That too I hope will be done quite
shortly.
  * The jsonb regression test currently contains U+ABCD - I guess we'd
better use some hex encoding or whatever for that - unlike json, the
jsonb de-serializer dissolves unicode escapes.


 cheers

 andrew




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal, patch: allow multiple plpgsql plugins

2014-01-09 Thread Pavel Stehule
Hello

We talked about enhancing a plpgsql plugin API to support more active
plugins.

I wrote a prototype based on function plpgsql_register_plugin instead
rendezvous variable.

There are two basic questions:

a) will we support rendezvous variable still?

b) will we support same API still - a reference on plugin_info in exec
state is a issue - described in patch.

without a) a d b) we will break a current plugins little bit more than is
usual - not terrible hard to fix it.  But without a) and b) a
implementation can be significantly cleaner.

Comments, notes?

Regards

Pavel
commit 406ee9d32dbb09385ec38bb6d89e8531cac1cd5f
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Thu Jan 9 23:32:30 2014 +0100

initial

diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 852b0c7..37d17a8 100644
--- a/src/pl/plpgsql/src/Makefile
+++ b/src/pl/plpgsql/src/Makefile
@@ -19,7 +19,7 @@ rpath =
 
 OBJS = pl_gram.o pl_handler.o pl_comp.o pl_exec.o pl_funcs.o pl_scanner.o
 
-DATA = plpgsql.control plpgsql--1.0.sql plpgsql--unpackaged--1.0.sql
+DATA = plpgsql.control plpgsql--1.1.sql plpgsql--1.0--1.1.sql plpgsql--unpackaged--1.0.sql
 
 all: all-lib
 
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 3749fac..fc7158e 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -86,6 +86,21 @@ typedef struct SimpleEcontextStackEntry
 static EState *shared_simple_eval_estate = NULL;
 static SimpleEcontextStackEntry *simple_econtext_stack = NULL;
 
+/*
+ * List of pointers and info of registered plugins.
+ */
+typedef struct PluginPtrEntry
+{
+	PLpgSQL_plugin *plugin_ptr;
+	void *plugin_info;		/* reserved for use by optional plugin */
+	struct PluginPtrEntry *next;
+} PluginPtrEntry;
+
+/*
+ * Allocated in TopMemoryContext
+ */
+static PluginPtrEntry *plugins = NULL;
+
 /
  * Local function forward declarations
  /
@@ -236,6 +251,11 @@ static char *format_expr_params(PLpgSQL_execstate *estate,
 static char *format_preparedparamsdata(PLpgSQL_execstate *estate,
 	   const PreparedParamsData *ppd);
 
+bool multi_plugin_func_setup = false;
+bool multi_plugin_func_beg = false;
+bool multi_plugin_func_end = false;
+bool multi_plugin_stmt_beg = false;
+bool multi_plugin_stmt_end = false;
 
 /* --
  * plpgsql_exec_function	Called by the call handler for
@@ -336,6 +356,39 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	if (*plugin_ptr  (*plugin_ptr)-func_beg)
 		((*plugin_ptr)-func_beg) (estate, func);
 
+	if (multi_plugin_func_beg)
+	{
+		PluginPtrEntry *plugin_entry;
+
+		Assert(plugins != NULL);
+
+		for (plugin_entry = plugins; plugin_entry != NULL;
+		 plugin_entry = plugin_entry-next)
+		{
+			PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr;
+
+			if (plugin_entry-plugin_ptr-func_beg)
+			{
+void *plugin_info = estate.plugin_info;
+
+/* save a plugin_info related single only plpgsql plugin */
+PG_TRY();
+{
+	estate.plugin_info = plugin_entry-plugin_info;
+	(plugin_ptr-func_beg) (estate, func);
+	plugin_entry-plugin_info = estate.plugin_info;
+	estate.plugin_info = plugin_info;
+}
+PG_CATCH();
+{
+	estate.plugin_info = plugin_info;
+	PG_RE_THROW();
+}
+PG_END_TRY();
+			}
+		}
+	}
+
 	/*
 	 * Now call the toplevel block of statements
 	 */
@@ -484,6 +537,39 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	if (*plugin_ptr  (*plugin_ptr)-func_end)
 		((*plugin_ptr)-func_end) (estate, func);
 
+	if (multi_plugin_func_end)
+	{
+		PluginPtrEntry *plugin_entry;
+
+		Assert(plugins != NULL);
+
+		for (plugin_entry = plugins; plugin_entry != NULL;
+		 plugin_entry = plugin_entry-next)
+		{
+			PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr;
+
+			if (plugin_entry-plugin_ptr-func_end)
+			{
+void *plugin_info = estate.plugin_info;
+
+/* save a plugin_info related single only plpgsql plugin */
+PG_TRY();
+{
+	estate.plugin_info = plugin_entry-plugin_info;
+	(plugin_ptr-func_end) (estate, func);
+	plugin_entry-plugin_info = estate.plugin_info;
+	estate.plugin_info = plugin_info;
+}
+PG_CATCH();
+{
+	estate.plugin_info = plugin_info;
+	PG_RE_THROW();
+}
+PG_END_TRY();
+			}
+		}
+	}
+
 	/* Clean up any leftover temporary memory */
 	plpgsql_destroy_econtext(estate);
 	exec_eval_cleanup(estate);
@@ -1393,6 +1479,39 @@ exec_stmt(PLpgSQL_execstate *estate, PLpgSQL_stmt *stmt)
 	if (*plugin_ptr  (*plugin_ptr)-stmt_beg)
 		((*plugin_ptr)-stmt_beg) (estate, stmt);
 
+	if (multi_plugin_stmt_beg)
+	{
+		PluginPtrEntry *plugin_entry;
+
+		Assert(plugins != NULL);
+
+		for (plugin_entry = plugins; plugin_entry != NULL;
+		 plugin_entry = plugin_entry-next)
+		{
+			PLpgSQL_plugin *plugin_ptr = plugin_entry-plugin_ptr;
+
+			if 

Re: [HACKERS] Disallow arrays with non-standard lower bounds

2014-01-09 Thread Peter Geoghegan
On Thu, Jan 9, 2014 at 2:30 PM, Jim Nasby j...@nasby.net wrote:
 ISTM that allowing users to pick arbitrary lower array bounds was a huge
 mistake. I've never seen anyone make use of it, can't think of any
 legitimate use cases for it, and hate the stupendous amount of extra code
 needed to deal with it.

I agree with this, but I think it's too late. I don't think the answer
is any type of parameter.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array_length(anyarray)

2014-01-09 Thread Florian Pflug
On Jan9, 2014, at 23:26 , Jim Nasby j...@nasby.net wrote:
 On 1/9/14, 11:08 AM, Marko Tiikkaja wrote:
 On 1/9/14 5:44 PM, Florian Pflug wrote:
 On Jan9, 2014, at 14:57 , Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 19 December 2013 08:05, Pavel Stehule pavel.steh...@gmail.com wrote:
 length should be irrelevant to fact so array starts from 1, 0 or anything
 else
 
 Yes, this should just return the number of elements, and 0 for an empty 
 array.
 
 +1. Anything that complains about arrays whose lower bound isn't 1 really
 needs a *way* less generic name than array_length().
 
 Problem is, if you're operating on an array which could have a lower bound 
 that isn't 1, why would you look at the length in the first place?  You 
 can't access any elements by index, you'd need to look at array_lower().  
 You can't iterate over the array by index, you'd need to do  array_lower() 
 .. array_lower() + array_length(), which doesn't make sense.  And then 
 there's the myriad of stuff you can do with unnest() without actually having 
 to look at the length.  Same goes for multi-dimensional arrays: you have 
 even less things you can do there with only a length.
 
 So if we give up these constraints, we also make this function completely 
 useless.
 
 I'm generally opposed to creating code that doesn't support the full 
 featureset of something (in this case, array_lower()1). But in this case I 
 hope we can all agree that allowing the user to set an arbitrary array lower 
 bound was an enormous mistake.

No doubt. 

 While we might not be able to ever completely remove that behavior, I find 
 the idea of throwing an error to be highly enticing.
 
 Plus, as Marko said, this function is pretty useless for non-1-based arrays.

That I doubt, but...

 I do agree that the name is probably too generic for this though.

this one is actually my main complaint. The name needs to very clearly mark 
such a function as dealing only with a subset of all possible arrays. Otherwise 
we'll just add to the confusion, not avoid it.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >