date:20151021

Re: [HACKERS] September 2015 Commitfest

2015-10-21 Thread Michael Paquier

On Thu, Sep 3, 2015 at 10:06 PM, Andres Freund  wrote:
> To review (copying Heikki's message):
>
> 1. Pick a patch from the list at:
>
> https://commitfest.postgresql.org/6/
>
> 2. Review it. Test it. Try to break it. Do we want the feature? Any
> weird interactions in corner-cases? Does it have the intended or an
> unintended performance impact?
>
> 3. Reply on the thread on pgsql-hackers with your findings.
>
> It is perfectly OK to review a patch that someone else has signed up for as
> reviewer - different people tend to pay attention to different things.

We are close to the end of October, and the numbers are a bit more
encouraging than at the beginning:
Needs review: 27.
Waiting on Author: 12.
Ready for Committer: 5.
Committed: 30
Among the five patches marked as ready for committer, one is a bug fix
that should be back-patched (ahem). Shouldn't we move on with those
entries first? Also, to all reviewers and authors, please be sure that
the status of your patch is correctly updated.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

2015-10-21 Thread Michael Paquier

On Thu, Oct 22, 2015 at 8:37 AM, Petr Jelinek wrote:
> BTW I think this is ready for committer, except for the need to check docs
> by native speaker.

If so, could you update the entry of this patch accordingly?
https://commitfest.postgresql.org/6/353/
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On columnar storage (2)

2015-10-21 Thread Haribabu Kommi

On Tue, Sep 1, 2015 at 8:53 AM, Alvaro Herrera  wrote:
> As discussed in
> https://www.postgresql.org/message-id/20150611230316.gm133...@postgresql.org
> we've been working on implementing columnar storage for Postgres.
> Here's some initial code to show our general idea, and to gather
> comments related to what we're building.  This is not a complete patch,
> and we don't claim that it works!  This is in very early stages, and we
> have a lot of work to do to get this in working shape.
>
> This was proposed during the Developer's Unconference in Ottawa earlier
> this year.  While some questions were raised about some elements of our
> design, we don't think they were outright objections, so we have pressed
> forward on the expectation that any limitations can be fixed before this
> is final if they are critical, or in subsequent commits if not.
>
> The commit messages for each patch should explain what we've done in
> enough technical detail, and hopefully provide a high-level overview of
> what we're developing.
>
> The first few pieces are "ready for comment" -- feel free to speak up
> about the catalog additions, the new COLUMN STORE bits we added to the
> grammar, the way we handle column stores in the relcache, or the
> mechanics to create column store catalog entries.
>
> The later half of the patch series is much less well cooked yet; for
> example, the colstore_dummy module is just a simple experiment to let us
> verify that the API is working.  The planner and executor code are
> mostly stubs, and we are not yet sure of what are the executor nodes
> that we would like to have: while we have discussed this topic
> internally a lot, we haven't yet formed final opinions, and of course
> the stub implementations are not doing the proper things, and in many
> cases they are even not doing anything at all.

Fujitsu is also interested in implementing a columnar storage extension.
First we thought of implementing this extension using index access methods
The following is the basic design idea of the columnar extension, currently
this may need to be redesigned according to columnar access methods,

create an vertical columnar index on a table with specified columns that are
needed to be stored in columnar storage format. To provide performance
benefit for both read and write operations, the data is stored in two formats,
1) write optimized storage (WOS) 2) read optimized storage (ROS). This
is useful for the users where there is a great chance of data modification
that is newly added.

Because of two storage's, we need maintain two entries in pg_class table.
one is WOS and others are all columns in columnar storage.

Insert:

write optimized storage is the data of all columns that are part of VCI are
stored in a row wise format. All the newly added data is stored in WOS
relation with xmin/xmax information also. If user wants to update/delete the
newly added data, it doesn't affect the performance much compared to
deleting the data from columnar storage.

The tuples which don't have multiple copies or frozen data will be moved
from WOS to ROS periodically by the background worker process or autovauum
process. Every column data is stored separately in it's relation file. There
is no transaction information is present in ROS. The data in ROS can be
referred with tuple ID.

In this approach, the column data is present in both heap and columnar
storage, whereas with columnar access methods the column data doesn't
present in the heap.

Select:

Because of two storage formats, during the select operation, the data in WOS
is converted into Local ROS for the statement to be executed. The conversion
cost depends upon the number of tuples present in the WOS file. This
may add some performance overhead for select statements.

Delete:

During the delete operation, whenever the data is deleted in heap at the same
time the data in WOS file is marked as deleted similar like heap. But in case
if the data is already migrated from WOS to ROS, then we will maintain some
delete vector to store the details of tuple id, transaction information and etc.
During the data read from ROS file, it is verified against delete
vector and confirms
whether the record is visible or not? All the delete vectors data is
applied to ROS
periodically.

The concept of columnar extension is from Fujitsu Labs, Japan.
Any comments for further evaluation of this approach according to
columnar access
methods?

Regards,
Hari Babu
Fujitsu Australia

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-10-21 Thread Michael Paquier

On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
 wrote:
> Heikki Linnakangas wrote:
>
>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>
>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>> would normally be WAL-logged, we check if the relation is in the hash table,
>> and skip WAL-logging if so.
>
> I think this wasn't applied, was it?

No, it was not applied.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-10-21 Thread Thomas Munro

Here is a new version merging the recent CREATE EXTENSION ... VERSION
patch from master.

(Apologies for sending so many versions.  tab-complete.c keeps moving
and I want to keep a version that applies on top of master out there,
for anyone interested in looking at this.  As long as no one objects
and there is interest in the patch, I'll keep doing that.)

-- 
Thomas Munro
http://www.enterprisedb.com


tab-complete-macrology-v6.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

2015-10-21 Thread Tom Lane

Petr Jelinek  writes:
> On 2015-10-12 14:32, Alexander Korotkov wrote:
>> Also, it was planned that documentation changes would be reviewed by
>> native english speaker.

> BTW I think this is ready for committer, except for the need to check 
> docs by native speaker.

I'm working at Salesforce this week, but will take a look after I get
home.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Rework access method interface

2015-10-21 Thread Petr Jelinek


On 2015-10-12 14:32, Alexander Korotkov wrote:

Also, it was planned that documentation changes would be reviewed by
native english speaker.


BTW I think this is ready for committer, except for the need to check 
docs by native speaker.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Tom Lane

Alvaro Herrera  writes:
> Andres Freund wrote:
>> for another every decent compiler can optimize those away. Note that
>> those duplicate strlen() calls are there in a lot of places in
>> walsender.c

> It can?  Tom has repeatedly argue the opposite, in the past.

I'm prepared to believe that *some* compilers do that, but I think it's
doubtful that they all do.  Even if they do, it would have to be a very
tightly constrained optimization, since the compiler would have to be able
to prove that there is no way for the referenced string to get changed
between the two call sites.  That would likely mean that some places where
you think this will happen will actually end up doing the strlen() twice.

I'm willing to buy the argument that performance doesn't matter in this
particular context; but if we think it does, I'd rather see us explicitly
save and re-use the strlen() result, so that the code behaves the same on
every platform.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bugs and bug tracking

2015-10-21 Thread Thomas Munro

On Thu, Oct 22, 2015 at 2:04 AM, Nathan Wagner  wrote:
> https://granicus.if.org/pgbugs/ for anyone who hasn't and wants to take a
> look.

FYI There seems to be an encoding problem somewhere.  "Đặng Minh Dũng"
is showing up as "Äáº·ng Minh DÅ©ng" on this page:

https://granicus.if.org/pgbugs/13691

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Andres Freund

On 2015-10-21 19:36:16 -0300, Alvaro Herrera wrote:
> diff --git a/src/backend/replication/walsender.c 
> b/src/backend/replication/walsender.c
> index c6043cd..5487cc0 100644
> --- a/src/backend/replication/walsender.c
> +++ b/src/backend/replication/walsender.c
> @@ -762,10 +762,10 @@ logical_read_xlog_page(XLogReaderState *state, 
> XLogRecPtr targetPagePtr, int req
>  static void
>  CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
>  {
> - const char *slot_name;
>   const char *snapshot_name = NULL;
>   charxpos[MAXFNAMELEN];
>   StringInfoData buf;
> + int len;

If you want to do that, I'm unenthusiastically not objecting. But if so,
let's also do it in IdentifySystem(), SendTimeLineHistory(),
StartReplication(), SendBackupHeader(), SendXLogRecPtResult() - they're
modeled after each other.

Do you want to commit the slot_name with the rest or do you want me to
do that?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Alvaro Herrera

Andres Freund wrote:

> That seems fairly insignificant. For one this is a rather infrequent and
> expensive operation, for another every decent compiler can optimize
> those away. Note that those duplicate strlen() calls are there in a lot
> of places in walsender.c

diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index c6043cd..5487cc0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -762,10 +762,10 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr 
targetPagePtr, int req
 static void
 CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 {
-   const char *slot_name;
const char *snapshot_name = NULL;
charxpos[MAXFNAMELEN];
StringInfoData buf;
+   int len;
 
Assert(!MyReplicationSlot);
 
@@ -791,14 +791,11 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 
initStringInfo(&output_message);
 
-   slot_name = NameStr(MyReplicationSlot->data.name);
-
if (cmd->kind == REPLICATION_KIND_LOGICAL)
{
LogicalDecodingContext *ctx;
 
-   ctx = CreateInitDecodingContext(
-   
cmd->plugin, NIL,
+   ctx = CreateInitDecodingContext(cmd->plugin, NIL,

logical_read_xlog_page,

WalSndPrepareWrite, WalSndWriteData);
 
@@ -834,7 +831,6 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
ReplicationSlotSave();
}
 
-   slot_name = NameStr(MyReplicationSlot->data.name);
snprintf(xpos, sizeof(xpos), "%X/%X",
 (uint32) (MyReplicationSlot->data.confirmed_flush >> 
32),
 (uint32) MyReplicationSlot->data.confirmed_flush);
@@ -885,18 +881,21 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
pq_sendint(&buf, 4, 2); /* # of columns */
 
/* slot_name */
-   pq_sendint(&buf, strlen(slot_name), 4); /* col1 len */
-   pq_sendbytes(&buf, slot_name, strlen(slot_name));
+   len = strlen(NameStr(MyReplicationSlot->data.name));
+   pq_sendint(&buf, len, 4);   /* col1 len */
+   pq_sendbytes(&buf, NameStr(MyReplicationSlot->data.name), len);
 
/* consistent wal location */
-   pq_sendint(&buf, strlen(xpos), 4);  /* col2 len */
-   pq_sendbytes(&buf, xpos, strlen(xpos));
+   len = strlen(xpos);
+   pq_sendint(&buf, len, 4);   /* col2 len */
+   pq_sendbytes(&buf, xpos, len);
 
/* snapshot name */
if (snapshot_name != NULL)
{
-   pq_sendint(&buf, strlen(snapshot_name), 4); /* col3 
len */
-   pq_sendbytes(&buf, snapshot_name, strlen(snapshot_name));
+   len = strlen(snapshot_name);
+   pq_sendint(&buf, len, 4);   /* col3 len */
+   pq_sendbytes(&buf, snapshot_name, len);
}
else
pq_sendint(&buf, -1, 4);/* col3 len, NULL */
@@ -904,8 +903,9 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
/* plugin */
if (cmd->plugin != NULL)
{
-   pq_sendint(&buf, strlen(cmd->plugin), 4);   /* col4 
len */
-   pq_sendbytes(&buf, cmd->plugin, strlen(cmd->plugin));
+   len = strlen(cmd->plugin);
+   pq_sendint(&buf, len, 4);   /* col4 len */
+   pq_sendbytes(&buf, cmd->plugin, len);
}
else
pq_sendint(&buf, -1, 4);/* col4 len, NULL */

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Alvaro Herrera

Andres Freund wrote:
> On 2015-10-21 17:21:25 -0300, Alvaro Herrera wrote:

> > I think the first assignment is also pointless -- I mean can't we just
> > use MyReplicationSlot->data.name in both places where slot_name is used?
> 
> Makes the lines a bit long. strlen(NameStr(MyReplicationSlot->data.name)) ...

Meh.  Assign the strlen somewhere?

> > In the same routine, it seems wasteful to be doing strlen() twice for
> > every string sent over the wire.
> 
> That seems fairly insignificant. For one this is a rather infrequent and
> expensive operation,

OK, I can buy that.

> for another every decent compiler can optimize those away. Note that
> those duplicate strlen() calls are there in a lot of places in
> walsender.c

It can?  Tom has repeatedly argue the opposite, in the past.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_recvlogical fixes

2015-10-21 Thread Andres Freund

On 2015-10-21 11:52:31 -0300, Euler Taveira wrote:
> While testing wal2json, I faced some problems with pg_recvlogical. Attached
> is a serie of patches that can improve pg_recvlogical. Patches #2 and #3 are
> bugfixes (and should be applied to 9.5 too). Patch #1 is not mandatory to
> 9.5.

> #1: add a bunch of checks to complain when using an option that is not
> available in the specified action;

I'm not a fan of doing that. Doing that for every option tends to be
more annoying than helpful. E.g. pgbench's checks requires you to
pointlessly remove a lot of harmless options just to be able to pass -i.

> #2: there is a wrong check because startpos option can be specified with
> --create-slot;

> #3: doesn't ignore startpos in --create-slot because that action could be
> specified together with --start action (that uses that option);

It doesn't make sense to specify it with --create-slot though - you
can't even know what a the start position would mean before the slot is
created. You can't get older records than the ones at slot creation
time, and you can't know what a feature xlog position would mean.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Andres Freund

On 2015-10-21 17:21:25 -0300, Alvaro Herrera wrote:
> > It seems that the 2nd assignment was an oversight. Spotted by Bernd
> > Helmle.

Yea, that's obviously redudant. Will remove.

> I think the first assignment is also pointless -- I mean can't we just
> use MyReplicationSlot->data.name in both places where slot_name is used?

Makes the lines a bit long. strlen(NameStr(MyReplicationSlot->data.name)) ...

> In the same routine, it seems wasteful to be doing strlen() twice for
> every string sent over the wire.

That seems fairly insignificant. For one this is a rather infrequent and
expensive operation, for another every decent compiler can optimize
those away. Note that those duplicate strlen() calls are there in a lot
of places in walsender.c

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Alvaro Herrera

Euler Taveira wrote:

> It seems that the 2nd assignment was an oversight. Spotted by Bernd
> Helmle.

I think the first assignment is also pointless -- I mean can't we just
use MyReplicationSlot->data.name in both places where slot_name is used?

In the same routine, it seems wasteful to be doing strlen() twice for
every string sent over the wire.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without cleanup lock

2015-10-21 Thread Alvaro Herrera

Jim Nasby wrote:
> While warning a client that just did a Slony-based version upgrade to make
> sure to freeze the new database, it occurred to me that it should be safe to
> freeze without the cleanup lock. This is interesting because it would allow
> a scan_all vacuum to do it's job without blocking on the cleanup lock.
> 
> Does anyone have a feel for whether scan_all vacuums blocking on the cleanup
> lock is an actual problem?

Yeah, I remember we discussed this and some other possible improvements
related to freezing.  I think other ideas proposed were that (1) during
an emergency (uncancellable) autovacuum run, we process only the tables
that are past the age limit, and (2) we remove the cost-based sleep so
that it finishes as quickly as possible.  (Yours is (3) only freeze and
not do any actual pruning -- did I get that right?)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] BUG #13694: Row Level Security by-passed with CREATEUSER permission

2015-10-21 Thread Joe Conway

On 10/21/2015 12:46 PM, Tom Lane wrote:
> Attached patch rips out CREATEUSER and NOCREATEUSER options lock, stock,
> and barrel.

Looks good to me.

> Another possibility is to change them to actually mean CREATEROLE and
> NOCREATEROLE.  I think probably a clean break is better though.


I think that would be too confusing. I'd rather see them go away ala
your patch.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Duplicated assignment of slot_name in walsender.c

2015-10-21 Thread Euler Taveira


On 20-10-2015 08:28, Bernd Helmle wrote:

The 2nd assignment to slot_name looks unnecessary?


Yes, it is. Seems to be an oversight. Patch attached.


--
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
>From 87570993d29f2c98121c3a0a75c85cdc4211f24f Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Wed, 21 Oct 2015 16:52:26 -0300
Subject: [PATCH] Fix a duplicated assignment in walsender code

It seems that the 2nd assignment was an oversight. Spotted by Bernd
Helmle.
---
 src/backend/replication/walsender.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c6043cd..ca1b4b9 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -834,7 +834,6 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
 		ReplicationSlotSave();
 	}
 
-	slot_name = NameStr(MyReplicationSlot->data.name);
 	snprintf(xpos, sizeof(xpos), "%X/%X",
 			 (uint32) (MyReplicationSlot->data.confirmed_flush >> 32),
 			 (uint32) MyReplicationSlot->data.confirmed_flush);
-- 
2.1.4


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without cleanup lock

2015-10-21 Thread Tom Lane

Jim Nasby  writes:
> While warning a client that just did a Slony-based version upgrade to 
> make sure to freeze the new database, it occurred to me that it should 
> be safe to freeze without the cleanup lock.

What's your argument for that being safe?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freezing without cleanup lock

2015-10-21 Thread Andres Freund

On October 21, 2015 9:47:45 PM GMT+02:00, Tom Lane  wrote:
>Jim Nasby  writes:
>> While warning a client that just did a Slony-based version upgrade to
>
>> make sure to freeze the new database, it occurred to me that it
>should 
>> be safe to freeze without the cleanup lock.
>
>What's your argument for that being safe?

It doesn't affect tuple contents and thus backends having a pin can continue 
looking at tuple contents. The reason we need a cleanup lock is IIRC repairing 
page fragmentation / hot pruning, not freezing.

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] BUG #13694: Row Level Security by-passed with CREATEUSER permission

2015-10-21 Thread Tom Lane

Joe Conway  writes:
> On 10/21/2015 11:26 AM, Andres Freund wrote:
>> On 2015-10-21 11:17:44 -0700, Tom Lane wrote:
>>> I wonder if it's time yet to remove those keywords.  We've had the
>>> SUPERUSER spelling since 8.1, and this report should remind us that
>>> people get confused by the old spellings.

>> +1 for doing that in 9.6.

> 1++

Attached patch rips out CREATEUSER and NOCREATEUSER options lock, stock,
and barrel.

Another possibility is to change them to actually mean CREATEROLE and
NOCREATEROLE.  I think probably a clean break is better though.

regards, tom lane

diff --git a/doc/src/sgml/ref/alter_role.sgml b/doc/src/sgml/ref/alter_role.sgml
index 7638817b830a1eb73ab59174591fe6eaf3378453..da36ad96967e7c842c28f5787b5ec0dc48ec5a60 100644
*** a/doc/src/sgml/ref/alter_role.sgml
--- b/doc/src/sgml/ref/alter_role.sgml
*** ALTER ROLE NOCREATEDB
CREATEROLE
NOCREATEROLE
-   CREATEUSER
-   NOCREATEUSER
INHERIT
NOINHERIT
LOGIN
--- 159,164 
diff --git a/doc/src/sgml/ref/alter_user.sgml b/doc/src/sgml/ref/alter_user.sgml
index 0ffaa16da2fc75fcd1121dcde627837d6a8e9166..84a0c52191fc6c82687ef49df99ac30fdbfd8f1d 100644
*** a/doc/src/sgml/ref/alter_user.sgml
--- b/doc/src/sgml/ref/alter_user.sgml
*** ALTER USER password'
--- 28,33 
diff --git a/doc/src/sgml/ref/create_role.sgml b/doc/src/sgml/ref/create_role.sgml
index 240c21ce85fcd1bd4cddd4bb6af109ef319f8175..38cd4c83696053f5b70fc514e9b3939cb842fb3a 100644
*** a/doc/src/sgml/ref/create_role.sgml
--- b/doc/src/sgml/ref/create_role.sgml
*** CREATE ROLE 
  
   
-   CREATEUSER
-   NOCREATEUSER
-   
-
- These clauses are an obsolete, but still accepted, spelling of
- SUPERUSER and NOSUPERUSER.
- Note that they are not equivalent to
- CREATEROLE as one might naively expect!
-
-   
-  
- 
-  
INHERIT
NOINHERIT

--- 124,129 
diff --git a/doc/src/sgml/ref/create_user.sgml b/doc/src/sgml/ref/create_user.sgml
index 065999c85a4df610908c1e762d7bc3435406986b..6c690b36df561573f234de073e4ddbb10e6e9ec7 100644
*** a/doc/src/sgml/ref/create_user.sgml
--- b/doc/src/sgml/ref/create_user.sgml
*** CREATE USER 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Freezing without cleanup lock

2015-10-21 Thread Jim Nasby

While warning a client that just did a Slony-based version upgrade to 
make sure to freeze the new database, it occurred to me that it should 
be safe to freeze without the cleanup lock. This is interesting because 
it would allow a scan_all vacuum to do it's job without blocking on the 
cleanup lock.


Does anyone have a feel for whether scan_all vacuums blocking on the 
cleanup lock is an actual problem?

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-10-21 Thread Alvaro Herrera

Jim Nasby wrote:
> On 10/21/15 8:11 AM, Andres Freund wrote:
> >Meh. Adding complexity definitely needs to be weighed against the
> >benefits. As pointed out e.g. by all the multixact issues you mentioned
> >upthread. In this case your argument for changing the name doesn't seem
> >to hold much water.
> 
> ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h
> instead of pg_upgrade.h though, to ensure it's correctly updated when this
> gets committed though.

That would be untidy and pointless.  pg_upgrade.h contains other
catversions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-10-21 Thread Jim Nasby


On 10/21/15 8:11 AM, Andres Freund wrote:

Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.


ISTM VISIBILITY_MAP_FROZEN_BIT_CAT_VER shold be defined in catversion.h 
instead of pg_upgrade.h though, to ensure it's correctly updated when 
this gets committed though.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Change behavior of (m)xid_age

2015-10-21 Thread Jim Nasby

Currently, xid_age() returns INT_MAX for a permanent xid. The comment in 
the function that 'Permanent XIDs are always infinitely old' may be 
technically correct, but returning INT_MAX is a useless behavior because 
it actually makes it look like that XID is in immediate wraparound 
danger. I think we should change it to return either 0, -1, or INT_MIN. 
To me, 0 makes the most sense for monitoring relfrozenxid.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY FREEZE and PD_ALL_VISIBLE

2015-10-21 Thread Jeff Janes

On Tue, Oct 20, 2015 at 7:02 AM, Robert Haas  wrote:

> On Sun, Oct 18, 2015 at 5:23 PM, Jeff Janes  wrote:
> > I'm planning on adding a todo item to have COPY FREEZE set
> PD_ALL_VISIBLE.
> > Or is there some reason this can't be done?
> >
> > Since the whole point of COPY FREEZE is to avoid needing to rewrite the
> > entire table, it seems rather perverse that the first time the table is
> > vacuumed, it needs to rewrite the entire table.
>
> *facepalm*
>
> I don't know how hard that is to implement, but +1 for trying to
> figure out a way.
>

It turns out it was pretty easy to set PD_ALL_VISIBLE on the new pages,
since the code in hio that requests the relation to be extended already has
info on the tuple's intended freeze status.

Then you just need to refrain from clearing PD_ALL_VISIBLE when that tuple
is actually written into the page.  Not only because clearing would defeat
the purpose, but also because it will cause an error--apparently the
incipient page is not yet in a state where visibilitymap_clear is willing
to deal with it.

With this patch, you get a table which has PD_ALL_VISIBLE set for all
pages, but which doesn't have a _vm file. Index-only scans will visit the
heap for each tuple until the first VACUUM is done.

The first vacuum will read the entire table, but not need to write it
anymore.  And will create the _vm file.

I think we really want to create _vm file as well as set PD_ALL_VISIBLE,
but I don't know the best way to do that.  Set a flag somewhere and then
create it in bulk at the end of the transaction?  Set it bit by bit as the
pages are extended and initialized?

Cheers,

Jeff

copy_freeze_all_visible_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v3] GSSAPI encryption support

2015-10-21 Thread Robbie Harwood

Michael Paquier  writes:

> On Tue, Oct 20, 2015 at 3:01 AM, Robbie Harwood wrote:
>> Stephen Frost  writes:
>>> psql: lost synchronization with server: got message type "S", length 22
>>
>> which unfortunately could be a great many things.  I've said this a
>> couple times now, but I really do need more information - a traffic
>> dump, a list of commands that were run, etc.; unfortunately, the surface
>> here is pretty large, and while I totally am willing to believe there
>> are bugs in the code I've written, I do not yet see them.
>
> --- a/src/interfaces/libpq/fe-protocol3.c
> +++ b/src/interfaces/libpq/fe-protocol3.c
> @@ -129,6 +129,58 @@ pqParseInput3(PGconn *conn)
> return;
> }
>
> +#ifdef ENABLE_GSS
> +   /* We want to be ready in both IDLE and BUSY states
> for encryption */
> +   if (id == 'g' && !conn->gss_disable_enc && conn->gctx)
> +   {
> +   ssize_t encEnd, next;
> [...]
> +   }
> +   else if (!conn->gss_disable_enc && conn->gss_auth_done &&
> +!conn->gss_decrypted_cur && id != 'E')
> +   /* This could be a sync error, so let's handle
> it as such. */
> +   handleSyncLoss(conn, id, msgLength);
> +#endif
>
> Hm. The out-of-sync error I am seeing in my environment is caused by
> this block when parsing 'g' messages coming from the backend that are
> considered as being GSSAPI-encrypted messages. I am still looking at
> that...

If you're hitting the else-block, that suggests a GSSAPI context is not
present at the time a GSSAPI message was received, I think.


signature.asc
Description: PGP signature

Re: [HACKERS] WAL logging problem in 9.4.3?

2015-10-21 Thread Alvaro Herrera

Heikki Linnakangas wrote:

> Thanks. For comparison, I wrote a patch to implement what I had in mind.
> 
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash table,
> and skip WAL-logging if so.

I think this wasn't applied, was it?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Suporting multiple recursive table reads

2015-10-21 Thread Wesley Massuda

I am trying to patch postgres to accept this query. A recursive query
referencing the recursive table twice. By removing some checks it accepts
the query and generates a plan.

### Query

create table base(k1,k2,k3) as  (values
(1,null,null),(2,null,null),(3,1,2));

with recursive x(k1,k2,k3) as ( select k1,null::record,null ::record from
base  union all select b.k1,x1,x2 from base as b join  x as x1  on
x1.k1=b.k2  join x as x2 on  x2.k1=b.k3  where b.k2 is not null or b.k3 is
not null) select * from x;

drop table base;

### Plan


 CTE Scan on x  (cost=7745569.00..11990441.80 rows=212243640 width=68)
   CTE x
 ->  Recursive Union  (cost=0.00..7745569.00 rows=212243640 width=60)
   ->  Seq Scan on base  (cost=0.00..30.40 rows=2040 width=4)
   ->  Merge Join  (cost=31081.98..350066.58 rows=21224160 width=6
0)
 Merge Cond: (x2.k1 = b.k3)
 ->  Sort  (cost=1868.26..1919.26 rows=20400 width=32)
   Sort Key: x2.k1
   ->  WorkTable Scan on x x2  (cost=0.00..408.00 rows
=20400 width=32)
 ->  Materialize  (cost=29213.72..30254.12 rows=208080 wid
th=36)
   ->  Sort  (cost=29213.72..29733.92 rows=208080 widt
h=36)
 Sort Key: b.k3
 ->  Merge Join  (cost=2010.80..5142.20 rows=2
08080 width=36)
   Merge Cond: (b.k2 = x1.k1)
   ->  Sort  (cost=142.54..147.64 rows=204
0 width=12)
 Sort Key: b.k2
 ->  Seq Scan on base b  (cost=0.0
0..30.40 rows=2040 width=12)
   Filter: ((k2 IS NOT NULL) O
R (k3 IS NOT NULL))
   ->  Sort  (cost=1868.26..1919.26 rows=2
0400 width=32)
 Sort Key: x1.k1
 ->  WorkTable Scan on x x1  (cost
=0.00..408.00 rows=20400 width=32)


But when the query is executed looks like the working table can't be read
twice. Then the second join returns empty. Reading the executor of
workingtablescan, i see some notes about a private read pointer and
copying.

It's my first time in the postgres codebase. Can someone point out if the
working_table table can't really be read twice. And if there is similar
code in other executor that allow it.







-- 
Wesley S. Massuda

Re: [HACKERS] [PATCH v3] GSSAPI encryption support

2015-10-21 Thread Robbie Harwood

Michael Paquier  writes:

> Robbie,
>
> +#ifdef ENABLE_GSS
> +   if (pggss_encrypt(conn) < 0)
> +   return EOF;
> +#endif
>
> @@ -1528,10 +1541,20 @@ socket_putmessage(char msgtype, const char *s,
> size_t len)
> if (internal_putbytes(s, len))
> goto fail;
> PqCommBusy = false;
> +#ifdef ENABLE_GSS
> +   /* if we're GSSAPI encrypting, s was allocated in be_gss_encrypt */
> +   if (msgtype == 'g')
> +   pfree((char *)s);
> +#endif
>
> Looking at this patch in more details... Why is it necessary to wrap
> all the encrypted messages into a dedicated message type 'g'? Cannot
> we rely on the context on client and backend that should be set up
> after authentication to perform the encryption and decryption
> operations? This patch is enforcing the message type in
> socket_putmessage for backend and pggss_encrypt/pqPutMsgEnd for
> frontend, this just feels wrong and I think that the patch could be
> really simplified, this includes the crafting added in fe-protocol3.c
> that should be IMO reserved only for messages received from the
> backend and should not be made aware of any protocol-level logic.

Well, it's not strictly necessary in the general case, but it makes
debugging a *lot* easier to have it present, and it simplifies some of
the handling logic.  For instance, in the part quoted above, with
socket_putmessage() and socket_putmessage_noblock(), we need some way to
tell whether a message blob has already been encrypted.

Some enforcement of message type *will need to be carried out* anyway to
avoid security issues with tampering on the wire, whether that be by
sanity-checking the stated message type and then handling accordingly,
or trying to decrypt and detonating the connection if it fails.

GSSAPI does not define a wire protocol for the transport of messages the
way, e.g., TLS does, so there must be some handling of incomplete
messages, multiple messages at once, etc.  There is already adequate
handling of these present in postgres already, so I have structured the
code to take advantage of it.  That said, I could absolutely reimplement
them - but it would not be simpler, I'm reasonably sure.

signature.asc
Description: PGP signature

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-10-21 Thread Masahiko Sawada

On Tue, Oct 20, 2015 at 8:10 PM, Beena Emerson  wrote:
>
> On Mon, Oct 19, 2015 at 8:47 PM, Masahiko Sawada 
> wrote:
>>
>>
>> Hi,
>>
>> Attached patch is a rough patch which supports multi sync replication
>> by another approach I sent before.
>>
>> The new GUC parameters are:
>> * synchronous_standby_num, which specifies the number of standby
>> servers using sync rep. (default is 0)
>> * synchronous_replication_method, which specifies replication method;
>> priority or quorum. (default is priority)
>>
>> The behaviour of 'priority' and 'quorum' are same as what we've been
>> discussing.
>> But I write overview of these here again here.
>>
>> [Priority Method]
>> The standby server has each different priority, and the active standby
>> servers having the top N priroity are become sync standby server.
>> If synchronous_standby_names = '*', the all active standby server
>> would be sync standby server.
>> If you want to set up standby like 9.5 or before, you can set
>> synchronous_standby_num = 1.
>>
>
>
> I used the following setting with 2 servers A and D connected:
>
> synchronous_standby_names = 'A,B,C,D'
> synchronous_standby_num = 2
> synchronous_replication_method = 'priority'
>
> Though s_r_m = 'quorum' worked fine, changing it to 'priority' caused
> segmentation fault.
>

Thank you for taking a look!
This patch is a tool for discussion, so I'm not going to fix this bug
until getting consensus.

We are still under the discussion to find solution that can get consensus.
I felt that it's difficult to select from the two approaches within
this development cycle, and there would not be time to implement such
big feature even if we selected.
But this feature is obviously needed by many users.
So I'm considering more simple and extensible something solution, the
idea I posted is one of them.
The another worth considering approach is that just specifying the
number of sync standby. It also can cover the main use cases in
some-cases.

Regards,

--
Masahiko Sawada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_recvlogical fixes

2015-10-21 Thread Euler Taveira


Hi,

While testing wal2json, I faced some problems with pg_recvlogical. 
Attached is a serie of patches that can improve pg_recvlogical. Patches 
#2 and #3 are bugfixes (and should be applied to 9.5 too). Patch #1 is 
not mandatory to 9.5.


Short description:

#1: add a bunch of checks to complain when using an option that is not 
available in the specified action;
#2: there is a wrong check because startpos option can be specified with 
--create-slot;
#3: doesn't ignore startpos in --create-slot because that action could 
be specified together with --start action (that uses that option);



--
   Euler Taveira   Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
>From 9b3d63d1744e2b65a7f1a1d44ed166f4c9684771 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 1 Sep 2015 23:52:55 -0300
Subject: [PATCH 3/3] Fix a startpos override.

Since --create-slot and --start can be specified together, we can't
override startpos while creating a slot (it'll ignore the replication
start position, if specified).
---
 src/bin/pg_basebackup/pg_recvlogical.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index b59fe13..a2284d2 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -969,7 +969,6 @@ main(int argc, char **argv)
 		if (!CreateReplicationSlot(conn, replication_slot, plugin,
    false, slot_exists_ok))
 			disconnect_and_exit(1);
-		startpos = InvalidXLogRecPtr;
 	}
 
 	if (!do_start_slot)
-- 
2.1.4

>From 8393aa62cddeaff8cc66b19c6ba36a371b090db1 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 1 Sep 2015 23:46:33 -0300
Subject: [PATCH 2/3] Fix startpos parameter check.

startpos parameter can be specified together with --create-slot (--start
must be specified too). This check is wrong since --create-slot and
--start could be specified together.
---
 src/bin/pg_basebackup/pg_recvlogical.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 9ce237f..b59fe13 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -848,9 +848,9 @@ main(int argc, char **argv)
 		exit(1);
 	}
 
-	if (startpos != InvalidXLogRecPtr && (do_create_slot || do_drop_slot))
+	if (startpos != InvalidXLogRecPtr && !do_start_slot)
 	{
-		fprintf(stderr, _("%s: cannot use --create-slot or --drop-slot together with --startpos\n"), progname);
+		fprintf(stderr, _("%s: can only use --startpos together with --start\n"), progname);
 		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 progname);
 		exit(1);
-- 
2.1.4

>From da241d9e2529888297c0f68bd89706fb59ae5884 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Tue, 1 Sep 2015 23:34:51 -0300
Subject: [PATCH 1/3] pg_recvlogical: Tighten checks for action parameters.

Until now, the mistaken parameters are ignored (which is bad because
that parameter won't take the desired effect). Complaining about the
"wrong" parameter will help the user to fix the problem immediately.
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  2 +-
 src/bin/pg_basebackup/pg_recvlogical.c | 82 +++---
 2 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index 9d0b58b..a513047 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -149,7 +149,7 @@ PostgreSQL documentation
 In --start mode, start replication from the given
 LSN.  For details on the effect of this, see the documentation
 in 
-and . Ignored in other modes.
+and .

   
  
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 93f61c3..9ce237f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -30,12 +30,17 @@
 /* Time to sleep between reconnection attempts */
 #define RECONNECT_SLEEP_TIME 5
 
+#define	DEFAULT_MESSAGE_TIMEOUT	(10 * 1000)		/* 10 secs = default */
+#define	DEFAULT_FSYNC_INTERVAL	(10 * 1000)		/* 10 secs = defualt */
+#define	DEFAULT_PLUGIN	"test_decoding"
+
 /* Global Options */
 static char *outfile = NULL;
+static char *plugin = NULL;
 static int	verbose = 0;
 static int	noloop = 0;
-static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
-static int	fsync_interval = 10 * 1000; /* 10 sec = default */
+static int	standby_message_timeout = -1;
+static int	fsync_interval = -1;
 static XLogRecPtr startpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
@@ -45,7 +50,6 @@ static bool do_drop_slot = false;
 /* filled pairwise with option, value. value may be NULL */
 static char **options;
 static size_t nopti

Re: [HACKERS] checkpointer continuous flushing

2015-10-21 Thread Andres Freund

On 2015-10-21 07:49:23 +0200, Fabien COELHO wrote:
>
> Hello Andres,
>
> >>>In my performance testing it showed that calling PerformFileFlush() only
> >>>at segment boundaries and in CheckpointWriteDelay() can lead to rather
> >>>spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is
> >>>problematic because it only is triggered while on schedule, and not when
> >>>behind.
> >>
> >>When behind, the PerformFileFlush should be called on segment
> >>boundaries.
> >
> >That means it's flushing up to a gigabyte of data at once. Far too
> >much.
>
> Hmmm. I do not get it. There would not be gigabytes,

I said 'up to a gigabyte' not gigabytes. But it actually can be more
than one  if you're unluckly.

> there would be as much as was written since the last sleep, about 100
> ms ago, which is not likely to be gigabytes?

In many cases we don't sleep all that frequently - after one 100ms sleep
we're already behind a lot. And even so, it's pretty easy to get into
checkpoint scenarios with ~500 mbyte/s as a writeout rate. Only issuing
a sync_file_range() 10 times for that is obviously problematic.

> >The implementation pretty always will go behind schedule for some
> >time. Since sync_file_range() doesn't flush in the foreground I don't
> >think it's important to do the flushing in concert with sleeping.
>
> For me it is important to avoid accumulating too large flushes, and that is
> the point of the call before sleeping.

I don't follow this argument. It's important to avoid large flushes,
therefore we potentially allow large flushes to accumulate?

> >>>My testing seems to show that just adding a limit of 32 buffers to
> >>>FileAsynchronousFlush() leads to markedly better results.
> >>
> >>Hmmm. 32 buffers means 256 KB, which is quite small.
> >
> >Why?
>
> Because the point of sorting is to generate sequential writes so that the
> HDD has a lot of aligned stuff to write without moving the head, and 32 is
> rather small for that.

A sync_file_range(SYNC_FILE_RANGE_WRITE) doesn't synchronously write
data back. It just puts it into the write queue. You can have merging
between IOs from either side.  But more importantly you can't merge that
many requests together anyway.

> >The aim is to not overwhelm the request queue - which is where the
> >coalescing is done. And usually that's rather small.
>
> That is an argument. How small, though? It seems to be 128 by default, so
> I'd rather have 128? Also, it can be changed, so maybe it should really be a
> guc?

I couldn't see any benefits above (and below) 32 on a 20 drive system,
so I doubt it's worthwhile. It's actually good for interactivity to
allow other requests into the queue concurrently - otherwise other
reads/writes will obviously have a higher latency...

> >If you flush much more sync_file_range starts to do work in the
> >foreground.
>
> Argh, too bad. I would have hoped that the would just deal with in an
> asynchronous way,

It's even in the man page:
"Note  that  even  this  may  block if you attempt to write more than
request queue size."

> this is not a "fsync" call, just a flush advise.

sync_file_range isn't fadvise().

> Because it should be in shared buffers where pg needs it?

Huh? I'm just suggesting p = mmap(fd, offset, bytes);msync(p, bytes);munmap(p);
instead of sync_file_range().

> >>Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 element
> >>set in most case is an improvement.
> >
> >Yes, it'll not matter that much in many cases. But I rather disliked the
> >NextBufferToWrite() implementation, especially that it walkes the array
> >multiple times. And I did see setups with ~15 tablespaces.
>
> ISTM that it is rather an argument for taking the tablespace into the
> sorting, not necessarily for a binary heap.

I don't understand your problem with that. The heap specific code is
small, smaller than your NextBufferToWrite() implementation?

ts_heap = binaryheap_allocate(nb_spaces,
  ts_progress_cmp,
  NULL);

spcContext = (FileFlushContext *)
palloc(sizeof(FileFlushContext) * nb_spaces);

for (i = 0; i < nb_spaces; i++)
{
TableSpaceCheckpointStatus *spc = &spcStatus[i];

spc->progress_slice = ((float8) num_to_write) / (float8) 
spc->num_to_write;

ResetFileFlushContext(&spcContext[i]);
spc->flushContext = &spcContext[i];

binaryheap_add_unordered(ts_heap, PointerGetDatum(&spcStatus[i]));
}

binaryheap_build(ts_heap);

and then

while (!binaryheap_empty(ts_heap))
{
TableSpaceCheckpointStatus *ts = (TableSpaceCheckpointStatus *)
DatumGetPointer(binaryheap_first(ts_heap));

...
ts->progress += ts->progress_slice;
ts->num_written++;
...
if (ts->num_written == ts->num_to_write)
{
...
binaryheap_remove_first(ts_heap);
}
else
{
/* update heap with the new progress */

Re: [HACKERS] Getting sorted data from foreign server

2015-10-21 Thread Ashutosh Bapat

Here's patch with the regression fixed.

On Wed, Oct 21, 2015 at 2:53 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

>
>
> On Fri, Oct 16, 2015 at 11:33 PM, Robert Haas 
> wrote:
>
>> On Thu, Oct 15, 2015 at 6:28 AM, Ashutosh Bapat
>>  wrote:
>> > Attached is the patch which takes care of above comments.
>>
>> I spent some time on this patch today.  But it's still not right.
>>
>> I've attached a new version which fixes a serious problem with your
>> last version - having postgresGetForeignPaths do the costing of the
>> sorted path itself instead of delegating that to
>> estimate_path_cost_size is wrong.  In your version, 10% increment gets
>> applied to the network transmission costs as well as the cost of
>> generating the tupes - but only when use_remote_estimate == false.  I
>> fixed this and did some cosmetic cleanup.
>>
>
> That's better.
>
>
>>
>> But you'll notice if you try this some of postgres_fdw's regression
>> tests fail.  This is rather mysterious:
>>
>> ***
>> *** 697,715 
>>Sort
>>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>>  Sort Key: t1.c1
>> !->  Nested Loop Semi Join
>>Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>> !  Join Filter: (t1.c3 = t2.c3)
>>->  Foreign Scan on public.ft1 t1
>>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7,
>> t1.c8
>>  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
>> FROM "S 1"."T 1" WHERE (("C 1" < 20))
>> !  ->  Materialize
>>  Output: t2.c3
>> !->  Foreign Scan on public.ft2 t2
>>Output: t2.c3
>> !  Filter: (date(t2.c4) = '01-17-1970'::date)
>> !  Remote SQL: SELECT c3, c4 FROM "S 1"."T 1"
>> WHERE (("C 1" > 10))
>> ! (15 rows)
>>
>>   EXECUTE st2(10, 20);
>>c1 | c2 |  c3   |  c4  |c5
>>   | c6 | c7 | c8
>> --- 697,718 
>>Sort
>>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>>  Sort Key: t1.c1
>> !->  Hash Join
>>Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>> !  Hash Cond: (t1.c3 = t2.c3)
>>->  Foreign Scan on public.ft1 t1
>>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7,
>> t1.c8
>>  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
>> FROM "S 1"."T 1" WHERE (("C 1" < 20))
>> !  ->  Hash
>>  Output: t2.c3
>> !->  HashAggregate
>>Output: t2.c3
>> !  Group Key: t2.c3
>> !  ->  Foreign Scan on public.ft2 t2
>> !Output: t2.c3
>> !Filter: (date(t2.c4) = '01-17-1970'::date)
>> !Remote SQL: SELECT c3, c4 FROM "S 1"."T
>> 1" WHERE (("C 1" > 10))
>> ! (18 rows)
>>
>> What I think is happening here is that the planner notices that
>> instead of doing a parameterized nestloop, it could pull down the data
>> already sorted from the remote side, cheaply unique-ify it by using
>> the ordering provided by the remote side, and then do a standard hash
>> join. That might well be a sensible approach, but the ORDER BY that
>> would make it correct doesn't show up in the Remote SQL.  I don't know
>> why that's happening, but it's not good.
>>
>>
> The unique-ifying is happening through HashAggregate, so we do not need
> ORDER BY clause in RemoteSQL. So the plan is correct.
>
> Careful examination of paths created revealed that the change in plan is
> result of our fuzzy path selection logic. Let me explain it using the cost
> values I got on my machine. Please note that the costs are described as a
> tuple (startup cost, total cost, number of rows, present of pathkeys).
>
> With your patch, base relation paths have following costs:
>
> ft1 path without pathkeys - (100, 123.9, 20, no)
> ft1 path with pathkeys (100, 126.25, 20, yes)
> ft2 path without pathkeys (100, 143.645, 4, no)
> ft2 path without pathkeys with params (100.01, 125.365, 1, no)
>
> Notice the sorted path is only marginally costly (2.5%) compared to the
> non-sorted path for ft1. During the path creation process several nested
> loop paths are created, but except for two, they are too costly. The two
> nested loop paths of interest have costs (200, 268.755, 10, no) and (200,
> 271.105, 10, yes). The path with pathkeys kicks out the one without
> pathkeys because the later's costs are "fuzzily" equal and pathkeys give
> extra advantage. The "fuzzy" equality is effect of the marginally extra
> sorting cost. The nested loop path with pathkeys remains in the path list.
> Several hash join paths and merge join paths are created but are costlier
> than one particular hash path which has costs (243.677, 267.6624, 10, no).
> This hash path is not beaten by the nested loop path sin

Re: [HACKERS] Freeze avoidance of very large table.

2015-10-21 Thread Andres Freund

On 2015-10-20 20:35:31 -0400, Simon Riggs wrote:
> On 9 October 2015 at 15:20, Robert Haas  wrote:
> 
> > On Thu, Oct 8, 2015 at 1:52 PM, Andres Freund  wrote:
> > > I don't see the problem? I mean catversion will reliably tell you which
> > format the vm is in?
> >
> > Totally agreed.
> >
> 
> This isn't an agreement competition, its a cool look at what might cause
> problems for all of us.

Uh, we form rough concensuses all the time.

> If we want to avoid bugs in future then we'd better start acting like that
> is actually true in practice.

> Why should we wave away this concern? Will we wave away a concern next time
> you personally raise one? Bruce would have me believe that we added months
> onto 9.5 to improve robustness. So lets actually do that. Starting at the
> first opportunity.

Meh. Adding complexity definitely needs to be weighed against the
benefits. As pointed out e.g. by all the multixact issues you mentioned
upthread. In this case your argument for changing the name doesn't seem
to hold much water.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a raft of parallelism-related bug fixes

2015-10-21 Thread Amit Langote

On Wednesday, 21 October 2015, Amit Kapila  wrote:

> On Tue, Oct 20, 2015 at 8:16 PM, Robert Haas  > wrote:
> >
> > On Sat, Oct 17, 2015 at 6:17 PM, Robert Haas  > wrote:
> > > It's good to have your perspective on how this can be improved, and
> > > I'm definitely willing to write more documentation.  Any lack in that
> > > area is probably due to being too close to the subject area, having
> > > spent several years on parallelism in general, and 200+ emails on
> > > parallel sequential scan in particular.  Your point about the lack of
> > > a good header file comment for execParallel.c is a good one, and I'll
> > > rectify that next week.
> >
> > Here is a patch to add a hopefully-useful file header comment to
> > execParallel.c.  I included one for nodeGather.c as well, which seems
> > to be contrary to previous practice, but actually it seems like
> > previous practice is not the greatest: surely it's not self-evident
> > what all of the executor nodes do.
> >
>
> + * any ParamListInfo associated witih the query, buffer usage info, and
> + * the actual plan to be passed down to the worker.
>
> typo 'witih'.
>
> + * return the results.  Therefore, a plan used with a single-copy Gather
> + * node not be parallel-aware.
>
> "node not" seems to be incomplete.
>

... node *need* not be parallel aware?

Thanks,
Amit

Re: [HACKERS] bugs and bug tracking

2015-10-21 Thread Nathan Wagner

On Tue, Oct 20, 2015 at 10:39:55AM -0700, Joshua D. Drake wrote:
> So where are we at on this?

Well, I can't speak to where we are, but my system is up, running, and
seems to work well,  It even attracts a few visitors.

I have been meaning to write a triage interface, but I have been stuck
doing postgis work for an anthropology paper I am working on for the
last week.  I hope to be able to take a break from that tomorrow and
through the weekend.  I have read the various comments, and will also do
some simple scanning for phrases in emails and commit messages
indicating status changes which I will reflect on the main web page.
Once I write that code, I will email the list with what I have done.  If
people want to use it, or criticize it, they would be welcome to do so.
If not, well, the interface is useful for me in any event, so I will
probably maintain it for the forseeable future.

https://granicus.if.org/pgbugs/ for anyone who hasn't and wants to take a
look.

-- 
nw


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] a raft of parallelism-related bug fixes

2015-10-21 Thread Amit Kapila

On Tue, Oct 20, 2015 at 8:16 PM, Robert Haas  wrote:
>
> On Sat, Oct 17, 2015 at 6:17 PM, Robert Haas 
wrote:
> > It's good to have your perspective on how this can be improved, and
> > I'm definitely willing to write more documentation.  Any lack in that
> > area is probably due to being too close to the subject area, having
> > spent several years on parallelism in general, and 200+ emails on
> > parallel sequential scan in particular.  Your point about the lack of
> > a good header file comment for execParallel.c is a good one, and I'll
> > rectify that next week.
>
> Here is a patch to add a hopefully-useful file header comment to
> execParallel.c.  I included one for nodeGather.c as well, which seems
> to be contrary to previous practice, but actually it seems like
> previous practice is not the greatest: surely it's not self-evident
> what all of the executor nodes do.
>

+ * any ParamListInfo associated witih the query, buffer usage info, and
+ * the actual plan to be passed down to the worker.

typo 'witih'.

+ * return the results.  Therefore, a plan used with a single-copy Gather
+ * node not be parallel-aware.

"node not" seems to be incomplete.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] [Proposal] Table partition + join pushdown

2015-10-21 Thread Taiki Kondo

Hello, KaiGai-san and Horiguchi-san.

I created v2 patch. Please find attached.
I believe this patch will fix the most of issues mentioned by
Horiguchi-san except naming.

In this v2 patch, scan node which is originally inner relation of
Join node must be SeqScan (or SampleScan). This limitation is
due to implementation of try_join_pushdown(), which copies Path nodes
to attach new filtering conditions converted from CHECK() constraints.

It uses copyObject() for this purpose, so I must implement copy functions
for scan Path nodes like IndexPath, BitmapHeapPath, TidPath and so on.


By the way, let me introduce the performance of this feature.
Here are the results I tested in my environment.
These results were got by "pushdown_test.v1.large.sql" 
running on the environment that "work_mem" set to "1536kB".
(This file is also attached in this mail.)

[Normal]
 QUERY PLAN
-
 Hash Join  (cost=1851.02..14638.11 rows=34 width=20) (actual 
time=88.188..453.926 rows=22 loops=1)
   Hash Cond: (check_test_div.id = inner_t.id)
   ->  Append  (cost=0.00..4911.03 rows=34 width=20) (actual 
time=0.089..133.456 rows=33 loops=1)
 ->  Seq Scan on check_test_div  (cost=0.00..0.00 rows=1 width=20) 
(actual time=0.003..0.003 rows=0 loops=1)
 ->  Seq Scan on check_test_div_0  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.085..40.741 rows=11 loops=1)
 ->  Seq Scan on check_test_div_1  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.023..29.213 rows=11 loops=1)
 ->  Seq Scan on check_test_div_2  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.021..28.592 rows=11 loops=1)
   ->  Hash  (cost=866.01..866.01 rows=60001 width=8) (actual 
time=87.970..87.970 rows=60001 loops=1)
 Buckets: 32768  Batches: 2  Memory Usage: 1446kB
 ->  Seq Scan on inner_t  (cost=0.00..866.01 rows=60001 width=8) 
(actual time=0.030..39.133 rows=60001 loops=1)
 Planning time: 0.867 ms
 Execution time: 470.269 ms
(12 rows)

[With this feature]
 QUERY PLAN
-
 Append  (cost=0.01..10651.37 rows=34 width=20) (actual 
time=55.548..377.615 rows=22 loops=1)
   ->  Hash Join  (cost=0.01..1091.04 rows=1 width=20) (actual 
time=0.017..0.017 rows=0 loops=1)
 Hash Cond: (inner_t.id = check_test_div.id)
 ->  Seq Scan on inner_t  (cost=0.00..866.01 rows=60001 width=8) (never 
executed)
 ->  Hash  (cost=0.00..0.00 rows=1 width=20) (actual time=0.003..0.003 
rows=0 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 8kB
   ->  Seq Scan on check_test_div  (cost=0.00..0.00 rows=1 
width=20) (actual time=0.002..0.002 rows=0 loops=1)
   ->  Hash Join  (cost=1169.76..3186.78 rows=11 width=20) (actual 
time=55.530..149.205 rows=11 loops=1)
 Hash Cond: (check_test_div_0.id = inner_t.id)
 ->  Seq Scan on check_test_div_0  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.058..34.268 rows=11 loops=1)
 ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual 
time=55.453..55.453 rows=20001 loops=1)
   Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)  
Memory Usage: 1038kB
   ->  Seq Scan on inner_t  (cost=0.00..1166.01 rows=300 width=8) 
(actual time=0.031..43.590 rows=20001 loops=1)
 Filter: ((id % 3) = 0)
 Rows Removed by Filter: 4
   ->  Hash Join  (cost=1169.76..3186.78 rows=11 width=20) (actual 
time=27.942..97.582 rows=6 loops=1)
 Hash Cond: (check_test_div_1.id = inner_t.id)
 ->  Seq Scan on check_test_div_1  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.030..25.514 rows=11 loops=1)
 ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual 
time=27.890..27.890 rows=2 loops=1)
   Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)  
Memory Usage: 1038kB
   ->  Seq Scan on inner_t  (cost=0.00..1166.01 rows=300 width=8) 
(actual time=0.014..21.688 rows=2 loops=1)
 Filter: ((id % 3) = 1)
 Rows Removed by Filter: 40001
   ->  Hash Join  (cost=1169.76..3186.78 rows=11 width=20) (actual 
time=27.651..97.755 rows=5 loops=1)
 Hash Cond: (check_test_div_2.id = inner_t.id)
 ->  Seq Scan on check_test_div_2  (cost=0.00..1637.01 rows=11 
width=20) (actual time=0.026..25.620 rows=11 loops=1)
 ->  Hash  (cost=1166.01..1166.01 rows=300 width=8) (actual 
time=27.599..27.599 rows=2 loops=1)
   Buckets: 32768 (originally 1024)  Batches: 1 (originally 1)

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-10-21 Thread Artur Zakirov


21.10.2015 01:37, Jim Nasby пишет:

On 10/20/15 9:00 AM, Artur Zakirov wrote:

Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.


Is that per dictionary entry, fixed at 64k? That seems pretty excessive,
if that's the case...


This is per dictionary only. flagval array is used for the all 
dictionary. And it is not used for every dictionary word.


There are also flag field of AFFIX structure, wich size must be 
increased from 8 bit to 16 bit. This structure is used for every affix 
in affix file.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-21 Thread Etsuro Fujita


On 2015/10/21 13:34, Kouhei Kaigai wrote:

On 2015/10/20 13:11, Etsuro Fujita wrote:

On 2015/10/20 5:34, Robert Haas wrote:

No.  You just need to populate fdw_recheck_quals correctly, same as
for the scan case.



As I said yesterday, that opinion of me is completely wrong.  Sorry for
the incorrectness.  Let me explain a little bit more.  I still think
that even if ROW_MARK_COPY is in use, we would need to locally rejoin
the tuples populated from the whole-row images for the foreign tables
involved in a remote join, using a secondary plan.  Consider eg,

SELECT localtab.*, ft2 from localtab, ft1, ft2
   WHERE ft1.x = ft2.x AND ft1.y = localtab.y FOR UPDATE

In this case, since the output of the foreign join would not include any
ft1 columns, I don't think we could do the same thing as for the scan
case, even if populating fdw_recheck_quals correctly.



As an aside, could you introduce the reason why you think so? It is
significant point in discussion, if we want to reach the consensus.



On the other hands, the joined-tuple we're talking about in this context
is a tuple prior to projection; formed according to the fdw_scan_tlist.
So, it contains all the necessary information to run scan/join qualifiers
towards the joined-tuple. It is not affected by the target-list of user
query.


After research into the planner, I noticed that I was still wrong; IIUC, 
the planner requires that the output of foreign join include the column 
ft1.y even for that case.  (I don't understand the reason why the 
planner requires that.)  So, as Robert mentioned, the clause ft1.y = 
localtab.y could be rechecked during an EPQ recheck, if populating 
fdw_recheck_quals correctly.  Sorry again for the incorrectness.



Even though I think the approach with joined-tuple reconstruction is
reasonable solution here, it is not a fair reason to introduce disadvantage
of Robert's suggestion.


Agreed.


Also, please don't mix up "what we do" and "how we do".

It is "what we do" to discuss which format of tuples shall be returned
to the core backend from the extension, because it determines the role
of interface. If our consensus is to return a joined-tuple, we need to
design the interface according to the consensus.

On the other hands, it is "how we do" discussion whether we should
enforce all the FDW/CSP extension to have alternative plan, or not.
Once we got a consensus in "what we do" discussion, there are variable
options to solve the requirement by the consensus, however, we cannot
prioritize "how we do" without "what we do".


Agreed.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Getting sorted data from foreign server

2015-10-21 Thread Ashutosh Bapat

On Fri, Oct 16, 2015 at 11:33 PM, Robert Haas  wrote:

> On Thu, Oct 15, 2015 at 6:28 AM, Ashutosh Bapat
>  wrote:
> > Attached is the patch which takes care of above comments.
>
> I spent some time on this patch today.  But it's still not right.
>
> I've attached a new version which fixes a serious problem with your
> last version - having postgresGetForeignPaths do the costing of the
> sorted path itself instead of delegating that to
> estimate_path_cost_size is wrong.  In your version, 10% increment gets
> applied to the network transmission costs as well as the cost of
> generating the tupes - but only when use_remote_estimate == false.  I
> fixed this and did some cosmetic cleanup.
>

That's better.


>
> But you'll notice if you try this some of postgres_fdw's regression
> tests fail.  This is rather mysterious:
>
> ***
> *** 697,715 
>Sort
>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>  Sort Key: t1.c1
> !->  Nested Loop Semi Join
>Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
> !  Join Filter: (t1.c3 = t2.c3)
>->  Foreign Scan on public.ft1 t1
>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7,
> t1.c8
>  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
> FROM "S 1"."T 1" WHERE (("C 1" < 20))
> !  ->  Materialize
>  Output: t2.c3
> !->  Foreign Scan on public.ft2 t2
>Output: t2.c3
> !  Filter: (date(t2.c4) = '01-17-1970'::date)
> !  Remote SQL: SELECT c3, c4 FROM "S 1"."T 1"
> WHERE (("C 1" > 10))
> ! (15 rows)
>
>   EXECUTE st2(10, 20);
>c1 | c2 |  c3   |  c4  |c5
>   | c6 | c7 | c8
> --- 697,718 
>Sort
>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
>  Sort Key: t1.c1
> !->  Hash Join
>Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
> !  Hash Cond: (t1.c3 = t2.c3)
>->  Foreign Scan on public.ft1 t1
>  Output: t1.c1, t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7,
> t1.c8
>  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8
> FROM "S 1"."T 1" WHERE (("C 1" < 20))
> !  ->  Hash
>  Output: t2.c3
> !->  HashAggregate
>Output: t2.c3
> !  Group Key: t2.c3
> !  ->  Foreign Scan on public.ft2 t2
> !Output: t2.c3
> !Filter: (date(t2.c4) = '01-17-1970'::date)
> !Remote SQL: SELECT c3, c4 FROM "S 1"."T
> 1" WHERE (("C 1" > 10))
> ! (18 rows)
>
> What I think is happening here is that the planner notices that
> instead of doing a parameterized nestloop, it could pull down the data
> already sorted from the remote side, cheaply unique-ify it by using
> the ordering provided by the remote side, and then do a standard hash
> join. That might well be a sensible approach, but the ORDER BY that
> would make it correct doesn't show up in the Remote SQL.  I don't know
> why that's happening, but it's not good.
>
>
The unique-ifying is happening through HashAggregate, so we do not need
ORDER BY clause in RemoteSQL. So the plan is correct.

Careful examination of paths created revealed that the change in plan is
result of our fuzzy path selection logic. Let me explain it using the cost
values I got on my machine. Please note that the costs are described as a
tuple (startup cost, total cost, number of rows, present of pathkeys).

With your patch, base relation paths have following costs:

ft1 path without pathkeys - (100, 123.9, 20, no)
ft1 path with pathkeys (100, 126.25, 20, yes)
ft2 path without pathkeys (100, 143.645, 4, no)
ft2 path without pathkeys with params (100.01, 125.365, 1, no)

Notice the sorted path is only marginally costly (2.5%) compared to the
non-sorted path for ft1. During the path creation process several nested
loop paths are created, but except for two, they are too costly. The two
nested loop paths of interest have costs (200, 268.755, 10, no) and (200,
271.105, 10, yes). The path with pathkeys kicks out the one without
pathkeys because the later's costs are "fuzzily" equal and pathkeys give
extra advantage. The "fuzzy" equality is effect of the marginally extra
sorting cost. The nested loop path with pathkeys remains in the path list.
Several hash join paths and merge join paths are created but are costlier
than one particular hash path which has costs (243.677, 267.6624, 10, no).
This hash path is not beaten by the nested loop path since because of lower
total cost (which is beyond the fuzzy margins) and gets ultimately chosen
by planner to perform ft1 join ft2.

Interestingly, if the nested loop path with pathkeys had not kicked out
that without pathkeys, the nested loop path would have emerged as w

[HACKERS] Patch (2): Implement failover on libpq connect level.

2015-10-21 Thread Victor Wagner

On 2015.10.14 at 13:41:51 +0300, Victor Wagner wrote:

> 
> Attached patch which implements client library failover and
> loadbalancing as was described in the proposal
> <20150818041850.ga5...@wagner.pp.ru>.
> 

I'm sending imporoved verison of patch. As Olexander Shulgin noted,
previous version of patch lacks support for service files.

Now support for service files is implemented and multiple host
statements in the service file are allowed.
-- 
   Victor Wagner 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 0ee018e..ec12fce 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -792,7 +792,7 @@ host=localhost port=5432 dbname=mydb connect_timeout=10

The general form for a connection URI is:
 
-postgresql://[user[:password]@][netloc][:port][/dbname][?param1=value1&...]
+postgresql://[user[:password]@][netloc][:port][,netloc[:port]...][/dbname][?param1=value1&...]
 

 
@@ -809,6 +809,7 @@ postgresql://localhost/mydb
 postgresql://user@localhost
 postgresql://user:secret@localhost
 postgresql://other@localhost/otherdb?connect_timeout=10&application_name=myapp
+postgresql://node1,node2:5433,node3:4432,node4/mydb?hostorder=random&readonly=1
 
 Components of the hierarchical part of the URI can also
 be given as parameters.  For example:
@@ -831,7 +832,9 @@ postgresql:///mydb?host=localhost&port=5433

 For improved compatibility with JDBC connection URIs,
 instances of parameter ssl=true are translated into
-sslmode=require.
+sslmode=require and
+loadBalanceHosts=true  into
+hostorder=random.

 

@@ -841,6 +844,10 @@ postgresql:///mydb?host=localhost&port=5433
 postgresql://[2001:db8::1234]/database
 

+   
+   There can be serveral host specifications, optionally accompanied
+   with port, separated by comma.
+   
 

 The host component is interpreted as described for the parameter PostgreSQL was built). On machines without
 Unix-domain sockets, the default is to connect to localhost.

+   
+   There can be more than one host parameter in
+   the connect string. In this case these hosts would be considered
+   alternate entries into same database and if connect to first one
+   fails, library would try to connect second etc. This can be used
+   for high availability cluster or for load balancing. See
+parameter.
+   
+   
+   Network host name can be accompanied with port number, separated by
+   colon. If so, this port number is used only when connected to
+   this host. If there is no port number, port specified in the
+parameter would be used.
+   
   
  
 
@@ -942,8 +963,44 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname


   
-
+
+  
+  hostorder
+  
+  
+  Specifies how to choose host from list of alternate hosts,
+  specified in the  parameter.
+  
+  
+  If value of this argument is sequential (the
+  default) library connects to the hosts in order they specified,
+  and tries to connect second one only if connection to the first
+  fails.
+  
+  
+  If value is random host to connect is randomly
+  picked from the list. It allows to balance load between several
+  cluster nodes. However, currently PostgreSQL doesn't support
+  multimaster clusters. So, without use of third-party products,
+  only read-only connections can take advantage from the
+  load-balancing. See 
+  
+  
+  
+  
+  readonly
+  
+  
+  If this parameter is 0 (the default), upon successful connection
+  library checks if host is in recovery state, and if it is so,
+  tries next host in the connect string. If this parameter is
+  non-zero, connection to warm standby nodes are considered
+  successful.
+  
+  
+  
   
+
port


@@ -985,7 +1042,6 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
  
   connect_timeout
   
@@ -996,7 +1052,27 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
   
  
-
+ 
+ falover_timeout
+ 
+ 
+ Maximum time to cycilically retry all the hosts in commect string.
+ (as decimal integer number of seconds). If not specified, then
+ hosts are tried just once.
+ 
+ 
+ If we have replicating cluster, and master node fails, it might
+ take some time to promote one of standby nodes to the new master.
+ So clients which notice that connect to the master fails, can
+ already give up attempt to reestablish a connection when new master
+ became available. 
+ 
+ 
+ Setting this parameter to reasonable time makes library try to
+ reconnect all the host in cyclically until new master appears.
+ 
+ 
+ 
  
   client_encoding
   
@@ -7161,6 +7237,18 @@ user=admin

Re: [HACKERS] [PATCH v3] GSSAPI encryption support

2015-10-21 Thread Michael Paquier

Robbie,

On Wed, Oct 21, 2015 at 3:54 PM, Michael Paquier
 wrote:
> On Tue, Oct 20, 2015 at 3:01 AM, Robbie Harwood wrote:
>> Stephen Frost  writes:
>>> psql: lost synchronization with server: got message type "S", length 22
>>
>> which unfortunately could be a great many things.  I've said this a
>> couple times now, but I really do need more information - a traffic
>> dump, a list of commands that were run, etc.; unfortunately, the surface
>> here is pretty large, and while I totally am willing to believe there
>> are bugs in the code I've written, I do not yet see them.
>
> --- a/src/interfaces/libpq/fe-protocol3.c
> +++ b/src/interfaces/libpq/fe-protocol3.c
> @@ -129,6 +129,58 @@ pqParseInput3(PGconn *conn)
> return;
> }
>
> +#ifdef ENABLE_GSS
> +   /* We want to be ready in both IDLE and BUSY states
> for encryption */
> +   if (id == 'g' && !conn->gss_disable_enc && conn->gctx)
> +   {
> +   ssize_t encEnd, next;
> [...]
> +   }
> +   else if (!conn->gss_disable_enc && conn->gss_auth_done &&
> +!conn->gss_decrypted_cur && id != 'E')
> +   /* This could be a sync error, so let's handle
> it as such. */
> +   handleSyncLoss(conn, id, msgLength);
> +#endif
>
> Hm. The out-of-sync error I am seeing in my environment is caused by
> this block when parsing 'g' messages coming from the backend that are
> considered as being GSSAPI-encrypted messages. I am still looking at
> that...

@@ -604,6 +604,11 @@ pqPutMsgEnd(PGconn *conn)
memcpy(conn->outBuffer + conn->outMsgStart, &msgLen, 4);
}

+#ifdef ENABLE_GSS
+   if (pggss_encrypt(conn) < 0)
+   return EOF;
+#endif

@@ -1528,10 +1541,20 @@ socket_putmessage(char msgtype, const char *s,
size_t len)
if (internal_putbytes(s, len))
goto fail;
PqCommBusy = false;
+#ifdef ENABLE_GSS
+   /* if we're GSSAPI encrypting, s was allocated in be_gss_encrypt */
+   if (msgtype == 'g')
+   pfree((char *)s);
+#endif

Looking at this patch in more details... Why is it necessary to wrap
all the encrypted messages into a dedicated message type 'g'? Cannot
we rely on the context on client and backend that should be set up
after authentication to perform the encryption and decryption
operations? This patch is enforcing the message type in
socket_putmessage for backend and pggss_encrypt/pqPutMsgEnd for
frontend, this just feels wrong and I think that the patch could be
really simplified, this includes the crafting added in fe-protocol3.c
that should be IMO reserved only for messages received from the
backend and should not be made aware of any protocol-level logic.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

44 matches

Mail list logo