Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2017-02-20 Thread Michael Paquier
On Fri, Feb 17, 2017 at 2:17 PM, Prabakaran, Vaishnavi
 wrote:
> On 22 November 2016 at 18:32, Craig Ringer wrote:
> I am interested in this patch and addressed various below comments from 
> reviewers. And, I have separated out code and test module into 2 patches. So, 
> If needed, test patch can be enhanced more, meanwhile code patch can be 
> committed.

Cool. Nice to see a new version of this patch.
(You may want to your replies breath a bit, for example by adding an
extra newline to the paragraph you are answering to.)

>>Renaming and refactoring new APIs
>> +PQisInBatchMode   172
>>+PQqueriesInBatch  173
>>+PQbeginBatchMode  174
>>+PQendBatchMode175
>>+PQsendEndBatch176
>>+PQgetNextQuery177
>>+PQbatchIsAborted  178
>>This set of routines is a bit inconsistent. Why not just prefixing them with 
>>PQbatch? Like that for example:
>> PQbatchStatus(): consists of disabled/inactive/none, active, error. This 
>> covers both PQbatchIsAborted() and PQisInBatchMode().
>>PQbatchBegin()
>>PQbatchEnd()
>>PQbatchQueueSync() or PQbatchQueueFlush, same as PQsendEndBatch() to add and 
>>process a sync message into the queue.
>
> Renamed and modified batch status APIs as below
> PQisInBatchMode & PQbatchIsAborted ==> PQbatchStatus
> PQqueriesInBatch ==> PQbatchQueueCount
> PQbeginBatchMode ==> PQbatchBegin
> PQendBatchMode ==> PQbatchEnd
> PQsendEndBatch ==> PQbatchQueueSync
> PQgetNextQuery ==> PQbatchQueueProcess

Thanks. This seems a way cleaner interface to me. Others may have a
different opinion so let's see if there are some.

>>PQbatchQueueCount(): returns N>0 if there are N entries, 0 if empty,-1 on 
>>failure
>>PQbatchQueueProcess(): returns 1 if process can begin, 0 if not, -1 on 
>>failure (OOM)
>
> I think it is still ok to keep the current behaviour like other ones present 
> in the same file. E.g:"PQsendPrepare" "PQsendQueryGuts"
>
>>PQqueriesInBatch() (Newname(NN):PQbatchQueueCount)doesn't work as documented.
>>It says:
>>"Returns the number of queries still in the queue for this batch"
>>but in fact it's implemented as a Boolean.
>
> Modified the logic to count number of entries in pending queue and return the 
> count
>
>>The changes in src/test/examples/ are not necessary anymore. You moved all 
>>the tests to test_libpq (for the best actually).
> Removed these unnecessary changes from src/test/examples folder and corrected 
> the path mentioned in comments section of testlibpqbatch.c

>>But with the libpq batch API, maybe this could be modernized
>>with meta-commands like this:
>>\startbatch
>>...
>>\endbatch
>
> I think it is a separate patch candidate.

Definitely agreed on that. Let's not complicate things further more.

>> It may be a good idea to check for PG_PROTOCOL_MAJOR < 3 and issue an error 
>> for the new routines.
>
> All the APIs which supports asynchronous command processing can be executed 
> only if PG_PROTOCOL_MAJOR >= 3. So, adding it to new routines are not really 
> needed.

OK. Let's not complicate the patch more than necessary.

After an extra lookup at the patch, here are some comments.

In the docs when listing any of the PQBATCH_MODE_* variables, you
should have a markup  around them. It would be cleaner to
make a list of the potential values that can be returned by
PQbatchStatus() using .

+   while (queue != NULL)
+   {
+   PGcommandQueueEntry *prev = queue;
+
+   queue = queue->next;
+   if (prev->query)
+   free(prev->query);
+   free(prev);
+   }
+   conn->cmd_queue_recycle = NULL
This could be useful as a separate routine.

+/* PQmakePipelinedCommand
+ * Get a new command queue entry, allocating it if required. Doesn't add it to
+ * the tail of the queue yet, use PQappendPipelinedCommand once the command has
+ * been written for that. If a command fails once it's called this, it should
+ * use PQrecyclePipelinedCommand to put it on the freelist or release it.
+ *
This comment block is not project-like. Please use an empty line as
first line with only "/*". The same comment applies to a bunch of the
routines introduced by this patch.

Not sure there is much point in having PQstartProcessingNewQuery. It
only does two things in two places, so that's not worth the loss in
readability.

+   if (conn->batch_status != PQBATCH_MODE_OFF)
+   {
+   pipeCmd = PQmakePipelinedCommand(conn);
+
+   if (pipeCmd == NULL)
+   return 0;   /* error msg already set */
+
+   last_query = >query;
+   queryclass = >queryclass;
+   }
+   else
+   {
+   last_query = >last_query;
+   queryclass = >queryclass;
+   }
This setup should happen further down.

conn->in_batch is undefined, causing the patch to fail to compile. And
actually you should not need it.

-   conn->asyncStatus = PGASYNC_READY;
+   conn->asyncStatus = PGASYNC_READY_MORE;
return;

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-02-20 Thread Amit Langote
On 2017/02/21 15:35, Amit Langote wrote:
>>  drop table list_parted cascade;
>> -NOTICE:  drop cascades to 3 other objects
>> -DETAIL:  drop cascades to table part_ab_cd
>> probably we should remove cascade from there, unless you are testing CASCADE
>> functionality. Similarly for other blocks like
>>  drop table range_parted cascade;

Oops, failed to address this in the last email.  Updated patches attached.

Thanks,
Amit
>From 0dc550d501805bcea81da9b2a06a39430427927c Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 16 Feb 2017 15:56:44 +0900
Subject: [PATCH 1/2] Allow dropping partitioned table without CASCADE

Currently, a normal dependency is created between a inheritance
parent and child when creating the child.  That means one must
specify CASCADE to drop the parent table if a child table exists.
When creating partitions as inheritance children, create auto
dependency instead, so that partitions are dropped automatically
when the parent is dropped i.e., without specifying CASCADE.
---
 src/backend/commands/tablecmds.c   | 26 ++
 src/test/regress/expected/alter_table.out  | 10 --
 src/test/regress/expected/create_table.out |  9 ++---
 src/test/regress/expected/inherit.out  | 22 ++
 src/test/regress/expected/insert.out   |  7 ++-
 src/test/regress/expected/update.out   |  7 +--
 src/test/regress/sql/alter_table.sql   | 10 --
 src/test/regress/sql/create_table.sql  |  9 ++---
 src/test/regress/sql/inherit.sql   |  4 ++--
 src/test/regress/sql/insert.sql|  7 ++-
 src/test/regress/sql/update.sql|  2 +-
 11 files changed, 40 insertions(+), 73 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3cea220421..31b50ad77f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -289,9 +289,11 @@ static List *MergeAttributes(List *schema, List *supers, char relpersistence,
 static bool MergeCheckConstraint(List *constraints, char *name, Node *expr);
 static void MergeAttributesIntoExisting(Relation child_rel, Relation parent_rel);
 static void MergeConstraintsIntoExisting(Relation child_rel, Relation parent_rel);
-static void StoreCatalogInheritance(Oid relationId, List *supers);
+static void StoreCatalogInheritance(Oid relationId, List *supers,
+		bool child_is_partition);
 static void StoreCatalogInheritance1(Oid relationId, Oid parentOid,
-		 int16 seqNumber, Relation inhRelation);
+		 int16 seqNumber, Relation inhRelation,
+		 bool child_is_partition);
 static int	findAttrByName(const char *attributeName, List *schema);
 static void AlterIndexNamespaces(Relation classRel, Relation rel,
    Oid oldNspOid, Oid newNspOid, ObjectAddresses *objsMoved);
@@ -725,7 +727,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId,
 		  typaddress);
 
 	/* Store inheritance information for new rel. */
-	StoreCatalogInheritance(relationId, inheritOids);
+	StoreCatalogInheritance(relationId, inheritOids, stmt->partbound != NULL);
 
 	/*
 	 * We must bump the command counter to make the newly-created relation
@@ -2240,7 +2242,8 @@ MergeCheckConstraint(List *constraints, char *name, Node *expr)
  * supers is a list of the OIDs of the new relation's direct ancestors.
  */
 static void
-StoreCatalogInheritance(Oid relationId, List *supers)
+StoreCatalogInheritance(Oid relationId, List *supers,
+		bool child_is_partition)
 {
 	Relation	relation;
 	int16		seqNumber;
@@ -2270,7 +2273,8 @@ StoreCatalogInheritance(Oid relationId, List *supers)
 	{
 		Oid			parentOid = lfirst_oid(entry);
 
-		StoreCatalogInheritance1(relationId, parentOid, seqNumber, relation);
+		StoreCatalogInheritance1(relationId, parentOid, seqNumber, relation,
+ child_is_partition);
 		seqNumber++;
 	}
 
@@ -2283,7 +2287,8 @@ StoreCatalogInheritance(Oid relationId, List *supers)
  */
 static void
 StoreCatalogInheritance1(Oid relationId, Oid parentOid,
-		 int16 seqNumber, Relation inhRelation)
+		 int16 seqNumber, Relation inhRelation,
+		 bool child_is_partition)
 {
 	TupleDesc	desc = RelationGetDescr(inhRelation);
 	Datum		values[Natts_pg_inherits];
@@ -2317,7 +2322,10 @@ StoreCatalogInheritance1(Oid relationId, Oid parentOid,
 	childobject.objectId = relationId;
 	childobject.objectSubId = 0;
 
-	recordDependencyOn(, , DEPENDENCY_NORMAL);
+	if (!child_is_partition)
+		recordDependencyOn(, , DEPENDENCY_NORMAL);
+	else
+		recordDependencyOn(, , DEPENDENCY_AUTO);
 
 	/*
 	 * Post creation hook of this inheritance. Since object_access_hook
@@ -10744,7 +10752,9 @@ CreateInheritance(Relation child_rel, Relation parent_rel)
 	StoreCatalogInheritance1(RelationGetRelid(child_rel),
 			 RelationGetRelid(parent_rel),
 			 inhseqno + 1,
-			 catalogRelation);
+			 catalogRelation,
+			 parent_rel->rd_rel->relkind ==
+			

Re: [HACKERS] Provide list of subscriptions and publications in psql's completion

2017-02-20 Thread Michael Paquier
On Tue, Feb 21, 2017 at 12:48 AM, Stephen Frost  wrote:
> * Fujii Masao (masao.fu...@gmail.com) wrote:
>> On Fri, Feb 17, 2017 at 11:17 PM, Peter Eisentraut
>> 1) Expose all the columns except subconninfo in pg_subscription to
>> non-superusers. In this idea, the tab-completion and psql meta-command
>> for subscription still sees pg_subscription. One good point of
>> idea is that even non-superusers can see all the useful information
>> about subscriptions other than sensitive information like conninfo.
>>
>> 2) Change pg_stat_subscription so that it also shows all the columns except
>> subconninfo in pg_subscription. Also change the tab-completion and
>> psql meta-command for subscription so that they see pg_stat_subscription
>> instead of pg_subscription. One good point is that pg_stat_subscription
>> exposes all the useful information about subscription to even
>> non-superusers. OTOH, pg_subscription and pg_stat_subscription have
>> to have several same columns. This would be redundant and a bit 
>> confusing.
>>
>> 3) Expose subdbid in pg_stat_subscription. Also change the tab-completion
>> and psql meta-command for subscription so that they see
>> pg_stat_subscription. This change is very simple. But non-superusers 
>> cannot
>> see useful information like subslotname because of privilege problem.
>>
>> I like #1, but any better approach?
>
> #1 seems alright to me, at least.  We could start using column-level
> privs in other places also, as I mentioned up-thread.

Among the three, this looks like a good one.

> I don't particularly like the suggestions to have psql use pg_stat_X
> views or to put things into pg_stat_X views which are object definitions
> for non-superusers.  If we really don't want to use column-level
> privileges then we should have an appropriate view create instead which
> provides non-superusers with the non-sensitive object information.

Neither do I, those are by definition tables for statistics. Having
tab completion depend on them is not right.

So what do you think about the patch attached? This does the following:
- complete subscription list for CREATE/ALTER SUBSCRIPTION
- complete publication list for CREATE/ALTER PUBLICATION
- complete to nothing for PUBLICATION in CREATE/ALTER SUBSCRIPTION as
this refers to remote objects defined in subconninfo.
- authorize read access to public for all columns of pg_subscription
except subconninfo
Thoughts?
-- 
Michael
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf1a0..3e88a2a595 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -900,7 +900,10 @@ CREATE VIEW pg_replication_origin_status AS
 
 REVOKE ALL ON pg_replication_origin_status FROM public;
 
+-- All columns of pg_subscription, except subconninfo, are readable.
 REVOKE ALL ON pg_subscription FROM public;
+GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublications)
+ON pg_subscription TO public;
 
 --
 -- We have a few function definitions in here, too.
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 94814c20d0..fa0ce5c184 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -830,6 +830,18 @@ static const SchemaQuery Query_for_list_of_matviews = {
 "   FROM pg_catalog.pg_am "\
 "  WHERE substring(pg_catalog.quote_ident(amname),1,%d)='%s'"
 
+#define Query_for_list_of_subscriptions \
+" SELECT pg_catalog.quote_ident(c1.subname) "\
+"   FROM pg_catalog.pg_subscription c1, pg_catalog.pg_database c2"\
+"  WHERE substring(pg_catalog.quote_ident(c1.subname),1,%d)='%s'"\
+"AND c2.datname = pg_catalog.current_database()"\
+"AND c1.subdbid = c2.oid"
+
+#define Query_for_list_of_publications \
+" SELECT pg_catalog.quote_ident(pubname) "\
+"   FROM pg_catalog.pg_publication "\
+"  WHERE substring(pg_catalog.quote_ident(pubname),1,%d)='%s'"
+
 /* the silly-looking length condition is just to eat up the current word */
 #define Query_for_list_of_arguments \
 "SELECT pg_catalog.oidvectortypes(proargtypes)||')' "\
@@ -960,13 +972,13 @@ static const pgsql_thing_t words_after_create[] = {
 	{"OWNED", NULL, NULL, THING_NO_CREATE},		/* for DROP OWNED BY ... */
 	{"PARSER", Query_for_list_of_ts_parsers, NULL, THING_NO_SHOW},
 	{"POLICY", NULL, NULL},
-	{"PUBLICATION", NULL, NULL},
+	{"PUBLICATION", Query_for_list_of_publications},
 	{"ROLE", Query_for_list_of_roles},
 	{"RULE", "SELECT pg_catalog.quote_ident(rulename) FROM pg_catalog.pg_rules WHERE substring(pg_catalog.quote_ident(rulename),1,%d)='%s'"},
 	{"SCHEMA", Query_for_list_of_schemas},
 	{"SEQUENCE", NULL, _for_list_of_sequences},
 	{"SERVER", Query_for_list_of_servers},
-	{"SUBSCRIPTION", NULL, NULL},
+	{"SUBSCRIPTION", Query_for_list_of_subscriptions},
 	{"TABLE", NULL, _for_list_of_tables},
 	{"TABLESPACE", Query_for_list_of_tablespaces},
 	{"TEMP", NULL, NULL, 

Re: [HACKERS] dropping partitioned tables without CASCADE

2017-02-20 Thread Amit Langote
Hi Ashutosh,

Thanks for taking a look at the patch.

On 2017/02/20 21:49, Ashutosh Bapat wrote:
> Thanks for working on all the follow on work for partitioning feature.
> 
> May be you should add all those patches in the next commitfest, so
> that we don't forget those.

I think adding these as one of the PostgreSQL 10 Open Items [0] might be
better.  I've done that.

> On Mon, Feb 20, 2017 at 7:46 AM, Amit Langote wrote:
>> So I count more than a few votes saying that we should be able to DROP
>> partitioned tables without specifying CASCADE.
>>
>> I tried to implement that using the attached patch by having
>> StoreCatalogInheritance1() create DEPENDENCY_AUTO dependency between
>> parent and child if the child is a partition, instead of DEPENDENCY_NORMAL
>> that would otherwise be created.  Now it seems that that is one way of
>> making sure that partitions are dropped when the root partitioned table is
>> dropped, not sure if the best; why create the pg_depend entries at all one
>> might ask.  I chose it for now because that's the one with fewest lines of
>> change.  Adjusted regression tests as well, since we recently tweaked
>> tests [1] to work around the irregularities of test output when using 
>> CASCADE.
> 
> The patch applies cleanly and regression does not show any failures.
> 
> Here are some comments
> 
> For the sake of readability you may want reverse the if and else order.
> -recordDependencyOn(, , DEPENDENCY_NORMAL);
> +if (!child_is_partition)
> +recordDependencyOn(, , DEPENDENCY_NORMAL);
> +else
> +recordDependencyOn(, , DEPENDENCY_AUTO);
> like
> +if (child_is_partition)
> +recordDependencyOn(, , DEPENDENCY_AUTO);
> +else
> +recordDependencyOn(, , DEPENDENCY_NORMAL);

Sure, done.

> It's weird that somebody can perform DROP TABLE on the partition without
> referring to its parent. That may be a useful feature as it allows one to
> detach the partition as well as remove the table in one command. But it looks
> wierd for someone familiar with partitioning features of other DBMSes. But 
> then
> our partition creation command is CREATE TABLE  So may be this is expected
> difference.

There is a line on the CREATE TABLE page in the description of PARTITION
OF clause:

"Note that dropping a partition with DROP TABLE requires taking an ACCESS
EXCLUSIVE lock on the parent table."

In earlier proposals I had included the ALTER TABLE parent ADD/DROP
PARTITION commands, but CRAETE TABLE PARTITION OF / DROP TABLE prevailed.

> --- cleanup: avoid using CASCADE
> -DROP TABLE list_parted, part_1;
> -DROP TABLE list_parted2, part_2, part_5, part_5_a;
> -DROP TABLE range_parted, part1, part2;
> +-- cleanup
> +DROP TABLE list_parted, list_parted2, range_parted;
> Testcases usually drop one table at a time, I guess, to reduce the differences
> when we add or remove tables from testcases. All such blocks should probably
> follow same policy.

Hmm, I see this in src/test/regress/sql/inherit.sql:141

DROP TABLE firstparent, secondparent, jointchild, thirdparent, otherchild;

>  drop table list_parted cascade;
> -NOTICE:  drop cascades to 3 other objects
> -DETAIL:  drop cascades to table part_ab_cd
> probably we should remove cascade from there, unless you are testing CASCADE
> functionality. Similarly for other blocks like
>  drop table range_parted cascade;
> 
> BTW, I noticed that although we are allowing foreign tables to be
> partitions, there are no tests in foreign_data.sql for testing it. If
> there would have been we would tests DROP TABLE on a partitioned table
> with foreign partitions as well. That file has testcases for testing
> foreign table inheritance, and should have tests for foreign table
> partitions.

That makes sense.  Patch 0002 is for that (I'm afraid this should be
posted separately though).  I didn't add/repeat all the tests that were
added by the foreign table inheritance patch again for foreign partitions
(common inheritance rules apply to both cases), only added those for the
new partitioning commands and certain new rules.

Thanks,
Amit

[0] https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items
>From c7cff17ab7861dbd0a4fb329115b3f99fd800325 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 16 Feb 2017 15:56:44 +0900
Subject: [PATCH 1/2] Allow dropping partitioned table without CASCADE

Currently, a normal dependency is created between a inheritance
parent and child when creating the child.  That means one must
specify CASCADE to drop the parent table if a child table exists.
When creating partitions as inheritance children, create auto
dependency instead, so that partitions are dropped automatically
when the parent is dropped i.e., without specifying CASCADE.
---
 src/backend/commands/tablecmds.c   | 26 ++
 src/test/regress/expected/alter_table.out  | 10 --
 src/test/regress/expected/create_table.out |  9 ++---
 src/test/regress/expected/inherit.out 

[HACKERS] [doc fix] Really trivial fix for BRIN documentation

2017-02-20 Thread Tsunakawa, Takayuki
Hello,

This is just a correction from "index" to "table".  I was a bit confused when I 
first read this part.


Regards
Takayuki Tsunakawa



brin_trivial_doc_fix.patch
Description: brin_trivial_doc_fix.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Dilip Kumar
On Mon, Feb 20, 2017 at 11:18 PM, Robert Haas  wrote:
> On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
>  wrote:
>> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
>>> So basically, what I want to propose is that Only during
>>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>>> that time we can be sure that all the workers have completed there
>>> task and we are safe to free. (And we don't free any DSA memory at
>>> ExecEndBitmapHeapScan).
>>
>> I think this works.
>
> OK.

In my latest version of the patch, I have fixed it as described above
i.e only cleanup in ExecReScanBitmapHeapScan.

For getting this there is some change in both the patches.

0001:
1. I have got a new interface, "tbm_free_shared_area(dsa_area *dsa,
dsa_pointer dp)"  which will be responsible for freeing all the shared
members (pagetable, ptpage and ptchunk).  Actually, we can not do this
in tbm_free itself because the only leader will have a local tbm with
reference to all these pointers and our parallel bitmap leader may not
necessarily be the actual backend.

If we want to get this done using tbm, then we need to create a local
tbm in each worker and get the shared member reference copied into tbm
using tbm_attach_shared_iterate like we were doing in the earlier
version.

2. Now tbm_free is not freeing any of the shared members which can be
accessed by other worker so tbm_free is safe to call from
ExecEndBitmapHeapScan without any safety check or ref count.

0002:
1. We don't need ExecShutdownBitmapHeapScan anymore because now we are
not freeing any shared member in ExecEndBitmapHeapScan.

2. In ExecReScanBitmapHeapScan we will call tbm_free_shared_area to
free the shared members of the TBM.
3. After that, we will free TBMSharedIteratorState what we allocated
using tbm_prepare_shared_iterate.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0001-tidbitmap-support-shared-v5.patch
Description: Binary data


0002-parallel-bitmap-heapscan-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Ashutosh Bapat
On Tue, Feb 21, 2017 at 9:57 AM, Amit Langote
 wrote:
> On 2017/02/21 12:10, Ashutosh Bapat wrote:
>> I think, that's a limitation till we implement global indexes. But
>> nothing in the current implementation stops us from implementing it.
>> In fact, I remember, a reply from Robert to another thread on
>> partitioning started in parallel to Amit's thread had explained some
>> of these design decisions. I am unable to find link to that exact
>> reply though.
>
> Are you perhaps thinking of the message titled "design for a partitioning
> feature (was: inheritance)" a few months back:
>
> https://www.postgresql.org/message-id/ca+tgmozev0nryvw9cnb81xdojg4xtpjmkdif0zto-gdlcoc...@mail.gmail.com

Thanks a lot, that's the one I was searching for. And now that I
confirm it, there's NO reply to Robert's mail.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Amit Langote
On 2017/02/21 12:10, Ashutosh Bapat wrote:
> I think, that's a limitation till we implement global indexes. But
> nothing in the current implementation stops us from implementing it.
> In fact, I remember, a reply from Robert to another thread on
> partitioning started in parallel to Amit's thread had explained some
> of these design decisions. I am unable to find link to that exact
> reply though.

Are you perhaps thinking of the message titled "design for a partitioning
feature (was: inheritance)" a few months back:

https://www.postgresql.org/message-id/ca+tgmozev0nryvw9cnb81xdojg4xtpjmkdif0zto-gdlcoc...@mail.gmail.com

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Passing query string to workers

2017-02-20 Thread Rafia Sabih
On Mon, Feb 20, 2017 at 8:35 PM, Kuntal Ghosh 
wrote:
>
> +   char   *query_data;
> +   query_data = estate->es_sourceText;
> estate->es_sourceText is a const char* variable. Assigning this const
> pointer to a non-const pointer violates the rules
> constant-correctness. So, either you should change query_data as const
> char*, or as Robert suggested, you can directly use
> estate->es_sourceText.
>
Done.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


pass_queryText_to_workers_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2017-02-20 Thread Thomas Munro
On Tue, Feb 21, 2017 at 7:14 AM, Thomas Munro
 wrote:
> On Fri, Jan 20, 2017 at 2:49 AM, Kevin Grittner  wrote:
>> Attached is v9 which fixes bitrot from v8.  No other changes.
>>
>> Still needs review.

Based on a suggestion from Robert off-list I tried inserting into a
delta relation from a trigger function and discovered that it
segfaults:

  * frame #0: 0x0001057705a6
postgres`addRangeTableEntryForRelation(pstate=0x7fa58186a4d0,
rel=0x, alias=0x, inh='\0',
inFromCl='\0') + 70 at parse_relation.c:1280 [opt]
frame #1: 0x00010575bbda
postgres`setTargetTable(pstate=0x7fa58186a4d0,
relation=0x7fa58186a098, inh=, alsoSource='\0',
requiredPerms=1) + 90 at parse_clause.c:199 [opt]
frame #2: 0x000105738530 postgres`transformStmt [inlined]
transformInsertStmt(pstate=) + 69 at analyze.c:540 [opt]
frame #3: 0x0001057384eb
postgres`transformStmt(pstate=, parseTree=)
+ 2411 at analyze.c:279 [opt]
frame #4: 0x000105737a42
postgres`transformTopLevelStmt(pstate=,
parseTree=0x7fa58186a438) + 18 at analyze.c:192 [opt]
frame #5: 0x0001059408d0
postgres`pg_analyze_and_rewrite_params(parsetree=0x7fa58186a438,
query_string="insert into d values (100, 100, 'x')",
parserSetup=(plpgsql.so`plpgsql_parser_setup at pl_comp.c:1017),
parserSetupArg=0x7fa58185c2a0, queryEnv=0x7fa581857798) + 128
at postgres.c:706 [opt]

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Ashutosh Bapat
On Tue, Feb 21, 2017 at 2:03 AM, Bruce Momjian  wrote:
> On Mon, Feb 20, 2017 at 02:37:44PM +0530, Robert Haas wrote:
>> On Mon, Feb 20, 2017 at 2:09 AM, Simon Riggs  wrote:
>> > On 15 February 2017 at 15:46, Robert Haas  wrote:
>> >>> It leaves me asking what else is missing.
>> >>
>> >> There is certainly a lot of room for improvement here but I don't
>> >> understand your persistent negativity about what's been done thus far.
>> >> I think it's pretty clearly a huge step forward, and I think Amit
>> >> deserves a ton of credit for making it happen.  The improvements in
>> >> bulk loading performance alone are stupendous.  You apparently have
>> >> the idea that somebody could have written an even larger patch that
>> >> solved even more problems at once, but this was already a really big
>> >> patch, and IMHO quite a good one.
>> >
>> > Please explain these personal comments against me.
>>
>> Several of your emails, including your first post to this thread,
>> seemed to me to be quite negative about the state of this feature.  I
>> don't think that's warranted, though perhaps I am misreading your
>> tone, as I have been known to do.  I also don't think that expressing
>> the opinion that the feature is better than you're giving it credit
>> for is a personal comment against you.  Where exactly do you see a
>> personal comment against you in what I wrote?
>
> I have to admit my reaction was similar to Simon's, meaning that the
> lack of docs is a problem, and that the limitations are kind of a
> surprise, and I wonder what other surprises there are.
>
> I am thinking this is a result of small teams, often from the same
> company, working on a features in isolation and then making them public.

I agree that this is result of small teams. The partitioning feature
encompasses features like global indexes, which is large in
themselves. Usually, in a company many teams would be working on
different sub-features in the same release schedule. But that's not
the case here. We have to add sub-features in multiple releases. That
might be the reason behind some of the current limitations.

> It is often not clear what decisions were made and why.

Amit Langote submitted the patch sometime in August 2015, which
certainly didn't look like a well-thought design, certainly not a
product of 'long cooking' within his company. It was more
experimental. (Obviously he had background of many earlier discussions
on partitioning, that all happened on hackers.) Since then all the
discussion is on hackers; all decisions were made during the
discussion. While what you are saying may be true with other patches,
I am not sure whether it's true with this work.

> The idea that
> unique indexes on a parent table can't guarantee uniqueness across child
> tables is both a surprise, and obvious once stated.

I think, that's a limitation till we implement global indexes. But
nothing in the current implementation stops us from implementing it.
In fact, I remember, a reply from Robert to another thread on
partitioning started in parallel to Amit's thread had explained some
of these design decisions. I am unable to find link to that exact
reply though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Amit Langote
On 2017/02/16 10:45, Amit Langote wrote:
> Also attaching 0002 (unchanged) for tab-completion support for the new
> partitioning syntax.

Robert already spawned a new thread titled "tab completion for
partitioning" for this [0].

> 0003 changes how ExecFindPartition() shows the row for which
> get_partition_for_tuple() failed to find a partition.  As Simon commented
> upthread, we should show just the partition key, not the whole row in the
> error DETAIL.  So the DETAIL now looks like what's shown by
> _bt_check_unique() upon uniqueness violation:
> 
> DETAIL: Partition key of the failing row contains (key1, key2, ...)=(val1,
> val2, ...)
> 
> The rules about which columns to show or whether to show the DETAIL at all
> are similar to those in BuildIndexValueDescription():
> 
>  - if user has SELECT privilege on the whole table, simply go ahead
> 
>  - if user doesn't have SELECT privilege on the table, check that they
>can see all the columns in the key (no point in showing partial key);
>however abort on finding an expression for which we don't try finding
>out privilege situation of whatever columns may be in the expression

I posted this patch in a new thread titled "error detail when partition
not found" [1].

Thanks,
Amit

[0]
https://www.postgresql.org/message-id/CA+TgmobYOj=a8gesies_v2wq46-_w0+7mowpinwc+iuzj-u...@mail.gmail.com
[1]
https://www.postgresql.org/message-id/9f9dc7ae-14f0-4a25-5485-964d9bfc19bd%40lab.ntt.co.jp




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] error detail when partition not found

2017-02-20 Thread Amit Langote
Simon pointed out in a nearby thread [0] that the detail part of
partition-not-found error should show just the partition keys.  I posted a
patch on that thread [1], but to avoid confusion being caused by multitude
of patches over there I'm re-posting it here.

* What the patch does:

Currently we show the whole row in the detail part of the error.

CREATE TABLE measurement_year_month (
logdate date not null,
peaktempint,
unitsales   int
) PARTITION BY RANGE (EXTRACT(YEAR FROM logdate), EXTRACT(MONTH FROM
logdate));

# INSERT INTO measurement_year_month VALUES ('2016-12-02', 1, 1);
ERROR:  no partition of relation "measurement_year_month" found for row
DETAIL:  Failing row contains (2016-12-02, 1, 1).

Patch changes it look like the following:

# INSERT INTO measurement_year_month VALUES ('2016-12-02', 1, 1);
ERROR:  no partition of relation "measurement_year_month" found for row
DETAIL:  Partition key of the failing row contains
(date_part('year'::text, logdate), date_part('month'::text,
logdate))=(2016, 12).

It's similar to error detail shown when btree unique violation occurs:

-- just to be clear, using LIKE won't make measurement partitioned too
CREATE TABLE measurement (LIKE measurement_year_month);
CREATE UNIQUE INDEX ON measurement (EXTRACT(YEAR FROM logdate),
EXTRACT(MONTH FROM logdate))

# INSERT INTO measurement VALUES ('2016-12-02', 1, 1);
INSERT 0 1

# INSERT INTO measurement VALUES ('2016-12-02', 1, 1);
ERROR:  duplicate key value violates unique constraint
"measurement_date_part_date_part1_idx"
DETAIL:  Key (date_part('year'::text, logdate), date_part('month'::text,
logdate))=(2016, 12) already exists.

* Some of the implementation details of the patch here:

The rules about which columns to show or whether to show the DETAIL at all
are similar to those in BuildIndexValueDescription():

 - if user has SELECT privilege on the whole table, simply go ahead

 - if user doesn't have SELECT privilege on the table, check that they
   can see all the columns in the key (no point in showing partial key);
   however abort on finding an expression for which we don't try finding
   out privilege situation of whatever columns may be in the expression

Thanks,
Amit

[0]
https://www.postgresql.org/message-id/CANP8%2BjJBpWocfKrbJcaf3iBt9E3U%3DWPE_NC8YE6rye%2BYJ1sYnQ%40mail.gmail.com
[1]
https://www.postgresql.org/message-id/2f8df068-9a49-d74a-30af-7cd17bdee181%40lab.ntt.co.jp
>From b382d1ebb25ab3d577667d50e86430b34e57ab9a Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 13 Feb 2017 19:52:19 +0900
Subject: [PATCH] Show only the partition key upon failing to find a partition

Currently, the whole row is shown without column names.  Instead,
adopt a style similar to _bt_check_unique() in ExecFindPartition()
and show the failing key in format (key1, ...)=(val1, ...).

The key description shown in the error message is now built by the
new ExecBuildSlotPartitionKeyDescription(), which works along the
lines of BuildIndexValueDescription(), using similar rules about
what columns of the partition key to include depending on the user's
privileges to view the same.

A bunch of relevant tests are added.
---
 src/backend/catalog/partition.c  |  29 +++
 src/backend/executor/execMain.c  | 147 ++-
 src/backend/utils/adt/ruleutils.c|  37 ++---
 src/include/catalog/partition.h  |  13 +++-
 src/include/utils/ruleutils.h|   2 +
 src/test/regress/expected/insert.out |  38 -
 src/test/regress/sql/insert.sql  |  30 +++
 7 files changed, 246 insertions(+), 50 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 4bcef58763..710ce07a6f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -140,13 +140,6 @@ static int partition_bound_bsearch(PartitionKey key,
 		PartitionBoundInfo boundinfo,
 		void *probe, bool probe_is_bound, bool *is_equal);
 
-/* Support get_partition_for_tuple() */
-static void FormPartitionKeyDatum(PartitionDispatch pd,
-	  TupleTableSlot *slot,
-	  EState *estate,
-	  Datum *values,
-	  bool *isnull);
-
 /*
  * RelationBuildPartitionDesc
  *		Form rel's partition descriptor
@@ -1608,7 +1601,7 @@ generate_partition_qual(Relation rel)
  * the heap tuple passed in.
  * 
  */
-static void
+void
 FormPartitionKeyDatum(PartitionDispatch pd,
 	  TupleTableSlot *slot,
 	  EState *estate,
@@ -1672,7 +1665,7 @@ int
 get_partition_for_tuple(PartitionDispatch *pd,
 		TupleTableSlot *slot,
 		EState *estate,
-		Oid *failed_at)
+		GetPartitionFailureData *gpfd)
 {
 	PartitionDispatch parent;
 	Datum		values[PARTITION_MAX_KEYS];
@@ -1693,13 +1686,6 @@ get_partition_for_tuple(PartitionDispatch *pd,
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
 
-		/* Quick exit */
-		if (partdesc->nparts == 0)
-		{
-			

Re: [HACKERS] delta relations in AFTER triggers

2017-02-20 Thread Thomas Munro
On Fri, Jan 20, 2017 at 2:49 AM, Kevin Grittner  wrote:
> Attached is v9 which fixes bitrot from v8.  No other changes.
>
> Still needs review.

This patch still applies, builds cleanly after a small modification,
includes regression tests and the tests past.  The modification I
needed to make was due to this compile error:

nodeNamedtuplestorescan.c:154:19: error: no member named
'ps_TupFromTlist' in 'struct PlanState'
scanstate->ss.ps.ps_TupFromTlist = false;

Commit ea15e18677fc2eff3135023e27f69ed8821554ed got rid of that member
of PlanState and I assume based on other changes in that commit that
the thing to do was simply to remove that line.  Having done that, it
builds cleanly.

+INSERT INTO level1_table(level1_no)
+  SELECT generate_series(1,200);
+INSERT INTO level2_table(level2_no, parent_no)
+  SELECT level2_no, level2_no / 50 + 1 AS parent_no
+FROM generate_series(1,) level2_no;
+INSERT INTO all_level_status(level, node_no, status)
+  SELECT 1, level1_no, 0 FROM level1_table;
+INSERT INTO all_level_status(level, node_no, status)
+  SELECT 2, level2_no, 0 FROM level2_table;
+INSERT INTO level1_table(level1_no)
+  SELECT generate_series(201,1000);
+DELETE FROM level1_table WHERE level1_no BETWEEN 201 AND 1000;
+DELETE FROM level1_table WHERE level1_no BETWEEN 1 AND 10010;
+SELECT count(*) FROM level1_table;
+ count
+---
+   200
+(1 row)

Was it intentional that this test doesn't include any statements that
reach the case where the trigger does RAISE EXCEPTION 'RI error'?
>From the output generated there doesn't seem to be any evidence that
the triggers run at all, though I know from playing around with this
that they do:

  postgres=# delete from level1_table where level1_no = 42;
  ERROR:  RI error
  CONTEXT:  PL/pgSQL function level1_table_ri_parent_del_func() line 6 at RAISE

+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group

These copyright messages are missing 3 years' worth of bugfixes.

+SPI_get_caller_relation(const char *name)

Do we need this function?  It's not used by the implementation.  If it
does have a good use for end-users, then perhaps it should be called
something like SPI_get_registered_relation, to make it clear that it
will return whatever you registered with SPI_register_relation,
instead of introducing this 'caller' terminology?

+typedef struct NamedTuplestoreScan
+{
+ Scan scan;
+ char   *enrname;
+} NamedTuplestoreScan;

Nearly plan node structs always have a comment for the members below
'scan'; I think this needs one too because 'enrname' is not
self-explanatory.

  /*
+ * Capture the NEW and OLD transition TABLE tuplestores (if specified for
+ * this trigger).
+ */
+ if (trigdata->tg_newtable || trigdata->tg_oldtable)
+ {
+ estate.queryEnv = create_queryEnv();
+ if (trigdata->tg_newtable)
+ {
+ Enr enr = palloc(sizeof(EnrData));
+
+ enr->md.name = trigdata->tg_trigger->tgnewtable;
+ enr->md.tupdesc = trigdata->tg_relation->rd_att;
+ enr->md.enrtuples = tuplestore_tuple_count(trigdata->tg_newtable);
+ enr->reldata = trigdata->tg_newtable;
+ register_enr(estate.queryEnv, enr);
+ SPI_register_relation(enr);
+ }

Why do we we have to call register_enr and also SPI_register_relation here?

On Mon, Dec 19, 2016 at 4:35 AM, Thomas Munro
 wrote:
> On Sun, Dec 18, 2016 at 3:15 PM, Kevin Grittner  wrote:
>> There were a few changes Thomas included in the version he posted,
>> without really delving into an explanation for those changes.  Some
>> or all of them are likely to be worthwhile, but I would rather
>> incorporate them based on explicit discussion, so this version
>> doesn't do much other than generalize the interface a little,
>> change some names, and add more regression tests for the new
>> feature.  (The examples I worked up for the rough proof of concept
>> of enforcement of RI through set logic rather than row-at-a-time
>> navigation were the basis for the new tests, so the idea won't get
>> totally lost.)  Thomas, please discuss each suggested change (e.g.,
>> the inclusion of the query environment in the parameter list of a
>> few more functions).
>
> I was looking for omissions that would cause some kind of statements
> to miss out on ENRs arbitrarily.  It seemed to me that
> parse_analyze_varparams should take a QueryEnvironment, mirroring
> parse_analyze, so that PrepareQuery could pass it in.  Otherwise,
> PREPARE wouldn't see ENRs.  Is there any reason why SPI_prepare should
> see them but PREPARE not?

Any thoughts about that?

More soon.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM authentication, take three

2017-02-20 Thread Michael Paquier
On Mon, Feb 20, 2017 at 9:41 PM, Aleksander Alekseev
 wrote:
>> Speaking about flaws, it looks like there is a memory leak in
>> array_to_utf procedure - result is allocated twice.

Pushed a fix for this one on my branch.

> And a few more things I've noticed after a closer look:
>
> * build_client_first_message does not free `state->client_nonce` if
>   second malloc (for `buf`) fails
> * same for `state->client_first_message_bare`
> * ... and most other procedures declared in fe-auth-scram.c file
>   (see malloc and strdup calls)

You are visibly missing pg_fe_scram_free().

> * scram_Normalize doesn't check malloc return value

Yes, I am aware of this one. This makes the interface utterly ugly
though because an error log message needs to be handled across many
API layers. We could just assume anything returning NULL is equivalent
to an OOM and nothing else though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multivariate statistics and expression indexes

2017-02-20 Thread Bruce Momjian
On Tue, Feb 21, 2017 at 01:27:53AM +0100, Tomas Vondra wrote:
> On 02/21/2017 12:13 AM, Bruce Momjian wrote:
> >At the risk of asking a stupid question, we already have optimizer
> >statistics on expression indexes.  In what sense are we using this for
> >multi-variate statistics, and in what sense can't we.
> >
> 
> We're not using that at all, because those are really orthogonal features.
> Even with expression indexes, the statistics are per attribute, and the
> attributes are treated as independent.
> 
> There was a proposal to also allow creating statistics on expressions
> (without having to create an index), but that's not supported yet.

OK, thanks.  I had to ask.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: two slab-like memory allocators

2017-02-20 Thread Tomas Vondra

Hi,

Attached is v9 of this patch series. This addresses most of the points 
raised in the review, namely:


1) change most 'debug' stuff to 'static inline' in memdebug.h

2) fixed and reworded a bunch of comments

3) get rid of the block-level bitmap tracking free chunks

Instead of the bitmap, I've used a simple singly-linked list, using 
int32 chunk indexes. Perhaps it could use the slist instead, but I'm not 
quite sure MAXALIGN is guaranteed to be greater than pointer.


In any case, this seems to be working reasonably well - it saves a bit 
of code (but also made some code slightly more complex). Also, it seems 
to be a tad faster than v8 - after repeating the same benchmark as 
before, I get these numbers:


masterslab-v8slab-v9
-
  1 50 28 25
  5  17500180160
 10 15380330
 20  ?750670

Although the results are quite noisy.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 7c70a7bef4029dd7f10c7dc9ff0dd92a7bd2f966 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 20 Feb 2017 20:16:16 +0100
Subject: [PATCH 1/3] move common bits to memdebug

---
 src/backend/utils/mmgr/Makefile   |   2 +-
 src/backend/utils/mmgr/aset.c | 115 +-
 src/backend/utils/mmgr/memdebug.c |  93 ++
 src/include/utils/memdebug.h  |  48 
 4 files changed, 144 insertions(+), 114 deletions(-)
 create mode 100644 src/backend/utils/mmgr/memdebug.c

diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index 1842bae..fc5f793 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/mmgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = aset.o dsa.o freepage.o mcxt.o portalmem.o
+OBJS = aset.o dsa.o freepage.o mcxt.o memdebug.o portalmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 4dfc3ec..33b4d01 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -41,46 +41,6 @@
  *	chunks as chunks.  Anything "large" is passed off to malloc().  Change
  *	the number of freelists to change the small/large boundary.
  *
- *
- *	About CLOBBER_FREED_MEMORY:
- *
- *	If this symbol is defined, all freed memory is overwritten with 0x7F's.
- *	This is useful for catching places that reference already-freed memory.
- *
- *	About MEMORY_CONTEXT_CHECKING:
- *
- *	Since we usually round request sizes up to the next power of 2, there
- *	is often some unused space immediately after a requested data area.
- *	Thus, if someone makes the common error of writing past what they've
- *	requested, the problem is likely to go unnoticed ... until the day when
- *	there *isn't* any wasted space, perhaps because of different memory
- *	alignment on a new platform, or some other effect.  To catch this sort
- *	of problem, the MEMORY_CONTEXT_CHECKING option stores 0x7E just beyond
- *	the requested space whenever the request is less than the actual chunk
- *	size, and verifies that the byte is undamaged when the chunk is freed.
- *
- *
- *	About USE_VALGRIND and Valgrind client requests:
- *
- *	Valgrind provides "client request" macros that exchange information with
- *	the host Valgrind (if any).  Under !USE_VALGRIND, memdebug.h stubs out
- *	currently-used macros.
- *
- *	When running under Valgrind, we want a NOACCESS memory region both before
- *	and after the allocation.  The chunk header is tempting as the preceding
- *	region, but mcxt.c expects to able to examine the standard chunk header
- *	fields.  Therefore, we use, when available, the requested_size field and
- *	any subsequent padding.  requested_size is made NOACCESS before returning
- *	a chunk pointer to a caller.  However, to reduce client request traffic,
- *	it is kept DEFINED in chunks on the free list.
- *
- *	The rounded-up capacity of the chunk usually acts as a post-allocation
- *	NOACCESS region.  If the request consumes precisely the entire chunk,
- *	there is no such region; another chunk header may immediately follow.  In
- *	that case, Valgrind will not detect access beyond the end of the chunk.
- *
- *	See also the cooperating Valgrind client requests in mcxt.c.
- *
  *-
  */
 
@@ -296,10 +256,10 @@ static const unsigned char LogTable256[256] =
  */
 #ifdef HAVE_ALLOCINFO
 #define AllocFreeInfo(_cxt, _chunk) \
-			fprintf(stderr, "AllocFree: %s: %p, %d\n", \
+			fprintf(stderr, "AllocFree: %s: %p, %lu\n", \
 (_cxt)->header.name, (_chunk), (_chunk)->size)
 #define 

Re: [HACKERS] Multivariate statistics and expression indexes

2017-02-20 Thread Tomas Vondra

On 02/21/2017 12:13 AM, Bruce Momjian wrote:

At the risk of asking a stupid question, we already have optimizer
statistics on expression indexes.  In what sense are we using this for
multi-variate statistics, and in what sense can't we.



We're not using that at all, because those are really orthogonal 
features. Even with expression indexes, the statistics are per 
attribute, and the attributes are treated as independent.


There was a proposal to also allow creating statistics on expressions 
(without having to create an index), but that's not supported yet.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan
On Mon, Feb 20, 2017 at 3:51 PM, Bruce Momjian  wrote:
> So we don't have any other cases where we warn about possible corruption
> except this?

I'm not sure that I understand the distinction you're making.

> Also, I will go back to my previous concern, that while I like the fact
> we can detect collation changes with ICU, we don't know if ICU
> collations change more often than OS collations.

We do know that ICU collations can never change behaviorally in a
given stable release. Bug fixes are allowed in point releases, but
these never change the user-visible behavior of collations. That's
very clear, because an upstream Unciode UCA version is used by a given
major release of ICU. This upstream data describes the behavior of a
collation using a high-level declarative language, that non-programmer
experts in natural languages write.

ICU versions many different things, in fact. Importantly, it
explicitly decouples behavioral issues (user visible sort order -- UCA
version) from technical issues (collator implementation details). So,
my original point is that that could change, and if that happens we
ought to have a plan. But, it won't change unless it really has to.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian
On Mon, Feb 20, 2017 at 03:29:07PM -0800, Peter Geoghegan wrote:
> Marking all indexes as invalid would be an enormous overkill. We don't
> even know for sure that the values the user has indexed happens to be
> affected. In order for there to have been a bug in ICU in the first
> place, the likelihood is that it only occurs in what are edge cases
> for the collator.
> 
> ICU is a very popular library, used in software that I personally
> interact with every day [1]. Any bugs like this should be exceptional.
> They very clearly appreciate how sensitive software like Postgres is
> to changes like this, which is why the versioning API exists.
> 
> [1] http://site.icu-project.org/#TOC-Who-Uses-ICU-

So we don't have any other cases where we warn about possible corruption
except this?

Also, I will go back to my previous concern, that while I like the fact
we can detect collation changes with ICU, we don't know if ICU
collations change more often than OS collations.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan
On Mon, Feb 20, 2017 at 3:19 PM, Bruce Momjian  wrote:
> Well, the release notes are a clear-enough communication for a need to
> reindex.  I don't see a LOG message as similar.  Don't we have other
> cases where we have to warn administrators?  We can mark the indexes as
> invalid but how would we report that?

Marking all indexes as invalid would be an enormous overkill. We don't
even know for sure that the values the user has indexed happens to be
affected. In order for there to have been a bug in ICU in the first
place, the likelihood is that it only occurs in what are edge cases
for the collator.

ICU is a very popular library, used in software that I personally
interact with every day [1]. Any bugs like this should be exceptional.
They very clearly appreciate how sensitive software like Postgres is
to changes like this, which is why the versioning API exists.

[1] http://site.icu-project.org/#TOC-Who-Uses-ICU-
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian
On Mon, Feb 20, 2017 at 02:54:22PM -0800, Peter Geoghegan wrote:
> On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian  wrote:
> > I can't think of any cases where we warn of possible corruption only in
> > the server logs.
> 
> It's just like any case where there was a bug in Postgres that could
> have subtly broken index builds. We don't make the next point release
> force users to REINDEX every possibly-affected index every time that
> happens. Presumably this is because they aren't particularly likely to
> be affected by any given bug, and because it's pretty infeasible for
> us to do so anyway. There aren't always easy ways to detect the
> problem, and users will never learn to deal with issues like this well
> when it is by definition something that should never happen.

Well, the release notes are a clear-enough communication for a need to
reindex.  I don't see a LOG message as similar.  Don't we have other
cases where we have to warn administrators?  We can mark the indexes as
invalid but how would we report that?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Multivariate statistics and expression indexes

2017-02-20 Thread Bruce Momjian
At the risk of asking a stupid question, we already have optimizer
statistics on expression indexes.  In what sense are we using this for
multi-variate statistics, and in what sense can't we.

FYI, I just wrote a blog post about expression index statistics:

http://momjian.us/main/blogs/pgblog/2017.html#February_20_2017

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan
On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian  wrote:
> I can't think of any cases where we warn of possible corruption only in
> the server logs.

It's just like any case where there was a bug in Postgres that could
have subtly broken index builds. We don't make the next point release
force users to REINDEX every possibly-affected index every time that
happens. Presumably this is because they aren't particularly likely to
be affected by any given bug, and because it's pretty infeasible for
us to do so anyway. There aren't always easy ways to detect the
problem, and users will never learn to deal with issues like this well
when it is by definition something that should never happen.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian
On Wed, Feb 15, 2017 at 10:35:34PM -0800, Peter Geoghegan wrote:
> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
>  wrote:
> > Stable operating systems shouldn't do major library upgrades, so I feel
> > pretty confident about this.
> 
> There is one case where it *is* appropriate for a bugfix release of
> ICU to bump collator version: When there was a bug in the collator
> itself, leading to broken binary blobs [1]. We should make the user
> REINDEX when this happens.
> 
> I think that ICU may well do this in point releases, which is actually
> a good thing. So, it seems like a good idea to test that there is no
> change, in a place like check_strxfrm_bug(). In my opinion, we should
> LOG a complaint in the event of a mismatch rather than refusing to
> start the server, since there probably isn't that much chance of the
> user being harmed by the bug. The cost of not starting the server
> properly until a REINDEX finishes or similar looks particularly
> unappealing when one considers that the app was probably affected by
> any corruption for weeks or months before the ICU update enabled its
> detection.
> http://www.postgresql.org/mailpref/pgsql-hackers

I can't think of any cases where we warn of possible corruption only in
the server logs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SUBSCRIPTION command hangs up, segfault occurs in the server process and to cancel the execution.

2017-02-20 Thread nuko yokohama
Hi.
While verifying the logical replication of PostgreSQL 10-devel, I found the
following problem.

* When you run the SUBSCRIPTION command against a table in the same
PostgreSQL server, hang up.
* Canceling the hung SUBSCRIPTION command with CTRL + C causes a server
process occurs Segfault, and the PostgreSQL server to restart.


[Steps to Reproduce]

$ createdb db1 -U postgres
$ createdb db2 -U postgres
$ psql -U postgres db1 -c "CREATE TABLE test (id int primary key, data
text)"
$ psql -U postgres db2 -c "CREATE TABLE test (id int primary key, data
text)"
$ psql -U postgres db1 -c "CREATE PUBLICATION pub_db1_test FOR TABLE test"
$ psql -U postgres db2 -c "CREATE SUBSCRIPTION sub_db2_test CONNECTION
'dbname=db1 port=5432 user=postgres' PUBLICATION pub_db1_test"

The SUBSCRIPTION command does not end, it hangs up.
At this time, the following logs are output in the server log.

2017-02-21 06:58:05.082 JST [22060] LOG:  logical decoding found initial
starting point at 0/1C06660
2017-02-21 06:58:05.082 JST [22060] DETAIL:  1 transaction needs to finish.

Suspending psql (running SUBSCRIPTION) with CTRL + C causes a Segfault in
the server process.
At this time, the following message is output to the server log.

2017-02-21 07:01:00.246 JST [22059] ERROR:  canceling statement due to user
request
2017-02-21 07:01:00.246 JST [22059] STATEMENT:  CREATE SUBSCRIPTION
sub_db2_test CONNECTION 'dbname=db1 port=5432 user=postgres' PUBLICATION
pub_db1_test
2017-02-21 07:01:01.006 JST [21445] LOG:  server process (PID 22060) was
terminated by signal 11: Segmentation fault
2017-02-21 07:01:01.007 JST [21445] LOG:  terminating any other active
server processes


[Environment]

postgres=# SELECT version();
  version


 PostgreSQL 10devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.2
20140120 (Red Hat 4.8.2-16), 64-bit
(1 row)

postgres=# SHOW wal_level;
 wal_level
---
 logical
(1 row)

postgres=# SHOW max_wal_senders;;
 max_wal_senders
-
 10
(1 row)

postgres=# SHOW max_replication_slots;
 max_replication_slots
---
 10
(1 row)

postgres=# TABLE pg_hba_file_rules ;
 line_number | type  |   database| user_name  |  address  |
netmask | auth_method | options | error
-+---+---++---+-+-+-+---
   2 | local | {all} | {all}  |   |
| trust   | |
   4 | host  | {all} | {all}  | 127.0.0.1 |
255.255.255.255 | trust   | |
   6 | host  | {all} | {all}  | ::1   |
::::::: | trust   | |
   9 | local | {replication} | {postgres} |   |
| trust   | |
(4 rows)


Regards.


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Bruce Momjian
On Mon, Feb 20, 2017 at 02:37:44PM +0530, Robert Haas wrote:
> On Mon, Feb 20, 2017 at 2:09 AM, Simon Riggs  wrote:
> > On 15 February 2017 at 15:46, Robert Haas  wrote:
> >>> It leaves me asking what else is missing.
> >>
> >> There is certainly a lot of room for improvement here but I don't
> >> understand your persistent negativity about what's been done thus far.
> >> I think it's pretty clearly a huge step forward, and I think Amit
> >> deserves a ton of credit for making it happen.  The improvements in
> >> bulk loading performance alone are stupendous.  You apparently have
> >> the idea that somebody could have written an even larger patch that
> >> solved even more problems at once, but this was already a really big
> >> patch, and IMHO quite a good one.
> >
> > Please explain these personal comments against me.
> 
> Several of your emails, including your first post to this thread,
> seemed to me to be quite negative about the state of this feature.  I
> don't think that's warranted, though perhaps I am misreading your
> tone, as I have been known to do.  I also don't think that expressing
> the opinion that the feature is better than you're giving it credit
> for is a personal comment against you.  Where exactly do you see a
> personal comment against you in what I wrote?

I have to admit my reaction was similar to Simon's, meaning that the
lack of docs is a problem, and that the limitations are kind of a
surprise, and I wonder what other surprises there are.

I am thinking this is a result of small teams, often from the same
company, working on a features in isolation and then making them public.
It is often not clear what decisions were made and why.  The idea that
unique indexes on a parent table can't guarantee uniqueness across child
tables is both a surprise, and obvious once stated.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Bruce Momjian
On Wed, Feb 15, 2017 at 12:08:19PM -0500, Robert Haas wrote:
> On Wed, Feb 15, 2017 at 11:34 AM, Alvaro Herrera
>  wrote:
> > I think new-style partitioning is supposed to consider each partition as
> > an implementation detail of the table; the fact that you can manipulate
> > partitions separately does not really mean that they are their own
> > independent object.  You don't stop to think "do I really want to drop
> > the TOAST table attached to this main table?" and attach a CASCADE
> > clause if so.  You just drop the main table, and the toast one is
> > dropped automatically.  I think new-style partitions should behave
> > equivalently.
> 
> That's a reasonable point of view.  I'd like to get some more opinions
> on this topic.  I'm happy to have us do whatever most people want, but
> I'm worried that having table inheritance and table partitioning work
> differently will be create confusion.  I'm also suspicious that there
> may be some implementation difficulties.  On the hand, it does seem a
> little silly to say that DROP TABLE partitioned_table should always
> fail except in the degenerate case where there are no partitions, so
> maybe changing it is for the best.

I think we have a precedent for this, and that is the SERIAL data type,
which is really just a macro on top of CREATE SEQUENCE and DEFAULT
nextval() using the sequence.  You can manipulate the sequence and the
DEFAULT separately, but if you drop the table the sequence is dropped to
automatically.

This seems like an instructive example of how we have bundled behavior
together in the past in a logical and easy-to-understand way.  Of
course, their might be some technical limitations that prevent us from
using this approach, and I would be interested in hearing those details.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Peter Geoghegan
On Thu, Feb 16, 2017 at 10:41 AM, Robert Haas  wrote:
>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>> half-dead pages. So skipping an index vacuum might mean that second
>> index scan never happens at all, which would be bad.
>
> Maybe.  If there are a tiny number of those half-dead pages in a huge
> index, it probably doesn't matter.  Also, I don't think it would never
> happen, unless the table just never gets any more updates or deletes -
> but that case could also happen today.  It's just a matter of
> happening less frequently.
>
> I guess the question is whether the accumulation of half-dead pages in
> the index could become a problem before the unsetting of all-visible
> bits in the heap becomes a problem.  If the second one always happen
> first, then we don't have an issue here, but if it's possible for the
> first one to become a big problem before the second one gets to be a
> serious issue, then we need something more sophisticated.

Not getting to a second VACUUM where you might have otherwise can only
be a problem to the extent that users are sensitive to not reclaiming
disk space from indexes at the level of the FSM. It's not accurate to
say that pages could be left "half dead" indefinitely by this patch,
since that is something that lasts only as long as the first phase of
B-Tree page deletion. In fact, the only possible problem is that pages
are recyclable in principle, but that doesn't happen due to this new
GUC.

That isn't analogous to heap bloat at all, because it's not as if
there are any downlinks or right links or left links pointing to the
recyclable (fully deleted) pages; the previous key space *has* in fact
been *fully* reclaimed. These pages are fully dead, and as such are
out of the critical path of index scans entirely once the second phase
finishes. (They only need to continue to physically exist because old
index scans might follow a stale pointer).

Note that there is an interlock against RecentGlobalXmin respected by
VACUUM, that prevents this sort of recycling. I suspect that the
restrictions on page deletion as opposed to page recycling is vastly
more likely to cause pain to users, and that's not made any worse by
this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Simon Riggs
On 20 February 2017 at 10:27, Amit Kapila  wrote:
> On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs  wrote:
>> On 20 February 2017 at 09:15, Amit Kapila  wrote:
>>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada  
>>> wrote:
 On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas  wrote:
> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  
> wrote:
>> On 15 February 2017 at 08:07, Masahiko Sawada  
>> wrote:
>>> It's a bug. Attached latest version patch, which passed make check.
>>
>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>> half-dead pages. So skipping an index vacuum might mean that second
>> index scan never happens at all, which would be bad.
>
> Maybe.  If there are a tiny number of those half-dead pages in a huge
> index, it probably doesn't matter.  Also, I don't think it would never
> happen, unless the table just never gets any more updates or deletes -
> but that case could also happen today.  It's just a matter of
> happening less frequently.

>>>
>>> Yeah thats right and I am not sure if it is worth to perform a
>>> complete pass to reclaim dead/deleted pages unless we know someway
>>> that there are many such pages.
>>
>> Agreed which is why
>> On 16 February 2017 at 11:17, Simon Riggs  wrote:
>>> I suggest that we store the number of half-dead pages in the metapage
>>> after each VACUUM, so we can decide whether to skip the scan or not.
>>
>>
>>> Also, I think we do reclaim the
>>> complete page while allocating a new page in btree.
>>
>> That's not how it works according to the README at least.
>>
>
> I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
> won't that help us in reclaiming the space?

Not unless the README is incorrect, no.

That section of code is just a retest of pages retrieved from FSM;
they aren't even added there until two scans have occurred and even
then it may not be possible to recycle.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SERIALIZABLE with parallel query

2017-02-20 Thread Thomas Munro
On Thu, Feb 16, 2017 at 6:19 PM, Thomas Munro
 wrote:
> Specifically, DeleteChildTargetLocks() assumes it can walk
> MySerializableXact->predicateLocks and throw away locks that are
> covered by a new lock (ie throw away tuple locks because a covering
> page lock has been acquired) without let or hindrance until it needs
> to modify the locks themselves.  That assumption doesn't hold up with
> that last patch and will require a new kind of mutual exclusion.  I
> wonder if the solution is to introduce an LWLock into each
> SERIALIZABLEXACT object, so DeleteChildTargetLocks() can prevent
> workers from stepping on each others' toes during lock cleanup.  An
> alternative would be to start taking SerializablePredicateLockListLock
> in exclusive rather than shared mode, but that seems unnecessarily
> coarse.

Here is a patch to do that, for discussion.  It adds an LWLock to each
SERIALIZABLEXACT, and acquires it after SerializablePredicateListLock
and before any predicate lock partition lock.  It doesn't bother with
that if not in parallel mode, or in the cases where
SerializablePredicateListLock is held exclusively.  This prevents
parallel query workers and leader from stepping on each others' toes
when manipulating the predicate list.

The case in CheckTargetForConflictsIn is theoretical for now since we
don't support writing in parallel query yet.  The case in
CreatePredicateLock is reachable by running a simple parallel
sequential scan.  The case in DeleteChildTargetLocks is for when we've
acquired a new predicate lock that covers finer grained locks which
can be dropped; that is reachable the same way again.  I don't think
it's required in ReleaseOneSerializableXact since it was already
called in several places with an sxact other than the caller's, and
deals with finished transactions.

-- 
Thomas Munro
http://www.enterprisedb.com


ssi-parallel-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Tom Lane
Robert Haas  writes:
> On Mon, Feb 20, 2017 at 7:37 PM, Tom Lane  wrote:
>> The question to be asked is whether there is still anybody out there
>> using float timestamps.  I'm starting to get dubious about it myself.

> I'm wondering if it has any effect that pg_config.h.win32 says

> /* Define to 1 if you want 64-bit integer timestamp and interval support.
>(--enable-integer-datetimes) */
> /* #undef USE_INTEGER_DATETIMES */

> Whereas pg_config.h.win32 says:

> /* Define to 1 if you want 64-bit integer timestamp and interval support.
>(--enable-integer-datetimes) */
> #define USE_INTEGER_DATETIMES 1

Er, what?  For me, grep finds

src/include/pg_config.h.in: 836: #undef USE_INTEGER_DATETIMES
src/include/pg_config.h.win32: 630: /* #undef USE_INTEGER_DATETIMES */

> It looks like it was commit 2169e42bef9db7e0bdd1bea00b81f44973ad83c8
> that enabled integer datetimes by default, but that commit seems to
> not to have touched the Windows build scripts.  Commit
> fcf053d7829f2d83829256153e856f9a36c83ffd changed MSVC over to use
> integer datetimes by default, but I'm not clear if there's any build
> environment where we rely on config.h.win32 but not Solution.pm?

Any such build would find itself without a defined value of BLCKSZ,
to mention just one of the settings that get appended by Solution.pm.

It does look like we expect pg_config.h.win32 to be usable standalone
for libpq-only Windows compiles, but it would never work for building
the backend.  (I dunno whether the libpq-only scripts still work at
all or have bitrotted, but it's irrelevant for the question at hand.)

> If not, what exactly is pg_config.h.win32 for and to what degree does it
> need to be in sync with pg_config.h.in?  The list of differences
> appears to be far more extensive than the header comment at the top of
> pg_config.h.win32 would lead one to believe.

Yeah, I think the maintenance of pg_config.h.win32 has been a bit more
haphazard than one could wish.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Tom Lane
Robert Haas  writes:
> On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane  wrote:
>> The thing that you really have to worry about for this kind of proposal
>> is "what if the query errors out and we never get to ExecEndNode"?
>> It's particularly nasty if you're talking about parallel queries where
>> maybe only one or some of the processes involved detect an error.

> I think that's not actually a problem, because we've already got code
> to make sure that all DSM resources associated with the query get
> blown away in that case.  Of course, that code might have bugs, but if
> it does, I think it's better to try to fix those bugs than to insert
> some belt-and-suspenders mechanism for reclaiming every possible chunk
> of memory in retail fashion, just like we blow up es_query_cxt rather
> than trying to pfree allocations individually.

Actually, I think we're saying the same thing: rely on the general DSM
cleanup mechanism, don't insert extra stuff that you expect will get
done by executor shutdown.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fd,c just Assert()s that lseek() succeeds

2017-02-20 Thread Tom Lane
I wrote:
> I noticed while researching bug #14555 that fd.c contains two separate
> cases like
>   vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
>   Assert(vfdP->seekPos != (off_t) -1);
> This seems, um, unwise.  It might somehow fail to fail in production
> builds, because elsewhere it's assumed that -1 means FileUnknownPos,
> but it seems like reporting the error would be a better plan.

I looked at things more closely and decided that this was really pretty
seriously broken.  Attached is a patch that attempts to spruce things up.

In LruDelete, it's more or less okay to not throw an error, mainly because
that's likely to get called during error abort and so throwing an error
would probably just lead to an infinite loop.  But by the same token,
throwing an error from the close() right after that is ill-advised, not to
mention that it would've left the LRU state corrupted since we'd already
unlinked the VFD.  I also noticed that really, most of the time, we should
know the current seek position and it shouldn't be necessary to do an
lseek here at all.  In the attached, if we don't have a seek position and
an lseek attempt doesn't give us one, we'll close the file but then
subsequent re-open attempts will fail (except in the somewhat-unlikely
case that a FileSeek(SEEK_SET) call comes between and allows us to
re-establish a known seek position).

Meanwhile, having an Assert instead of an honest test in LruInsert is
really dangerous: if that lseek failed, a subsequent read or write would
read or write from the start of the file, not where the caller expected,
leading to data corruption.

I also fixed a number of other places that were being sloppy about
behaving correctly when the seekPos is unknown.

Also, I changed FileSeek to return -1 with EINVAL for the cases where it
detects a bad offset, rather than throwing a hard elog(ERROR).  It seemed
pretty inconsistent that some bad-offset cases would get a failure return
while others got elog(ERROR).  It was missing an offset validity check for
the SEEK_CUR case on a closed file, too.

I'm inclined to treat this as a bug fix and back-patch it.  lseek()
failure on a valid FD is certainly unlikely, but if it did happen our
behavior would be pretty bad.  I'm also suspicious that some of these bugs
could be exposed even without an lseek() failure, because of the code in
FileRead and FileWrite that will reset seekPos to "unknown" after an
error.

regards, tom lane

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ce4bd0f..1065bc1 100644
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
*** int			max_safe_fds = 32;	/* default if n
*** 160,166 
--- 160,173 
  
  #define FileIsNotOpen(file) (VfdCache[file].fd == VFD_CLOSED)
  
+ /*
+  * Note: a VFD's seekPos is normally always valid, but if for some reason
+  * an lseek() fails, it might become set to FileUnknownPos.  We can struggle
+  * along without knowing the seek position in many cases, but in some places
+  * we have to fail if we don't have it.
+  */
  #define FileUnknownPos ((off_t) -1)
+ #define FilePosIsUnknown(pos) ((pos) < 0)
  
  /* these are the assigned bits in fdstate below: */
  #define FD_TEMPORARY		(1 << 0)	/* T = delete when closed */
*** typedef struct vfd
*** 174,180 
  	File		nextFree;		/* link to next free VFD, if in freelist */
  	File		lruMoreRecently;	/* doubly linked recency-of-use list */
  	File		lruLessRecently;
! 	off_t		seekPos;		/* current logical file position */
  	off_t		fileSize;		/* current size of file (0 if not temporary) */
  	char	   *fileName;		/* name of file, or NULL for unused VFD */
  	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
--- 181,187 
  	File		nextFree;		/* link to next free VFD, if in freelist */
  	File		lruMoreRecently;	/* doubly linked recency-of-use list */
  	File		lruLessRecently;
! 	off_t		seekPos;		/* current logical file position, or -1 */
  	off_t		fileSize;		/* current size of file (0 if not temporary) */
  	char	   *fileName;		/* name of file, or NULL for unused VFD */
  	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
*** LruDelete(File file)
*** 967,985 
  
  	vfdP = [file];
  
! 	/* delete the vfd record from the LRU ring */
! 	Delete(file);
! 
! 	/* save the seek position */
! 	vfdP->seekPos = lseek(vfdP->fd, (off_t) 0, SEEK_CUR);
! 	Assert(vfdP->seekPos != (off_t) -1);
  
  	/* close the file */
  	if (close(vfdP->fd))
! 		elog(ERROR, "could not close file \"%s\": %m", vfdP->fileName);
! 
! 	--nfile;
  	vfdP->fd = VFD_CLOSED;
  }
  
  static void
--- 974,1003 
  
  	vfdP = [file];
  
! 	/*
! 	 * Normally we should know the seek position, but if for some reason we
! 	 * have lost track of it, try again to get it.  If we still can't get it,
! 	 * we have a problem: we will be unable to restore the file seek position
! 	 * when and if the file is 

[HACKERS] possible encoding issues with libxml2 functions

2017-02-20 Thread Pavel Stehule
Hi

Today I played with xml_recv function and with xml processing functions.

xml_recv function ensures correct encoding from document encoding to server
encoding. But the decl section holds original encoding info - that should
be obsolete after encoding. Sometimes we solve this issue by removing decl
section - see the xml_out function.

Sometimes we don't do it - lot of functions uses direct conversion from
xmltype to xmlChar. Wrong encoding in decl section can breaks libxml2
parser with error

ERROR:  could not parse XML document
DETAIL:  input conversion failed due to input error, bytes 0x88 0x3C 0x2F
0x72
line 1: switching encoding: encoder error

This error is not often - but it is hard to find it - because there is
small but important difference between printed XML and used XML.

There are possible two fixes

a) clean decl on input - the encoding info can be removed from decl part

b) use xml_out_internal everywhere before transformation to
xmlChar. pg_xmlCharStrndup can be good candidate.

Comments?

Regards

Pavel


Re: [HACKERS] DROP SUBSCRIPTION and ROLLBACK

2017-02-20 Thread Fujii Masao
On Thu, Feb 16, 2017 at 11:40 AM, Masahiko Sawada  wrote:
> On Thu, Feb 16, 2017 at 7:52 AM, Petr Jelinek
>  wrote:
>> On 15/02/17 06:43, Masahiko Sawada wrote:
>>> On Tue, Feb 14, 2017 at 1:13 AM, Fujii Masao  wrote:
 On Mon, Feb 13, 2017 at 4:05 PM, Masahiko Sawada  
 wrote:
> On Sat, Feb 11, 2017 at 8:18 PM, Petr Jelinek
>  wrote:
>> On 10/02/17 19:55, Masahiko Sawada wrote:
>>> On Thu, Feb 9, 2017 at 12:44 AM, Petr Jelinek
>>>  wrote:
 On 08/02/17 07:40, Masahiko Sawada wrote:
> On Wed, Feb 8, 2017 at 9:01 AM, Michael Paquier
>  wrote:
>> On Wed, Feb 8, 2017 at 1:30 AM, Fujii Masao  
>> wrote:
>>> On Wed, Feb 8, 2017 at 12:26 AM, Petr Jelinek
>>>  wrote:
 For example what happens if apply crashes during the DROP
 SUBSCRIPTION/COMMIT and is not started because the delete from 
 catalog
 is now visible so the subscription is no longer there?
>>>
>>> Another idea is to treat DROP SUBSCRIPTION in the same way as 
>>> VACUUM, i.e.,
>>> make it emit an error if it's executed within user's transaction 
>>> block.
>>
>> It seems to me that this is exactly Petr's point: using
>> PreventTransactionChain() to prevent things to happen.
>
> Agreed. It's better to prevent to be executed inside user transaction
> block. And I understood there is too many failure scenarios we need to
> handle.
>
>>> Also DROP SUBSCRIPTION should call CommitTransactionCommand() just
>>> after removing the entry from pg_subscription, then connect to the 
>>> publisher
>>> and remove the replication slot.
>>
>> For consistency that may be important.
>
> Agreed.
>
> Attached patch, please give me feedback.
>

 This looks good (and similar to what initial patch had btw). Works fine
 for me as well.

 Remaining issue is, what to do about CREATE SUBSCRIPTION then, there 
 are
 similar failure scenarios there, should we prevent it from running
 inside transaction as well?

>>>
>>> Hmm,  after thought I suspect current discussing approach. For
>>> example, please image the case where CRAETE SUBSCRIPTION creates
>>> subscription successfully but fails to create replication slot for
>>> whatever reason, and then DROP SUBSCRIPTION drops the subscription but
>>> dropping replication slot is failed. In such case, CREAET SUBSCRIPTION
>>> and DROP SUBSCRIPTION return ERROR but the subscription is created and
>>> dropped successfully. I think that this behaviour confuse the user.
>>>
>>> I think we should just prevent calling DROP SUBSCRIPTION in user's
>>> transaction block. Or I guess that it could be better to separate the
>>> starting/stopping logical replication from subscription management.
>>>
>>
>> We need to stop the replication worker(s) in order to be able to drop
>> the slot. There is no such issue with startup of the worker as that one
>> is launched by launcher after the transaction has committed.
>>
>> IMO best option is to just don't allow DROP/CREATE SUBSCRIPTION inside a
>> transaction block and don't do any commits inside of those (so that
>> there are no rollbacks, which solves your initial issue I believe). That
>> way failure to create/drop slot will result in subscription not being
>> created/dropped which is what we want.

 On second thought, +1.

> I basically agree this option, but why do we need to change CREATE
> SUBSCRIPTION as well?

 Because the window between the creation of replication slot and the 
 transaction
 commit of CREATE SUBSCRIPTION should be short. Otherwise, if any error 
 happens
 during that window, the replication slot unexpectedly remains while there 
 is no
 corresponding subscription. Of course, even If we prevent CREATE 
 SUBSCRIPTION
 from being executed within user's transaction block, there is still such
 window. But we can reduce the possibility of that problem.
>>>
>>> Thank you for the explanation. I understood and agree.
>>>
>>> I think we should disallow to call ALTER SUBSCRIPTION inside a user's
>>> transaction block as well.
>>
>> Why? ALTER SUBSCRIPTION does not create/drop anything on remote server ever.
>>
>
> Hmm you're right. ALTER SUBSCRIPTION can support transaction. Attached
> fixed version patch.

We should disallow CREATE/DROP SUBSCRIPTION inside a user transaction
block only when CREATE/DROP SLOT option 

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Masahiko Sawada
On Mon, Feb 20, 2017 at 6:15 PM, Amit Kapila  wrote:
> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada  
> wrote:
>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas  wrote:
>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  wrote:
 On 15 February 2017 at 08:07, Masahiko Sawada  
 wrote:
> It's a bug. Attached latest version patch, which passed make check.

 In its current form, I'm not sure this is a good idea. Problems...

 1. I'm pretty sure the world doesn't need another VACUUM parameter

 I suggest that we use the existing vacuum scale factor/4 to reflect
 that indexes are more sensitive to bloat.
>>>
>>> I do not think it's a good idea to control multiple behaviors with a
>>> single GUC.  We don't really know that dividing by 4 will be right for
>>> everyone, or even for most people.  It's better to have another
>>> parameter with a sensible default than to hardcode a ratio that might
>>> work out poorly for some people.
>>>
 2. The current btree vacuum code requires 2 vacuums to fully reuse
 half-dead pages. So skipping an index vacuum might mean that second
 index scan never happens at all, which would be bad.
>>>
>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>> index, it probably doesn't matter.  Also, I don't think it would never
>>> happen, unless the table just never gets any more updates or deletes -
>>> but that case could also happen today.  It's just a matter of
>>> happening less frequently.
>>
>
> Yeah thats right and I am not sure if it is worth to perform a
> complete pass to reclaim dead/deleted pages unless we know someway
> that there are many such pages.  Also, I think we do reclaim the
> complete page while allocating a new page in btree.
>
>> The half-dead pages are never cleaned up if the ratio of pages
>> containing garbage is always lower than threshold.
>>
>
> Which threshold are you referring here?
>

I meant the new parameter in current patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Simon Riggs
On 20 February 2017 at 17:32, Robert Haas  wrote:

>>> Have you checked whether this
>>> patch makes any noticeable performance difference?
>>
>> No, but then we're reducing the number of calls to PgXact directly;
>> there is no heuristic involved, its just a pure saving.
>
> Well, it's adding a branch where there wasn't one.

A branch that is avoided in almost all cases, so easy to predict.

> Maybe that costs
> essentially nothing and the saved write to shared memory saves
> something noticeable, but for all I know it's the reverse.  If I had
> to guess, it would be that neither the costs nor the savings from this
> are in the slightest way noticeable on a macrobenchmark, and therefore
> there's not much point in changing it, but that could be 100% wrong.

Given Andres' earlier measurements, it seems worth testing to me.



Hopefully someone can recheck. Thanks in advance.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 6:19 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> One practical problem that came up was the need for executor nodes to
>> get a chance to do that kind of cleanup before the DSM segment is
>> detached.  In my patch series I introduced a new node API
>> ExecNodeDetach to allow for that.  Andres objected that the need for
>> that is evidence that the existing protocol is broken and should be
>> fixed instead.  I'm looking into that.
>
> The thing that you really have to worry about for this kind of proposal
> is "what if the query errors out and we never get to ExecEndNode"?
> It's particularly nasty if you're talking about parallel queries where
> maybe only one or some of the processes involved detect an error.

I think that's not actually a problem, because we've already got code
to make sure that all DSM resources associated with the query get
blown away in that case.  Of course, that code might have bugs, but if
it does, I think it's better to try to fix those bugs than to insert
some belt-and-suspenders mechanism for reclaiming every possible chunk
of memory in retail fashion, just like we blow up es_query_cxt rather
than trying to pfree allocations individually.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel bitmap heap scan

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 5:55 AM, Thomas Munro
 wrote:
> On Sun, Feb 19, 2017 at 9:59 PM, Dilip Kumar  wrote:
>> So basically, what I want to propose is that Only during
>> ExecReScanBitmapHeapScan we can free all the DSA pointers because at
>> that time we can be sure that all the workers have completed there
>> task and we are safe to free. (And we don't free any DSA memory at
>> ExecEndBitmapHeapScan).
>
> I think this works.

OK.

> Some hand-wavy thoughts on this topic in the context of hash joins:
>
> The argument for cleaning up sooner rather than later would be that it
> could reduce the total peak memory usage of large execution plans.  Is
> that a reasonable goal and can we achieve it?  I suspect the answer is
> yes in theory but no in practice, and we don't even try to achieve it
> in non-parallel queries as far as I know.

We're pretty stupid about causing nodes to stop eating up resources as
early as we could; for example, when a Limit is filled, we don't make
any attempt to have scans underneath it release pins or memory or
anything else.  But we don't usually let the same node consume memory
multiple times. ExecReScanBitmapHeapScan frees all of the memory used
for the previous bitmap in the non-parallel case, so it should
probably do that in the parallel case also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 10:49 PM, Simon Riggs  wrote:
>> Regarding reduce_pgxact_access_AtEOXact.v1.patch, it took me a few
>> minutes to figure out that the comment was referring to
>> ProcArrayEndTransaction(), so it might be good to be more explicit
>> about that if we go forward with this.
>
> Sure, attached.

Looks better, thanks.

>> Have you checked whether this
>> patch makes any noticeable performance difference?
>
> No, but then we're reducing the number of calls to PgXact directly;
> there is no heuristic involved, its just a pure saving.

Well, it's adding a branch where there wasn't one.  Maybe that costs
essentially nothing and the saved write to shared memory saves
something noticeable, but for all I know it's the reverse.  If I had
to guess, it would be that neither the costs nor the savings from this
are in the slightest way noticeable on a macrobenchmark, and therefore
there's not much point in changing it, but that could be 100% wrong.

>> It's sure
>> surprising that we go to all of this trouble to clean things up in
>> AtEOXact_Snapshot() when we've already nuked MyPgXact->xmin from
>> orbit.  (Instead of changing AtEOXact_Snapshot, should we think about
>> removing the xid clear logic from ProcArrayEndTransaction and only
>> doing it here, or would that be wrong-headed?)
>
> If anything, I'd move the call to PgXact->xmin = InvalidTransactionId
> into a function inside procarray.c, so we only touch snapshots in
> snapmgr.c and all procarray stuff is isolated. (Not done here, yet).

I'm not convinced that really buys us anything except more
function-call overhead.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add pg_disable_checksums() and supporting infrastructure

2017-02-20 Thread David Christensen

> On Feb 19, 2017, at 8:14 PM, Jim Nasby  wrote:
> 
> On 2/19/17 11:02 AM, David Christensen wrote:
>> My design notes for the patch were submitted to the list with little 
>> comment; see: 
>> https://www.postgresql.org/message-id/1E6E64E9-634B-43F4-8AA2-CD85AD92D2F8%40endpoint.com
>> 
>> I have since added the WAL logging of checksum states, however I’d be glad 
>> to take feedback on the other proposed approaches (particularly the system 
>> catalog changes + the concept of a checksum cycle).]
> 
> A couple notes:
> 
> - AFAIK unlogged tables get checksummed today if checksums are enabled; the 
> same should hold true if someone enables checksums on the whole cluster.

Agreed; AFAIK this should already be working if it’s using the PageIsVerified() 
API, since I just effectively modified the logic there, depending on state.

> - Shared relations should be handled as well; you don't mention them.

I agree that they should be handled as well; I thought I had mentioned it later 
in the design doc, though TBH I’m not sure if there is more involved than just 
visiting the global relations in pg_class.  In addition we need to visit all 
forks of each relation.  I’m not certain if loading those into shared_buffers 
would be sufficient; I think so, but I’d be glad to be informed otherwise.  (As 
long as they’re already using the PageIsVerified() API call they get this logic 
for free.

> - If an entire cluster is going to be considered as checksummed, then even 
> databases that don't allow connections would need to get enabled.

Yeah, the workaround for now would be to require “datallowconn" to be set to 
’t' for all databases before proceeding, unless there’s a way to connect to 
those databases internally that bypasses that check.  Open to ideas, though for 
a first pass seems like the “datallowconn” approach is the least amount of work.

> I like the idea of revalidation, but I'd suggest leaving that off of the 
> first pass.

Yeah, agreed.

> It might be easier on a first pass to look at supporting per-database 
> checksums (in this case, essentially treating shared catalogs as their own 
> database). All normal backends do per-database stuff (such as setting 
> current_database) during startup anyway. That doesn't really help for things 
> like recovery and replication though. :/ And there's still the question of 
> SLRUs (or are those not checksum'd today??).

So you’re suggesting that the data_checksums GUC get set per-database context, 
so once it’s fully enabled in the specific database it treats it as in 
enforcing state, even if the rest of the cluster hasn’t completed?  Hmm, might 
think on that a bit, but it seems pretty straightforward.

What issues do you see affecting replication and recovery specifically (other 
than the entire cluster not being complete)?  Since the checksum changes are 
WAL logged, seems you be no worse the wear on a standby if you had to change 
things.

> BTW, it occurs to me that this is related to the problem we have with trying 
> to make changes that break page binary compatibility. If we had a method for 
> handling that it would probably be useful for enabling checksums as well. 
> You'd essentially treat an un-checksum'd page as if it was an "old page 
> version". The biggest problem there is dealing with the potential that the 
> new page needs to be larger than the old one was, but maybe there's some 
> useful progress to be had in this area before tackling the "page too small" 
> problem.

I agree it’s very similar; my issue is I don’t want to have to postpone 
handling a specific case for some future infrastructure.  That being said, what 
I could see being a general case is the piece which basically walks all pages 
in the cluster; as long as the page checksum/format validation happens at Page 
read time, we could do page version checks/conversions at the same time, and 
the only special code we’d need is to keep track of which files still need to 
be visited and how to minimize the impact on the cluster via some sort of 
throttling/leveling.  (It’s also similar to vacuum in that way, however we have 
been going out of our way to make vacuum smart enough to *avoid* visiting every 
page, so I think it is a different enough use case that we can’t tie the two 
systems together, nor do I feel like taking that project on.)

We could call the checksum_cycle something else (page_walk_cycle?  bikeshed 
time!) and basically have the infrastructure to initiate online/gradual 
conversion/processing of all pages for free.

Thoughts?

David
--
David Christensen
End Point Corporation
da...@endpoint.com
785-727-1171





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Simon Riggs
On 20 February 2017 at 16:53, Robert Haas  wrote:
> On Mon, Feb 20, 2017 at 6:02 PM, Simon Riggs  wrote:
>> On 15 February 2017 at 19:15, Andres Freund  wrote:
>>
>>> I think I previously
>>> mentioned, even just removing the MyPgXact->xmin assignment in
>>> SnapshotResetXmin() is measurable performance wise and cache-hit ratio
>>> wise.
>>
>> Currently, we issue SnapshotResetXmin() pointlessly at end of xact, so
>> patch attached to remove that call, plus some comments to explain
>> that. This reduces the cause.
>>
>> Also, another patch to reduce the calls to SnapshotResetXmin() using a
>> simple heuristic to reduce the effects.
>
> I think skip_SnapshotResetXmin_if_idle_timeout.v1.patch isn't a good
> idea, because it could have the surprising result that setting
> idle_in_transaction_timeout to a non-zero value makes bloat worse.  I
> don't think users will like that.
>
> Regarding reduce_pgxact_access_AtEOXact.v1.patch, it took me a few
> minutes to figure out that the comment was referring to
> ProcArrayEndTransaction(), so it might be good to be more explicit
> about that if we go forward with this.

Sure, attached.

> Have you checked whether this
> patch makes any noticeable performance difference?

No, but then we're reducing the number of calls to PgXact directly;
there is no heuristic involved, its just a pure saving.

> It's sure
> surprising that we go to all of this trouble to clean things up in
> AtEOXact_Snapshot() when we've already nuked MyPgXact->xmin from
> orbit.  (Instead of changing AtEOXact_Snapshot, should we think about
> removing the xid clear logic from ProcArrayEndTransaction and only
> doing it here, or would that be wrong-headed?)

If anything, I'd move the call to PgXact->xmin = InvalidTransactionId
into a function inside procarray.c, so we only touch snapshots in
snapmgr.c and all procarray stuff is isolated. (Not done here, yet).

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


reduce_pgxact_access_AtEOXact.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 7:37 PM, Tom Lane  wrote:
> The question to be asked is whether there is still anybody out there
> using float timestamps.  I'm starting to get dubious about it myself.
> Certainly, no packager that I'm aware of has shipped a float-timestamp
> build since we switched the default in 8.4.  Maybe there is somebody
> who's faithfully built a float-timestamp custom build every year so they
> can pg_upgrade in place from their 8.3 installation, but there have got
> to be darn few such people.

I'm wondering if it has any effect that pg_config.h.win32 says

/* Define to 1 if you want 64-bit integer timestamp and interval support.
   (--enable-integer-datetimes) */
/* #undef USE_INTEGER_DATETIMES */

Whereas pg_config.h.win32 says:

/* Define to 1 if you want 64-bit integer timestamp and interval support.
   (--enable-integer-datetimes) */
#define USE_INTEGER_DATETIMES 1

It looks like it was commit 2169e42bef9db7e0bdd1bea00b81f44973ad83c8
that enabled integer datetimes by default, but that commit seems to
not to have touched the Windows build scripts.  Commit
fcf053d7829f2d83829256153e856f9a36c83ffd changed MSVC over to use
integer datetimes by default, but I'm not clear if there's any build
environment where we rely on config.h.win32 but not Solution.pm?  If
not, what exactly is pg_config.h.win32 for and to what degree does it
need to be in sync with pg_config.h.in?  The list of differences
appears to be far more extensive than the header comment at the top of
pg_config.h.win32 would lead one to believe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 6:02 PM, Simon Riggs  wrote:
> On 15 February 2017 at 19:15, Andres Freund  wrote:
>
>> I think I previously
>> mentioned, even just removing the MyPgXact->xmin assignment in
>> SnapshotResetXmin() is measurable performance wise and cache-hit ratio
>> wise.
>
> Currently, we issue SnapshotResetXmin() pointlessly at end of xact, so
> patch attached to remove that call, plus some comments to explain
> that. This reduces the cause.
>
> Also, another patch to reduce the calls to SnapshotResetXmin() using a
> simple heuristic to reduce the effects.

I think skip_SnapshotResetXmin_if_idle_timeout.v1.patch isn't a good
idea, because it could have the surprising result that setting
idle_in_transaction_timeout to a non-zero value makes bloat worse.  I
don't think users will like that.

Regarding reduce_pgxact_access_AtEOXact.v1.patch, it took me a few
minutes to figure out that the comment was referring to
ProcArrayEndTransaction(), so it might be good to be more explicit
about that if we go forward with this.  Have you checked whether this
patch makes any noticeable performance difference?  It's sure
surprising that we go to all of this trouble to clean things up in
AtEOXact_Snapshot() when we've already nuked MyPgXact->xmin from
orbit.  (Instead of changing AtEOXact_Snapshot, should we think about
removing the xid clear logic from ProcArrayEndTransaction and only
doing it here, or would that be wrong-headed?)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checksums by default?

2017-02-20 Thread Tomas Vondra

On 02/11/2017 01:38 AM, Tomas Vondra wrote:


Incidentally, I've been dealing with a checksum failure reported by a
customer last week, and based on the experience I tend to agree that we
don't have the tools needed to deal with checksum failures. I think such
tooling should be a 'must have' for enabling checksums by default.

In this particular case the checksum failure is particularly annoying
because it happens during recovery (on a standby, after a restart),
during startup, so FATAL means shutdown.

I've managed to inspect the page in different way (dd and pageinspect
from another instance), and it looks fine - no obvious data corruption,
the only thing that seems borked is the checksum itself, and only three
consecutive bits are flipped in the checksum. So this doesn't seem like
a "stale checksum" - hardware issue is a possibility (the machine has
ECC RAM though), but it might just as easily be a bug in PostgreSQL,
when something scribbles over the checksum due to a buffer overflow,
just before we write the buffer to the OS. So 'false failures' are not
entirely impossible thing.



Not to leave this without any resolution, it seems the issue has been 
caused by a SAN. Some configuration changes or something was being done 
at the time of the issue, and the SAN somehow ended up writing a page 
into a different relfilenode, into a different block. The page was from 
a btree index and got written into a heap relfilenode, but otherwise it 
was correct - the only thing that changed seems to be the block number, 
which explains the minimal difference in the checksum.


I don't think we'll learn much more, but it seems the checksums did 
their work in detecting the issue.


>

So I think we're not ready to enable checksums by default for everyone,
not until we can provide tools to deal with failures like this (I don't
think users will be amused if we tell them to use 'dd' and inspect the
pages in a hex editor).

ISTM the way forward is to keep the current default (disabled), but to
allow enabling checksums on the fly. That will mostly fix the issue for
people who actually want checksums but don't realize they need to enable
them at initdb time (and starting from scratch is not an option for
them), are running on good hardware and are capable of dealing with
checksum errors if needed, even without more built-in tooling.

Being able to disable checksums on the fly is nice, but it only really
solves the issue of extra overhead - it does really help with the
failures (particularly when you can't even start the database, because
of a checksum failure in the startup phase).

So, shall we discuss what tooling would be useful / desirable?



Although the checksums did detect the issue (we might never notice 
without them, or maybe the instance would mysteriously crash), I still 
think better tooling is neeed.


I've posted some minor pageinspect improvements I hacked together while 
investigating this, but I don't think pageinspect is a very good tool 
for investigating checksum / data corruption issues, for a number of 
reasons:


1) It does not work at all when the instance does not even start - you 
have to manually dump the pages and try inspecting them from another 
instance.


2) Even then it assumes the pages are not corrupted, and may easily 
cause segfaults or other issues if that's not the case.


3) Works on a manual page-by-page basis.

4) It does not even try to resolve the issue somehow.

For example I think it'd be great to have a tool that work even on 
instances that are not running. For example, something that recursively 
walks through all files in a data directory, verifies checksums on 
everything, lists/dumps pages with broken checksums for further 
inspection. I have an alpha-alpha versions of something along those 
lines, written before the root cause was identified.


It'd be nice to have something that could help with fixing the issues 
(e.g. by fetching the last FPI from the backup, or so). But that's 
clearly way more difficult.


There are probably some other tools that might be useful when dealing 
with data corruption (e.g. scrubbing to detect it).



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-02-20 Thread Peter Moser
On Wed, Feb 15, 2017 at 9:33 PM, David G. Johnston
 wrote:
> On Wed, Feb 15, 2017 at 12:24 PM, Robert Haas  wrote:
>>
>> So it seems like an ALIGN or NORMALIZE option is kind of like a JOIN,
>> except apparently there's no join type and the optimizer can never
>> reorder these operations with each other or with other joins.  Is that
>> right?  The optimizer changes in this patch seem fairly minimal, so
>> I'm guessing it can't be doing anything very complex here.
>>
>> + * INPUT:
>> + * (r ALIGN s ON q WITH (r.ts, r.te, s.ts, s.te)) c
>> + * where q can be any join qualifier, and r.ts, r.te, s.ts,
>> and s.t
>> e
>> + * can be any column name.
>> + *
>> + * OUTPUT:
>> + * (
>> + * SELECT r.*, GREATEST(r.ts, s.ts) P1, LEAST(r.te, s.te) P2
>> + *  FROM
>> + *  (
>> + * SELECT *, row_id() OVER () rn FROM r
>> + *  ) r
>> + *  LEFT OUTER JOIN
>> + *  s
>> + *  ON q AND r.ts < s.te AND r.te > s.ts
>> + *  ORDER BY rn, P1, P2
>> + *  ) c
>>
>> It's hard to see what's going on here.  What's ts?  What's te?  If you
>> used longer names for these things, it might be a bit more
>> self-documenting.
>
>
> Just reasoning out loud here...
>
> ISTM ts and te are "temporal [range] start" and "temporal [range] end" (or
> probably just the common "timestamp start/end")
>
> From what I can see it is affecting an intersection of the two ranges and,
> furthermore, splitting the LEFT range into sub-ranges that match up with the
> sub-ranges found on the right side.  From the example above this seems like
> it should be acting on self-normalized ranges - but I may be missing
> something by evaluating this out of context.
>
> r1 [1, 6] [ts, te] [time period start, time period end]
> s1 [2, 3]
> s2 [3, 4]
> s3 [5, 7]
>
> r LEFT JOIN s ON (r.ts < s.te AND r.te > s.ts)
>
> r1[1, 6],s1[2, 3] => [max(r.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[2, 3]
> r1[1, 6],s2[3, 4] => [max(t.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[3, 4]
> r1[1, 6],s3[5, 7] => [max(t.ts, s.ts),min(r.te, s.te)] => r1[1, 6],d[5, 6]
>
> Thus the intersection is [2,6] but since s1 has three ranges that begin
> between 2 and 6 (i.e., 2, 3, and 5) there are three output records that
> correspond to those sub-ranges.

Yes, this is what the internal rewriting produces for r1.
Note that till now we only support half-open ranges, i.e., [), but for
visibility I will continue this example using closed ranges [].
The executor function (ExecTemporalAdjustment) gets this (the output above) as
the input and will then produce:

r1[1, 1]
r1[2, 3]
r1[3, 4]
r1[5, 6]

Which means also for the ALIGN the non-overlapping parts are retained.

>
> The description in the OP basically distinguishes between NORMALIZE and
> ALIGN in that ALIGN, as described above, affects an INTERSECTION on the two
> ranges - discarding the non-overlapping data - while NORMALIZE performs the
> alignment while also retaining the non-overlapping data.

Also for ALIGN we retain the non-overlapping part.
Intersections are symmetric/commutative, so a subsequent outer join can then use
equality on the ranges
to produce join matches (for overlapping) as well as null-extend the
produced non-overlapping parts.

The difference between ALIGN and NORMALIZE is how they split, while ALIGN
produces intersections between pairs of tuples (used for joins) and the
non-overlapping parts, NORMALIZE produces intersections between groups of tuples
(used for aggregation, so that all tuples with the same group have equal
ranges) and the non-overlapping parts.

For instance, the NORMALIZE between r1, s1, s2, and s3 in your example above
would give the following:

r1[1, 1]
r1[2, 2]
r1[3, 3]
r1[4, 4]
r1[5, 6]

>
> The rest of the syntax seems to deal with selecting subsets of range records
> based upon attribute data.

Yes, exactly!


Best regards,
Anton, Johann, Michael, Peter


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-02-20 Thread Amit Kapila
On Fri, Feb 17, 2017 at 7:21 PM, Mithun Cy  wrote:
>
> To implement such an incremental addition of bucket blocks I have to
> increase the size of array hashm_spares in meta page by four times.
> Also, mapping methods which map a total number of buckets to a
> split-point position of hashm_spares array, need to be changed. These
> changes create backward compatibility issues.
>

How will high and lowmask calculations work in this new strategy?
Till now they always work on doubling strategy and I don't see you
have changed anything related to that code.  Check below places.

_hash_metapinit()
{
..
/*
* We initialize the index with N buckets, 0 .. N-1, occupying physical
* blocks 1 to N.  The first freespace bitmap page is in block N+1. Since
* N is a power of 2, we can set the masks this way:
*/
metap->hashm_maxbucket = metap->hashm_lowmask = num_buckets - 1;
metap->hashm_highmask = (num_buckets << 1) - 1;
..
}

_hash_expandtable()
{
..
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
}
..
}

Till now, we have worked hard for not changing the page format in a
backward incompatible way, so it will be better if we could find some
way of doing this without changing the meta page format in a backward
incompatible way.  Have you considered to store some information in
shared memory based on which we can decide how much percentage of
buckets are allocated in current table half?  I think we might be able
to construct this information after crash recovery as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] powerpc(32) point/polygon regression failures on Debian Jessie

2017-02-20 Thread Christoph Berg
Re: Tom Lane 2017-02-20 <13825.1487607...@sss.pgh.pa.us>
> Still, it'd be worth comparing the assembly code for your test program.

I was compiling the program on jessie and on sid, and running the
jessie binary on sid made it output the same as the sid binary, so the
difference isn't in the binary, but in some system library.

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-02-20 Thread Tomas Vondra

Hi,

while investigating some checksum-related issues, I needed to perform 
some forensics on a copy of a btree page (taken from another instance 
using 'dd').


But I've ran into two pageinspect limitations, hopefully addressed by 
this patch:


1) bt_page_items(bytea) not defined

We have heap_page_items(bytea) but not bt_page_items(bytea). I suspect 
this is partially historical API inconsistence, and partially due to the 
need to handle the btree metapage explicitly.


The original function simply threw an error with blkno=0, the new 
function simply checks for BTP_META page.


I believe this is sufficient, assuming the instance without data 
corruption (which pageinspect assumes anyway). With data corruption all 
bets are off anyway - for example the metapage might be written to a 
different block (essentially what I saw in the investigated issue). 
Actually, the flag check is better in this case - it detects the 
metapage, while the blkno=0 check fails to do that (leading to crash).


2) page_checksum()

When everything is fine, you can do page_header() which also includes 
the checksum. When the checksum gets broken, you may still dump the page 
using 'dd+pg_read_binary_file' to see the header, but clearly that 
checksum is wrong - and it's interesting to see the correct one and 
compare it to the checksum in the header.


This function makes it possible - it accepts the bytea image of the 
page, and blkno (so it's possible to see how would the block look if it 
was written somewhere else, for example).



BTW I've noticed the pageinspect version is 1.6, but we only have 
pageinspect--1.5.sql (and upgrade script to 1.6). Not sure that's 
entirely intentional?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 499a7bd9aea3032f03d787833c0501d9fa703271 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 13 Feb 2017 21:20:12 +0100
Subject: [PATCH] pageinspect - page_checksum and bt_page_items(bytea)

Adds two functions to the pageinspect extension:

1) page_checksum(bytea,int4) allows computing checksum for
arbitrary page, even if data checksums are not enabled

2) bt_page_items(bytea) is similar to heap_page_items(bytea)
---
 contrib/pageinspect/btreefuncs.c  | 209 +++---
 contrib/pageinspect/expected/btree.out|  13 ++
 contrib/pageinspect/pageinspect--1.5--1.6.sql |  22 +++
 contrib/pageinspect/rawpage.c |  37 +
 contrib/pageinspect/sql/btree.sql |   4 +
 doc/src/sgml/pageinspect.sgml |  58 +++
 src/include/access/nbtree.h   |   1 +
 7 files changed, 320 insertions(+), 24 deletions(-)

diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index d50ec3a..93da844 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -39,8 +39,14 @@
 #include "utils/varlena.h"
 
 
+extern Datum bt_metap(PG_FUNCTION_ARGS);
+extern Datum bt_page_items(PG_FUNCTION_ARGS);
+extern Datum bt_page_items_bytea(PG_FUNCTION_ARGS);
+extern Datum bt_page_stats(PG_FUNCTION_ARGS);
+
 PG_FUNCTION_INFO_V1(bt_metap);
 PG_FUNCTION_INFO_V1(bt_page_items);
+PG_FUNCTION_INFO_V1(bt_page_items_bytea);
 PG_FUNCTION_INFO_V1(bt_page_stats);
 
 #define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
@@ -215,17 +221,31 @@ bt_page_stats(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	j = 0;
-	values[j++] = psprintf("%d", stat.blkno);
-	values[j++] = psprintf("%c", stat.type);
-	values[j++] = psprintf("%d", stat.live_items);
-	values[j++] = psprintf("%d", stat.dead_items);
-	values[j++] = psprintf("%d", stat.avg_item_size);
-	values[j++] = psprintf("%d", stat.page_size);
-	values[j++] = psprintf("%d", stat.free_size);
-	values[j++] = psprintf("%d", stat.btpo_prev);
-	values[j++] = psprintf("%d", stat.btpo_next);
-	values[j++] = psprintf("%d", (stat.type == 'd') ? stat.btpo.xact : stat.btpo.level);
-	values[j++] = psprintf("%d", stat.btpo_flags);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.blkno);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%c", stat.type);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.live_items);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.dead_items);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.avg_item_size);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.page_size);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.free_size);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.btpo_prev);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, "%d", stat.btpo_next);
+	values[j] = palloc(32);
+	if (stat.type == 'd')
+		snprintf(values[j++], 32, "%d", stat.btpo.xact);
+	else
+		snprintf(values[j++], 32, "%d", stat.btpo.level);
+	values[j] = palloc(32);
+	snprintf(values[j++], 32, 

Re: [HACKERS] powerpc(32) point/polygon regression failures on Debian Jessie

2017-02-20 Thread Tom Lane
Christoph Berg  writes:
> Re: Tom Lane 2017-02-20 <30737.1487598...@sss.pgh.pa.us>
>> Hmph.  We haven't touched that code in awhile, and certainly not in the
>> 9.4.x branch.  I'd have to agree that this must be a toolchain change.

> FYI, in the meantime we could indeed trace it back to an libc issue on
> Jessie:

I wonder whether it's a compiler change, maybe along the lines of
rearranging the computation so that it gives a slightly different result.
Although you'd think that 10.0/10.0 would give exactly 1.0 no matter what.
Still, it'd be worth comparing the assembly code for your test program.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] powerpc(32) point/polygon regression failures on Debian Jessie

2017-02-20 Thread Christoph Berg
Re: Tom Lane 2017-02-20 <30737.1487598...@sss.pgh.pa.us>
> Hmph.  We haven't touched that code in awhile, and certainly not in the
> 9.4.x branch.  I'd have to agree that this must be a toolchain change.

FYI, in the meantime we could indeed trace it back to an libc issue on
Jessie:

$ cat sqrt.c 
#include 
#include 
#include 

double
pg_hypot(double x, double y)
{
double  yx;

/* Some PG-specific code deleted here */

/* Else, drop any minus signs */
x = fabs(x);
y = fabs(y);

/* Swap x and y if needed to make x the larger one */
if (x < y)
{
double  temp = x;

x = y;
y = temp;
}

/*
 * If y is zero, the hypotenuse is x.  This test saves a few cycles in
 * such cases, but more importantly it also protects against
 * divide-by-zero errors, since now x >= y.
 */
if (y == 0.0)
return x;

/* Determine the hypotenuse */
yx = y / x;
return x * sqrt(1.0 + (yx * yx));
}


int main ()
{
//fesetround(FE_TONEAREST);
printf("fegetround is %d\n", fegetround());
double r = pg_hypot(10.0, 10.0);
printf("14 %.14g\n", r);
printf("15 %.15g\n", r);
printf("16 %.16g\n", r);
printf("17 %.17g\n", r);
return 0;
}


Jessie output:
fegetround is 0
14 14.142135623731
15 14.1421356237309
16 14.14213562373095
17 14.142135623730949

Sid output:
fegetround is 0
14 14.142135623731
15 14.142135623731
16 14.14213562373095
17 14.142135623730951


The Sid output is what the point and polygon tests are expecting.

Possible culprit is this bug report from November:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843904
(Though that doesn't explain why it affects 32bit powerpc only.)

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Tom Lane
Tomas Vondra  writes:
> On 02/20/2017 04:37 PM, Tom Lane wrote:
>> But that's using gcc.  Perhaps clang behaves differently?

> AFAIK it happens because clang treats missing declarations as warnings, 
> which confuses configure:
> https://bugs.llvm.org//show_bug.cgi?id=20820

Ah, right.  Looks like the autoconf people still haven't made a
release incorporating the fix Noah provided :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Tomas Vondra

On 02/20/2017 04:37 PM, Tom Lane wrote:

Aleksander Alekseev  writes:

In theory - could we just always use our internal strl* implementations?


Hmm, maybe configure's test to see if a declaration has been provided
is going wrong?  I notice that anchovy, which is supposedly current
Arch Linux, doesn't think the platform has it:

checking whether strlcat is declared... no
checking whether strlcpy is declared... no
...
checking for strlcat... no
checking for strlcpy... no

But that's using gcc.  Perhaps clang behaves differently?



AFAIK it happens because clang treats missing declarations as warnings, 
which confuses configure:


https://bugs.llvm.org//show_bug.cgi?id=20820

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Provide list of subscriptions and publications in psql's completion

2017-02-20 Thread Stephen Frost
Greetings,

* Fujii Masao (masao.fu...@gmail.com) wrote:
> On Fri, Feb 17, 2017 at 11:17 PM, Peter Eisentraut
>  wrote:
> > On 2/13/17 12:07, Fujii Masao wrote:
> >> Anyway IMO that we can expose all the
> >> columns except the sensitive information (i.e., subconninfo field)
> >> in pg_subscription to even non-superusers.
> >
> > You mean with column privileges?
> 
> Yes.
> 
> So there are several approaches...
> 
> 1) Expose all the columns except subconninfo in pg_subscription to
> non-superusers. In this idea, the tab-completion and psql meta-command
> for subscription still sees pg_subscription. One good point of
> idea is that even non-superusers can see all the useful information
> about subscriptions other than sensitive information like conninfo.
> 
> 2) Change pg_stat_subscription so that it also shows all the columns except
> subconninfo in pg_subscription. Also change the tab-completion and
> psql meta-command for subscription so that they see pg_stat_subscription
> instead of pg_subscription. One good point is that pg_stat_subscription
> exposes all the useful information about subscription to even
> non-superusers. OTOH, pg_subscription and pg_stat_subscription have
> to have several same columns. This would be redundant and a bit confusing.
> 
> 3) Expose subdbid in pg_stat_subscription. Also change the tab-completion
> and psql meta-command for subscription so that they see
> pg_stat_subscription. This change is very simple. But non-superusers 
> cannot
> see useful information like subslotname because of privilege problem.
> 
> I like #1, but any better approach?

#1 seems alright to me, at least.  We could start using column-level
privs in other places also, as I mentioned up-thread.

I don't particularly like the suggestions to have psql use pg_stat_X
views or to put things into pg_stat_X views which are object definitions
for non-superusers.  If we really don't want to use column-level
privileges then we should have an appropriate view create instead which
provides non-superusers with the non-sensitive object information.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Provide list of subscriptions and publications in psql's completion

2017-02-20 Thread Fujii Masao
On Fri, Feb 17, 2017 at 11:17 PM, Peter Eisentraut
 wrote:
> On 2/13/17 12:07, Fujii Masao wrote:
>> Anyway IMO that we can expose all the
>> columns except the sensitive information (i.e., subconninfo field)
>> in pg_subscription to even non-superusers.
>
> You mean with column privileges?

Yes.

So there are several approaches...

1) Expose all the columns except subconninfo in pg_subscription to
non-superusers. In this idea, the tab-completion and psql meta-command
for subscription still sees pg_subscription. One good point of
idea is that even non-superusers can see all the useful information
about subscriptions other than sensitive information like conninfo.

2) Change pg_stat_subscription so that it also shows all the columns except
subconninfo in pg_subscription. Also change the tab-completion and
psql meta-command for subscription so that they see pg_stat_subscription
instead of pg_subscription. One good point is that pg_stat_subscription
exposes all the useful information about subscription to even
non-superusers. OTOH, pg_subscription and pg_stat_subscription have
to have several same columns. This would be redundant and a bit confusing.

3) Expose subdbid in pg_stat_subscription. Also change the tab-completion
and psql meta-command for subscription so that they see
pg_stat_subscription. This change is very simple. But non-superusers cannot
see useful information like subslotname because of privilege problem.

I like #1, but any better approach?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Tom Lane
Aleksander Alekseev  writes:
> In theory - could we just always use our internal strl* implementations? 

Hmm, maybe configure's test to see if a declaration has been provided
is going wrong?  I notice that anchovy, which is supposedly current
Arch Linux, doesn't think the platform has it:

checking whether strlcat is declared... no
checking whether strlcpy is declared... no
...
checking for strlcat... no
checking for strlcpy... no

But that's using gcc.  Perhaps clang behaves differently?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] mat views stats

2017-02-20 Thread Jim Mlodgenski
I've come across a number of times where the statistics on materialized
views become stale producing bad plans. It turns out that autovacuum only
touches a materialized view when it is first created and ignores it on a
refresh. When you have a materialized view like yesterdays_sales the data
in the materialized view turns over every day.

Attached is a patch to trigger autovacuum based on a matview refresh along
with a system view pg_stat_all_matviews to show information more meaningful
for materialized views.

-- Jim
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fad5cb0..ec27e2c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -436,6 +436,21 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
 
  
+  pg_stat_all_matviewspg_stat_all_matviews
+  
+   One row for each materialized view in the current database, showing statistics
+   about accesses to that specific materialized view.
+   See  for details.
+  
+ 
+
+ 
+  pg_stat_user_matviewspg_stat_user_matviews
+  Same as pg_stat_all_matviews, except that only
+  user materialized views are shown.
+ 
+
+ 
   pg_statio_all_tablespg_statio_all_tables
   
One row for each table in the current database, showing statistics
@@ -2277,6 +2292,97 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

   
 
+  
+   pg_stat_all_matviews View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+
+ relid
+ oid
+ OID of this materialize view
+
+
+ schemaname
+ name
+ Name of the schema that this materialized view is in
+
+
+ relname
+ name
+ Name of this materialized view
+
+
+ seq_scan
+ bigint
+ Number of sequential scans initiated on this materialized view
+
+
+ seq_tup_read
+ bigint
+ Number of live rows fetched by sequential scans
+
+
+ idx_scan
+ bigint
+ Number of index scans initiated on this materialized ciew
+
+
+ idx_tup_fetch
+ bigint
+ Number of live rows fetched by index scans
+
+
+ last_refresh
+ timestamp with time zone
+ Last time at which this materialized view was refreshed
+
+
+ refresh_count
+ bigint
+ Number of times this materialized view has been refreshed
+
+
+ last_analyze
+ timestamp with time zone
+ Last time at which this materialized view was manually analyzed
+
+
+ last_autoanalyze
+ timestamp with time zone
+ Last time at which this materialized ciew was analyzed by the
+  autovacuum daemon
+
+
+ analyze_count
+ bigint
+ Number of times this materialized view has been manually analyzed
+
+
+ autoanalyze_count
+ bigint
+ Number of times this materialized view has been analyzed by the
+  autovacuum daemon
+
+   
+   
+  
+
+  
+   The pg_stat_all_matviews view will contain
+   one row for each materialized view in the current database, showing 
+   statistics about accesses to that specific materialized view. The
+   pg_stat_user_matviews contain the same 
+   information, but filtered to only show user materialized views.
+  
+
   
pg_statio_all_tables View

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 816483e..8d60d48 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -164,6 +164,15 @@ static relopt_int intRelOpts[] =
 		},
 		-1, 0, INT_MAX
 	},
+{
+{
+"autovacuum_analyze_refresh_threshold",
+"Minimum number of materialized view refreshes prior to analyze",
+RELOPT_KIND_HEAP,
+ShareUpdateExclusiveLock
+},
+-1, 0, INT_MAX
+},
 	{
 		{
 			"autovacuum_vacuum_cost_delay",
@@ -1283,6 +1292,8 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, vacuum_threshold)},
 		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_threshold)},
+{"autovacuum_analyze_refresh_threshold", RELOPT_TYPE_INT,
+offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, analyze_ref_threshold)},
 		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_INT,
 		offsetof(StdRdOptions, autovacuum) +offsetof(AutoVacOpts, vacuum_cost_delay)},
 		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..07b9f1c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -535,6 +535,28 @@ CREATE VIEW 

Re: [HACKERS] Passing query string to workers

2017-02-20 Thread Kuntal Ghosh
On Mon, Feb 20, 2017 at 10:11 AM, Rafia Sabih
 wrote:
> On Sun, Feb 19, 2017 at 10:11 PM, Robert Haas  wrote:
>> +   query_data = (char *) palloc0(strlen(estate->es_queryString) + 1);
>> +   strcpy(query_data, estate->es_queryString);
>>
>> It's unnecessary to copy the query string here; you're going to use it
>> later on in the very same function.  That code can just refer to
>> estate->es_queryString directly.
>
>
> Done.
+   char   *query_data;
+   query_data = estate->es_sourceText;
estate->es_sourceText is a const char* variable. Assigning this const
pointer to a non-const pointer violates the rules
constant-correctness. So, either you should change query_data as const
char*, or as Robert suggested, you can directly use
estate->es_sourceText.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] "may be unused" warnings for gcc

2017-02-20 Thread Andres Freund
Hi,

When building with a new-ish gcc (6.3.0 right now, but I've seen this
for a while) with optimization I get a number of warnings:

In file included from /home/andres/src/postgresql/src/include/postgres.h:48:0,
 from 
/home/andres/src/postgresql/src/backend/parser/parse_collate.c:41:
/home/andres/src/postgresql/src/backend/parser/parse_collate.c: In function 
‘select_common_collation’:
/home/andres/src/postgresql/src/include/utils/elog.h:107:4: warning: 
‘context.location2’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
errfinish rest; \
^
/home/andres/src/postgresql/src/backend/parser/parse_collate.c:210:28: note: 
‘context.location2’ was declared here
  assign_collations_context context;
^~~
In file included from /home/andres/src/postgresql/src/include/postgres.h:48:0,
 from 
/home/andres/src/postgresql/src/backend/parser/parse_collate.c:41:
/home/andres/src/postgresql/src/include/utils/elog.h:107:4: warning: 
‘context.collation2’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
errfinish rest; \
^
/home/andres/src/postgresql/src/backend/parser/parse_collate.c:210:28: note: 
‘context.collation2’ was declared here
  assign_collations_context context;
^~~

While I believe these are false positives, I am not surprised that the
compiler can't see that.  select_common_collation() initializes some
fields of assign_collations_context, but not others.  There's several
branches out of assign_collations_walker that return without setting
ocllation2/location2. I think that's currently harmless because
it looks like select_common_collation() won't enter the context.strength
== COLLATE_CONFLICT branch in that case - but it's certainly hard to
see.


In file included from 
/home/andres/src/postgresql/src/backend/commands/dbcommands.c:20:0:
/home/andres/src/postgresql/src/backend/commands/dbcommands.c: In function 
‘createdb’:
/home/andres/src/postgresql/src/include/postgres.h:529:35: warning: 
‘src_minmxid’ may be used uninitialized in this function [-Wmaybe-uninitialized]
 #define TransactionIdGetDatum(X) ((Datum) SET_4_BYTES((X)))
   ^
/home/andres/src/postgresql/src/backend/commands/dbcommands.c:113:14: note: 
‘src_minmxid’ was declared here
  MultiXactId src_minmxid;
  ^~~
(and the same for src_frozenxid, src_lastsysoid, ...)

It appears that the loop in get_db_info() is too complex for gcc.
Replacing the !HeapTupleIsValid(tuple) break; with a heap_close() and
direct return fixes those.


/home/andres/src/postgresql/src/backend/utils/misc/guc.c: In function 
‘RestoreGUCState’:
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:6619:21: warning: 
‘varsourceline’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  record->sourceline = sourceline;
  ~~~^~~~
/home/andres/src/postgresql/src/backend/utils/misc/guc.c:9279:8: note: 
‘varsourceline’ was declared here
  int   varsourceline;
^

Not sure what the problem is here - even if the varsourcefile[0] test in
RestoreGUCState is stored in a local variable that's also checked before
the set_config_sourcefile() branch, it warns.  Initializing
varsourceline to 0 works and seems reasonable.


/home/andres/src/postgresql/src/backend/utils/adt/varlena.c: In function 
‘text_position’:
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1358:36: warning: 
‘state.skiptablemask’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
 hptr += state->skiptable[*hptr & skiptablemask];
  ~~^~~
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1099:20: note: 
‘state.skiptablemask’ was declared here
  TextPositionState state;
^
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1344:9: warning: 
‘state.wstr2’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  if (nptr == needle)
 ^
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1099:20: note: 
‘state.wstr2’ was declared here
  TextPositionState state;
^
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1099:20: warning: 
‘state.wstr1’ may be used uninitialized in this function [-Wmaybe-uninitialized]
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1288:9: warning: 
‘state.str2’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  if (nptr == needle)
 ^
/home/andres/src/postgresql/src/backend/utils/adt/varlena.c:1099:20: note: 
‘state.str2’ was declared here
  TextPositionState state;
^

No idea what exactly triggers this, but zero-initializing
TextPositionState helps. What confuses me is that doing so in
text_position() is sufficient - the other uses don't trigger a warning?



Re: [HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Aleksander Alekseev
In theory - could we just always use our internal strl* implementations? 

On Mon, Feb 20, 2017 at 09:26:44AM -0500, Tom Lane wrote:
> Aleksander Alekseev  writes:
> > I've just tried to build PostgreSQL with Clang 3.9.1 (default version
> > currently available in Arch Linux) and noticed that it outputs lots of
> > warning messages. Most of them are result of a bug in Clang itself:
> > 
> > postinit.c:846:3: note: include the header  or explicitly
> > provide a declaration for 'strlcpy'
> 
> It might be an incompatibility with the platform-supplied string.h
> rather than an outright bug, but yeah, that's pretty annoying.
> 
> > The rest of warnings looks more like something we could easily deal with:
> 
> It's hard to get excited about these if there are going to be hundreds
> of the other ones ...
> 
>   regards, tom lane

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Tom Lane
Aleksander Alekseev  writes:
> I've just tried to build PostgreSQL with Clang 3.9.1 (default version
> currently available in Arch Linux) and noticed that it outputs lots of
> warning messages. Most of them are result of a bug in Clang itself:
> 
> postinit.c:846:3: note: include the header  or explicitly
> provide a declaration for 'strlcpy'

It might be an incompatibility with the platform-supplied string.h
rather than an outright bug, but yeah, that's pretty annoying.

> The rest of warnings looks more like something we could easily deal with:

It's hard to get excited about these if there are going to be hundreds
of the other ones ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Suppress Clang 3.9 warnings

2017-02-20 Thread Aleksander Alekseev
Hello.

I've just tried to build PostgreSQL with Clang 3.9.1 (default version
currently available in Arch Linux) and noticed that it outputs lots of
warning messages. Most of them are result of a bug in Clang itself:

```
postinit.c:846:3: note: include the header  or explicitly
provide a declaration for 'strlcpy'
```

I've reported it to Clang developers almost a year ago but apparently
no one cares. You can find all the details in a corresponding thread
[1]. Frankly I'm not sure what to do about it. 

The rest of warnings looks more like something we could easily deal with:

```
xloginsert.c:742:18: warning: implicit conversion from 'int' to 'char'
changes value from 253 to -3 [-Wconstant-conversion]
```

Patch that fixes these warnings is attached to this email.

[1] http://lists.llvm.org/pipermail/cfe-dev/2016-March/048126.html

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f23e108628..0007010b25 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5015,7 +5015,7 @@ BootStrapXLOG(void)
 	record->xl_rmid = RM_XLOG_ID;
 	recptr += SizeOfXLogRecord;
 	/* fill the XLogRecordDataHeaderShort struct */
-	*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
+	*(recptr++) = (char)XLR_BLOCK_ID_DATA_SHORT;
 	*(recptr++) = sizeof(checkPoint);
 	memcpy(recptr, , sizeof(checkPoint));
 	recptr += sizeof(checkPoint);
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 03b05f937f..ea8e915029 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -739,7 +739,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) &&
 		replorigin_session_origin != InvalidRepOriginId)
 	{
-		*(scratch++) = XLR_BLOCK_ID_ORIGIN;
+		*(scratch++) = (char)XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, _session_origin, sizeof(replorigin_session_origin));
 		scratch += sizeof(replorigin_session_origin);
 	}
@@ -749,13 +749,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	{
 		if (mainrdata_len > 255)
 		{
-			*(scratch++) = XLR_BLOCK_ID_DATA_LONG;
+			*(scratch++) = (char)XLR_BLOCK_ID_DATA_LONG;
 			memcpy(scratch, _len, sizeof(uint32));
 			scratch += sizeof(uint32);
 		}
 		else
 		{
-			*(scratch++) = XLR_BLOCK_ID_DATA_SHORT;
+			*(scratch++) = (char)XLR_BLOCK_ID_DATA_SHORT;
 			*(scratch++) = (uint8) mainrdata_len;
 		}
 		rdt_datas_last->next = mainrdata_head;
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 96b7097f8b..9165da1ee5 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -1100,7 +1100,7 @@ WriteEmptyXLOG(void)
 	record->xl_rmid = RM_XLOG_ID;
 
 	recptr += SizeOfXLogRecord;
-	*(recptr++) = XLR_BLOCK_ID_DATA_SHORT;
+	*(recptr++) = (char)XLR_BLOCK_ID_DATA_SHORT;
 	*(recptr++) = sizeof(CheckPoint);
 	memcpy(recptr, ,
 		   sizeof(CheckPoint));


signature.asc
Description: PGP signature


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Tom Lane
Petr Jelinek  writes:
> It's definitely not hard, we already have
> IntegerTimestampToTimestampTz() which does the opposite conversion anyway.

It's not the functions that are hard, it's making sure that you have used
them in the correct places, and declared relevant variables with the
appropriate types.  Evidence on the ground is that that is very hard;
I have little faith that we'll get it right without compiler support.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Tom Lane
Andres Freund  writes:
> On 2017-02-20 11:58:12 +0100, Petr Jelinek wrote:
>> That being said, I did wonder myself if we should just deprecate float
>> timestamps as well.

> I think we need a proper deprecation period for that, given that the
> conversion away will be painful for pg_upgrade using people with big
> clusters.  So I think we should fix this regardless... :(

The question to be asked is whether there is still anybody out there
using float timestamps.  I'm starting to get dubious about it myself.
Certainly, no packager that I'm aware of has shipped a float-timestamp
build since we switched the default in 8.4.  Maybe there is somebody
who's faithfully built a float-timestamp custom build every year so they
can pg_upgrade in place from their 8.3 installation, but there have got
to be darn few such people.

As for "proper deprecation period", the documentation has called the
option deprecated since 8.4:

-disable-integer-datetimes
Disable support for 64-bit integer storage for timestamps and
intervals, and store datetime values as floating-point numbers
instead. Floating-point datetime storage was the default in
PostgreSQL releases prior to 8.4, but it is now deprecated,
because it does not support microsecond precision for the full
range of timestamp values.

I think the strongest reason why we didn't move to kill it sooner was
that we were not then assuming that every platform had 64-bit ints;
but we've required that since 9.0.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] powerpc(32) point/polygon regression failures on Debian Jessie

2017-02-20 Thread Tom Lane
Christoph Berg  writes:
> The point/polygon regression tests have started to fail on 32-bit
> powerpc on Debian Jessie. So far I could reproduce the problem with
> PostgreSQL 9.4.10+11 and 9.6.1, on several different machines. Debian
> unstable is unaffected.

Hmph.  We haven't touched that code in awhile, and certainly not in the
9.4.x branch.  I'd have to agree that this must be a toolchain change.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Error in XML recv function

2017-02-20 Thread Pavel Stehule
Hi

When I did tests of own XML import functions I found a strange error.

I successfully imported XML to PostgreSQL. This document is readable
without any visual defects. But when I tested this document against any
libxml2 function I found a error -

ERROR:  could not parse XML document
DETAIL:  input conversion failed due to input error, bytes 0x88 0x3C 0x2F
0x72
line 1: switching encoding: encoder error

When I debug this document I found a inconsistency between encoding info
and data. Recv correctly ensure encoding from cp1250 encoding to UTF8. But
the encoding info stays without any change and shows cp1250 still. Then
libxml2 functions fails.

Regards

Pavel


	
	ò
		

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to read a value when it is VARATT EXTERNAL ONDISK from logical replication decoder

2017-02-20 Thread Andres Freund
Hi,

On 2017-02-20 11:44:58 +0100, Adam Dratwiński wrote:
> Hello everyone,
> 
> I am writing a custom logical replication decoder, and I took test decoder
> from Postgres sources as an example.
> 
> Could anyone tell me how to read "unchanged toast datum" in case it is
> VARTT_IS_EXTERNAL_ONDISK.

You can't by default. It's simply not available data - we'd have to log
a lot more data to make that fully available.  If you change the replica
identity settings to FULL, the full old data (including toasted data)
will be logged.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] How to read a value when it is VARATT EXTERNAL ONDISK from logical replication decoder

2017-02-20 Thread Adam Dratwiński
Hello everyone,

I am writing a custom logical replication decoder, and I took test decoder
from Postgres sources as an example.

Could anyone tell me how to read "unchanged toast datum" in case it is
VARTT_IS_EXTERNAL_ONDISK.

In the test decoder it is ignored:

https://github.com/postgres/postgres/blob/master/contrib/test_decoding/test_decoding.c#L377


Cheers
Adam


[HACKERS] powerpc(32) point/polygon regression failures on Debian Jessie

2017-02-20 Thread Christoph Berg
The point/polygon regression tests have started to fail on 32-bit
powerpc on Debian Jessie. So far I could reproduce the problem with
PostgreSQL 9.4.10+11 and 9.6.1, on several different machines. Debian
unstable is unaffected.

The failure looks like this:

https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.6=powerpc=9.6.1-2~bpo8%2B1=1485184696=0

 build/src/test/regress/regression.diffs 
*** /«PKGBUILDDIR»/build/../src/test/regress/expected/point.out Mon Oct 24 
20:08:51 2016
--- /«PKGBUILDDIR»/build/src/test/regress/results/point.out Mon Jan 23 
15:17:51 2017
***
*** 125,131 
   | (-3,4) |5
   | (-10,0)|   10
   | (-5,-12)   |   13
!  | (10,10)|  14.142135623731
   | (5.1,34.5) | 34.8749193547455
  (6 rows)
  
--- 125,131 
   | (-3,4) |5
   | (-10,0)|   10
   | (-5,-12)   |   13
!  | (10,10)| 14.1421356237309
   | (5.1,34.5) | 34.8749193547455
  (6 rows)
  
***
*** 150,157 
 | (-5,-12)   | (-10,0)|   13
 | (-5,-12)   | (0,0)  |   13
 | (0,0)  | (-5,-12)   |   13
!| (0,0)  | (10,10)|  14.142135623731
!| (10,10)| (0,0)  |  14.142135623731
 | (-3,4) | (10,10)| 14.3178210632764
 | (10,10)| (-3,4) | 14.3178210632764
 | (-5,-12)   | (-3,4) | 16.1245154965971
--- 150,157 
 | (-5,-12)   | (-10,0)|   13
 | (-5,-12)   | (0,0)  |   13
 | (0,0)  | (-5,-12)   |   13
!| (0,0)  | (10,10)| 14.1421356237309
!| (10,10)| (0,0)  | 14.1421356237309
 | (-3,4) | (10,10)| 14.3178210632764
 | (10,10)| (-3,4) | 14.3178210632764
 | (-5,-12)   | (-3,4) | 16.1245154965971
***
*** 221,227 
   | (-10,0)| (0,0)  |   10
   | (-10,0)| (-5,-12)   |   13
   | (-5,-12)   | (0,0)  |   13
!  | (0,0)  | (10,10)|  14.142135623731
   | (-3,4) | (10,10)| 14.3178210632764
   | (-5,-12)   | (-3,4) | 16.1245154965971
   | (-10,0)| (10,10)| 22.3606797749979
--- 221,227 
   | (-10,0)| (0,0)  |   10
   | (-10,0)| (-5,-12)   |   13
   | (-5,-12)   | (0,0)  |   13
!  | (0,0)  | (10,10)| 14.1421356237309
   | (-3,4) | (10,10)| 14.3178210632764
   | (-5,-12)   | (-3,4) | 16.1245154965971
   | (-10,0)| (10,10)| 22.3606797749979

==

*** /«PKGBUILDDIR»/build/../src/test/regress/expected/polygon.out   Mon Oct 
24 20:08:51 2016
--- /«PKGBUILDDIR»/build/src/test/regress/results/polygon.out   Mon Jan 23 
15:17:51 2017
***
*** 222,229 
'(2,2)'::point <-> '((0,0),(1,4),(3,1))'::polygon as inside,
'(3,3)'::point <-> '((0,2),(2,0),(2,2))'::polygon as near_corner,
'(4,4)'::point <-> '((0,0),(0,3),(4,0))'::polygon as near_segment;
!  on_corner | on_segment | inside |   near_corner   | near_segment 
! ---+++-+--
!  0 |  0 |  0 | 1.4142135623731 |  3.2
  (1 row)
  
--- 222,229 
'(2,2)'::point <-> '((0,0),(1,4),(3,1))'::polygon as inside,
'(3,3)'::point <-> '((0,2),(2,0),(2,2))'::polygon as near_corner,
'(4,4)'::point <-> '((0,0),(0,3),(4,0))'::polygon as near_segment;
!  on_corner | on_segment | inside |   near_corner| near_segment 
! ---+++--+--
!  0 |  0 |  0 | 1.41421356237309 |  3.2
  (1 row)
  


The 9.4.11 log contains the same point.out diff, but not polygon.out:
https://buildd.debian.org/status/fetch.php?pkg=postgresql-9.4=powerpc=9.4.11-0%2Bdeb8u1=1487517299=0

Does that ring any bell? As Debian unstable is unaffected, it's likely
the toolchain to be blamed, but it worked for Debian Jessie before.

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
pgp fingerprint: 5C48 FE61 57F4 9179 5970  87C6 4C5A 6BAB 12D2 A7AE


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dropping partitioned tables without CASCADE

2017-02-20 Thread Ashutosh Bapat
Thanks for working on all the follow on work for partitioning feature.

May be you should add all those patches in the next commitfest, so
that we don't forget those.

On Mon, Feb 20, 2017 at 7:46 AM, Amit Langote
 wrote:
> Re-posting the patch I posted in a nearby thread [0].
>
> On 2017/02/16 2:08, Robert Haas wrote:
>> On Wed, Feb 15, 2017 at 11:34 AM, Alvaro Herrera
>>  wrote:
>>> I think new-style partitioning is supposed to consider each partition as
>>> an implementation detail of the table; the fact that you can manipulate
>>> partitions separately does not really mean that they are their own
>>> independent object.  You don't stop to think "do I really want to drop
>>> the TOAST table attached to this main table?" and attach a CASCADE
>>> clause if so.  You just drop the main table, and the toast one is
>>> dropped automatically.  I think new-style partitions should behave
>>> equivalently.
>>
>> That's a reasonable point of view.  I'd like to get some more opinions
>> on this topic.  I'm happy to have us do whatever most people want, but
>> I'm worried that having table inheritance and table partitioning work
>> differently will be create confusion.  I'm also suspicious that there
>> may be some implementation difficulties.  On the hand, it does seem a
>> little silly to say that DROP TABLE partitioned_table should always
>> fail except in the degenerate case where there are no partitions, so
>> maybe changing it is for the best.
>
> So I count more than a few votes saying that we should be able to DROP
> partitioned tables without specifying CASCADE.
>
> I tried to implement that using the attached patch by having
> StoreCatalogInheritance1() create DEPENDENCY_AUTO dependency between
> parent and child if the child is a partition, instead of DEPENDENCY_NORMAL
> that would otherwise be created.  Now it seems that that is one way of
> making sure that partitions are dropped when the root partitioned table is
> dropped, not sure if the best; why create the pg_depend entries at all one
> might ask.  I chose it for now because that's the one with fewest lines of
> change.  Adjusted regression tests as well, since we recently tweaked
> tests [1] to work around the irregularities of test output when using CASCADE.

The patch applies cleanly and regression does not show any failures.

Here are some comments

For the sake of readability you may want reverse the if and else order.
-recordDependencyOn(, , DEPENDENCY_NORMAL);
+if (!child_is_partition)
+recordDependencyOn(, , DEPENDENCY_NORMAL);
+else
+recordDependencyOn(, , DEPENDENCY_AUTO);
like
+if (child_is_partition)
+recordDependencyOn(, , DEPENDENCY_AUTO);
+else
+recordDependencyOn(, , DEPENDENCY_NORMAL);

It's weird that somebody can perform DROP TABLE on the partition without
referring to its parent. That may be a useful feature as it allows one to
detach the partition as well as remove the table in one command. But it looks
wierd for someone familiar with partitioning features of other DBMSes. But then
our partition creation command is CREATE TABLE  So may be this is expected
difference.

--- cleanup: avoid using CASCADE
-DROP TABLE list_parted, part_1;
-DROP TABLE list_parted2, part_2, part_5, part_5_a;
-DROP TABLE range_parted, part1, part2;
+-- cleanup
+DROP TABLE list_parted, list_parted2, range_parted;
Testcases usually drop one table at a time, I guess, to reduce the differences
when we add or remove tables from testcases. All such blocks should probably
follow same policy.

 drop table list_parted cascade;
-NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table part_ab_cd
probably we should remove cascade from there, unless you are testing CASCADE
functionality. Similarly for other blocks like
 drop table range_parted cascade;

BTW, I noticed that although we are allowing foreign tables to be
partitions, there are no tests in foreign_data.sql for testing it. If
there would have been we would tests DROP TABLE on a partitioned table
with foreign partitions as well. That file has testcases for testing
foreign table inheritance, and should have tests for foreign table
partitions.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM authentication, take three

2017-02-20 Thread Aleksander Alekseev
And a few more things I've noticed after a closer look:

* build_client_first_message does not free `state->client_nonce` if
  second malloc (for `buf`) fails
* same for `state->client_first_message_bare`
* ... and most other procedures declared in fe-auth-scram.c file
  (see malloc and strdup calls)
* scram_Normalize doesn't check malloc return value

Sorry for lots of emails.

On Mon, Feb 20, 2017 at 03:15:14PM +0300, Aleksander Alekseev wrote:
> Speaking about flaws, it looks like there is a memory leak in
> array_to_utf procedure - result is allocated twice.
> 
> On Mon, Feb 20, 2017 at 02:51:13PM +0300, Aleksander Alekseev wrote:
> > Hi!
> > 
> > Currently I don't see any significant flaws in these patches. However I
> > would like to verify that implemented algorithms are compatible with
> > algorithms implemented by third party.
> > 
> > For instance, for user 'eax' and password 'pass' I got the following
> > record in pg_authid:
> > 
> > ```
> > scram-sha-256:
> > xtznkRN/nc/1DQ==:
> > 4096:
> > 2387c124a3139a276b848c910f43ece05dd670d0977ace4f20d724af522312e4:
> > 5e3bdd6584880198b0375acabd8754c460af2389499f71a756660a10a8aaa843
> > ```
> > 
> > Let's say I would like to implement SCRAM in pure Python, for instance
> > add it to pg8000 driver. Firstly I need to know how to generate server
> > key and client key having only user name and password. Somehow like
> > this:
> > 
> > ```
> >  >>> import base64
> >  >>> import hashlib
> >  >>> base64.b16encode(hashlib.pbkdf2_hmac('sha256', b'pass',
> >  ...base64.b64decode(b'xtznkRN/nc/1DQ=='), 4096))
> > b'14B90CFCF690120399A0E6D30C60DD9D9D221CD9C2E31EA0A00514C41823E6C3'
> > ```
> > 
> > Naturally it doesn't work because both SCRAM_SERVER_KEY_NAME and
> > SCRAM_CLIENT_KEY_NAME should also be involved. I see PostgreSQL
> > implementation just in front of me but unfortunately I'm still having
> > problems calculating exactly the same server key and client key. It makes
> > me worry whether PostgreSQL implementation is OK.
> > 
> > Could you please give an example of how to do it?
> > 
> > On Mon, Feb 20, 2017 at 03:29:19PM +0900, Michael Paquier wrote:
> > > On Sun, Feb 19, 2017 at 10:07 PM, Michael Paquier
> > >  wrote:
> > > > There is something that I think is still unwelcome in this patch: the
> > > > interface in pg_hba.conf. I mentioned that in the previous thread but
> > > > now if you want to match a user and a database with a scram password
> > > > you need to do that with the current set of patches:
> > > > local $dbname $user scram
> > > > That's not really portable as SCRAM is one protocol in the SASL
> > > > family, and even worse in our case we use SCRAM-SHA-256. I'd like to
> > > > change pg_hba.conf to be as follows:
> > > > local $dbname $user sasl protocol=scram_sha_256
> > > > This is extensible for the future, and protocol is a mandatory option
> > > > that would have now just a single value: scram_sha_256. Heikki,
> > > > others, are you fine with that?
> > > 
> > > I have implemented that as 0009 which is attached, and need to be
> > > applied on the rest of upthread. I am not sure if we want to make the
> > > case where no protocol is specified map to everything. This would be a
> > > tricky support for users in the future if new authentication
> > > mechanisms for SASL are added in the future.
> > > 
> > > Another issue that I have is: do we really want to have
> > > password_encryption being set to "scram" for verifiers of
> > > SCRAM-SHA-256? I would think that scram_sha_256 makes the most sense.
> > > Who knows, perhaps there could be in a couple of years a SHA-SHA-512..
> > > 
> > > At the same time, attached is a new version of 0008 that implements
> > > SASLprep, I have stabilized the beast after fixing some allocation
> > > calculations when converting the decomposed pg_wchar array back to a
> > > utf8 string.
> > > -- 
> > > Michael
> > 
> > -- 
> > Best regards,
> > Aleksander Alekseev
> 
> 
> 
> -- 
> Best regards,
> Aleksander Alekseev



-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] 答复: [HACKERS] Adding new output parameter of pg_stat_statements toidentify operation of the query.(Internet mail)

2017-02-20 Thread Tom Lane
=?gb2312?B?amFzb255c2xpKMDu1L7JrSk=?=  writes:
> Yes, it seems the pg_stat_sql function can fit the individual need of 
> collecting tags of query.  However the new function can not return other 
> values of query  at the same time, such as block number info, run time and so 
> on. Returning these values at the same time are very important.  

Meh ... that seems like a justification that was made up on the fly,
rather than being a demonstrable requirement.  What's the specific use
case that requires it?


After thinking about this some more I'm really pretty dubious that storing
the command tag will get you anything you can't get as well or better by
looking at the first word or two of the query text.  I don't believe the
original claim that doing so is "too expensive for a monitoring system".
It does not take that much to pull out a couple of whitespace-separated
keywords and perhaps case-fold them, and by the time you've fetched the
string out of the database you've spent way more cycles than that.
A slightly better argument is that WITH will confuse matters, but really
it does anyway: consider

WITH x AS (INSERT INTO ... RETURNING *)
SELECT * FROM x;

This will get a command tag of "SELECT", but any reasonable person would
deem that this would be better characterized as an INSERT.  So I think
if you want useful results for WITH cases you're going to need to spend
effort looking at the querystring anyway.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Simon Riggs
On 15 February 2017 at 19:15, Andres Freund  wrote:

> I think I previously
> mentioned, even just removing the MyPgXact->xmin assignment in
> SnapshotResetXmin() is measurable performance wise and cache-hit ratio
> wise.

Currently, we issue SnapshotResetXmin() pointlessly at end of xact, so
patch attached to remove that call, plus some comments to explain
that. This reduces the cause.

Also, another patch to reduce the calls to SnapshotResetXmin() using a
simple heuristic to reduce the effects.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


skip_SnapshotResetXmin_if_idle_timeout.v1.patch
Description: Binary data


reduce_pgxact_access_AtEOXact.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM authentication, take three

2017-02-20 Thread Aleksander Alekseev
Speaking about flaws, it looks like there is a memory leak in
array_to_utf procedure - result is allocated twice.

On Mon, Feb 20, 2017 at 02:51:13PM +0300, Aleksander Alekseev wrote:
> Hi!
> 
> Currently I don't see any significant flaws in these patches. However I
> would like to verify that implemented algorithms are compatible with
> algorithms implemented by third party.
> 
> For instance, for user 'eax' and password 'pass' I got the following
> record in pg_authid:
> 
> ```
> scram-sha-256:
> xtznkRN/nc/1DQ==:
> 4096:
> 2387c124a3139a276b848c910f43ece05dd670d0977ace4f20d724af522312e4:
> 5e3bdd6584880198b0375acabd8754c460af2389499f71a756660a10a8aaa843
> ```
> 
> Let's say I would like to implement SCRAM in pure Python, for instance
> add it to pg8000 driver. Firstly I need to know how to generate server
> key and client key having only user name and password. Somehow like
> this:
> 
> ```
>  >>> import base64
>  >>> import hashlib
>  >>> base64.b16encode(hashlib.pbkdf2_hmac('sha256', b'pass',
>  ...base64.b64decode(b'xtznkRN/nc/1DQ=='), 4096))
> b'14B90CFCF690120399A0E6D30C60DD9D9D221CD9C2E31EA0A00514C41823E6C3'
> ```
> 
> Naturally it doesn't work because both SCRAM_SERVER_KEY_NAME and
> SCRAM_CLIENT_KEY_NAME should also be involved. I see PostgreSQL
> implementation just in front of me but unfortunately I'm still having
> problems calculating exactly the same server key and client key. It makes
> me worry whether PostgreSQL implementation is OK.
> 
> Could you please give an example of how to do it?
> 
> On Mon, Feb 20, 2017 at 03:29:19PM +0900, Michael Paquier wrote:
> > On Sun, Feb 19, 2017 at 10:07 PM, Michael Paquier
> >  wrote:
> > > There is something that I think is still unwelcome in this patch: the
> > > interface in pg_hba.conf. I mentioned that in the previous thread but
> > > now if you want to match a user and a database with a scram password
> > > you need to do that with the current set of patches:
> > > local $dbname $user scram
> > > That's not really portable as SCRAM is one protocol in the SASL
> > > family, and even worse in our case we use SCRAM-SHA-256. I'd like to
> > > change pg_hba.conf to be as follows:
> > > local $dbname $user sasl protocol=scram_sha_256
> > > This is extensible for the future, and protocol is a mandatory option
> > > that would have now just a single value: scram_sha_256. Heikki,
> > > others, are you fine with that?
> > 
> > I have implemented that as 0009 which is attached, and need to be
> > applied on the rest of upthread. I am not sure if we want to make the
> > case where no protocol is specified map to everything. This would be a
> > tricky support for users in the future if new authentication
> > mechanisms for SASL are added in the future.
> > 
> > Another issue that I have is: do we really want to have
> > password_encryption being set to "scram" for verifiers of
> > SCRAM-SHA-256? I would think that scram_sha_256 makes the most sense.
> > Who knows, perhaps there could be in a couple of years a SHA-SHA-512..
> > 
> > At the same time, attached is a new version of 0008 that implements
> > SASLprep, I have stabilized the beast after fixing some allocation
> > calculations when converting the decomposed pg_wchar array back to a
> > utf8 string.
> > -- 
> > Michael
> 
> -- 
> Best regards,
> Aleksander Alekseev



-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Petr Jelinek
On 20/02/17 12:04, Andres Freund wrote:
> On 2017-02-20 11:58:12 +0100, Petr Jelinek wrote:
>> That being said, I did wonder myself if we should just deprecate float
>> timestamps as well.
> 
> I think we need a proper deprecation period for that, given that the
> conversion away will be painful for pg_upgrade using people with big
> clusters.  So I think we should fix this regardless... :(
> 

That's a good point.

Attached should fix the logical replication problems. I am not quite
sure if there is anything in physical that needs changing.

I opted for GetCurrentIntegerTimestamp() in the reply code as that's the
same coding walreceiver uses.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 142cd99..dd5bdcc 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -46,7 +46,7 @@ logicalrep_write_begin(StringInfo out, ReorderBufferTXN *txn)
 
 	/* fixed fields */
 	pq_sendint64(out, txn->final_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, TimestampTzToIntegerTimestamp(txn->commit_time));
 	pq_sendint(out, txn->xid, 4);
 }
 
@@ -60,7 +60,7 @@ logicalrep_read_begin(StringInfo in, LogicalRepBeginData *begin_data)
 	begin_data->final_lsn = pq_getmsgint64(in);
 	if (begin_data->final_lsn == InvalidXLogRecPtr)
 		elog(ERROR, "final_lsn not set in begin message");
-	begin_data->committime = pq_getmsgint64(in);
+	begin_data->committime = IntegerTimestampToTimestampTz(pq_getmsgint64(in));
 	begin_data->xid = pq_getmsgint(in, 4);
 }
 
@@ -82,7 +82,7 @@ logicalrep_write_commit(StringInfo out, ReorderBufferTXN *txn,
 	/* send fields */
 	pq_sendint64(out, commit_lsn);
 	pq_sendint64(out, txn->end_lsn);
-	pq_sendint64(out, txn->commit_time);
+	pq_sendint64(out, TimestampTzToIntegerTimestamp(txn->commit_time));
 }
 
 /*
@@ -100,7 +100,7 @@ logicalrep_read_commit(StringInfo in, LogicalRepCommitData *commit_data)
 	/* read fields */
 	commit_data->commit_lsn = pq_getmsgint64(in);
 	commit_data->end_lsn = pq_getmsgint64(in);
-	commit_data->committime = pq_getmsgint64(in);
+	commit_data->committime = IntegerTimestampToTimestampTz(pq_getmsgint64(in));
 }
 
 /*
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 0b19fec..225ea4c 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1183,7 +1183,7 @@ send_feedback(XLogRecPtr recvpos, bool force, bool requestReply)
 	pq_sendint64(reply_message, recvpos);		/* write */
 	pq_sendint64(reply_message, flushpos);		/* flush */
 	pq_sendint64(reply_message, writepos);		/* apply */
-	pq_sendint64(reply_message, now);			/* sendTime */
+	pq_sendint64(reply_message, GetCurrentIntegerTimestamp());	/* sendTime */
 	pq_sendbyte(reply_message, requestReply);	/* replyRequested */
 
 	elog(DEBUG2, "sending feedback (force %d) to recv %X/%X, write %X/%X, flush %X/%X",
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 9b4c012..1dd469d 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -1749,6 +1749,20 @@ IntegerTimestampToTimestampTz(int64 timestamp)
 #endif
 
 /*
+ * TimestampTzToIntegerTimestamp -- convert a native format timestamp to int64
+ *
+ * When compiled with --enable-integer-datetimes, this is implemented as a
+ * no-op macro.
+ */
+#ifndef HAVE_INT64_TIMESTAMP
+int64
+TimestampTzToIntegerTimestamp(TimestampTz timestamp)
+{
+	return timestamp * USECS_PER_SEC;
+}
+#endif
+
+/*
  * GetSQLCurrentTimestamp -- implements CURRENT_TIMESTAMP, CURRENT_TIMESTAMP(n)
  */
 TimestampTz
diff --git a/src/include/utils/timestamp.h b/src/include/utils/timestamp.h
index 21651b1..765fa81 100644
--- a/src/include/utils/timestamp.h
+++ b/src/include/utils/timestamp.h
@@ -108,9 +108,11 @@ extern bool TimestampDifferenceExceeds(TimestampTz start_time,
 #ifndef HAVE_INT64_TIMESTAMP
 extern int64 GetCurrentIntegerTimestamp(void);
 extern TimestampTz IntegerTimestampToTimestampTz(int64 timestamp);
+extern int64 TimestampTzToIntegerTimestamp(TimestampTz timestamp);
 #else
 #define GetCurrentIntegerTimestamp()	GetCurrentTimestamp()
 #define IntegerTimestampToTimestampTz(timestamp) (timestamp)
+#define TimestampTzToIntegerTimestamp(timestamp) (timestamp)
 #endif
 
 extern TimestampTz time_t_to_timestamptz(pg_time_t tm);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SCRAM authentication, take three

2017-02-20 Thread Aleksander Alekseev
Hi!

Currently I don't see any significant flaws in these patches. However I
would like to verify that implemented algorithms are compatible with
algorithms implemented by third party.

For instance, for user 'eax' and password 'pass' I got the following
record in pg_authid:

```
scram-sha-256:
xtznkRN/nc/1DQ==:
4096:
2387c124a3139a276b848c910f43ece05dd670d0977ace4f20d724af522312e4:
5e3bdd6584880198b0375acabd8754c460af2389499f71a756660a10a8aaa843
```

Let's say I would like to implement SCRAM in pure Python, for instance
add it to pg8000 driver. Firstly I need to know how to generate server
key and client key having only user name and password. Somehow like
this:

```
 >>> import base64
 >>> import hashlib
 >>> base64.b16encode(hashlib.pbkdf2_hmac('sha256', b'pass',
 ...base64.b64decode(b'xtznkRN/nc/1DQ=='), 4096))
b'14B90CFCF690120399A0E6D30C60DD9D9D221CD9C2E31EA0A00514C41823E6C3'
```

Naturally it doesn't work because both SCRAM_SERVER_KEY_NAME and
SCRAM_CLIENT_KEY_NAME should also be involved. I see PostgreSQL
implementation just in front of me but unfortunately I'm still having
problems calculating exactly the same server key and client key. It makes
me worry whether PostgreSQL implementation is OK.

Could you please give an example of how to do it?

On Mon, Feb 20, 2017 at 03:29:19PM +0900, Michael Paquier wrote:
> On Sun, Feb 19, 2017 at 10:07 PM, Michael Paquier
>  wrote:
> > There is something that I think is still unwelcome in this patch: the
> > interface in pg_hba.conf. I mentioned that in the previous thread but
> > now if you want to match a user and a database with a scram password
> > you need to do that with the current set of patches:
> > local $dbname $user scram
> > That's not really portable as SCRAM is one protocol in the SASL
> > family, and even worse in our case we use SCRAM-SHA-256. I'd like to
> > change pg_hba.conf to be as follows:
> > local $dbname $user sasl protocol=scram_sha_256
> > This is extensible for the future, and protocol is a mandatory option
> > that would have now just a single value: scram_sha_256. Heikki,
> > others, are you fine with that?
> 
> I have implemented that as 0009 which is attached, and need to be
> applied on the rest of upthread. I am not sure if we want to make the
> case where no protocol is specified map to everything. This would be a
> tricky support for users in the future if new authentication
> mechanisms for SASL are added in the future.
> 
> Another issue that I have is: do we really want to have
> password_encryption being set to "scram" for verifiers of
> SCRAM-SHA-256? I would think that scram_sha_256 makes the most sense.
> Who knows, perhaps there could be in a couple of years a SHA-SHA-512..
> 
> At the same time, attached is a new version of 0008 that implements
> SASLprep, I have stabilized the beast after fixing some allocation
> calculations when converting the decomposed pg_wchar array back to a
> utf8 string.
> -- 
> Michael

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


[HACKERS] pg_monitor role

2017-02-20 Thread Dave Page
Further to the patch I just submitted
(https://www.postgresql.org/message-id/CA%2BOCxow-X%3DD2fWdKy%2BHP%2BvQ1LtrgbsYQ%3DCshzZBqyFT5jOYrFw%40mail.gmail.com)
I'd like to propose the addition of a default role, pg_monitor.

The intent is to make it easy for users to setup a role for fully
monitoring their servers, without requiring superuser level privileges
which is a problem for many users working within strict security
policies.

At present, functions or system config info that divulge any
installation path related info typically require superuser privileges.
This makes monitoring for unexpected changes in configuration or
filesystem level monitoring (e.g. checking for large numbers of WAL
files or log file info) impossible for non-privileged roles.

A similar example is the restriction on the pg_stat_activity.query
column, which prevents non-superusers seeing any query strings other
than their own.

Using ACLs is a problem for a number of reasons:

- Users often don't like their database schemas to be modified
(cluttered with GRANTs).
- ACL modifications would potentially have to be made in every
database in a cluster.
- Using a pre-defined role minimises the setup that different tools
would have to require.
- Not all functionality has an ACL (e.g. SHOW)

Other DBMSs solve this problem in a similar way.

Initially I would propose that permission be granted to the role to:

- Execute pg_ls_logdir() and pg_ls_waldir()
- Read pg_stat_activity, including the query column for all queries.
- Allow "SELECT pg_tablespace_size('pg_global')"
- Read all GUCs

In the future I would also like to see us add additional roles for
system administration functions, for example, a backup operator role
that would have the appropriate rights to make and restore backups.

Comments?

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_ls_waldir() & pg_ls_logdir()

2017-02-20 Thread Dave Page
Hi

Following various conversations on list and in person, including the
developer meeting in Brussels earlier this month, here is a patch that
implements pg_ls_logdir() and pg_ls_waldir() functions.

The ultimate aim of this (and followup work I'll be doing) is to
provide functionality to enable monitoring of PostgreSQL without
requiring a user with superuser permissions as many of us have users
for whom security policies prevent this or make it very difficult.

In order to achieve that, there are various pieces of functionality
such as pg_ls_dir() that need to have superuser checks removed to
allow permissions to be granted to a monitoring role. There were
objections in previous discussions to doing this with such generic
functions, hence this patch which adds two narrowly focussed functions
to allow tools to monitor the contents of the log and WAL directories.
Neither function has a hard-coded superuser check, but have ACLs that
prevent public execution by default.

Patch includes the code and doc updates.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d7738b18b7..ecd17a3528 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19360,6 +19360,24 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
   
   

+pg_ls_logdir()
+   
+   setof record
+   
+List the name, last modification time and size of files in the log 
directory.
+   
+  
+  
+   
+pg_ls_waldir()
+   
+   setof record
+   
+List the name, last modification time and size of files in the WAL 
directory.
+   
+  
+  
+   
 pg_read_file(filename text 
[, offset bigint, length bigint 
[, missing_ok boolean] ])

text
@@ -19390,7 +19408,7 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());

 

-All of these functions take an optional missing_ok parameter,
+Some of these functions take an optional missing_ok 
parameter,
 which specifies the behavior when the file or directory does not exist.
 If true, the function returns NULL (except
 pg_ls_dir, which returns an empty result set). If
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 38be9cf1a0..4b67102439 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1096,3 +1096,6 @@ REVOKE EXECUTE ON FUNCTION pg_stat_reset() FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_table_counters(oid) FROM 
public;
 REVOKE EXECUTE ON FUNCTION pg_stat_reset_single_function_counters(oid) FROM 
public;
+
+REVOKE EXECUTE ON FUNCTION pg_ls_logdir() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_ls_waldir() FROM public;
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 66d09bcb0c..a8cdf3bcbf 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -15,12 +15,14 @@
 #include "postgres.h"
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
 #include "access/sysattr.h"
+#include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/catalog.h"
 #include "catalog/pg_tablespace.h"
@@ -46,6 +48,8 @@
 
 #define atooid(x)  ((Oid) strtoul((x), NULL, 10))
 
+/* Generic function to return a directory listing of files */
+Datum pg_ls_dir_files(FunctionCallInfo fcinfo, char *dir);
 
 /*
  * Common subroutine for num_nulls() and num_nonnulls().
@@ -892,3 +896,109 @@ parse_ident(PG_FUNCTION_ARGS)
 
PG_RETURN_DATUM(makeArrayResult(astate, CurrentMemoryContext));
 }
+
+
+typedef struct
+{
+   char*location;
+   DIR *dirdesc;
+} directory_fctx;
+
+/* Generic function to return a directory listing of files */
+Datum
+pg_ls_dir_files(FunctionCallInfo fcinfo, char *dir)
+{
+   FuncCallContext *funcctx;
+   struct dirent *de;
+   directory_fctx *fctx;
+
+   if (SRF_IS_FIRSTCALL())
+   {
+   MemoryContext oldcontext;
+   TupleDesc   tupdesc;
+
+   funcctx = SRF_FIRSTCALL_INIT();
+   oldcontext = 
MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+   fctx = palloc(sizeof(directory_fctx));
+
+   tupdesc = CreateTemplateTupleDesc(3, false);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 1, "mtime",
+  TIMESTAMPTZOID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 2, "size",
+  INT8OID, -1, 0);
+   TupleDescInitEntry(tupdesc, (AttrNumber) 3, "file",
+  TEXTOID, -1, 0);
+
+   funcctx->attinmeta = 

Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Andres Freund
On 2017-02-20 11:58:12 +0100, Petr Jelinek wrote:
> That being said, I did wonder myself if we should just deprecate float
> timestamps as well.

I think we need a proper deprecation period for that, given that the
conversion away will be painful for pg_upgrade using people with big
clusters.  So I think we should fix this regardless... :(


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replication vs. float timestamps is a disaster

2017-02-20 Thread Petr Jelinek
On 20/02/17 08:03, Andres Freund wrote:
> On 2017-02-19 10:49:29 -0500, Tom Lane wrote:
>> Robert Haas  writes:
>>> On Sun, Feb 19, 2017 at 3:31 AM, Tom Lane  wrote:
 Thoughts?  Should we double down on trying to make this work according
 to the "all integer timestamps" protocol specs, or cut our losses and
 change the specs?
>>
>>> I vote for doubling down.  It's bad enough that we have so many
>>> internal details that depend on this setting; letting that cascade
>>> into the wire protocol seems like it's just letting the chaos spread
>>> farther and wider.
>>
>> How do you figure that it's not embedded in the wire protocol already?
>> Not only the replicated data for a timestamp column, but also the
>> client-visible binary I/O format, depend on this.  I think having some
>> parts of the protocol use a different timestamp format than other parts
>> is simply weird, and as this exercise has shown, it's bug-prone as all
>> get out.
> 
> I don't think it's that closely tied together atm. Things like
> pg_basebackup, pg_receivexlog etc should work, without having to match
> timestamp storage.  Logical replication, unless your output plugin dumps
> data in binary / "raw" output, also works just fine across the timestamp
> divide.
> 
> It doesn't sound that hard to add a SystemToIntTimestamp() function,
> given it only needs to do something if float timestamps are enabled?
> 

It's definitely not hard, we already have
IntegerTimestampToTimestampTz() which does the opposite conversion anyway.

That being said, I did wonder myself if we should just deprecate float
timestamps as well.

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] wait events for disk I/O

2017-02-20 Thread Rushabh Lathia
My colleague Rahila reported compilation issue with
the patch. Issue was only coming with we do the clean
build on the branch.

Fixed the same into latest version of patch.

Thanks,


On Tue, Jan 31, 2017 at 11:09 AM, Rushabh Lathia 
wrote:

>
>
> On Tue, Jan 31, 2017 at 8:54 AM, Michael Paquier <
> michael.paqu...@gmail.com> wrote:
>
>> On Mon, Jan 30, 2017 at 10:01 PM, Rushabh Lathia
>>  wrote:
>> > Attached is the patch, which extend the existing wait event
>> infrastructure
>> > to implement the wait events for the disk I/O. Basically
>> pg_stat_activity's
>> > wait event information to show data about disk I/O as well as IPC
>> primitives.
>> >
>> > Implementation details:
>> >
>> > - Added PG_WAIT_IO to pgstat.h and a new enum WaitEventIO
>> > - Added a wait_event_info argument to FileRead, FileWrite, FilePrefetch,
>> > FileWriteback, FileSync, and FileTruncate. Set this wait event just
>> before
>> > performing the file system operation and clear it just after.
>> > - Pass down an appropriate wait event from  caller of any of those
>> > functions.
>> > - Also set and clear a wait event around standalone calls to read(),
>> > write(), fsync() in other parts of the system.
>> > - Added documentation for all newly added wait event.
>>
>> Looks neat, those are unlikely to overlap with other wait events.
>>
>
> Thanks.
>
>
>>
>> > Open issue:
>> > - Might missed few standalone calls to read(), write(), etc which need
>> > to pass the wait_event_info.
>>
>> It may be an idea to use LD_PRELOAD with custom versions of read(),
>> write() and fsync(), and look at the paths where no flags are set in
>> MyProc->wait_event_info, and log information when that happens.
>>
>>
> Yes, may be I will try this.
>
>
>> > Thanks to my colleague Robert Haas for his help in design.
>> > Please let me know your thought, and thanks for reading.
>>
>> Did you consider a wrapper of the type pg_read_event() or
>> pg_write_event(), etc.?
>>
>
> I thought on that, but eventually stick to this approach as it looks
> more neat and uniform with other wait event implementation.
>
>
>
>> --
>> Michael
>>
>
>
>
> Thanks,
> Rushabh Lathia
> www.EnterpriseDB.com
>



-- 
Rushabh Lathia


wait_event_for_disk_IO_v1.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Partitioned tables and relfilenode

2017-02-20 Thread Amit Langote
On 2017/02/19 18:53, Robert Haas wrote:
> On Fri, Feb 17, 2017 at 1:12 PM, Amit Langote wrote:
>> Do you mean that if database-wide analyze is to be run, we should also
>> exclude those RELKIND_RELATION relations that are partitions?
>>
>> So the only way to update a partition's statistics is to directly specify
>> it in the command or by autovacuum.
> 
> I think if you type:
> 
> ANALYZE;
> 
> ...that should process all partitioned tables and all tables that are
> not themselves partitions.

OK.

> If you type:
> 
> ANALYZE name;
> 
> ...that should ANALYZE that relation, whatever it is.  If it's a
> partitioned table, it should recurse.

To be clear, by "recurse" I assume you mean to perform ANALYZE on
individual partitions, not just collect the inheritance statistics.  So
ANALYZE partitioned_table would both a) collect the inheritance statistics
for the specified table and other partitioned tables in the hierarchy, b)
ANALYZE every leaf partitions updating their statistics in pg_class.

While working on this, I noticed that autovacuum.c does not collect
RELKIND_PARTITIONED_TABLE relations, which I think is not right.  It
should match what get_rel_oids() does, which in the database-wide
VACUUM/ANALYZE case collects them.

Attached updated patches:

Updated 0001 addresses the following comments:

 - should recurse when vacuum/analyze is performed on a partitioned table
 - database-wide vacuum should ignore partitioned tables
 - database-wide analyze should ignore partitions; only the inheritance
   statistics of the partitioned tables must be collected in this case

Thanks,
Amit
>From 982366eb019585438513b0c7b6e86acc74e459d1 Mon Sep 17 00:00:00 2001
From: amit 
Date: Mon, 6 Feb 2017 18:03:28 +0900
Subject: [PATCH 1/3] Partitioned tables are empty themselves

So, there is not much point in trying to do things to them that need
accessing files (a later commit will make it so that there is no file
at all to access.)  Things that needed attention are: vacuum, analyze,
truncate, ATRewriteTables.
---
 src/backend/commands/analyze.c  | 39 ++--
 src/backend/commands/tablecmds.c| 15 ++--
 src/backend/commands/vacuum.c   | 71 +
 src/backend/postmaster/autovacuum.c | 20 +++
 4 files changed, 118 insertions(+), 27 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index ed3acb1673..12cd0110b0 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -201,8 +201,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	 * locked the relation.
 	 */
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
-		onerel->rd_rel->relkind == RELKIND_MATVIEW ||
-		onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
 		/* Regular table, so we'll use the regular row acquisition function */
 		acquirefunc = acquire_sample_rows;
@@ -234,6 +233,12 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 			return;
 		}
 	}
+	else if (onerel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+	{
+		/*
+		 * For partitioned tables, we want to do the recursive ANALYZE below.
+		 */
+	}
 	else
 	{
 		/* No need for a WARNING if we already complained during VACUUM */
@@ -253,13 +258,15 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	LWLockRelease(ProcArrayLock);
 
 	/*
-	 * Do the normal non-recursive ANALYZE.
+	 * Do the normal non-recursive ANALYZE, non-partitioned relations.
 	 */
-	do_analyze_rel(onerel, options, params, va_cols, acquirefunc, relpages,
-   false, in_outer_xact, elevel);
+	if (onerel->rd_rel->relkind != RELKIND_PARTITIONED_TABLE)
+		do_analyze_rel(onerel, options, params, va_cols, acquirefunc,
+	   relpages, false, in_outer_xact, elevel);
 
 	/*
-	 * If there are child tables, do recursive ANALYZE.
+	 * If there are child tables, do recursive ANALYZE.  This includes
+	 * partitioned tables.
 	 */
 	if (onerel->rd_rel->relhassubclass)
 		do_analyze_rel(onerel, options, params, va_cols, acquirefunc, relpages,
@@ -1316,10 +1323,20 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 			continue;
 		}
 
+		/* Ignore partitioned tables as there are no tuples to collect */
+		if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+		{
+			/* Don't try to unlock the passed-in root table. */
+			if (childrel == onerel)
+heap_close(childrel, NoLock);
+			else
+heap_close(childrel, AccessShareLock);
+			continue;
+		}
+
 		/* Check table type (MATVIEW can't happen, but might as well allow) */
 		if (childrel->rd_rel->relkind == RELKIND_RELATION ||
-			childrel->rd_rel->relkind == RELKIND_MATVIEW ||
-			childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+			childrel->rd_rel->relkind == RELKIND_MATVIEW)
 		{
 			/* Regular table, so use the regular row acquisition function */
 			acquirefunc = acquire_sample_rows;
@@ -1366,9 +1383,11 @@ 

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Amit Kapila
On Mon, Feb 20, 2017 at 3:01 PM, Simon Riggs  wrote:
> On 20 February 2017 at 09:15, Amit Kapila  wrote:
>> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada  
>> wrote:
>>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas  wrote:
 On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  wrote:
> On 15 February 2017 at 08:07, Masahiko Sawada  
> wrote:
>> It's a bug. Attached latest version patch, which passed make check.
>
> 2. The current btree vacuum code requires 2 vacuums to fully reuse
> half-dead pages. So skipping an index vacuum might mean that second
> index scan never happens at all, which would be bad.

 Maybe.  If there are a tiny number of those half-dead pages in a huge
 index, it probably doesn't matter.  Also, I don't think it would never
 happen, unless the table just never gets any more updates or deletes -
 but that case could also happen today.  It's just a matter of
 happening less frequently.
>>>
>>
>> Yeah thats right and I am not sure if it is worth to perform a
>> complete pass to reclaim dead/deleted pages unless we know someway
>> that there are many such pages.
>
> Agreed which is why
> On 16 February 2017 at 11:17, Simon Riggs  wrote:
>> I suggest that we store the number of half-dead pages in the metapage
>> after each VACUUM, so we can decide whether to skip the scan or not.
>
>
>> Also, I think we do reclaim the
>> complete page while allocating a new page in btree.
>
> That's not how it works according to the README at least.
>

I am referring to code (_bt_getbuf()->if (_bt_page_recyclable(page))),
won't that help us in reclaiming the space?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Ashutosh Sharma
Hi,

>> On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov
> >>  wrote:
> >> > On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera
> >> > 
> >> > wrote:
> >> >> Alexander Korotkov wrote:
> >> >>
> >> >> > Difference between master, pgxact-align-2 and pgxact-align-3
> doesn't
> >> >> > exceed
> >> >> > per run variation.
> >> >>
> >> >> FWIW this would be more visible if you added error bars to each data
> >> >> point.  Should be simple enough in gnuplot ...
> >> >
> >> > Good point.
> >> > Please find graph of mean and errors in attachment.
> >>
> >> So ... no difference?
> >
> >
> > Yeah, nothing surprising.  It's just another graph based on the same
> data.
> > I wonder how pgxact-align-3 would work on machine of Ashutosh Sharma,
> > because I observed regression there in write-heavy benchmark of
> > pgxact-align-2.
>
> I am yet to benchmark pgxact-align-3 patch on a read-write workload. I
> could not do it as our benchmarking machine was already reserved for
> some other test work. But, I am planning to do it on this weekend.
> Will try to post my results by Monday evening. Thank you and sorry for
> the delayed response.
>

Following are the pgbench results for read-write workload, I got with
pgxact-align-3 patch. The results are for 300 scale factor with 8GB of
shared buffer i.e. when data fits into the shared buffer. For 1000 scale
factor with 8GB shared buffer the test is still running, once it is
completed I will share the results for that as well.

*pgbench settings:*
pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres

where, time_for_reading = 30mins

*non default GUC param:*
shared_buffers=8GB
max_connections=300

pg_xlog is located in SSD.

*Machine details:*
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz


CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 4283 4220 -1.47093159
8 7291 7261 -0.4114661912
16 11909 12149 2.015282559
32 20789 20745 -0.211650392
64 28412 27831 -2.044910601
128 29369 28765 -2.056590282
156 27949 27189 -2.719238613
180 27876 27171 -2.529057254
196 28849 27872 -3.386599189
256 30321 28188 -7.034728406
Also, Excel sheet (results-read-write-300-SF) containing the results for
all the 3 runs is attached.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


results-read-write-300-SF.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-02-20 Thread Thomas Munro
On Mon, Feb 20, 2017 at 3:36 PM, Thomas Munro
 wrote:
> On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas  wrote:
>> Generally speaking, we don't throw
>> serialization errors today at READ COMMITTED, so if we do so here,
>> that's going to be a noticeable and perhaps unwelcome change.
>
> Yes we do:
>
> https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

Oops -- please ignore, I misread that as repeatable read.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-02-20 Thread Thomas Munro
On Thu, Feb 16, 2017 at 8:53 PM, Robert Haas  wrote:
> Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.

Yes we do:

https://www.postgresql.org/docs/9.6/static/transaction-iso.html#XACT-REPEATABLE-READ

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-20 Thread Joel Jacobson
On Mon, Feb 20, 2017 at 1:45 AM, Tom Lane  wrote:
> The versions of autocommit that have actually stood the test of time were
> implemented on the client side (in psql and JDBC, and I think ODBC as
> well), where the scope of affected code was lots smaller.  I wonder
> whether there's any hope of providing something useful for case-folding
> in that way.  psql's lexer is already smart enough that you could teach it
> rules like "smash any unquoted identifier to lower case" (probably it
> would fold keywords too, but that seems OK).  That's probably not much
> help for custom applications, which aren't likely to be going through
> psql scripts; but the fact that such behavior is in reach at all on the
> client side seems encouraging.

This sounds like a really good solution to me,
since there is actually nothing missing on the PostgreSQL server-side,
it's merely a matter of inconvenience on the client-side.

As long as the definitions of the database objects when stored
in the git repo can be written without the double-quotes,
i.e. CREATE TABLE Users (
instead of
CREATE TABLE "Users" (

where the object would be created as "Users" with capital "U",
then I see no problem.

Most people probably use psql to initiate a db instance of their
project locally,
so if psql would have a --preserve-case option, that would solve the
problem of creating new objects.
Or maybe --no-case-folding is a better name for the option.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Simon Riggs
On 20 February 2017 at 09:15, Amit Kapila  wrote:
> On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada  
> wrote:
>> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas  wrote:
>>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  wrote:
 On 15 February 2017 at 08:07, Masahiko Sawada  
 wrote:
> It's a bug. Attached latest version patch, which passed make check.

 2. The current btree vacuum code requires 2 vacuums to fully reuse
 half-dead pages. So skipping an index vacuum might mean that second
 index scan never happens at all, which would be bad.
>>>
>>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>>> index, it probably doesn't matter.  Also, I don't think it would never
>>> happen, unless the table just never gets any more updates or deletes -
>>> but that case could also happen today.  It's just a matter of
>>> happening less frequently.
>>
>
> Yeah thats right and I am not sure if it is worth to perform a
> complete pass to reclaim dead/deleted pages unless we know someway
> that there are many such pages.

Agreed which is why
On 16 February 2017 at 11:17, Simon Riggs  wrote:
> I suggest that we store the number of half-dead pages in the metapage
> after each VACUUM, so we can decide whether to skip the scan or not.


> Also, I think we do reclaim the
> complete page while allocating a new page in btree.

That's not how it works according to the README at least.

You might be referring to cleaning out killed tuples just before a
page split? That's something different.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] case_preservation_and_insensitivity = on

2017-02-20 Thread Joel Jacobson
On Mon, Feb 20, 2017 at 2:40 AM, Jim Nasby  wrote:
> Even if the project decided that "Users" and users is stupid and that we
> should deprecate it, I think the odds of also deciding to tell existing
> users to re-write their apps are zero.

But if the feature can't be turned on without also enforcing lowercase
uniqueness,
then the described problem situation will never happen.
Any existing projects who want to use the new feature but can't due to
conflicting names,
will simply just have to live without it, just like they already do.

> So no matter how this is designed, there has to be some way for existing
> users to be able to continue relying on "Users" and users being different.

There is a way, simply don't switch on the feature,
and "Users" and "users" will continue to be different.

> What would work is an initdb option that controls this: when ignoring case
> for uniqueness is disabled, your new column would simply be left as NULL.
> With some extra effort you could probably allow changing that on a running
> database as well, just not with something as easy to change as a GUC.

initdb option sounds good to me, just like you specify e.g.  --encoding.

Also, I think the --lowercase-uniqueness feature would be useful by
itself even without the --case-preserving feature,
since that might be a good way to enforce a good design of new databases,
as a mix of "users" and "Users" is probably considered ugly by many
system designers.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPDATE of partition key

2017-02-20 Thread Amit Khandekar
On 16 February 2017 at 20:53, Robert Haas  wrote:
> On Thu, Feb 16, 2017 at 5:47 AM, Greg Stark  wrote:
>> On 13 February 2017 at 12:01, Amit Khandekar  wrote:
>>> There are a few things that can be discussed about :
>>
>> If you do a normal update the new tuple is linked to the old one using
>> the ctid forming a chain of tuple versions. This tuple movement breaks
>> that chain.  So the question I had reading this proposal is what
>> behaviour depends on ctid and how is it affected by the ctid chain
>> being broken.
>
> I think this is a good question.
>
>> I think the concurrent update case is just a symptom of this. If you
>> try to update a row that's locked for a concurrent update you normally
>> wait until the concurrent update finishes, then follow the ctid chain
>> and recheck the where clause on the target of the link and if it still
>> matches you perform the update there.
>
> Right.  EvalPlanQual behavior, in short.
>
>> At least you do that if you have isolation_level set to
>> repeatable_read. If you have isolation level set to serializable then
>> you just fail with a serialization failure. I think that's what you
>> should do if you come across a row that's been updated with a broken
>> ctid chain even in repeatable read mode. Just fail with a
>> serialization failure and document that in partitioned tables if you
>> perform updates that move tuples between partitions then you need to
>> be ensure your updates are prepared for serialization failures.
>
> Now, this part I'm not sure about.  What's pretty clear is that,
> barring some redesign of the heap format, we can't keep the CTID chain
> intact when the tuple moves to a different relfilenode.  What's less
> clear is what to do about that.  We can either (1) give up on
> EvalPlanQual behavior in this case and act just as we would if the row
> had been deleted; no update happens.

This is what the current patch has done.

> or (2) throw a serialization
> error.  You're advocating for #2, but I'm not sure that's right,
> because:
>
> 1. It's a lot more work,
>
> 2. Your proposed implementation needs an on-disk format change that
> uses up a scarce infomask bit, and
>
> 3. It's not obvious to me that it's clearly preferable from a user
> experience standpoint.  I mean, either way the user doesn't get the
> behavior that they want.  Either they're hoping for EPQ semantics and
> they instead do a no-op update, or they're hoping for EPQ semantics
> and they instead get an ERROR.  Generally speaking, we don't throw
> serialization errors today at READ COMMITTED, so if we do so here,
> that's going to be a noticeable and perhaps unwelcome change.
>
> More opinions welcome.

I am inclined to at least have some option for the user to decide the
behaviour. In the future we can even consider support for walking
through the ctid chain across multiple relfilenodes. But till then, we
need to decide what default behaviour to keep. My inclination is more
towards erroring out in an unfortunate even where there is an UPDATE
while the row-movement is happening. One option is to not get into
finding whether the DELETE was part of partition row-movement or it
was indeed a DELETE, and always error out the UPDATE when
heap_update() returns HeapTupleUpdated, but only if the table is a
leaf partition. But this obviously will cause annoyance because of
chances of getting such errors when there are concurrent updates and
deletes in the same partition. But we can keep a table-level option
for determining whether to error out or silently lose the UPDATE.

Another option I was thinking : When the UPDATE is on a partition key,
acquire ExclusiveLock (not AccessExclusiveLock) only on that
partition, so that the selects will continue to execute, but
UPDATE/DELETE will wait before opening the table for scan. The UPDATE
on partition key is not going to be a very routine operation, it
sounds more like a DBA maintenance operation; so it does not look like
it would come in between usual transactions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gather Merge

2017-02-20 Thread Amit Kapila
On Mon, Feb 20, 2017 at 1:58 PM, Dilip Kumar  wrote:
> On Mon, Feb 20, 2017 at 12:05 PM, Rushabh Lathia
>  wrote:
>> Thanks Amit for raising this point. I was not at all aware of mark/restore.
>> I tried to come up with the case, but haven't found such case.
>>
>> For now here is the patch with comment update.
>
> I think for reproducing this you need plan something like below (I
> think this is a really bad plan, but you can use to test this
> particular case).
>
> MergeJoin
> -> Index Scan
> -> Gather Merge
>->Parallel Index Scan
>
> So if only IndexScan node is there as a inner node which support
> Mark/Restore then we don't need to insert any materialize node. But
> after we put Gather Merge (which don't support Mark/Restore) then we
> need a materialize node on top of that. Therefore, plan should become
> like this, I think so.
> (But anyway if we have the Gather instead of the GatherMerge we would
> required a Sort node on top of the Gather and Materialize is obviously
> cheaper than the Sort.)
>
> MergeJoin
> -> Index Scan
> -> Materialize
>-> Gather Merge  (Does not support mark/restore)
>->Parallel Index Scan
>

Yes, exactly that's what will happen, however, we are not sure how
many times such plan (Gather Merge on inner side of merge join) will
be helpful and whether adding Mark/Restore support will make it any
faster than just adding Materialize on top of Gather Merge.  So, it
seems better not to go there unless we see some use of it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] increasing the default WAL segment size

2017-02-20 Thread Beena Emerson
Hello,

On Thu, Feb 16, 2017 at 3:37 PM, Kuntal Ghosh 
wrote:

> On Mon, Feb 6, 2017 at 11:09 PM, Beena Emerson 
> wrote:
> >
> > Hello,
> > PFA the updated patches.
> I've started reviewing the patches.
> 01-add-XLogSegmentOffset-macro.patch looks clean to me. I'll post my
> detailed review after looking into the second patch. But, both the
> patches needs a rebase based on the commit 85c11324cabaddcfaf3347df7
> (Rename user-facing tools with "xlog" in the name to say "wal").
>
>

PFA the rebased patches.


Thank you,

Beena Emerson

Have a Great Day!


01-add-XLogSegmentOffset-macro.patch
Description: Binary data


02-initdb-walsegsize-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GUC for cleanup indexes threshold.

2017-02-20 Thread Amit Kapila
On Mon, Feb 20, 2017 at 7:26 AM, Masahiko Sawada  wrote:
> On Fri, Feb 17, 2017 at 3:41 AM, Robert Haas  wrote:
>> On Thu, Feb 16, 2017 at 6:17 AM, Simon Riggs  wrote:
>>> On 15 February 2017 at 08:07, Masahiko Sawada  wrote:
 It's a bug. Attached latest version patch, which passed make check.
>>>
>>> In its current form, I'm not sure this is a good idea. Problems...
>>>
>>> 1. I'm pretty sure the world doesn't need another VACUUM parameter
>>>
>>> I suggest that we use the existing vacuum scale factor/4 to reflect
>>> that indexes are more sensitive to bloat.
>>
>> I do not think it's a good idea to control multiple behaviors with a
>> single GUC.  We don't really know that dividing by 4 will be right for
>> everyone, or even for most people.  It's better to have another
>> parameter with a sensible default than to hardcode a ratio that might
>> work out poorly for some people.
>>
>>> 2. The current btree vacuum code requires 2 vacuums to fully reuse
>>> half-dead pages. So skipping an index vacuum might mean that second
>>> index scan never happens at all, which would be bad.
>>
>> Maybe.  If there are a tiny number of those half-dead pages in a huge
>> index, it probably doesn't matter.  Also, I don't think it would never
>> happen, unless the table just never gets any more updates or deletes -
>> but that case could also happen today.  It's just a matter of
>> happening less frequently.
>

Yeah thats right and I am not sure if it is worth to perform a
complete pass to reclaim dead/deleted pages unless we know someway
that there are many such pages.  Also, I think we do reclaim the
complete page while allocating a new page in btree.

> The half-dead pages are never cleaned up if the ratio of pages
> containing garbage is always lower than threshold.
>

Which threshold are you referring here?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation improvements for partitioning

2017-02-20 Thread Robert Haas
On Mon, Feb 20, 2017 at 2:09 AM, Simon Riggs  wrote:
> On 15 February 2017 at 15:46, Robert Haas  wrote:
>>> It leaves me asking what else is missing.
>>
>> There is certainly a lot of room for improvement here but I don't
>> understand your persistent negativity about what's been done thus far.
>> I think it's pretty clearly a huge step forward, and I think Amit
>> deserves a ton of credit for making it happen.  The improvements in
>> bulk loading performance alone are stupendous.  You apparently have
>> the idea that somebody could have written an even larger patch that
>> solved even more problems at once, but this was already a really big
>> patch, and IMHO quite a good one.
>
> Please explain these personal comments against me.

Several of your emails, including your first post to this thread,
seemed to me to be quite negative about the state of this feature.  I
don't think that's warranted, though perhaps I am misreading your
tone, as I have been known to do.  I also don't think that expressing
the opinion that the feature is better than you're giving it credit
for is a personal comment against you.  Where exactly do you see a
personal comment against you in what I wrote?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Gather Merge

2017-02-20 Thread Dilip Kumar
On Mon, Feb 20, 2017 at 12:05 PM, Rushabh Lathia
 wrote:
> Thanks Amit for raising this point. I was not at all aware of mark/restore.
> I tried to come up with the case, but haven't found such case.
>
> For now here is the patch with comment update.

I think for reproducing this you need plan something like below (I
think this is a really bad plan, but you can use to test this
particular case).

MergeJoin
-> Index Scan
-> Gather Merge
   ->Parallel Index Scan

So if only IndexScan node is there as a inner node which support
Mark/Restore then we don't need to insert any materialize node. But
after we put Gather Merge (which don't support Mark/Restore) then we
need a materialize node on top of that. Therefore, plan should become
like this, I think so.
(But anyway if we have the Gather instead of the GatherMerge we would
required a Sort node on top of the Gather and Materialize is obviously
cheaper than the Sort.)

MergeJoin
-> Index Scan
-> Materialize
   -> Gather Merge  (Does not support mark/restore)
   ->Parallel Index Scan


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Push down more UPDATEs/DELETEs in postgres_fdw

2017-02-20 Thread Etsuro Fujita

On 2017/02/13 18:24, Rushabh Lathia wrote:

I started reviewing the patch again. Patch applied cleanly on latest source
as well as regression pass through with the patch. I also performed
few manual testing and haven't found any regression. Patch look
much cleaner the earlier version, and don't have any major concern as
such.


Thanks for the review!


Here are few comments:

1)

@@ -211,6 +211,12 @@ typedef struct PgFdwDirectModifyState
 PGresult   *result;/* result for query */
 intnum_tuples;/* # of result tuples */
 intnext_tuple;/* index of next one to return */
+RelationresultRel;/* relcache entry for the target table */



Why we need resultRel? Can't we directly use dmstate->rel ?


The reason why we need that is because in get_returning_data, we pass 
dmstate->rel to make_tuple_from_result_row, which requires that 
dmstate->rel be NULL when the scan tuple is described by fdw_scan_tlist. 
 So in that case we set dmstate->rel to NULL and have 
dmstate->resultRel that is the relcache entry for the target relation in 
postgresBeginDirectModify.



2) In the patch somewhere scanrelid condition being used as
fscan->scan.scanrelid == 0 where as some place its been used as
fsplan->scan.scanrelid > 0. Infact in the same function its been used
differently example postgresBeginDirectModify. Can make this consistent.


Ok, done.


3)

+ * If UPDATE/DELETE on a join, create a RETURINING list used in the
remote
+ * query.
+ */
+if (fscan->scan.scanrelid == 0)
+returningList = make_explicit_returning_list(resultRelation, rel,
+ returningList);
+

Above block can be moved inside the if (plan->returningLists) condition
above
the block. Like this:

/*
 * Extract the relevant RETURNING list if any.
 */
if (plan->returningLists)
{
returningList = (List *) list_nth(plan->returningLists,
subplan_index);

/*
 * If UPDATE/DELETE on a join, create a RETURINING list used in
the remote
 * query.
 */
if (fscan->scan.scanrelid == 0)
returningList = make_explicit_returning_list(resultRelation,
 rel,
 returningList);
}


Done that way.

Another thing I noticed is duplicate work in apply_returning_filter; it 
initializes tableoid of an updated/deleted tuple if needed, but the core 
will do that (see ExecProcessReturning).  I removed that work from 
apply_returning_filter.



I am still doing few more testing with the patch, if I will found
anything apart from
this I will raise that into another mail.


Thanks again!

Attached is an updated version of the patch.

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/deparse.c
--- b/contrib/postgres_fdw/deparse.c
***
*** 130,142  static void deparseTargetList(StringInfo buf,
    Bitmapset *attrs_used,
    bool qualify_col,
    List **retrieved_attrs);
! static void deparseExplicitTargetList(List *tlist, List **retrieved_attrs,
  		  deparse_expr_cxt *context);
  static void deparseReturningList(StringInfo buf, PlannerInfo *root,
  	 Index rtindex, Relation rel,
  	 bool trig_after_row,
  	 List *returningList,
  	 List **retrieved_attrs);
  static void deparseColumnRef(StringInfo buf, int varno, int varattno,
   PlannerInfo *root, bool qualify_col);
  static void deparseRelation(StringInfo buf, Relation rel);
--- 130,151 
    Bitmapset *attrs_used,
    bool qualify_col,
    List **retrieved_attrs);
! static void deparseExplicitTargetList(List *tlist,
! 		  bool is_returning,
! 		  List **retrieved_attrs,
  		  deparse_expr_cxt *context);
  static void deparseReturningList(StringInfo buf, PlannerInfo *root,
  	 Index rtindex, Relation rel,
  	 bool trig_after_row,
  	 List *returningList,
  	 List **retrieved_attrs);
+ static void deparseExplicitReturningList(List *rlist,
+ 			 List **retrieved_attrs,
+ 			 deparse_expr_cxt *context);
+ static void pull_up_target_conditions(PlannerInfo *root, RelOptInfo *foreignrel,
+ 		  Index target_rel, List **target_conds);
+ static void extract_target_conditions(List **joinclauses, Index target_rel,
+ 		  List **target_conds);
  static void deparseColumnRef(StringInfo buf, int varno, int varattno,
   PlannerInfo *root, bool qualify_col);
  static void deparseRelation(StringInfo buf, Relation rel);
***
*** 165,171  static void deparseLockingClause(deparse_expr_cxt *context);
  static void appendOrderByClause(List *pathkeys, deparse_expr_cxt *context);
  static void appendConditions(List *exprs, deparse_expr_cxt *context);
  static void deparseFromExprForRel(StringInfo buf, PlannerInfo *root,
! 	RelOptInfo *joinrel, bool use_alias, List **params_list);
  static