Re: User defined data types in Logical Replication

2017-12-22 Thread Dang Minh Huong

On 2017/12/21 10:05, Masahiko Sawada wrote:


On Wed, Dec 20, 2017 at 5:39 PM, Huong Dangminh
 wrote:

Hi Sawada-san,


Thank you for quick response. The changes look good to me. But I wonder
if the following changes needs some comments to describe what each checks
does for.

-if (errarg->attnum < 0)
+if (errarg->local_attnum <0)
+return;
+
+rel = errarg->rel;
+remote_attnum = rel->attrmap[errarg->local_attnum];
+
+if (remote_attnum < 0)
  return;

Thanks, I have added some comments as you mentioned.


Thank you for updating the patch.

-   if (errarg->attnum < 0)
+   /* Check case of slot_store_error_callback() is called before
+* errarg.local_attnum is set. */
+   if (errarg->local_attnum <0)

This comment style isn't preferred by PostgreSQL code. Please refer to
https://www.postgresql.org/docs/current/static/source-format.html
--
$ git diff --check
src/backend/replication/logical/worker.c:291: trailing whitespace.
+   /* Check case of slot_store_error_callback() is called before

Thanks, I will be careful in the next time.

There is an extra white space in the patch.

I'm inclined to think SlotErrCallbackArg can have remote_attnum and
pass it to the callback function. That way, we don't need to case the
case where local_attnum is not set yet.

Nice.

Attached a new version patch incorporated the aboves. Please review it.

Thanks for updating the patch.
It looks fine to me.


---
Thanks and best regards,
Dang Minh Huong




PATCH: logical_work_mem and logical streaming of large in-progress transactions

2017-12-22 Thread Tomas Vondra
Hi all,

Attached is a patch series that implements two features to the logical
replication - ability to define a memory limit for the reorderbuffer
(responsible for building the decoded transactions), and ability to
stream large in-progress transactions (exceeding the memory limit).

I'm submitting those two changes together, because one builds on the
other, and it's beneficial to discuss them together.


PART 1: adding logical_work_mem memory limit (0001)
---

Currently, limiting the amount of memory consumed by logical decoding is
tricky (or you might say impossible) for several reasons:

* The value is hard-coded, so it's not quite possible to customize it.

* The amount of decoded changes to keep in memory is restricted by
number of changes. It's not very unclear how this relates to memory
consumption, as the change size depends on table structure, etc.

* The number is "per (sub)transaction", so a transaction with many
subtransactions may easily consume significant amount of memory without
actually hitting the limit.

So the patch does two things. Firstly, it introduces logical_work_mem, a
GUC restricting memory consumed by all transactions currently kept in
the reorder buffer.

Secondly, it adds a simple memory accounting by tracking the amount of
memory used in total (for the whole reorder buffer, to compare against
logical_work_mem) and per transaction (so that we can quickly pick
transaction to spill to disk).

The one wrinkle on the patch is that the memory limit can't be enforced
when reading changes spilled to disk - with multiple subtransactions, we
can't easily predict how many changes to pre-read for each of them. At
that point we still use the existing max_changes_in_memory limit.

Luckily, changes introduced in the other parts of the patch should allow
addressing this deficiency.


PART 2: streaming of large in-progress transactions (0002-0006)
---

Note: This part is split into multiple smaller chunks, addressing
different parts of the logical decoding infrastructure. That's mostly to
allow easier reviews, though. Ultimately, it's just one patch.

Processing large transactions often results in significant apply lag,
for a couple of reasons. One reason is network bandwidth - while we do
decode the changes incrementally (as we read the WAL), we keep them
locally, either in memory, or spilled to files. Then at commit time, all
the changes get sent to the downstream (and applied) at the same time.
For large transactions the time to do the network transfer may be
significant, causing apply lag.

This patch extends the logical replication infrastructure (output plugin
API, reorder buffer, pgoutput, replication protocol etc.) so allow
streaming of in-progress transactions instead of spilling them to local
files.

The extensions to the API are pretty straightforward. Aside from adding
methods to stream changes/messages and commit a streamed transaction,
the API needs a function to abort a streamed (sub)transaction, and
functions to demarcate a block of streamed changes.

To decode a transaction, we need to know all it's subtransactions, and
invalidations. Currently, those are only known at commit time (although
some assignments may be known earlier), but invalidations are only ever
written in the commit record.

So far that was fine, because we only decode/replay transactions at
commit time, when all of this is known (because it's either in commit
record, or written before it).

But for in-progress transactions (i.e. the subject of interest here),
that is not the case. So the patch modifies WAL-logging to ensure those
two bits of information are written immediately (for wal_level=logical).

For assignments that was fairly simple, thanks to existing caching. For
invalidations, it requires a new WAL record type and a couple of changes
in inval.c.

On the apply side, we simply receive the streamed changes, write them
into a file (one file for toplevel transaction, which is possible thanks
to the assignments being known immediately). And then at commit time the
changes are replayed locally, without having to copy a large chunk of
data over network.


WAL overhead


Of course, these changes to WAL logging are not for free - logging
assignments individually (instead of multiple subtransactions at once)
means higher xlog record overhead. Similarly, (sub)transactions doing a
lot of DDL may result in a lot of invalidations written to WAL (again,
with full xlog record overhead per invalidation).

I've done a number of tests to measure the impact, and for extreme
corner cases the additional amount of WAL is about 40% in both cases.

By an "extreme corner case" I mean a workloads intentionally triggering
many assignments/invalidations, without doing a lot of meaningful work.

For assignments, imagine a single-row table (no indexes), and a
transaction like this one:

BEGIN;
UPDATE t

Re: [HACKERS] [PATCH] Lockable views

2017-12-22 Thread Michael Paquier
On Fri, Dec 22, 2017 at 04:19:46PM +0900, Yugo Nagata wrote:
> I was busy for and I could not work on this patch. After reading the
> previous discussion, I still think the behavior of this patch would
> be right. So, I would like to reregister to CF 2018-1. Do I need to
> create a new entry on CF? or should I change the status to
> "Moved to next CF"?

This is entirely up to you. It may make sense as well to spawn a new thread
and mark the new patch set as v2, based on the feedback received for v1, as
well as it could make sense to continue with this thread. If the move involves
a complete patch rewrite with a rather different concept, a new thread is more
adapted in my opinion.
-- 
Michael


signature.asc
Description: PGP signature


Re: AS OF queries

2017-12-22 Thread Michael Paquier
On Fri, Dec 22, 2017 at 11:08:02PM +, Greg Stark wrote:
> On 20 December 2017 at 12:45, Konstantin Knizhnik
>  wrote:
> 
> > It seems to me that it will be not so difficult to implement them in
> > Postgres - we already have versions of tuples.
> > Looks like we only need to do three things:
> > 1. Disable autovacuum (autovacuum = off)
> 
> "The Wheel of Time turns, and Ages come and pass, leaving memories
> that become legend. Legend fades to myth, and even myth is long
> forgotten when the Age that gave it birth comes again"

I would be amazed if you have been able to finish the 14 volumes of the
series. There is a lot of content to take.

> Postgres used to have time travel. I think it's come up more than once
> in the pasts as something that can probably never come back due to
> other decisions made. If more decisions have made it possible again
> that will be fascinating.

This subject is showing up a couple of times lately, things would
be interested to see. What I am sure about is that people are not
willing to emulate that with triggers and two extra columns per table.
-- 
Michael


signature.asc
Description: PGP signature


Re: genomic locus

2017-12-22 Thread Gene Selkov

> On Dec 22, 2017, at 1:53 AM, Teodor Sigaev  wrote:
> 
> Hmm, would you try to implement separate type for querying? Similar to 
> tsquery, lquery (for ltree), jsquery etc.

That sounds like a good idea if I want to make an app that will only be 
accessed through a purpose-built front end. Now I’m messing with lots of data, 
making numerous small one-off experiments. Since all of that stuff consumes my 
brain power and keystrokes, I want to minimize all the ancillary stuff or at 
least make it invisible. But I’ll probably go ahead and add a separate 
query-friendly type  that if nothing else helps.

I think I can wrangle this type into GiST just by tweaking consistent(), 
union(), and picksplit(), if I manage to express my needs in C without breaking 
too many things. My first attempt segfaulted.

Here’s my plan of attack. I think that by setting a union of inconsistent loci 
(those on different contigs) to [0, MAX_INT], I will expose such union to a 
huge penalty, even without having to do anything with the current penalty(). No 
query will go there.

The required change to consistent() is obvious: contigs do not match — go away.

The situation with picksplit() may be more tricky; I can’t imagine all possible 
consequences until I’ve seen how it works. Currently, with simple intervals, it 
sorts them by center points and splits the sorted list in half. I have already 
changed the internal sort function, picksplit_item_cmp(), to make sure the data 
are sorted on the contig first, then by center point (or any other geometric 
feature). I am thinking about splitting the list at the first inconsistent 
contig, sending the consistent first part with its well-defined geometry to the 
left page, and filling the right page with whatever remains. If the right page 
has inconsistent contigs, its bounding box will be [0, MAX_INT], and it should 
be again picked for splitting at the next iteration (if I understand the 
algorithm correctly).

If all goes to plan, I will end up with an index tree partitioned by contig at 
the top level and geometrically down from there. That will be as close as I can 
get to an array of config-specific indices, without having to store data in 
separate tables.

What do you think of that?



I have a low-level technical question. Because I can’t anticipate the maximum 
length of contig names (and do not want to waste space), I have made the new 
locus type a varlena, like this:

#include "utils/varlena.h"

typedef struct LOCUS
{
  int32 l_len_; /* varlena header (do not touch directly!) */
  int32 start;
  int32 end;
  char  contig[FLEXIBLE_ARRAY_MEMBER];
} LOCUS;

#define LOCUS_SIZE(str) (offsetof(LOCUS, contig) + sizeof(str))

That flexible array member messes with me every time I need to copy it while 
deriving a new locus object from an existing one (or from a pair). What I ended 
up doing is this:

  LOCUS  *l = PG_GETARG_LOCUS_P(0);
  LOCUS  *new_locus;
  char   *contig;
  intsize;

  new_locus = (LOCUS *) palloc0(sizeof(*new_locus));
  contig = pstrdup(l->contig); // need this to determine the length of contig 
name at runtime
  size = LOCUS_SIZE(contig);
  SET_VARSIZE(new_locus, size);
  strcpy(new_locus->contig, contig);

Is there a more direct way to clone a varlena structure (possibly assigning an 
differently-sized contig to it)? One that is also memory-safe?


Thank you,

—Gene


> Gene Selkov wrote:
>>> On Dec 17, 2017, at 7:57 PM, Robert Haas >>  >> >> wrote:
>>> 
>>> On Fri, Dec 15, 2017 at 2:49 PM, Gene Selkov >>  >> >> wrote:
 I need a data type to represent genomic positions, which will consist of a
 string and a pair of integers with interval logic and access methods. Sort
 of like my seg type, but more straightforward.
>>> 
>>> Have you thought about just using a composite type?
>> Yes, I have. That is sort of what I have been doing; a composite type 
>> certainly gets the job done but I don’t feel it reduces query complexity, at 
>> least from the user’s point of view. Maybe I don’t know enough.
>> Here’s an example of how I imagine a composite genomic locus (conventionally 
>> represented as text ‘:’ integer ‘-‘ integer):
>> CREATE TYPE locus AS (contig text, coord int4range);
>> CREATE TABLE test_locus (
>>   pos locus,
>>   ref text,
>>   alt text,
>>   id text
>> );
>> CREATE INDEX test_locus_coord_ix ON test_locus (((pos).coord));
>> \copy test_locus from test_locus.tab
>> Where test_locus.tab has stuff like:
>> (chr3,"[178916937,178916940]")GAACHP2_PIK3CA_2
>> (chr3,"[178916939,178916948]")AGGATCHP2_PIK3CA_2
>> (chr3,"[178916940,178916941]")GACHP2_PIK3CA_2
>> (chr3,"[178916943,178916944]")AGCHP2_PIK3CA_2
>> (chr3,"[178916943,178916946]")AAGCHP2_PIK3CA_2
>> (chr3,"[178916943,178916952]")AAGATCCTCCHP2_PIK3CA_2
>> (chr3,"[178916944,1

Re: AS OF queries

2017-12-22 Thread Greg Stark
On 20 December 2017 at 12:45, Konstantin Knizhnik
 wrote:

> It seems to me that it will be not so difficult to implement them in
> Postgres - we already have versions of tuples.
> Looks like we only need to do three things:
> 1. Disable autovacuum (autovacuum = off)

"The Wheel of Time turns, and Ages come and pass, leaving memories
that become legend. Legend fades to myth, and even myth is long
forgotten when the Age that gave it birth comes again"

I think you'll find it a lot harder to get this to work than just
disabling autovacuum. Notably HOT updates can get cleaned up (and even
non-HOT updates can now leave tombstone dead line pointers iirc) even
if vacuum hasn't run.

We do have the infrastructure to deal with that. c.f.
vacuum_defer_cleanup_age. So in _theory_ you could create a snapshot
with xmin older than recent_global_xmin as long as it's not more than
vacuum_defer_cleanup_age older. But the devil will be in the details.
It does mean that you'll be making recent_global_xmin move backwards
which it has always been promised to *not* do

Then there's another issue that logical replication has had to deal
with -- catalog changes. You can't start looking at tuples that have a
different structure than the current catalog unless you can figure out
how to use the logical replication infrastructure to use the old
catalogs. That's a huge problem to bite off and probably can just be
left for another day if you can find a way to reliably detect the
problem and raise an error if the schema is inconsistent.

Postgres used to have time travel. I think it's come up more than once
in the pasts as something that can probably never come back due to
other decisions made. If more decisions have made it possible again
that will be fascinating.

-- 
greg



Unique indexes & constraints on partitioned tables

2017-12-22 Thread Alvaro Herrera
Hello,

I'm giving this patch its own thread for mental sanity, but this is
essentially what already posted in [1], plus some doc fixes.  This patch
depends on the main "local partitioned indexes" in that thread, last
version of which is at [2].

I also added a mechanism to set the constraints in partitions as
dependent on the constraint in the parent partitioned table, so deletion
is sensible: the PK in partitions goes away when the PK in the parent
table is dropped; and you can't drop the PK in partitions on their own.

In order to implement that I dress Rube Goldberg as Santa: the
constraint OID of the parent is needed by index_constraint_create when
creating the child; but it's the previous index_constraint_create itself
that generates the OID when creating the parent, and it's DefineIndex
that does the recursion.  So index_constraint_create returns the value
to index_create who returns it to DefineIndex, so that the recursive
step can pass it down to index_create to give it to
index_constraint_create.  It seems crazy, but it's correct.

As far as I can tell, pg_dump works correctly without any additional
changes.

[1] https://postgr.es/m/20171220194937.pldcecyx7yrwmgkg@alvherre.pgsql
[2] https://postgr.es/m/20171220212503.aamhlrs425flg47f@alvherre.pgsql

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 311cdeeae542a3f4cc20bf2308488756b8d3da80 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2017 17:04:55 +0100
Subject: [PATCH] allow indexes on partitioned tables to be unique

---
 doc/src/sgml/ref/alter_table.sgml  |   9 +-
 doc/src/sgml/ref/create_table.sgml |  16 +++-
 src/backend/bootstrap/bootparse.y  |   2 +
 src/backend/catalog/index.c|  28 +-
 src/backend/catalog/toasting.c |   4 +-
 src/backend/commands/indexcmds.c   |  80 ++--
 src/backend/commands/tablecmds.c   |  12 ++-
 src/backend/parser/parse_utilcmd.c |  31 +--
 src/backend/tcop/utility.c |   1 +
 src/include/catalog/index.h|   5 +-
 src/include/commands/defrem.h  |   1 +
 src/include/parser/parse_utilcmd.h |   3 +-
 src/test/regress/expected/alter_table.out  |   8 --
 src/test/regress/expected/create_table.out |  12 ---
 src/test/regress/expected/indexing.out | 142 -
 src/test/regress/sql/alter_table.sql   |   2 -
 src/test/regress/sql/create_table.sql  |   8 --
 src/test/regress/sql/indexing.sql  |  73 ++-
 18 files changed, 355 insertions(+), 82 deletions(-)

diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 0a2f3e3646..ee6a45c9ad 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -782,8 +782,9 @@ ALTER TABLE [ IF EXISTS ] name
   This form attaches an existing table (which might itself be partitioned)
   as a partition of the target table. The table can be attached
   as a partition for specific values using FOR VALUES
-   or as a default partition by using DEFAULT
-  .  For each index in the target table, a corresponding
+   or as a default partition by using
+  DEFAULT.
+  For each index in the target table, a corresponding
   one will be created in the attached table; or, if an equivalent
   index already exists, will be attached to the target table's index,
   as if ALTER INDEX ATTACH had been executed.
@@ -798,8 +799,10 @@ ALTER TABLE [ IF EXISTS ] name
   as the target table and no more; moreover, the column types must also
   match.  Also, it must have all the NOT NULL and
   CHECK constraints of the target table.  Currently
-  UNIQUE, PRIMARY KEY, and
   FOREIGN KEY constraints are not considered.
+  UNIQUE and PRIMARY KEY constraints
+  from the parent table will be created in the partition, if they don't
+  already exist.
   If any of the CHECK constraints of the table being
   attached is marked NO INHERIT, the command will fail;
   such a constraint must be recreated without the NO 
INHERIT
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index a0c9a6d257..98ab39473b 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -546,8 +546,8 @@ WITH ( MODULUS numeric_literal, REM
  
 
  
-  Partitioned tables do not support UNIQUE,
-  PRIMARY KEY, EXCLUDE, or
+  Partitioned tables do not support
+  EXCLUDE, or
   FOREIGN KEY constraints; however, you can define
   these constraints on individual partitions.
  
@@ -786,6 +786,11 @@ WITH ( MODULUS numeric_literal, REM
   primary key constraint defined for the table.  (Otherwise it
   would just be the same constraint listed twice.)
  
+
+ 
+  When used on partitioned tables, UNIQUE

Re: pgsql: Get rid of copy_partition_key

2017-12-22 Thread Alvaro Herrera
Seems I misremembered the whole opfuncid getting reset thing (it applies
to reading a node from string, not copying) and it was undone by
9f1255ac8593 anyway.  I don't think it makes much of a difference, but I
mention this in case you're wondering why I changed the fix_opfuncids()
call.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Get rid of copy_partition_key

2017-12-22 Thread Alvaro Herrera
I believe this patch (which also fixes a comment I neglected to fix in
the previous one) should satisfy your concerns.

It's still running a few relevant tests (create_function_1 create_type
point polygon circle create_table copy create_misc create_index
alter_table partition_join partition_prune hash_part) under
CLOBBER_FREED_MEMORY + CLOBBER_CACHE_ALWAYS, but so far it's looking
good.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 8d72b371c9d46bf509e75ed8e33ac7d6bb4f1840 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 22 Dec 2017 13:19:32 -0300
Subject: [PATCH] Protect against hypothetical memory leaks in
 RelationGetPartitionKey

---
 src/backend/utils/cache/relcache.c | 40 ++
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index e2760daac4..f25e9ba536 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -807,17 +807,16 @@ RelationBuildRuleLock(Relation relation)
  * RelationBuildPartitionKey
  * Build and attach to relcache partition key data of relation
  *
- * Partitioning key data is stored in CacheMemoryContext to ensure it survives
- * as long as the relcache.  To avoid leaking memory in that context in case
- * of an error partway through this function, we build the structure in the
- * working context (which must be short-lived) and copy the completed
- * structure into the cache memory.
- *
- * Also, since the structure being created here is sufficiently complex, we
- * make a private child context of CacheMemoryContext for each relation that
- * has associated partition key information.  That means no complicated logic
- * to free individual elements whenever the relcache entry is flushed - just
- * delete the context.
+ * Partitioning key data is a complex structure; to avoid complicated logic to
+ * free individual elements whenever the relcache entry is flushed, we give it
+ * its own memory context, child of CacheMemoryContext, which can easily be
+ * deleted on its own.  To avoid leaking memory in that context in case of an
+ * error partway through this function, the context is initially created as a
+ * child of CurTransactionContext and only re-parented to CacheMemoryContext
+ * at the end, when no further errors are possible.  Also, we don't make this
+ * context the current context except in very brief code sections, out of fear
+ * that some of our callees allocate memory on their own which would be leaked
+ * permanently.
  */
 static void
 RelationBuildPartitionKey(Relation relation)
@@ -850,9 +849,9 @@ RelationBuildPartitionKey(Relation relation)

   RelationGetRelationName(relation),

   MEMCONTEXT_COPY_NAME,

   ALLOCSET_SMALL_SIZES);
-   oldcxt = MemoryContextSwitchTo(partkeycxt);
 
-   key = (PartitionKey) palloc0(sizeof(PartitionKeyData));
+   key = (PartitionKey) MemoryContextAllocZero(partkeycxt,
+   
sizeof(PartitionKeyData));
 
/* Fixed-length attributes */
form = (Form_pg_partitioned_table) GETSTRUCT(tuple);
@@ -900,11 +899,15 @@ RelationBuildPartitionKey(Relation relation)
 */
expr = eval_const_expressions(NULL, (Node *) expr);
 
+   oldcxt = MemoryContextSwitchTo(partkeycxt);
+   key->partexprs = (List *) copyObject(expr);
+   MemoryContextSwitchTo(oldcxt);
+
/* May as well fix opfuncids too */
-   fix_opfuncids((Node *) expr);
-   key->partexprs = (List *) expr;
+   fix_opfuncids((Node *) key->partexprs);
}
 
+   oldcxt = MemoryContextSwitchTo(partkeycxt);
key->partattrs = (AttrNumber *) palloc0(key->partnatts * 
sizeof(AttrNumber));
key->partopfamily = (Oid *) palloc0(key->partnatts * sizeof(Oid));
key->partopcintype = (Oid *) palloc0(key->partnatts * sizeof(Oid));
@@ -919,6 +922,7 @@ RelationBuildPartitionKey(Relation relation)
key->parttypbyval = (bool *) palloc0(key->partnatts * sizeof(bool));
key->parttypalign = (char *) palloc0(key->partnatts * sizeof(char));
key->parttypcoll = (Oid *) palloc0(key->partnatts * sizeof(Oid));
+   MemoryContextSwitchTo(oldcxt);
 
/* For the hash partitioning, an extended hash function will be used. */
procnum = (key->strategy == PARTITION_STRATEGY_HASH) ?
@@ -989,11 +993,13 @@ RelationBuildPartitionKey(Relation relation)
 
ReleaseSysCache(tuple);
 
-   /* Success --- make

Re: [HACKERS] Runtime Partition Pruning

2017-12-22 Thread Robert Haas
On Thu, Dec 21, 2017 at 8:37 PM, David Rowley
 wrote:
>> No, I don't think comparing to previous custom plans is a workable
>> approach.  I was thinking, rather, that if we know for example that
>> we've doing pruning on partition_column = $1, then we know that only
>> one partition will match.  That's probably a common case.  If we've
>> got partition_column > $1, we could assume that, say, 75% of the
>> partitions would match.  partition_column BETWEEN $1 and $2 is
>> probably a bit more selective, so maybe we assume 50% of the
>> partitions would match.
>
> Okay. Do you think this is something we need to solve for this patch?
> When I complained originally I didn't quite see any way to even test
> the majority of this patch with the regression tests, but Beena has
> since proven me wrong about that.

Although I have done one round of view of this patch, I haven't really
got my head around it completely yet and I haven't spent of time on it
yet, so my opinions are not as well-formed as maybe they should be.
I'm glad, by the way, that you are putting some effort into it, as I
think that will help move this forward more quickly.   At a high
level, I want to avoid trying to solve too many problems in one patch
(which was the motivation behind my comment near the top of the
thread), but I also want to end up with something useful (which I
think is your concern).

Leaving aside the difficulty of implementation, I have some questions
about what the right thing to do actually is.  In a simple case, I'm
guessing that the cost of creating a custom plan will exceed the
amount that the plan saves, but in more complex cases, I'm not sure
that will be true.  For instance, if we know the particular parameter
value at plan time, we can make a more accurate estimate of how many
times that value appears, which can then feed into our choice of what
plan shape to use.  That is, for a query like SELECT * FROM a JOIN b
ON a.x = b.x WHERE a.y = $1, the generic plan might choose, say, a
nested loop with b on the inner side, but if we know that a particular
value for $1 will match a lot of rows in a, we might prefer a hash or
merge join for that specific case.  Run-time pruning doesn't give us
that flexibility.  My intuition is that the more complex we make the
query, the more point there will be to making custom plans, and the
simpler the query, the more likely it is that a generic plan will be
good enough that it's not worth replanning every time.

Now, in my experience, the current system for custom plans vs. generic
plans doesn't approach the problem in this way at all, and in my
experience that results in some pretty terrible behavior.  It will do
things like form a custom plan every time because the estimated cost
of the custom plan is lower than the estimated cost of the generic
plan even though the two plans are structurally identical; only the
estimates differ.  It will waste gobs of CPU cycles by replanning a
primary key lookup 5 times just on the off chance that a lookup on the
primary key index isn't the best option.  But this patch isn't going
to fix any of that.  The best we can probably do is try to adjust the
costing for Append paths in some way that reflects the costs and
benefits of pruning.  I'm tentatively in favor of trying to do
something modest in that area, but I don't have a detailed proposal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Using ProcSignal to get memory context stats from a running backend

2017-12-22 Thread Maksim Milyutin

On 22.12.2017 16:56, Craig Ringer wrote:

On 22 December 2017 at 20:50, Maksim Milyutin > wrote:


On 19.12.2017 16:54, Pavel Stehule wrote:


sorry for small offtopic. Can be used this mechanism for log of
executed plan or full query?



That's a really good idea. I'd love to be able to pg_explain_backend(...)

I left the mechanism as a generic diagnostic signal exactly so that we 
could add other things we wanted to be able to ask backends. I think a 
follow-on patch that adds support for dumping explain-format plans 
would be great, if it's practical to do that while a query's already 
running.


Noticing the interest in the calling some routines on the remote backend 
through signals, in parallel thread[1] I have proposed the possibility 
to define user defined signal handlers in extensions. There is a patch 
taken from pg_query_state module.


1. 
https://www.postgresql.org/message-id/3f905f10-cf7a-d4e0-64a1-7fd9b8351592%40gmail.com


--
Regards,
Maksim Milyutin



Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-22 Thread Alvaro Herrera
Hello Jesper,

Jesper Pedersen wrote:

> Passes check-world here too w/ TAP + cassert.

Great, thanks for checking.

> index.c:
>  [and a few other comments]

I believe these are all fixed by the attached delta patch.

If you have wording suggestions for the doc changes, please send them
along.

Thanks for reading,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 24ec874da257cf617359cce253ff4f8ad33166d7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 22 Dec 2017 11:41:15 -0300
Subject: [PATCH] some fixes per Jesper

---
 doc/src/sgml/ref/create_index.sgml | 28 +---
 doc/src/sgml/ref/reindex.sgml  |  5 +
 src/backend/catalog/index.c|  2 +-
 src/bin/psql/tab-complete.c|  7 ++-
 4 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml 
b/doc/src/sgml/ref/create_index.sgml
index b7b0506d6d..5137fe6383 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -557,17 +557,23 @@ Indexes:
 
   
When CREATE INDEX is invoked on a partitioned
-   table, the default behavior is to recurse to all partitions to
-   ensure they all have a matching indexes.  Each partition is first
-   checked to determine whether equivalent index already exists,
-   and if so, that index will become attached as a partition index to
-   the index being created, which will be its parent index.  If no
-   matching index exists, a new index will be created and automatically
-   attached.  If ONLY is specified, no recursion
-   is done.  Note, however, that any partition that is created in the
-   future using CREATE TABLE .. PARTITION OF will
-   automatically contain the index regardless of whether this option
-   was specified.
+   table, the default behavior is to recurse to all partitions to ensure
+   they all have matching indexes.
+   Each partition is first checked to determine whether an equivalent
+   index already exists, and if so, that index will become attached as a
+   partition index to the index being created, which will become its
+   parent index.
+   If no matching index exists, a new index will be created and
+   automatically attached; the name of the new index in each partition
+   will be determined as if no index name had been specified in the
+   command.
+   If the ONLY option is specified, no recursion
+   is done, and the index is marked invalid
+   (ALTER INDEX ... ATTACH PARTITION turns the index
+   valid, once all partitions acquire the index.)  Note, however, that
+   any partition that is created in the future using
+   CREATE TABLE ... PARTITION OF will automatically
+   contain the index regardless of whether this option was specified.
   
 
   
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 79f6931c6a..1c21fafb80 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -231,6 +231,11 @@ REINDEX [ ( VERBOSE ) ] { INDEX | TABLE | SCHEMA | 
DATABASE | SYSTEM } 
 
+  
+   Reindexing partitioned tables or partitioned indexes is not supported.
+   Each individual partition can be reindexed separately instead.
+  
+
  
 
  
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a62fe158ce..eee16dde25 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -629,7 +629,7 @@ UpdateIndexRelation(Oid indexoid,
 
values[Anum_pg_index_indexrelid - 1] = ObjectIdGetDatum(indexoid);
values[Anum_pg_index_indrelid - 1] = ObjectIdGetDatum(heapoid);
-   values[Anum_pg_index_indparentidx - 1 ] = 
ObjectIdGetDatum(parentIndexOid);
+   values[Anum_pg_index_indparentidx - 1] = 
ObjectIdGetDatum(parentIndexOid);
values[Anum_pg_index_indnatts - 1] = 
Int16GetDatum(indexInfo->ii_NumIndexAttrs);
values[Anum_pg_index_indisunique - 1] = 
BoolGetDatum(indexInfo->ii_Unique);
values[Anum_pg_index_indisprimary - 1] = BoolGetDatum(primary);
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 6af773f1bf..159ce4129e 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1695,14 +1695,11 @@ psql_completion(const char *text, int start, int end)
/* ALTER INDEX  */
else if (Matches3("ALTER", "INDEX", MatchAny))
COMPLETE_WITH_LIST7("ALTER COLUMN", "OWNER TO", "RENAME TO", 
"SET",
-   "RESET", "ATTACH 
PARTITION", "DETACH PARTITION");
-   else if (Matches4("ALTER", "INDEX", MatchAny, "ATTACH|DETACH"))
+   "RESET", "ATTACH 
PARTITION");
+   else if (Matches4("ALTER", "INDEX", MatchAny, "ATTACH"))
COMPLETE_WITH_CONST("PARTITION");
else if (Matches5("ALTER", "INDEX", MatchAny, "ATTACH", "PARTITION"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_o

Re: WIP: a way forward on bootstrap data

2017-12-22 Thread Alvaro Herrera
Robert Haas wrote:
> On Thu, Dec 21, 2017 at 5:32 PM, Alvaro Herrera  
> wrote:
> > Hmm, patch 0008 removes data lines from the .h but leaves the dependent
> > OID define lines around:
> 
> Just a question here -- do we actually have consensus on doing the
> stuff that these patches do?  Because I'm not sure we do.

My reading of the old threads (linked provided by John in his initial
email in this thread) is that we have a consensus that we want to get
rid of the old data representation, because it causes endless amount of
pain.  The proposed format seems to satisfy the constraints that we all
discussed, namely

1. be easier to modify than the current format,
2. in particular, allow for default values on certain columns
3. not cause git merge problems because of too similar lines in every
record
4. not require onerous Perl modules

The one thing we seem to lack is a tool to edit the data files, as you
suggested[1].  Stephen Frost mentioned[2] that we could do this by
allowing the .data files be loaded in a database table, have the changes
made via SQL, then have a way to create an updated .data file.  Tom
said[3] he didn't like that particular choice.

So we already have Catalog.pm that (after these patches) knows how to
load .data files; we could use that as a basis to enable easy oneliners
to do whatever editing is needed.

Also, the CPP symbols remaining in the pg_*.h that I commented yesterday
was already mentioned[4] before.

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoa4%3D5oz7wSMcLNLh8h6cXzPoxxNJKckkdSQA%2BzpUG0%2B0A%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20150304150712.GV29780%40tamriel.snowman.net
[3] https://www.postgresql.org/message-id/24766.1478821303%40sss.pgh.pa.us
[4] https://www.postgresql.org/message-id/15697.1479161...@sss.pgh.pa.us

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Observations in Parallel Append

2017-12-22 Thread Amit Kapila
Few observations in Parallel Append commit (ab727167)

1.
+++ b/src/include/nodes/execnodes.h
@@ -21,6 +21,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/spin.h"
..

There doesn't seem to be any need for including spin.h.  I think some
prior version of the patch might need it.  Patch attached to remove
it.

2.
+ * choose_next_subplan_for_worker
..
+ * We start from the first plan and advance through the list;
+ * when we get back to the end, we loop back to the first
+ * nonpartial plan.
..
+choose_next_subplan_for_worker(AppendState *node)
{
..
+   if (pstate->pa_next_plan < node->as_nplans - 1)
+   {
+   /* Advance to next plan. */
+   pstate->pa_next_plan++;
+   }
+   else if (append->first_partial_plan < node->as_nplans)
+   {
+   /* Loop back to first partial plan. */
+   pstate->pa_next_plan = append->first_partial_plan;
+   }
..

The code and comment don't seem to match.  The comments indicate that
after reaching the end of the list, we loop back to first nonpartial
plan whereas code indicates that we loop back to first partial plan.
I think one of those needs to be changed unless I am missing something
obvious.

3.
+cost_append(AppendPath *apath)
{
..
+   /*
+* Apply parallel divisor to non-partial subpaths.  Also add the
+* cost of partial paths to the total cost, but ignore non-partial
+* paths for now.
+*/
+   if (i < apath->first_partial_path)
+   apath->path.rows += subpath->rows / parallel_divisor;
+   else
+   {
+   apath->path.rows += subpath->rows;
+   apath->path.total_cost += subpath->total_cost;
+   }
..
}

I think it is better to use clamp_row_est for rows for the case where
we use parallel_divisor so that the value of rows is always sane.
Also, don't we need to use parallel_divisor for partial paths instead
of non-partial paths as those will be actually distributed among
workers?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


spurious_inclusion_v1.patch
Description: Binary data


Re: Using ProcSignal to get memory context stats from a running backend

2017-12-22 Thread Pavel Stehule
2017-12-22 13:50 GMT+01:00 Maksim Milyutin :

> On 19.12.2017 16:54, Pavel Stehule wrote:
>
> Hi
>
> 2017-12-19 14:44 GMT+01:00 Craig Ringer :
>
>> On 18 December 2017 at 10:05, Robert Haas  wrote:
>>
>>> On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer 
>>> wrote:
>>> > On 15 December 2017 at 09:24, Greg Stark  wrote:
>>> >> Another simpler option would be to open up a new file in the log
>>> >> directory
>>> >
>>> > ... if we have one.
>>> >
>>> > We might be logging to syslog, or whatever else.
>>> >
>>> > I'd rather keep it simple(ish).
>>>
>>> +1.  I still think just printing it to the log is fine.
>>>
>>>
>> Here we go. Implemented pretty much as outlined above. A new
>> pg_diag_backend(pid) function sends a new ProcSignalReason 
>> PROCSIG_DIAG_REQUEST.
>> It's handled by CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output
>> to elog(...).
>>
>> I didn't want to mess with the MemoryContextMethods and expose a
>> printf-wrapper style typedef in memnodes.h, so I went with a hook global.
>> It's a diagnostic routine so I don't think that's going to be a great
>> bother. By default it's set to write_stderr. That just writes to vfprintf
>> on unix so the outcome's the same as our direct use of fprintf now.
>>
>> On Windows, though, using write_stderr will make Pg attempt to write
>> memory context dumps to the eventlog with ERROR level if running as a
>> service with no console. That means we vsnprintf the string into a static
>> buffer first. I'm not 100% convinced of the wisdom of that because it could
>> flood the event log, which is usually size limited by # of events and
>> recycled. It'd be better to write one event than write one per memory
>> context line, but it's not safe to do that when OOM. I lean toward this
>> being a win: at least Windows users should have some hope of finding out
>> why Pg OOM'd, which currently they don't when it runs as a service. If we
>> don't, we should look at using SetStdHandle to write stderr to a secondary
>> logfile instead.
>>
>> I'm happy to just add a trivial vfprintf wrapper so we preserve exactly
>> the same behaviour instead, but I figured I'd start with reusing
>> write_stderr.
>>
>> I'd really like to preserve the writing to elog(...) not fprintf, because
>> on some systems simply finding where stderr is written can be a challenge,
>> if it's not going to /dev/null via some detached session. Is it in
>> journald? In some separate log? Being captured by the logging collector
>> (and probably silently eaten)? Who knows!
>>
>> Using elog(...) provides a log_line_prefix and other useful info to
>> associate the dump with the process of interest and what it's doing at the
>> time.
>>
>> Also, an astonishing number of deployed systems I've worked with actually
>> don't put the pid or anything useful in log_line_prefix to make grouping up
>> multi-line output practical. Which is insane. But it's only PGC_SIGHUP so
>> fixing it is easy enough.
>>
>
> sorry for small offtopic. Can be used this mechanism for log of executed
> plan or full query?
>
>
> This idea (but without logging) is implemented in the work on pg_progress
> command proposed by Remi Colinet[1] and in extension pg_query_state[2].
>
> 1. https://www.postgresql.org/message-id/CADdR5nxQUSh5kCm9MKmNga8%
> 2Bc1JLxLHDzLhAyXpfo9-Wmc6s5g%40mail.gmail.com
> 2. https://github.com/postgrespro/pg_query_state
>
>
I remember the discussion - and I hope so one time we will have some
EXPLAIN PROCESS pid command.

Using signal and pg log can be very simple solution immediately available

Regards

Pavel

> --
> Regards,
> Maksim Milyutin
>
>


Re: Using ProcSignal to get memory context stats from a running backend

2017-12-22 Thread Craig Ringer
On 22 December 2017 at 20:50, Maksim Milyutin  wrote:

> On 19.12.2017 16:54, Pavel Stehule wrote:
>
> Hi
>
> 2017-12-19 14:44 GMT+01:00 Craig Ringer :
>
>> On 18 December 2017 at 10:05, Robert Haas  wrote:
>>
>>> On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer 
>>> wrote:
>>> > On 15 December 2017 at 09:24, Greg Stark  wrote:
>>> >> Another simpler option would be to open up a new file in the log
>>> >> directory
>>> >
>>> > ... if we have one.
>>> >
>>> > We might be logging to syslog, or whatever else.
>>> >
>>> > I'd rather keep it simple(ish).
>>>
>>> +1.  I still think just printing it to the log is fine.
>>>
>>>
>> Here we go. Implemented pretty much as outlined above. A new
>> pg_diag_backend(pid) function sends a new ProcSignalReason 
>> PROCSIG_DIAG_REQUEST.
>> It's handled by CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output
>> to elog(...).
>>
>> I didn't want to mess with the MemoryContextMethods and expose a
>> printf-wrapper style typedef in memnodes.h, so I went with a hook global.
>> It's a diagnostic routine so I don't think that's going to be a great
>> bother. By default it's set to write_stderr. That just writes to vfprintf
>> on unix so the outcome's the same as our direct use of fprintf now.
>>
>> On Windows, though, using write_stderr will make Pg attempt to write
>> memory context dumps to the eventlog with ERROR level if running as a
>> service with no console. That means we vsnprintf the string into a static
>> buffer first. I'm not 100% convinced of the wisdom of that because it could
>> flood the event log, which is usually size limited by # of events and
>> recycled. It'd be better to write one event than write one per memory
>> context line, but it's not safe to do that when OOM. I lean toward this
>> being a win: at least Windows users should have some hope of finding out
>> why Pg OOM'd, which currently they don't when it runs as a service. If we
>> don't, we should look at using SetStdHandle to write stderr to a secondary
>> logfile instead.
>>
>> I'm happy to just add a trivial vfprintf wrapper so we preserve exactly
>> the same behaviour instead, but I figured I'd start with reusing
>> write_stderr.
>>
>> I'd really like to preserve the writing to elog(...) not fprintf, because
>> on some systems simply finding where stderr is written can be a challenge,
>> if it's not going to /dev/null via some detached session. Is it in
>> journald? In some separate log? Being captured by the logging collector
>> (and probably silently eaten)? Who knows!
>>
>> Using elog(...) provides a log_line_prefix and other useful info to
>> associate the dump with the process of interest and what it's doing at the
>> time.
>>
>> Also, an astonishing number of deployed systems I've worked with actually
>> don't put the pid or anything useful in log_line_prefix to make grouping up
>> multi-line output practical. Which is insane. But it's only PGC_SIGHUP so
>> fixing it is easy enough.
>>
>
> sorry for small offtopic. Can be used this mechanism for log of executed
> plan or full query?
>
>
That's a really good idea. I'd love to be able to pg_explain_backend(...)

I left the mechanism as a generic diagnostic signal exactly so that we
could add other things we wanted to be able to ask backends. I think a
follow-on patch that adds support for dumping explain-format plans would be
great, if it's practical to do that while a query's already running.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Huge backend memory footprint

2017-12-22 Thread Konstantin Knizhnik



On 22.12.2017 16:13, Claudio Freire wrote:



On Fri, Dec 22, 2017 at 10:07 AM, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


While my experiments with pthreads version of Postgres I find out
that I can not create more than 100k backends even at the system
with 4Tb of RAM.
I do not want to discuss now the idea of creating so large number
of backends - yes, most of the real production systems are using
pgbouncer or similar connection pooling
tool allowing to restrict number of connections to the database.
But there are 144 cores at this system and if we want to utilize
all system resources then optimal number of
backends will be several hundreds (especially taken in account
that Postgres backends are usually not CPU bounded and have to
read data from the disk, so number of backends
should be much larger than number of cores).

There are several per-backend arrays in postgres which size
depends on maximal number of backends.
For max_connections=10 Postgres allocates 26Mb for each snapshot:

        CurrentRunningXacts->xids = (TransactionId *)
            malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));

It seems to be too overestimated value, because
TOTAL_MAX_CACHED_SUBXIDS is defined as:

    /*
     * During Hot Standby processing we have a data structure called
     * KnownAssignedXids, created in shared memory. Local data
structures are
     * also created in various backends during GetSnapshotData(),
     * TransactionIdIsInProgress() and
GetRunningTransactionData(). All of the
     * main structures created in those functions must be
identically sized,
     * since we may at times copy the whole of the data structures
around. We
     * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
     *
     * Ideally we'd only create this structure if we were actually
doing hot
     * standby in the current run, but we don't know that yet at
the time
     * shared memory is being set up.
     */
#define TOTAL_MAX_CACHED_SUBXIDS \
    ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)


Another 12Mb array is used for deadlock detection:

#2  0x008ac397 in InitDeadLockChecking () at deadlock.c:196
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
(gdb) list
191         * last MaxBackends entries in possibleConstraints[]
are reserved as
192         * output workspace for FindLockCycle.
193         */
194        maxPossibleConstraints = MaxBackends * 4;
195        possibleConstraints =
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
197


As  result amount of dynamic memory allocated for each backend
exceeds 50Mb and so 100k backends can not be launched even at the
system with 4Tb!
I think that we should use more accurate allocation policy in this
places and do not waste memory in such manner (even if it is virtual).


Don't forget each thread also has its own stack. I don't think you can 
expect 100k threads to ever work.


Yes, Postgres  requires large stack. Although minimal pthread stack size 
is 16kb, Postgres requires at least 512kb and it is still not enough for 
passing regression tests.
But even with 1Mb thread stack size, 100k connections requires just (!) 
100Gb. But 50Mb is too much.




If you get to that point, you really need to consider async query 
execution. There was a lot of work related to that in other threads, 
you may want to take a look.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Huge backend memory footprint

2017-12-22 Thread Andres Freund
Hi,

On 2017-12-22 16:07:23 +0300, Konstantin Knizhnik wrote:
> While my experiments with pthreads version of Postgres I find out that I can
> not create more than 100k backends even at the system with 4Tb of RAM.

I don't think this is a problem we need to address at this point. Would
you care to argue otherwise?

For now we've so much more relevant limitations, e.g. the O(N)
complexity of computing a snapshot, that I think addressing these
concerns is a waste of time at this point.

Greetings,

Andres Freund



Re: Huge backend memory footprint

2017-12-22 Thread Claudio Freire
On Fri, Dec 22, 2017 at 10:07 AM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> While my experiments with pthreads version of Postgres I find out that I
> can not create more than 100k backends even at the system with 4Tb of RAM.
> I do not want to discuss now the idea of creating so large number of
> backends - yes, most of the real production systems are using pgbouncer or
> similar connection pooling
> tool allowing to restrict number of connections to the database. But there
> are 144 cores at this system and if we want to utilize all system resources
> then optimal number of
> backends will be several hundreds (especially taken in account that
> Postgres backends are usually not CPU bounded and have to read data from
> the disk, so number of backends
> should be much larger than number of cores).
>
> There are several per-backend arrays in postgres which size depends on
> maximal number of backends.
> For max_connections=10 Postgres allocates 26Mb for each snapshot:
>
> CurrentRunningXacts->xids = (TransactionId *)
> malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));
>
> It seems to be too overestimated value, because TOTAL_MAX_CACHED_SUBXIDS
> is defined as:
>
> /*
>  * During Hot Standby processing we have a data structure called
>  * KnownAssignedXids, created in shared memory. Local data structures
> are
>  * also created in various backends during GetSnapshotData(),
>  * TransactionIdIsInProgress() and GetRunningTransactionData(). All of
> the
>  * main structures created in those functions must be identically
> sized,
>  * since we may at times copy the whole of the data structures around.
> We
>  * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
>  *
>  * Ideally we'd only create this structure if we were actually doing
> hot
>  * standby in the current run, but we don't know that yet at the time
>  * shared memory is being set up.
>  */
> #define TOTAL_MAX_CACHED_SUBXIDS \
> ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
>
>
> Another 12Mb array is used for deadlock detection:
>
> #2  0x008ac397 in InitDeadLockChecking () at deadlock.c:196
> 196(EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
> (gdb) list
> 191 * last MaxBackends entries in possibleConstraints[] are
> reserved as
> 192 * output workspace for FindLockCycle.
> 193 */
> 194maxPossibleConstraints = MaxBackends * 4;
> 195possibleConstraints =
> 196(EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
> 197
>
>
> As  result amount of dynamic memory allocated for each backend exceeds
> 50Mb and so 100k backends can not be launched even at the system with 4Tb!
> I think that we should use more accurate allocation policy in this places
> and do not waste memory in such manner (even if it is virtual).
>

Don't forget each thread also has its own stack. I don't think you can
expect 100k threads to ever work.

If you get to that point, you really need to consider async query
execution. There was a lot of work related to that in other threads, you
may want to take a look.


Huge backend memory footprint

2017-12-22 Thread Konstantin Knizhnik
While my experiments with pthreads version of Postgres I find out that I 
can not create more than 100k backends even at the system with 4Tb of RAM.
I do not want to discuss now the idea of creating so large number of 
backends - yes, most of the real production systems are using pgbouncer 
or similar connection pooling
tool allowing to restrict number of connections to the database. But 
there are 144 cores at this system and if we want to utilize all system 
resources then optimal number of
backends will be several hundreds (especially taken in account that 
Postgres backends are usually not CPU bounded and have to read data from 
the disk, so number of backends

should be much larger than number of cores).

There are several per-backend arrays in postgres which size depends on 
maximal number of backends.

For max_connections=10 Postgres allocates 26Mb for each snapshot:

        CurrentRunningXacts->xids = (TransactionId *)
            malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));

It seems to be too overestimated value, because TOTAL_MAX_CACHED_SUBXIDS 
is defined as:


    /*
     * During Hot Standby processing we have a data structure called
     * KnownAssignedXids, created in shared memory. Local data 
structures are

     * also created in various backends during GetSnapshotData(),
     * TransactionIdIsInProgress() and GetRunningTransactionData(). All 
of the
     * main structures created in those functions must be identically 
sized,
     * since we may at times copy the whole of the data structures 
around. We

     * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
     *
     * Ideally we'd only create this structure if we were actually 
doing hot

     * standby in the current run, but we don't know that yet at the time
     * shared memory is being set up.
     */
#define TOTAL_MAX_CACHED_SUBXIDS \
    ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)


Another 12Mb array is used for deadlock detection:

#2  0x008ac397 in InitDeadLockChecking () at deadlock.c:196
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
(gdb) list
191         * last MaxBackends entries in possibleConstraints[] are 
reserved as

192         * output workspace for FindLockCycle.
193         */
194        maxPossibleConstraints = MaxBackends * 4;
195        possibleConstraints =
196            (EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
197


As  result amount of dynamic memory allocated for each backend exceeds 
50Mb and so 100k backends can not be launched even at the system with 4Tb!
I think that we should use more accurate allocation policy in this 
places and do not waste memory in such manner (even if it is virtual).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Using ProcSignal to get memory context stats from a running backend

2017-12-22 Thread Maksim Milyutin

On 19.12.2017 16:54, Pavel Stehule wrote:


Hi

2017-12-19 14:44 GMT+01:00 Craig Ringer >:


On 18 December 2017 at 10:05, Robert Haas mailto:robertmh...@gmail.com>> wrote:

On Thu, Dec 14, 2017 at 9:34 PM, Craig Ringer
mailto:cr...@2ndquadrant.com>> wrote:
> On 15 December 2017 at 09:24, Greg Stark mailto:st...@mit.edu>> wrote:
>> Another simpler option would be to open up a new file in
the log
>> directory
>
> ... if we have one.
>
> We might be logging to syslog, or whatever else.
>
> I'd rather keep it simple(ish).

+1.  I still think just printing it to the log is fine.


Here we go. Implemented pretty much as outlined above. A new
pg_diag_backend(pid) function sends a new
ProcSignalReason PROCSIG_DIAG_REQUEST. It's handled by
CHECK_FOR_INTERRUPTS() and logs MemoryContextStats() output to
elog(...).

I didn't want to mess with the MemoryContextMethods and expose a
printf-wrapper style typedef in memnodes.h, so I went with a hook
global. It's a diagnostic routine so I don't think that's going to
be a great bother. By default it's set to write_stderr. That just
writes to vfprintf on unix so the outcome's the same as our direct
use of fprintf now.

On Windows, though, using write_stderr will make Pg attempt to
write memory context dumps to the eventlog with ERROR level if
running as a service with no console. That means we vsnprintf the
string into a static buffer first. I'm not 100% convinced of the
wisdom of that because it could flood the event log, which is
usually size limited by # of events and recycled. It'd be better
to write one event than write one per memory context line, but
it's not safe to do that when OOM. I lean toward this being a win:
at least Windows users should have some hope of finding out why Pg
OOM'd, which currently they don't when it runs as a service. If we
don't, we should look at using SetStdHandle to write stderr to a
secondary logfile instead.

I'm happy to just add a trivial vfprintf wrapper so we preserve
exactly the same behaviour instead, but I figured I'd start with
reusing write_stderr.

I'd really like to preserve the writing to elog(...) not fprintf,
because on some systems simply finding where stderr is written can
be a challenge, if it's not going to /dev/null via some detached
session. Is it in journald? In some separate log? Being captured
by the logging collector (and probably silently eaten)? Who knows!

Using elog(...) provides a log_line_prefix and other useful info
to associate the dump with the process of interest and what it's
doing at the time.

Also, an astonishing number of deployed systems I've worked with
actually don't put the pid or anything useful in log_line_prefix
to make grouping up multi-line output practical. Which is insane.
But it's only PGC_SIGHUP so fixing it is easy enough.


sorry for small offtopic. Can be used this mechanism for log of 
executed plan or full query?


This idea (but without logging) is implemented in the work on 
pg_progress command proposed by Remi Colinet[1] and in extension 
pg_query_state[2].


1. 
https://www.postgresql.org/message-id/CADdR5nxQUSh5kCm9MKmNga8%2Bc1JLxLHDzLhAyXpfo9-Wmc6s5g%40mail.gmail.com

2. https://github.com/postgrespro/pg_query_state

--
Regards,
Maksim Milyutin



[HACKERS] PoC: custom signal handler for extensions

2017-12-22 Thread Maksim Milyutin

Hi, hackers!


I want to propose the patch that allows to define custom signals and 
their handlers on extension side. It is based on ProcSignal module, 
namely it defines the fixed set (number is specified by constant) of 
custom signals that could be reserved on postgres initialization stage 
(in _PG_init function of shared preloaded module) through specific 
interface functions. Functions that are custom signal handlers are 
defined in extension. The relationship between custom signals and 
assigned handlers (function addresses) is replicated from postmaster to 
child processes including working backends. Using this signals we are 
able to call specific handler (via SendProcSignal function) on remote 
backend that is actually come into action in CHECK_FOR_INTERRUPTS point.


In perspective, this mechanism provides the low-level instrument to 
define remote procedure call on extension side. The simple RPC that 
defines effective userid on remote backend (remote_effective_user 
function) is attached for example.


C&C welcome!


--
Regards,
Maksim Milyutin

diff --git a/src/backend/storage/ipc/procsignal.c 
b/src/backend/storage/ipc/procsignal.c
index b9302ac630..75d4ea72b7 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "tcop/tcopprot.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -60,12 +61,17 @@ typedef struct
  */
 #define NumProcSignalSlots (MaxBackends + NUM_AUXPROCTYPES)
 
+static bool CustomSignalPendings[NUM_CUSTOM_PROCSIGNALS];
+static ProcSignalHandler_type CustomHandlers[NUM_CUSTOM_PROCSIGNALS];
+
 static ProcSignalSlot *ProcSignalSlots = NULL;
 static volatile ProcSignalSlot *MyProcSignalSlot = NULL;
 
 static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 
+static void CustomSignalInterrupt(ProcSignalReason reason);
+
 /*
  * ProcSignalShmemSize
  * Compute space needed for procsignal's shared memory
@@ -166,6 +172,70 @@ CleanupProcSignalState(int status, Datum arg)
 }
 
 /*
+ * RegisterCustomProcSignalHandler
+ * Assign specific handler of custom process signal with new 
ProcSignalReason key
+ *
+ * Return INVALID_PROCSIGNAL if all custom signals have been assigned.
+ */
+ProcSignalReason
+RegisterCustomProcSignalHandler(ProcSignalHandler_type handler)
+{
+   ProcSignalReason reason;
+
+   /* iterate through custom signal keys to find free spot */
+   for (reason = PROCSIG_CUSTOM_1; reason <= PROCSIG_CUSTOM_N; reason++)
+   if (!CustomHandlers[reason - PROCSIG_CUSTOM_1])
+   {
+   CustomHandlers[reason - PROCSIG_CUSTOM_1] = handler;
+   return reason;
+   }
+   return INVALID_PROCSIGNAL;
+}
+
+/*
+ * ReleaseCustomProcSignal
+ *  Release slot of specific custom signal
+ */
+void
+ReleaseCustomProcSignal(ProcSignalReason reason)
+{
+   CustomHandlers[reason - PROCSIG_CUSTOM_1] = NULL;
+}
+
+/*
+ * AssignCustomProcSignalHandler
+ * Assign handler of custom process signal with specific 
ProcSignalReason key
+ *
+ * Return old ProcSignal handler.
+ * Assume incoming reason is one of custom ProcSignals.
+ */
+ProcSignalHandler_type
+AssignCustomProcSignalHandler(ProcSignalReason reason, ProcSignalHandler_type 
handler)
+{
+   ProcSignalHandler_type old;
+
+   Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N);
+
+   old = CustomHandlers[reason - PROCSIG_CUSTOM_1];
+   CustomHandlers[reason - PROCSIG_CUSTOM_1] = handler;
+   return old;
+}
+
+/*
+ * GetCustomProcSignalHandler
+ * Get handler of custom process signal
+ *
+ * Assume incoming reason is one of custom ProcSignals.
+ */
+ProcSignalHandler_type
+GetCustomProcSignalHandler(ProcSignalReason reason)
+{
+   Assert(reason >= PROCSIG_CUSTOM_1 && reason <= PROCSIG_CUSTOM_N);
+
+   return CustomHandlers[reason - PROCSIG_CUSTOM_1];
+}
+
+/*
  * SendProcSignal
  * Send a signal to a Postgres process
  *
@@ -260,7 +330,8 @@ CheckProcSignal(ProcSignalReason reason)
 void
 procsignal_sigusr1_handler(SIGNAL_ARGS)
 {
-   int save_errno = errno;
+   int save_errno = errno;
+   ProcSignalReasonreason;
 
if (CheckProcSignal(PROCSIG_CATCHUP_INTERRUPT))
HandleCatchupInterrupt();
@@ -292,9 +363,84 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+   for (reason = PROCSIG_CUSTOM_1; reason <= PROCSIG_CUSTOM_N; reason++)
+   if (CheckProcSignal(reason))
+   CustomSignalInterrupt(reason);
+
SetLatch(MyLatch);
 
latch_sigusr1_handler();
 
errno = save_errno;
 }
+
+/*
+ * Handle re

Re: [HACKERS] Restricting maximum keep segments by repslots

2017-12-22 Thread Sergei Kornilov
Hello
I think limit wal in replication slots is useful in some cases. But first time 
i was confused with proposed terminology secured/insecured/broken/unknown state.

patch -p1 gives some "Stripping trailing CRs from patch" messages for me, but 
applied to current HEAD and builds. After little testing i understood the 
difference in secured/insecured/broken terminology. Secured means garantee to 
keep wal, insecure - wal may be deleted with next checkpoint, broken - wal 
already deleted.
I think, we may split "secure" to "streaming" and... hmm... "waiting"? 
"keeping"? - according active flag for clarify and readable "status" field.

regards, Sergei



Re: After dropping the rule - Not able to insert / server crash (one time ONLY)

2017-12-22 Thread Dilip Kumar
On Mon, Dec 11, 2017 at 4:33 PM, Dilip Kumar  wrote:

> On Mon, Dec 11, 2017 at 3:54 PM, tushar 
> wrote:
>
>> Hi,
>>
>> While testing something , I found that even after rule has dropped  not
>> able to insert data  and in an another scenario , there is a Crash/
>>
>> Please refer this scenario's -
>>
>> 1) Rows not inserted after dropping the RULE
>>
>> postgres=# create table e(n int);
>> CREATE TABLE
>> postgres=# create rule e1 as on insert to e do instead nothing;
>> CREATE RULE
>> postgres=# create function e11(n int) returns int as $$ begin insert into
>> e values(10); return 1; end; $$ language 'plpgsql';
>> CREATE FUNCTION
>> postgres=# select e11(1);
>>  e11
>> -
>>1
>> (1 row)
>>
>> postgres=# select e11(1);
>>  e11
>> -
>>1
>> (1 row)
>>
>> postgres=# select * from e;  => Expected , due to  the RULE enforced
>>  n
>> ---
>> (0 rows)
>>
>>
>> postgres=# drop rule e1 on e;  ==>Drop the rule
>> DROP RULE
>>
>> postgres=# select e11(1);  =>again try to insert into table
>>  e11
>> -
>>1
>> (1 row)
>>
>> postgres=# select * from e;  =>This time , value should be inserted but
>> still showing 0 row.
>>  n
>> ---
>> (0 rows)
>>
>
I think this is becuase we are not invalidating the plan cache if rule is
dropped.  You can reproduce same with prepared statements.


> 2)Server crash (one time)
>>
>> postgres=# create table en(n int);
>> CREATE TABLE
>> postgres=# create function en1(n int) returns int as $$ begin insert into
>> en values(10); return 1; end; $$ language 'plpgsql';
>> CREATE FUNCTION
>> postgres=#
>> postgres=# select en1(1);
>>  en1
>> -
>>1
>> (1 row)
>>
>> postgres=# select * from en;
>>  n
>> 
>>  10
>> (1 row)
>>
>> postgres=# create rule en11 as on insert to en do instead nothing;
>> CREATE RULE
>> postgres=# select en1(1);
>> ostgres=# select en1(1);
>> TRAP: FailedAssertion("!(!stmt->mod_stmt)", File: "pl_exec.c", Line:
>> 3721)
>> server closed the connection unexpectedly
>> This probably means the server terminated abnormally
>> before or while processing the request.
>>
>
> I have looked into this crash, seems like below assert in
> exec_stmt_execsql is not correct
>
> *case* SPI_OK_REWRITTEN:
> Assert(!stmt->mod_stmt);
>
> In this issue first time execution of select en1(1); statement, stmt->mod_stmt
> is set to true for the insert query inside the function. Then before next
> execution we have created the rule "create rule en11 as on insert to en
> do instead nothing;".  Because of this nothing will be executed and
> output will be SPI_OK_REWRITTEN.  But we are asserting that stmt->mod_stmt
> should be false (which were set to true in first execution).
>

IMHO if the query is rewritten, then this assert is not valid. I have
attached a patch which removes this assert.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


remove_assert_in_rewritequery.patch
Description: Binary data


Should we nonblocking open FIFO files in COPY?

2017-12-22 Thread Adam Lee
Hi,

I have an issue that COPY from a FIFO, which has no writers, could not be
canceled, because COPY invokes AllocateFile() -> fopen() -> blocking open().

```
[postgres@s1 ~]$ mkfifo /tmp/test0
[postgres@s1 ~]$ /usr/local/pgsql/bin/psql test
psql (11devel)
Type "help" for help.

test=# create table test(t text);
CREATE TABLE
test=# copy test from '/tmp/test0';
^CCancel request sent
^CCancel request sent
^CCancel request sent
^CCancel request sent
^CCancel request sent
^CCancel request sent
...
```

Should we nonblocking open FIFO files?

And a following question if we nonblocking open them, say there is a
FIFO file, no one will write into it, and a utility calls `COPY FROM`
it, should we just return `COPY 0` or wait writers? If we wait, then
users have to interrupt or write an EOF into the FIFO after a timeout,
I see some utilities do that, gptransfer for instance, just seems not
right.

My plan is to write a new function which nonblocking opens FIFOs just
for COPY, and not waits writers, what do you think?

-- 
Adam Lee



Re: Suspicious call of initial_cost_hashjoin()

2017-12-22 Thread Thomas Munro
On Fri, Dec 22, 2017 at 10:45 PM, Antonin Houska  wrote:
> try_partial_hashjoin_path() passes constant true to for the parallel_hash
> argument of initial_cost_hashjoin(). Shouldn't it instead pass the
> parallel_hash argument that it receives?

Thanks.  Yeah.  When initial_cost_hashjoin() calls
get_parallel_divisor() on a non-partial inner path I think it would
return 1.0, so no damage was done there, but when
ExecChooseHashTableSize() receives try_combined_work_mem == true it
might underestimate the number of batches required for a partial hash
join without parallel hash, because it would incorrectly assume that a
single batch join could use the combined work_mem budget.  This was
quite well hidden because ExecHashTableCreate() calls
ExecChooseHashTableSize() again (rather than reusing the results from
planning time), so the bad nbatch estimate doesn't show up anywhere.

-- 
Thomas Munro
http://www.enterprisedb.com


fix.patch
Description: Binary data


Re: General purpose hashing func in pgbench

2017-12-22 Thread Ildar Musin

21/12/2017 18:26, Fabien COELHO пишет:
>
>> I think it is not commitfest ready yet -- I need to add some
>> documentation and tests first.
>
> Yes, doc & test are missing.
>
> From your figures, the murmur2 algorithm output looks way better. I'm
> wondering whether it makes sense to provide a bad hash function if a
> good/better one is available, unless the bad one actually appears in
> some benchmark... So I would suggest to remove fnv1a.
Actually the "bad" one appears in YCSB. But if we should choose the only
one I would stick to murmur too given it provides better results while
having similar computational complexity.
>
> One implementation put constants in defines, the other one uses "const
> int". The practice in pgbench seems to use defines (eg
> MIN_GAUSSIAN_PARAM...), so I would suggest to stick to this style.
That had been my intention from the start until I coded it that way and
it looked ugly and hard to read (IMHO), like:

    k *= MURMUR2_M;
    k ^= k >> MURMUR2_R;
    k *= MURMUR2_M;
    result ^= k;
    result *= MURMUR2_M;

...etc. And besides it is not a parameter to be tuned and not something
someone would ever want to change; it appears in just a single function.
So I'd better leave it the way it is. Actually I was thinking to do the
same to fnv1a too : )

-- 
Ildar Musin
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company 




Re: [HACKERS] pgbench more operators & functions

2017-12-22 Thread Fabien COELHO


Hello Teodor,


replaced -1 by 0x so that the code is hopefully clearer.

I changed 0xff constant  to  ~INT64CONST(0), seems, it's more consistent way.
Also I remove some whitespaces in exprparse.y. Fixed version in attachment.


Fine, quite readable this way.

Actually, I prefer to see single scripting implementation in both psql and 
pgbench,


I'll push for the implementations are to share more stuff in the future.

For instance the pgbench-if patch shares the conditional stack 
implementation. I intend to move pgbench expression engine as a shared 
front-end util, once its capabilites are extended and stable, which is 
basically after this patch, so that client side expressions can be used in 
psql.


Now, psql & pgbench contexts are slightly different, with an interactive 
thing which must evaluate on the fly on one side and a scripting thing on 
the other, so it would not be easy to share everything or to do everything 
the same way.


but I suppose nobody has a power to do it in foreseen future. And, 
may be, it's not a very good way  to invent one script language instead of 
using one of bunch of them, but, again, I'm afraid several months/years 
discussion about how and which one to embed. But scripting is needed now, I 
believe, at least I see several test scenarios which can not be implemented 
with current pgbench and this patch allows to do it.


That is exactly why I'm pushing different things into pgbench (\gset, 
\if, ...), to improve capabilities wrt to benchmarking.


So, I intend to push thish patch in current state. I saw several objections 
from commiters in thread, but, seems, that objections are lifted. Am I right?


I think so.

--
Fabien.



Re: [HACKERS] pgbench more operators & functions

2017-12-22 Thread Teodor Sigaev
I've checked, but truexxx is not accepted as true. I have added a test case 
which fails on "malformed variable", i.e. it went up to scanning a double. When 
comparing ("truexxx", "true", 7) the fifth char is different, so it is != 0. Or 
I'm missing something.

Oh, my fault. I've missed that. Thank you for test



Ok, I agree that it looks strange. I have added comments for both. I have 
replaced -1 by 0x so that the code is hopefully clearer.

I changed 0xff constant  to  ~INT64CONST(0), seems, it's more consistent way.
Also I remove some whitespaces in exprparse.y. Fixed version in attachment.


Looking to psql and pgbench scripting implementation, isn't it better to 
integrate lua in psql & pgbench?


Hmmm... if it starts on this slope, everyone will have its opinion (lua, tcl, 
python, ruby, perl, insert-script-name-here...) and it must interact with SQL, 
I'm not sure how to embed SQL & another language cleanly. So the idea is just to 
extend backslash command capabilities of psql & pgbench, preferably 
consistently, when need (i.e. use cases) arises.


Actually, I prefer to see single scripting implementation in both psql and 
pgbench, but I suppose nobody has a power to do it in foreseen future. And, may 
be, it's not a very good way  to invent one script language instead of using one 
of bunch of them, but, again, I'm afraid several months/years discussion about 
how and which one to embed. But scripting is needed now, I believe, at least I 
see several test scenarios which can not be implemented with current pgbench and 
this patch allows to do it.


So, I intend to push thish patch in current state. I saw several objections from 
commiters in thread, but, seems, that objections are lifted. Am I right?


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 4431fc3eb7..ea8f305834 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -904,14 +904,32 @@ pgbench  options  d
  
   Sets variable varname to a value calculated
   from expression.
-  The expression may contain integer constants such as 5432,
+  The expression may contain the NULL constant,
+  boolean constants TRUE and FALSE,
+  integer constants such as 5432,
   double constants such as 3.14159,
   references to variables :variablename,
-  unary operators (+, -) and binary operators
-  (+, -, *, /,
-  %) with their usual precedence and associativity,
-  function calls, and
-  parentheses.
+  operators
+  with their usual SQL precedence and associativity,
+  function calls,
+  SQL CASE generic conditional
+  expressions and parentheses.
+ 
+
+ 
+  Functions and most operators return NULL on
+  NULL input.
+ 
+
+ 
+  For conditional purposes, non zero numerical values are
+  TRUE, zero numerical values and NULL
+  are FALSE.
+ 
+
+ 
+  When no final ELSE clause is provided to a
+  CASE, the default value is NULL.
  
 
  
@@ -920,6 +938,7 @@ pgbench  options  d
 \set ntellers 10 * :scale
 \set aid (1021 * random(1, 10 * :scale)) % \
(10 * :scale) + 1
+\set divx CASE WHEN :x <> 0 THEN :y/:x ELSE NULL END
 
 

@@ -996,6 +1015,177 @@ pgbench  options  d
   
  
 
+ 
+  Built-In Operators
+
+  
+   The arithmetic, bitwise, comparison and logical operators listed in
+are built into pgbench
+   and may be used in expressions appearing in
+   \set.
+  
+
+  
+   pgbench Operators by increasing precedence
+   
+
+ 
+  Operator
+  Description
+  Example
+  Result
+ 
+
+
+ 
+  OR
+  logical or
+  5 or 0
+  TRUE
+ 
+ 
+  AND
+  logical and
+  3 and 0
+  FALSE
+ 
+ 
+  NOT
+  logical not
+  not false
+  TRUE
+ 
+ 
+  IS [NOT] (NULL|TRUE|FALSE)
+  value tests
+  1 is null
+  FALSE
+ 
+ 
+  ISNULL|NOTNULL
+  null tests
+  1 notnull
+  TRUE
+ 
+ 
+  =
+  is equal
+  5 = 4
+  FALSE
+ 
+ 
+  <>
+  is not equal
+  5 <> 4
+  TRUE
+ 
+ 
+  !=
+  is not equal
+  5 != 5
+  FALSE
+ 
+ 
+  <
+  lower than
+  5 < 4
+  FALSE
+ 
+ 
+  <=
+  lower or equal
+  5 <= 4
+  FALSE
+ 
+ 
+  >
+  greater than
+  5 > 4
+  TRUE
+ 
+ 
+  >=
+  greater or equal
+  5 >= 4
+  TRUE
+ 
+ 
+  |
+  integer bitwise OR
+  1 | 2
+  3
+ 
+ 
+  #
+  integer bitwise XOR
+  1 # 3
+  2
+ 
+ 
+  &
+  integer bitwise AND
+  1 & 3
+  1
+ 
+ 
+  ~
+  integer bitwise NOT
+  ~ 1
+  -2
+ 
+ 
+  <<
+

Re: [HACKERS] pow support for pgbench

2017-12-22 Thread Fabien COELHO


Hello,


If a double is always returned, I'm wondering whether keeping the ipow
version makes much sense: In case of double loss of precision, the
precision is lost, too bad, and casting back to int won't bring it back.


I've kept it because knowing that both are ints enables not making a lot of
checks (done in math.h pow) so it's way faster. In my system it's 2-3ns vs
~40ns. I'm willing to settle for using just pow() if that makes it clearer.


Ok, performance is a good argument. I would not have thought that the 
double performance would be so bad, but probably no miracle.


As of precision, there is another case where the int computation 
overflows, so that the int result is stupid and the double version is a 
better approximation. Now that can be controled by providing double or int 
arguments to the function, so for me it is ok.



Attached the  updated patch.


Ok.

I have marked it as ready to committer.

Basically for me this is an inferior version, not specially behaving that 
better with respect to SQL, but if it eventually gets through a committer 
maybe it is worth it.


--
Fabien.



Suspicious call of initial_cost_hashjoin()

2017-12-22 Thread Antonin Houska
try_partial_hashjoin_path() passes constant true to for the parallel_hash
argument of initial_cost_hashjoin(). Shouldn't it instead pass the
parallel_hash argument that it receives?

This is related to commit 1804284042e659e7d16904e7bbb0ad546394b6a3.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-12-22 Thread Rajkumar Raghuwanshi
On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> Thanks. Here are some comments
>
> Thanks Ashutosh for review and suggestions.


> +-- test default partition behavior for range
> +ALTER TABLE prt1 DETACH PARTITION prt1_p3;
> +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
> +ALTER TABLE prt2 DETACH PARTITION prt2_p3;
> +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
>
> I think we need an ANALYZE here in case the statistics gets updated while
> DETACH and ATTACH is going on. Other testcases also need to be updated with
> ANALYZE, including the negative one.
>
Done.


>
> +-- partition-wise join can not be applied if the only one of joining
> table have
>
> Correction: ... if only one of the joining tables has ...
>
Done.


> Please add the patch to the next commitfest so that it's not
> forgotten.

Done.
Added to CF: https://commitfest.postgresql.org/16/1426/


> I think we can get rid of the multi-level partition-wise
> testcase as well. Also, since we are re-attaching existing partition
> tables as default partitions, we don't need to check the output as
> well; just plan should be enough.
>
Ok. Done.

updated test patch attached.
diff --git a/src/test/regress/expected/partition_join.out b/src/test/regress/expected/partition_join.out
index 27ab852..d4c875a 100644
--- a/src/test/regress/expected/partition_join.out
+++ b/src/test/regress/expected/partition_join.out
@@ -1337,6 +1337,74 @@ SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM pht1 t1, ph
  574. | 574.5000 | 1148. | 0011 | 0011 | A0011
 (12 rows)
 
+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ANALYZE prt1;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
+ANALYZE prt2;
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
+QUERY PLAN
+--
+ Sort
+   Sort Key: t1.a
+   ->  Append
+ ->  Hash Join
+   Hash Cond: (t2.b = t1.a)
+   ->  Seq Scan on prt2_p1 t2
+   ->  Hash
+ ->  Seq Scan on prt1_p1 t1
+   Filter: (b = 0)
+ ->  Hash Join
+   Hash Cond: (t2_1.b = t1_1.a)
+   ->  Seq Scan on prt2_p2 t2_1
+   ->  Hash
+ ->  Seq Scan on prt1_p2 t1_1
+   Filter: (b = 0)
+ ->  Hash Join
+   Hash Cond: (t2_2.b = t1_2.a)
+   ->  Seq Scan on prt2_p3 t2_2
+   ->  Hash
+ ->  Seq Scan on prt1_p3 t1_2
+   Filter: (b = 0)
+(21 rows)
+
+-- test default partition behavior for list
+ALTER TABLE plt1 DETACH PARTITION plt1_p3;
+ALTER TABLE plt1 ATTACH PARTITION plt1_p3 DEFAULT;
+ANALYZE plt1;
+ALTER TABLE plt2 DETACH PARTITION plt2_p3;
+ALTER TABLE plt2 ATTACH PARTITION plt2_p3 DEFAULT;
+ANALYZE plt2;
+EXPLAIN (COSTS OFF)
+SELECT avg(t1.a), avg(t2.b), t1.c, t2.c FROM plt1 t1 RIGHT JOIN plt2 t2 ON t1.c = t2.c GROUP BY t1.c, t2.c ORDER BY t1.c, t2.c;
+  QUERY PLAN  
+--
+ Sort
+   Sort Key: t1.c, t2.c
+   ->  HashAggregate
+ Group Key: t1.c, t2.c
+ ->  Result
+   ->  Append
+ ->  Hash Right Join
+   Hash Cond: (t1.c = t2.c)
+   ->  Seq Scan on plt1_p1 t1
+   ->  Hash
+ ->  Seq Scan on plt2_p1 t2
+ ->  Hash Right Join
+   Hash Cond: (t1_1.c = t2_1.c)
+   ->  Seq Scan on plt1_p2 t1_1
+   ->  Hash
+ ->  Seq Scan on plt2_p2 t2_1
+ ->  Hash Right Join
+   Hash Cond: (t1_2.c = t2_2.c)
+   ->  Seq Scan on plt1_p3 t1_2
+   ->  Hash
+ ->  Seq Scan on plt2_p3 t2_2
+(21 rows)
+
 --
 -- multiple levels of partitioning
 --
@@ -1868,3 +1936,30 @@ SELECT t1.a, t1.c, t2.b, t2.c FROM prt1_n t1 FULL JOIN prt1 t2 ON (t1.c = t2.c);
->  Seq Scan on prt1_n_p2 t1_1
 (10 rows)
 
+-- partition-wise join can not be applied if only one of joining table has
+-- default partition
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 FOR VALUES FROM (500) TO (600);
+ANALYZE prt2;
+EXPLAIN (COSTS OFF)
+SELECT t1.a, t1.c, t2.b, t2.c FROM prt1 t1, prt2 t2 WHERE t1.a = t2.b AND t1.b = 0 ORDER BY t1.a, t2.b;
+QUERY PLAN
+

Re: [HACKERS] pow support for pgbench

2017-12-22 Thread Raúl Marín Rodríguez
Hi Fabien,

Thanks for the review.

If a double is always returned, I'm wondering whether keeping the ipow
> version makes much sense: In case of double loss of precision, the
> precision is lost, too bad, and casting back to int won't bring it back.

I've kept it because knowing that both are ints enables not making a lot of
checks (done in math.h pow) so it's way faster. In my system it's 2-3ns vs
~40ns. I'm willing to settle for using just pow() if that makes it clearer.

In the doc, I'm not sure that "Numeric" brings anything. "Exponentiation"
> would be enough.

Done.

Also, in pg I just noticed that POW is a shorthand for POWER. Maybe both
> should be supported? Or not.

I've never used power instead of pow, but I've added for compatibility
shake.

Attached the  updated patch.

On Thu, Dec 21, 2017 at 10:48 PM, Fabien COELHO  wrote:

>
> Hello Raúl,
>
> v7 needs a rebase.
>>>
>>> Also, you might try to produce a version which is compatible with
>>> Robert's
>>> constraints.
>>>
>>
> My 0.02€ on this new version: Applies cleanly, compiles and works.
>
> I cannot say that I like it more than the previous version.
>
> If a double is always returned, I'm wondering whether keeping the ipow
> version makes much sense: In case of double loss of precision, the
> precision is lost, too bad, and casting back to int won't bring it back.
>
> In the doc, I'm not sure that "Numeric" brings anything. "Exponentiation"
> would be enough.
>
> Also, in pg I just noticed that POW is a shorthand for POWER. Maybe both
> should be supported? Or not.
>
> --
> Fabien.




-- 

*Raúl Marín Rodríguez *carto.com
From 58463d2751c016a68410d5357625bf3d7f519ca2 Mon Sep 17 00:00:00 2001
From: Raul Marin 
Date: Thu, 21 Dec 2017 20:28:02 +0100
Subject: [PATCH] Add pow() support to pgbench

---
 doc/src/sgml/ref/pgbench.sgml|  7 
 src/bin/pgbench/exprparse.y  |  6 +++
 src/bin/pgbench/pgbench.c| 60 
 src/bin/pgbench/pgbench.h|  3 +-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 18 -
 5 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 4431fc3eb7..1519fe78ef 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1069,6 +1069,13 @@ pgbench  options  d
pi()
3.14159265358979323846
   
+  
+   pow(x, y), power(x, y)
+   double
+   exponentiation
+   pow(2.0, 10), power(2.0, 10)
+   1024.0
+  
   
random(lb, ub)
integer
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index 25d5ad48e5..74ffe5e7a7 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -194,6 +194,12 @@ static const struct
 	{
 		"random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
 	},
+	{
+		"pow", 2, PGBENCH_POW
+	},
+	{
+		"power", 2, PGBENCH_POW
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7ce6f607f5..f1c1e568e1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -911,6 +911,25 @@ getZipfianRand(TState *thread, int64 min, int64 max, double s)
 	  : computeHarmonicZipfian(thread, n, s));
 }
 
+/* Faster power function for integer values and exp >= 0 */
+static int64
+ipow(int64 base, int64 exp)
+{
+	int64 result = 1;
+
+	Assert(exp >= 0);
+
+	while (exp)
+	{
+		if (exp & 1)
+			result *= base;
+		exp >>= 1;
+		base *= base;
+	}
+
+	return result;
+}
+
 /*
  * Initialize the given SimpleStats struct to all zeroes
  */
@@ -1850,6 +1869,47 @@ evalFunc(TState *thread, CState *st,
 return true;
 			}
 
+		case PGBENCH_POW:
+			{
+PgBenchValue *lval = &vargs[0];
+PgBenchValue *rval = &vargs[1];
+
+Assert(nargs == 2);
+
+/*
+ * If both operands are int and exp >= 0 use
+ * the faster ipow() function, else use pow()
+ */
+if (lval->type == PGBT_INT &&
+	 rval->type == PGBT_INT)
+{
+
+	int64		li,
+ri;
+
+	if (!coerceToInt(lval, &li) ||
+		!coerceToInt(rval, &ri))
+		return false;
+
+	if (ri >= 0)
+		setDoubleValue(retval, ipow(li, ri));
+	else
+		setDoubleValue(retval, pow(li, ri));
+}
+else
+{
+	double		ld,
+rd;
+
+	if (!coerceToDouble(lval, &ld) ||
+		!coerceToDouble(rval, &rd))
+		return false;
+
+	setDoubleValue(retval, pow(ld, rd));
+}
+return true;
+			}
+
 		default:
 			/* cannot get here */
 			Assert(0);
diff --git a/src/bin/pgbench/pgbench.h b/src/bin/pgbench/pgbench.h
index 83fee1ae74..0e92882a4c 100644
--- a/src/bin/pgbench/pgbench.h
+++ b/src/bin/pgbench/pgbench.h
@@ -76,7 +76,8 @@ typedef enum PgBenchFunction
 	PGBENCH_RANDOM,
 	PGBENCH_RANDOM_GAUSSIAN,
 	PGBENCH_RANDOM_EXPONENTIAL,
-	PGBENCH_RANDOM_ZIPFIAN
+	PGBENCH_RANDOM_ZIPFIAN,
+	PGBENCH_POW
 } PgBenchFunction;
 
 typedef struct PgBenchExpr PgBenchExpr;

Re: [JDBC] [HACKERS] Channel binding support for SCRAM-SHA-256

2017-12-22 Thread Michael Paquier
On Fri, Dec 22, 2017 at 11:59 AM, Michael Paquier
 wrote:
> I have looked at how things could be done in symmetry for both the frontend
> and backend code, and I have produced the attached patch 0002, which
> can be applied on top of 0001 implementing tls-server-end-point. This
> simplifies the interfaces to initialize the SCRAM status data by saving
> into scram_state and fe_scram_state respectively Port* and PGconn* which
> holds most of the data needed for the exchange. With this patch, cbind_data
> is generated only if a specific channel binding type is used with the
> appropriate data. So if no channel binding is used there is no additional
> SSL call done to get the TLS finished data or the server certificate hash.
>
> 0001 has no real changes compared to the last versions.

Second thoughts on 0002 as there is actually no need to move around
errorMessage if the PGconn* pointer is saved in the SCRAM status data
as both are linked. The attached simplifies the logic even more.
-- 
Michael
From 69dcf31f5ce5938f9f56a94bf55c8439ea53ed27 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 22 Dec 2017 10:49:10 +0900
Subject: [PATCH 1/2] Implement channel binding tls-server-end-point for SCRAM

As referenced in RFC 5929, this channel binding is not the default value
and uses a hash of the certificate as binding data. On the frontend, this
can be resumed in getting the data from SSL_get_peer_certificate() and
on the backend SSL_get_certificate().

The hashing algorithm needs also to switch to SHA-256 if the signature
algorithm is MD5 or SHA-1, so let's be careful about that.
---
 doc/src/sgml/protocol.sgml   |  5 +-
 src/backend/libpq/auth-scram.c   | 26 ---
 src/backend/libpq/auth.c |  8 +++-
 src/backend/libpq/be-secure-openssl.c| 61 +
 src/include/common/scram-common.h|  1 +
 src/include/libpq/libpq-be.h |  1 +
 src/include/libpq/scram.h|  3 +-
 src/interfaces/libpq/fe-auth-scram.c | 18 ++--
 src/interfaces/libpq/fe-auth.c   | 12 -
 src/interfaces/libpq/fe-auth.h   |  4 +-
 src/interfaces/libpq/fe-secure-openssl.c | 78 
 src/interfaces/libpq/libpq-int.h |  1 +
 src/test/ssl/t/002_scram.pl  |  5 +-
 13 files changed, 207 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 8174e3defa..365f72b51d 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1576,8 +1576,9 @@ the password is in.
   
 Channel binding is supported in PostgreSQL builds with
 SSL support. The SASL mechanism name for SCRAM with channel binding
-is SCRAM-SHA-256-PLUS.  The only channel binding type
-supported at the moment is tls-unique, defined in RFC 5929.
+is SCRAM-SHA-256-PLUS.  Two channel binding types are
+supported at the moment: tls-unique, which is the default,
+and tls-server-end-point, both defined in RFC 5929.
   
 
 
diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index d52a763457..849587d141 100644
--- a/src/backend/libpq/auth-scram.c
+++ b/src/backend/libpq/auth-scram.c
@@ -114,6 +114,8 @@ typedef struct
 	bool		ssl_in_use;
 	const char *tls_finished_message;
 	size_t		tls_finished_len;
+	const char *certificate_hash;
+	size_t		certificate_hash_len;
 	char	   *channel_binding_type;
 
 	int			iterations;
@@ -176,7 +178,9 @@ pg_be_scram_init(const char *username,
  const char *shadow_pass,
  bool ssl_in_use,
  const char *tls_finished_message,
- size_t tls_finished_len)
+ size_t tls_finished_len,
+ const char *certificate_hash,
+ size_t certificate_hash_len)
 {
 	scram_state *state;
 	bool		got_verifier;
@@ -187,6 +191,8 @@ pg_be_scram_init(const char *username,
 	state->ssl_in_use = ssl_in_use;
 	state->tls_finished_message = tls_finished_message;
 	state->tls_finished_len = tls_finished_len;
+	state->certificate_hash = certificate_hash;
+	state->certificate_hash_len = certificate_hash_len;
 	state->channel_binding_type = NULL;
 
 	/*
@@ -857,13 +863,15 @@ read_client_first_message(scram_state *state, char *input)
 }
 
 /*
- * Read value provided by client; only tls-unique is supported
- * for now.  (It is not safe to print the name of an
- * unsupported binding type in the error message.  Pranksters
- * could print arbitrary strings into the log that way.)
+ * Read value provided by client; only tls-unique and
+ * tls-server-end-point are supported for now.  (It is
+ * not safe to print the name of an unsupported binding
+ * type in the error message.  Pranksters could print
+ * arbitrary strings into the log that way.)
  */
 channel_binding_type = read_attr_value(&input, 'p');
-if (strcmp(channel_binding_type, SCRAM_CHANNEL_BINDING_TLS_UNIQUE) != 0)
+if (strcmp(channel_binding_type, SCRAM_CHANNEL_BINDING_TLS_UNIQUE) != 0 &&
+	strcmp(chann