Re: [PATCH] Allow UNLISTEN during recovery

2019-01-08 Thread Mi Tar
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, failed

Hi!

I read through the discussion. I agree with the idea here. I think if DISCARD 
ALL is allowed during hot standby mode, and it includes UNLISTEN *, only 
UNLISTEN * should also be allowed. It seems this patch fixes this, but I could 
not test it (I do not know how to force this state). I would go further though, 
and I would also allow UNLISTEN a. Because also DISCARD allows discarding only 
part of resources.

So patch looks good to me, but documentation changes are missing from it. 
UNLISTEN should be removed from the list of commands not allowed and moved to 
the list of those which are allowed [1]. Moreover, 
src/test/regress/sql/hs_standby_allowed.sql and 
src/test/regress/sql/hs_standby_disallowed.sql tests should be updated based on 
these changes in the patch. What is surprising to me is that make check-world 
does not fail with this change, but there is an explicit check for UNLISTEN *. 
So does this mean those tests are not run or does it mean that this patch does 
not fix the problem?

[1] https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-USERS


Mitar

The new status of this patch is: Waiting on Author


Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread Michael Paquier
On Tue, Jan 08, 2019 at 10:56:32PM -0800, Mitar wrote:
> I see that some patches were sent to bugs mailing list, not hackers
> [1]. I thought that all patches have to be send to the hackers mailing
> list, as per this wiki page [2]. Moreover, because they were send to
> the bugs mailing list, I am unsure how can it be discussed/reviewed on
> hackers mailing list while keeping the thread, as per this wiki page
> [3]. Furthermore, I thought that each commitfest entry should be about
> one patch, but [1] seems to provide 3 patches, with multiple versions,
> which makes it a bit unclear to understand which one and how should
> they apply.

That's not a strict process per se.  Sometimes when discussing we
finish by splitting a patch into multiple ones where it makes sense,
and the factor which mainly matters is to keep a commit history clean.
Keeping that point in mind we may have one commit fest entry dealing
with one of more patches depending on how the author feels things
should be handled.  My take is that additional CF entries make sense
when working on patches which require a different audience and a
different kind of reviews, while refactoring and preparatory work may
be included with a main patch as long as the patch set remains in
roughly the same area of expertise and keeps close to the concept of
the thread dealing with a new feature.

Bugs can be added as CF entries, posting patches on a bug ticket is
also fine.  If a bug fix needs more input, moving it to -hackers can
also make sense by changing on the way its subject.  This depends on
the circumstances and that's a case-by-case handling usually.

> [1] https://commitfest.postgresql.org/21/1924/

This item is fun to work with, though all of them apply to unaccent
and are not that invasive, so a single entry looks fine IMO.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] pgbench - allow to store select results into variables

2019-01-08 Thread Fabien COELHO


Attached a v26 with I hope everything ok, but for the documentation that I'm 
unsure how to improve.


Attached v27 is the same but with an improved documentation where full 
examples, with and without prefix, are provided for both cset & gset.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index b5e3a62a33..cc369c423e 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -954,6 +954,91 @@ pgbench  options  d
   
 
   
+   
+
+ \cset [prefix]
+
+
+
+ 
+  This command may be used to end SQL queries, replacing an embedded
+  semicolon (\;) within a compound SQL command.
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example sends four queries as one compound SQL command,
+  inducing one message sent at the protocol level.
+  The result of the first query is stored into variable one,
+  the results of the third query are stored into variables z_three
+  and z_four,
+  whereas the results of the other queries are discarded.
+
+-- compound of four queries
+SELECT 1 AS one \cset
+SELECT 2 AS two \;
+SELECT 3 AS three, 4 AS four \cset z_
+SELECT 5;
+
+ 
+
+ 
+  
+\cset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+
+   
+
+ \gset [prefix]
+
+
+
+ 
+  This commands may be used to end SQL queries, replacing a final semicolon
+  (;). 
+ 
+
+ 
+  When this command is used, the preceding SQL query is expected to
+  return one row, the columns of which are stored into variables named after
+  column names, and prefixed with prefix if provided.
+ 
+
+ 
+  The following example puts the final account balance from the first query
+  into variable abalance, and fills variables
+  p_two and p_three 
+  with integers from the last query.
+  The result of the second query is discarded.
+
+UPDATE pgbench_accounts
+  SET abalance = abalance + :delta
+  WHERE aid = :aid
+  RETURNING abalance \gset
+-- compound of two queries
+SELECT 1 \;
+SELECT 2 AS two, 3 AS three \gset p_
+
+ 
+
+ 
+  
+\gset does not work when empty SQL queries appear
+within a compound SQL command.
+  
+ 
+
+   
+

 \if expression
 \elif expression
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 649297ae4f..4b8b6ba10f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -499,14 +499,17 @@ typedef enum QueryMode
 static QueryMode querymode = QUERY_SIMPLE;
 static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
 
-typedef struct
+typedef struct Command
 {
-	char	   *line;			/* text of command line */
-	int			command_num;	/* unique index of this Command struct */
+	PQExpBufferData lines;		/* raw command text (possibly multi-line) */
+	char	   *first_line;		/* first line for short display */
 	int			type;			/* command type (SQL_COMMAND or META_COMMAND) */
 	MetaCommand meta;			/* meta command identifier, or META_NONE */
 	int			argc;			/* number of command words */
 	char	   *argv[MAX_ARGS]; /* command word list */
+	int			nqueries;		/* number of compounds within an SQL command */
+	int			prefix_size;	/* allocated size of prefix, >= nqueries */
+	char	  **prefix;			/* per-compound command prefix for [cg]set */
 	PgBenchExpr *expr;			/* parsed expression, if needed */
 	SimpleStats stats;			/* time spent in this command */
 } Command;
@@ -521,7 +524,6 @@ typedef struct ParsedScript
 
 static ParsedScript sql_script[MAX_SCRIPTS];	/* SQL script files */
 static int	num_scripts;		/* number of scripts in sql_script[] */
-static int	num_commands = 0;	/* total number of Command structs */
 static int64 total_weight = 0;
 
 static int	debug = 0;			/* debug flag */
@@ -1732,6 +1734,107 @@ valueTruth(PgBenchValue *pval)
 	}
 }
 
+/* read all responses from backend, storing into variable or discarding */
+static bool
+read_response(CState *st, char **prefix)
+{
+	PGresult   *res;
+	int			compound = 0;
+
+	while ((res = PQgetResult(st->con)) != NULL)
+	{
+		switch (PQresultStatus(res))
+		{
+			case PGRES_COMMAND_OK:	/* non-SELECT commands */
+			case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
+if (prefix[compound] != NULL)
+{
+	fprintf(stderr,
+			"client %d file %d command %d compound %d: "
+			"\\gset/cset expects a row\n",
+			st->id, st->use_file, st->command, compound);
+	st->ecnt++;
+	return false;
+}
+break;			/* OK */
+
+			case PGRES_TUPLES_OK:
+if (prefix[compound] != NULL)
+{
+	/* store result into variables if required */
+	int			ntuples = PQntuples(res),
+nfields = PQnfields(r

Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread Mitar
Hi!

Few more questions.

I see that some patches were sent to bugs mailing list, not hackers
[1]. I thought that all patches have to be send to the hackers mailing
list, as per this wiki page [2]. Moreover, because they were send to
the bugs mailing list, I am unsure how can it be discussed/reviewed on
hackers mailing list while keeping the thread, as per this wiki page
[3]. Furthermore, I thought that each commitfest entry should be about
one patch, but [1] seems to provide 3 patches, with multiple versions,
which makes it a bit unclear to understand which one and how should
they apply.

[1] https://commitfest.postgresql.org/21/1924/
[2] https://wiki.postgresql.org/wiki/Submitting_a_Patch
[3] https://wiki.postgresql.org/wiki/CommitFest


Mitar



unique, partitioned index fails to distinguish index key from INCLUDEd columns

2019-01-08 Thread Justin Pryzby
eb7ed3f3063401496e4aa4bd68fa33f0be31a72f Allow UNIQUE indexes on partitioned 
tables
8224de4f42ccf98e08db07b43d52fed72f962ebb Indexes with INCLUDE columns and their 
support in B-tree

postgres=# CREATE TABLE t(i int,j int) PARTITION BY LIST (i);
postgres=# CREATE TABLE t1 PARTITION OF t FOR VALUES IN (1);
postgres=# CREATE TABLE t2 PARTITION OF t FOR VALUES IN (2);

-- Correctly errors
postgres=# CREATE UNIQUE INDEX ON t(j);
ERROR:  insufficient columns in UNIQUE constraint definition
DETAIL:  UNIQUE constraint on table "t" lacks column "i" which is part of the 
partition key.

-- Fails to error
postgres=# CREATE UNIQUE INDEX ON t(j) INCLUDE(i);

-- Fail to enforce uniqueness across partitions due to failure to enforce 
inclusion of partition key in index KEY
postgres=# INSERT INTO t VALUES(1,1);
postgres=# INSERT INTO t VALUES(2,1); 

postgres=# SELECT * FROM t;
 i | j 
---+---
 1 | 1
 2 | 1
(2 rows)

I found this thread appears to have been close to discovering the issue ~9
months ago.
https://www.postgresql.org/message-id/flat/CAJGNTeO%3DBguEyG8wxMpU_Vgvg3nGGzy71zUQ0RpzEn_mb0bSWA%40mail.gmail.com

Justin



Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-01-08 Thread Pavan Deolasee
On Tue, Jan 1, 2019 at 6:09 AM Alvaro Herrera 
wrote:

> For discussion, here's an preliminary patch.  This is just a first
> skeleton; needs to grow a lot of flesh yet, per my previous proposal.
> As far as the generic CREATE INDEX stuff goes, I think this is complete;
> it's missing the AM-specific bits.
>

Looks like it's missing the validate_index blocks-scanned report, which is
not AM-specific and your original proposal does mention that.

I thought a bit about index_build part. If most AMs follow a somewhat
standard phases while building an index, it might be simpler to define
those phases and have AM agnostic code report those phases. Or may be just
report the most significant information, instead of reporting each
sub-phase of index_build. I think the most important progress to know would
be how far the heap is scanned for to-be-indexed tuples. AFAICS all AMs
use IndexBuildHeapScan() to scan the heap. Can we simply do some reporting
from that routine? Like number of blocks scanned against the total number
of blocks requested?

Some minor comments on the patch, though I suspect you might be already
updating the patch since you marked it as WIP.

+CREATE VIEW pg_stat_progress_create_index AS
+ SELECT
+ s.pid AS pid, S.datid AS datid, D.datname AS datname,
+ S.relid AS relid,
+ CASE s.param1 WHEN 0 THEN 'initializing'
+   WHEN 1 THEN 'waiting for lockers 1'
+   WHEN 2 THEN 'building index'
+   WHEN 3 THEN 'waiting for lockers 2'
+   WHEN 4 THEN 'validating index'
+   WHEN 5 THEN 'waiting for lockers 3'

Can we have more descriptive text for waiters? Such as "waiting for existing
writers", "waiting for intermediate writers" and "waiting for old readers".
Not
sure if I got those correct, something of that sort will definitely give
more
insight into what the transaction is waiting for.

Can we actually also report the list of transactions the command is waiting
on?
That could be useful to the user if CIC appears to be stuck too long.


+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+ nparts);
+

IMHO we should just use full term INDEX instead of IDX, such as
PROGRESS_CREATE_INDEX_PARTITIONS_TOTAL. It's already a long name, so couple
of
extra characters won't make a difference. I did not see much precedence to
shortern to IDX for INDEX elsewhere in the code (though we tend to do that
for
variable names etc).


@@ -1282,6 +1318,9 @@ DefineIndex(Oid relationId,
  old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_old_snapshots);
+ pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+ PROGRESS_CREATEIDX_PHASE_WAIT_3);
+ pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);

I think we should clear the PROGRESS_WAITFOR_TOTAL and PROGRESS_WAITFOR_DONE
when the wait phase is over, to avoid any confusion. For example, I noticed
that the counters from WAIT_1 are reported as-is if WAIT_2 had no lockers.

Shouldn't PROGRESS_WAITFOR_DONE be updated when we skip a snapshot in the
code below?
if (!VirtualTransactionIdIsValid(old_snapshots[i]))
continue;   /* found uninteresting in previous cycle */


@@ -2817,7 +2818,7 @@ FastPathGetRelationLockEntry(LOCALLOCK *locallock)
  * uses of the result.
  */
 VirtualTransactionId *
-GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
+GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode, int *ocount)


Could that out variable be named something differently? "countp" or
something
like that. I did not check if there is some practice that we follow, but I
remember suffixing with "p" rather than prefixing with "o" (for out I
assume)


+
+/* Progress parameters for CREATE INDEX */
+#define PROGRESS_CREATEIDX_PHASE 0
+/* 1 and 2 reserved for "waitfor" metrics */
+#define PROGRESS_CREATEIDX_PARTITIONS_TOTAL 3
+#define PROGRESS_CREATEIDX_PARTITIONS_DONE 4
+

Is there a reason to leave those reserve placeholders, only to fill them a
few
lines down?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: proposal: variadic argument support for least, greatest function

2019-01-08 Thread Pavel Stehule
st 9. 1. 2019 v 1:07 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > We cannot to write
> > SELECT least(VARIADIC ARRAY[1,2,3]);
> > Attached patch add this possibility to least, greatest functions.
>
> TBH, I don't find that natural at all.  If I were looking for the
> functionality "smallest element of an array", I think I'd expect to find
> that exposed as "array_smallest(anyarray) returns anyelement", not as
> some weird syntax option for LEAST.
>

The target of this patch is a consistency LEAST, GREATEST variadic
functions (implementet) with generic variadic functions.

Sure it is possible to implement array_smallest(anyarray), but it
different. This patch try to eliminate unpleasing surprising about
different behave LEAST, GREATEST from other variadic functions.


> It also seems rather inconsistent that this behaves so differently
> from, eg,
>
> =# select least(array[1,2], array[3,4]);
>  least
> ---
>  {1,2}
> (1 row)
>
> Normally, if you have a variadic function, it doesn't also take arrays,
> so that there's less possibility for confusion.
>

This is different case - the keyword VARIADIC was not used here.


> The implementation seems mighty ugly too, in that it has to treat this
> as entirely disjoint from MinMaxExpr's normal argument interpretation.
> But that seems like a symptom of the fact that the definition is
> disjointed itself.
>

I don't think so there is any other possibility - I have not a possibility
to unpack a array to elements inside analyze stage.


> In short, I'd rather see this done with a couple of array functions,
> independently of MinMaxExpr.
>

It doesn't help to user, when they try to use VARIADIC keyword on LEAST,
GREATEST functions.

Regards

Pavel


>
> regards, tom lane
>


Re: Undo logs

2019-01-08 Thread Dilip Kumar
On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila  wrote:

>
> Few more comments:
> 
> Few comments:
> 
> 1.
> + * undorecord.c
> + *   encode and decode undo records
> + *
> + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
>
> Change the year in Copyright notice for all new files?
>

Done

>
> 2.
> + * This function sets uur->uur_info as a side effect.
> + */
> +bool
> +InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
> + int starting_byte, int *already_written, bool header_only)
>
> Is the above part of comment still correct? I don't see uur_info being set
> here.
>

Changed

>
> 3.
> + work_txn.urec_next = uur->uur_next;
> + work_txn.urec_xidepoch = uur->uur_xidepoch;
> + work_txn.urec_progress = uur->uur_progress;
> + work_txn.urec_prevurp = uur->uur_prevurp;
> + work_txn.urec_dbid = uur->uur_dbid;
>
> It would be better if we initialize these members in the order in
> which they appear in the actual structure.  All other undo header
> structures are initialized that way, so this looks out-of-place.
>

Fixed

>
> 4.
> + * 'my_bytes_written' is a pointer to the count of previous-written bytes
> + * from this and following structures in this undo record; that is, any
> + * bytes that are part of previous structures in the record have already
> + * been subtracted out.  We must update it for the bytes we write.
> + *
> ..
> +static bool
> +InsertUndoBytes(char *sourceptr, int sourcelen,
> + char **writeptr, char *endptr,
> + int *my_bytes_written, int *total_bytes_written)
> +{
> ..
> +
> + /* Update bookkeeeping infrormation. */
> + *writeptr += can_write;
> + *total_bytes_written += can_write;
> + *my_bytes_written = 0;
>
> I don't understand the above comment where it is written: "We must
> update it for the bytes we write.".  We always set  'my_bytes_written'
> as 0 if we write.  Can you clarify?  I guess this part of the comment
> is about total_bytes_written or here does it mean to say that caller
> should update it.  I think some wording change might be required based
> on what we intend to say here.
>
> Similar to above, there is a confusion in the description of
> my_bytes_read atop ReadUndoBytes.
>

Fixed

>
> 5.
> +uint32
> +GetEpochForXid(TransactionId xid)
> {
> ..
> + /*
> + * Xid can be on either side when near wrap-around.  Xid is certainly
> + * logically later than ckptXid.
> ..
>
> From the usage of this function in the patch, can we say that Xid is
> always later than ckptxid, if so, how?  Also, I think you previously
> told in this thread that usage of uur_xidepoch is mainly for zheap, so
> we might want to postpone the inclusion of this undo records. On
> thinking again, I think we should follow your advice as I think the
> correct usage here would require the patch by Thomas to fix our epoch
> stuff [1]?  Am, I correct, if so, I think we should postpone it for
> now.
>

Removed

>
> 6.
>  /*
> + * SetCurrentUndoLocation
> + */
> +void
> +SetCurrentUndoLocation(UndoRecPtr urec_ptr)
> {
> ..
> }
>
> I think you can use some comments atop this function to explain the
> usage of this function or how will callers use it.
>

Done

>
> I am done with the first level of code-review for this patch.  I am
> sure we might need few interface changes here and there while
> integrating and testing this with other patches, but the basic idea
> and code look reasonable to me.  I have modified the proposed commit
> message in the attached patch, see if that looks fine to you.
>
> To be clear, this patch can't be independently committed/tested, we
> need undo logs and undo worker machinery patches to be ready as well.
> I will review those next.
>

Make sense

>
> [1] -
> https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0003-Provide-interfaces-to-store-and-fetch-undo-records_v15.patch
Description: Binary data


Re: FETCH FIRST clause PERCENT option

2019-01-08 Thread Surafel Temesgen
On Sun, Jan 6, 2019 at 5:51 PM Tomas Vondra 
wrote:

>
> On 1/6/19 12:40 PM, Surafel Temesgen wrote:
> >
> >
> > On Fri, Jan 4, 2019 at 5:27 PM Tomas Vondra
> > mailto:tomas.von...@2ndquadrant.com>>
> wrote:
> >
> >
> > What formula? All the math remains exactly the same, you just need to
> > update the number of rows to return and track how many rows are
> already
> > returned.
> >
> > I haven't tried doing that, but AFAICS you'd need to tweak how/when
> > node->count is computed - instead of computing it only once it needs
> to
> > be updated after fetching each row from the subplan.
> >
> > Furthermore, you'll need to stash the subplan rows somewhere (into a
> > tuplestore probably), and whenever the node->count value increments,
> > you'll need to grab a row from the tuplestore and return that (i.e.
> > tweak node->position and set node->subSlot).
> >
> > I hope that makes sense. The one thing I'm not quite sure about is
> > whether tuplestore allows adding and getting rows at the same time.
> >
> > Does that make sense
> >
> >
> >
> > In current implementation in LIMIT_INITIAL state we execute outer
> > plan to the end , store the resulting tuples in tuplestore and
> > calculate limitCount in number .We are in this state only once and
> > did not return any tuple. Once we are at LIMIT_INWINDOW state and
> > inner node execution asking for tuple it return from tuple store
> > immediately.
> >
>
> Right.
>
> > Inorder to do fast startup what I was thinking was dividing the work
> > done at LIMIT_INITIAL state in to limitCount. For example we want
> > 0.5 percent of the result which means in LIMIT_INWINDOW state we
> > execute outer node 200 times ,store the result in tuplestore and
> > return the first tuple. if we don’t get that much tuple that means we
> > reach end of the limit.
> Yes, pretty much.
>


While working on this method i found an issue that did not work will on
percent
with fractional part like 2.6 percent which means we have to emit per every
38.4615
tuple so it have to be round up to 39 or round down to 38 either way it
leads to incorrect
result  .Even with integer percentage sometimes the result came out with
fractional part
and needs rounding

regards
Surafel


Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Fabien COELHO



I changed that to the switches -c/--verify (-c for check as -v is taken,
should it be --check as well? I personally like verify better), 
-d/--disable and -e/--enable.


I agree that checking the checksum sounds repetitive, but I think that for 
consistency --check should be provided.


About the patch: applies, compiles, global & local "make check" are ok.

There is still no documentation.

I think that there is a consensus about renaming the command.

The --help string documents --action which does not exists anymore.

The code in "updateControlFile" seems to allow to create the file 
(O_CREAT). I do not think that it should, it should only apply to an 
existing file.


ISTM that some generalized version of this function should be in 
"src/common/controldata_utils.c" instead of duplicating it from command to 
command (as suggested by Michaël as well).


In "scan_file" verbose output, ISTM that the checksum is more computed 
than enabled on the file. It is really enabled at the cluster level in the 
end.


Maybe there could be only one open call with a ?: for RO vs RW.

Non root check: as files are only manipulated RW, ISTM that there is no 
reason why the ownership would be changed, so I do not think that this 
constraint is useful.


There is kind of a copy paste for enabling/disabling, I'd consider 
skipping the scan when not necessary and merge both branches.



Also, the full page is rewritten... would it make sense to only overwrite
the checksum part itself?


So just writing the page header? I find that a bit scary and don't
expect much speedup as the OS would write the whole block anyway I
guess? I haven't touched that yet.


Possibly the OS would write its block size, which is not necessary the 
same as postgres page size?



It seems that the control file is unlinked and then rewritten. If the
rewritting fails, or the command is interrupted, the user has a problem.

Could the control file be simply opened RW? Else, I would suggest to
rename (eg add .tmp), write the new one, then unlink the old one, so that
recovering the old state in case of problem is possible.


I have mostly taken the pg_rewind code here; if there was a function
that allowed for safe offline changes of the control file, I'd be happy
to use it but I don't think it should be this patch to invent that.


It is reinventing it somehow by duplicating the stuff anyway. I'd suggest 
a separate preparatory patch to do the cleanup.


--
Fabien.

Re: add_partial_path() may remove dominated path but still in use

2019-01-08 Thread Kohei KaiGai
2019年1月9日(水) 13:18 Kyotaro HORIGUCHI :
>
> At Sun, 30 Dec 2018 12:31:22 +0900, Kohei KaiGai  wrote 
> in 
> > 2018年12月30日(日) 4:12 Tom Lane :
> > On the other hands, the later hook must be dedicated to add regular paths,
> > and also provides a chance for extensions to manipulate pre-built path-list
> > including Gather-path.
> > As long as I know, pg_hint_plan uses the set_rel_pathlist_hook to enforce
> > a particular path-node, including Gather-node, by manipulation of the cost
> > value. Horiguchi-san, is it right?
>
> Mmm. I haven't expected that it is mentioned here.
>
> Actually in the hook, it changes enable_* planner variables, or
> directory manipuraltes path costs or even can clear and
> regenerate the path list and gather paths for the parallel
> case. It will be happy if we had a chance to manpurate partial
> paths before genrating gahter paths.
>
So, is it sufficient if set_rel_pathlist_hook is just relocated in
front of the generate_gather_paths?
If we have no use case for the second hook, here is little necessity
to have the post_rel_pathlist_hook() here.
(At least, PG-Strom will use the first hook only.)

Thanks,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 



Re: Problem during Windows service start

2019-01-08 Thread Michael Paquier
On Wed, Jan 09, 2019 at 05:28:29AM +, Higuchi, Daisuke wrote:
> One solution is that status of Windows Service should be changed to
> "SERVICE_RUNNING" even if timeout is occurred because of long time
> recovery. I attached the patch of this solution.

You should register this patch to the next commit fest in the section
for bug fixes to not lose sight of it;
https://commitfest.postgresql.org/22/

+case PQPING_NO_ATTEMPT:
+write_eventlog(EVENTLOG_ERROR_TYPE,
_("Server startup failed because of wrong parameter or something\n"));
+pgwin32_SetServiceStatus(SERVICE_STOPPED);

I haven't put much thoughts into what you propose here, but this status
message is not really helpful for the user.

> Another solution is to backport commit [1] to 9.6 or earlier
> version. However this fix change the content of PID file, so
> incompatible change, I think.

Yeah, let's not go down that road.
--
Michael


signature.asc
Description: PGP signature


RE: Problem during Windows service start

2019-01-08 Thread Higuchi, Daisuke
Hi, 

This thread is inactive, but I want to solve this problem. 
I think this problem rarely occurs in 10 or later version because of commit 
[1]. Because "pg_ctl start -w" wait for only PID file creation.  It means that 
timeout is not occurred even if crash recovery takes a lot of times. 
However, 9.6 or earlier still wait for long time recovery complete. 

> How do you propose to fix it?

I think there are two solutions. 

One solution is that status of Windows Service should be changed to 
"SERVICE_RUNNING" even if timeout is occurred because of long time recovery. I 
attached the patch of this solution. 

Another solution is to backport commit [1] to 9.6 or earlier version. However 
this fix change the content of PID file, so incompatible change, I think. 

I would appreciate it if you give any comments. 

[1] 
https://github.com/postgres/postgres/commit/f13ea95f9e473a43ee4e1baeb94daaf83535d37c

Regards, 
Daisuke, Higuchi



windows_service_bug_fix.patch
Description: windows_service_bug_fix.patch


Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2019-01-08 Thread Haribabu Kommi
On Wed, Jan 9, 2019 at 2:25 PM Amit Kapila  wrote:

> Few minor comments:
>

Thanks for the review.


> 1.
> bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
>nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
>   FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
>
> -[ RECORD 5
> ]
> query   | vacuum analyze pgbench_accounts
> calls   | 1
> total_time  | 136.448116
> rows| 0
> hit_percent | 99.9201915403032721
>
> There is nothing between the first and second time you ran this query,
> so where did this vacuum analyze .. record came?
>

Because of the pg_stat_statements_reset() function call to reset the first
query, when the query executed second time the above statement is appeared
as the select query is limiting the result to 5.


> 2.
> bench=# SELECT pg_stat_statements_reset(0,0,s.queryid) FROM
> pg_stat_statements AS s
> bench-# WHERE s.query = 'UPDATE pgbench_branches SET bbalance =
> bbalance + $1 WHERE bid = $2';
>
> bench=#
>
> I think it would be good if you show the output of
> pg_stat_statements_reset statement instead of showing empty line.
>

The pg_stat_statements_reset() function just returns void, because
of this reason, already existing _reset() call also just lists the empty
line, I also followed the same.

I feel it is better to add the return data for all the _reset() function
calls
or leave it as empty.


> Another minor point is that in the second statement
> (pg_stat_statements_reset), you seem to have made it a two-line
> statement whereas first one looks to be a single-line statement, it
> would be good from the readability point of view if both looks same.
> I would prefer the second to look similar to the first one.
>

OK. Corrected.

Updated patch attached.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Extend-pg_stat_statements_reset-to-reset-statistics_v16.patch
Description: Binary data


Re: A few new options for vacuumdb

2019-01-08 Thread Michael Paquier
On Wed, Jan 09, 2019 at 10:33:00AM +0900, Masahiko Sawada wrote:
> Since pg_(total)_relation_size() returns 0 for parent table the
> specifying the parent table to vacuumdb with --min-relation-size
> always does nothing. Maybe we will need to deal with this case when a
> function returning whole partitoned table size is introduced.

Good point.  I am not sure if we want to go down to having a size
function dedicated to partitions especially as this would just now be
a wrapper around pg_partition_tree(), but the size argument with
partitioned tables is something to think about.  If we cannot sort out
this part cleanly, perhaps we could just focus on the age-ing
parameters and the other ones first?  It seems to me that what is
proposed on this thread has value, so we could shave things and keep
the essential, and focus on what we are sure about for simplicity.
--
Michael


signature.asc
Description: PGP signature


Re: add_partial_path() may remove dominated path but still in use

2019-01-08 Thread Kyotaro HORIGUCHI
At Sun, 30 Dec 2018 12:31:22 +0900, Kohei KaiGai  wrote in 

> 2018年12月30日(日) 4:12 Tom Lane :
> On the other hands, the later hook must be dedicated to add regular paths,
> and also provides a chance for extensions to manipulate pre-built path-list
> including Gather-path.
> As long as I know, pg_hint_plan uses the set_rel_pathlist_hook to enforce
> a particular path-node, including Gather-node, by manipulation of the cost
> value. Horiguchi-san, is it right?

Mmm. I haven't expected that it is mentioned here.

Actually in the hook, it changes enable_* planner variables, or
directory manipuraltes path costs or even can clear and
regenerate the path list and gather paths for the parallel
case. It will be happy if we had a chance to manpurate partial
paths before genrating gahter paths.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Installation instructions update (pg_ctl)

2019-01-08 Thread Ryan Lambert
I used the updated instructions from pg_ctl.diff to install from source.  
Worked well for me, new version is more consistent.

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2019-01-08 Thread Amit Kapila
On Tue, Jan 8, 2019 at 6:00 AM Haribabu Kommi  wrote:
>
> On Tue, Dec 18, 2018 at 3:53 PM Amit Kapila  wrote:
>>
>> Can you add a few examples in the documentation?  See if it is
>> possible to extend the existing documentation section (F.29.4),
>> otherwise, add a new section.
>
>
> Apologies for the delay.
>

No problem.

> Attached the updated patch with addition of few examples usage
> of pg_stat_statements_reset() function with new parameters to the
> existing sample output.
>

Few minor comments:
1.
bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
   nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
  FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;

-[ RECORD 5 
]
query   | vacuum analyze pgbench_accounts
calls   | 1
total_time  | 136.448116
rows| 0
hit_percent | 99.9201915403032721

There is nothing between the first and second time you ran this query,
so where did this vacuum analyze .. record came?

2.
bench=# SELECT pg_stat_statements_reset(0,0,s.queryid) FROM
pg_stat_statements AS s
bench-# WHERE s.query = 'UPDATE pgbench_branches SET bbalance =
bbalance + $1 WHERE bid = $2';

bench=#

I think it would be good if you show the output of
pg_stat_statements_reset statement instead of showing empty line.
Another minor point is that in the second statement
(pg_stat_statements_reset), you seem to have made it a two-line
statement whereas first one looks to be a single-line statement, it
would be good from the readability point of view if both looks same.
I would prefer the second to look similar to the first one.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Improve selectivity estimate for range queries

2019-01-08 Thread Kyotaro HORIGUCHI
At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190108.162638.106314087.horiguchi.kyot...@lab.ntt.co.jp>
> Hello.
> 
> At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane  wrote in 
> <28533.1545411...@sss.pgh.pa.us>
> > seem that that's just moving the problem around, but I think it
> > might be possible to show that such a value couldn't be computed
> > by scalarltsel given a histogram with no more than 1 members.
> > (I haven't tried to actually prove that, but it seems intuitive
> > that the set of possible results would be quantized with no more
> > than about 5 digits precision.)

I think we don't need a perfect proof for that. The fact that
exactly 1/3 is quite natural and common but 1/3 + ε is not would
be enough.

> FWIW, I got the following result on my environment. It seems
> different enough if this holds on all supported platforms, though
> there still is a case where the result of a sequence of
> arithmetics makes false match.

Simple selectivity of a relation theoretically cannot match with
the epsilon. (Of couse on *my* environment.)

(0.333..)
binary format: 3f d5 55 55 55 55 55 55
x = 0.15
   231 matches, 79 no_matches

(0.3{13}42..)
binary format: 3f d5 55 55 55 55 55 f1
x = 0.341975
   0 matches, 310 no_matches

(0.3{15}42..)
binary format: 3f d5 55 55 55 55 55 57
x = 0.333426
   0 matches, 310 no_matches


It seems that 0.3{13}42 is correctly 0.3{15}42, which makes just
two LSBs difference from 1/3. I believe C is well standardized on
the translation. Other DEFAULT_*_SELs are not compared in this
way.

The attached small patch fixes the case mentioned in this thread,
but I'm not sure where to put a test. Static assertion is not
usable. Assertion in somewhere perhaps in clauselist_selectivity
seems somewhat overdone.. I don't find a suitable place in
regression test..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#include 
#include 

int test(double x)
{
  double d = 1.0;
  double d0 = 0;
  unsigned char *c_x = (unsigned char *) &x;
  int nmatches = 0;
  int nnomatches = 0;
  int i;
  
  fprintf(stderr, "binary format: ");
  for (i = 7 ; i >= 0 ; i--)
fprintf(stderr, "%s%02x", i < 7 ? " " : "", c_x[i]);
  fprintf(stderr, "\n");

  fprintf(stderr, "x = %20.18f\n", x);
  while (d != d0)
  {
double z = floor(d * 3);
double z1 = z + 1.0;
double y = d / z;
double y1 = d / z1;

/* Check if both sides of d * 3 doesn't make match */
if (y == x || y1 == x)
  nmatches++;
else
  nnomatches++;

d0 = d;
d = d * 10;
  }

  fprintf(stderr, "   %d matches, %d no_matches\n", nmatches, nnomatches);

}

int main(void)
{
  test(0.);
  test(0.342);
  test(0.33342);

  return 0;
}
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index 3739b9817a..cdeaac22c8 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -109,7 +109,7 @@ clauselist_selectivity(PlannerInfo *root,
 	ListCell   *l;
 	int			listidx;
 
-	/*
+ 	/*
 	 * If there's exactly one clause, just go directly to
 	 * clause_selectivity(). None of what we might do below is relevant.
 	 */
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 5cc4cf15e2..15a8d2402a 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -33,8 +33,12 @@
 /* default selectivity estimate for equalities such as "A = b" */
 #define DEFAULT_EQ_SEL	0.005
 
-/* default selectivity estimate for inequalities such as "A < b" */
-#define DEFAULT_INEQ_SEL  0.
+/*
+ * default selectivity estimate for inequalities such as "A < b"
+ * The last two digits prevent it from making a false match with 1/3 computed
+ * from histogram and/or MCV.
+ */
+#define DEFAULT_INEQ_SEL  0.33342
 
 /* default selectivity estimate for range inequalities "A > b AND A < c" */
 #define DEFAULT_RANGE_INEQ_SEL	0.005


RE: speeding up planning with partitions

2019-01-08 Thread Imai, Yoshikazu
Hi,

I ran the performance tests for no prepared query and for prepared query with
plan_cache_mode='auto' and plan_cache_mode='force_custom_plan'. I also changed
number of partitions as 256 or 4096. I ran the tests on master and v9-patched.

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[pgbench(with 4096 partitions)]
no prepared: pgbench -n -f select4096.sql -T 60
prepared:pgbench -n -f select4096.sql -T 60 -M prepared

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[results]
master:
  part_num  no-prepared   auto   force_custom_plan  (1-auto/force_custom_plan)
   256  604571 5760.01
  4096 17.5   17.515.1   -0.16

patched:
  part_num  no-prepared   auto   force_custom_plan
   256 8614   94469384  -0.006
  4096 7158   71657864   0.089


There are almost no difference between auto and force_custom_plan with 256
partitions, but there are difference between auto and force_custom_plan with
4096 partitions. While auto is faster than force_custom_plan on master,
force_custom_plan is faster than auto on patched.

I wonder why force_custom_plan is faster than auto after applied the patch.

When we use PREPARE-EXECUTE, a generic plan is created and used if its cost is
cheaper than creating and using a custom plan with plan_cache_mode='auto',
while a custom plan is always created and used with 
plan_cache_mode='force_custom_plan'.
So one can think the difference in above results is because of creating or
using a generic plan.

I checked how many times a generic plan is created during executing pgbench and
found a generic plan is created only once and custom plans are created at other
times with plan_cache_mode='auto'. I also checked the time of creating a
generic plan, but it didn't take so much(250ms or so with 4096 partitions). So
the time of creating a generic plan does not affect the performance.

Currently, a generic plan is created at sixth time of executing EXECUTE query.
I changed it to more later (ex. at 400,000th time of executing EXECUTE query on
master with 4096 partitions, because 7000TPS x 60sec=420, transactions are
run while executing pgbench.), then there are almost no difference between auto
and force_custom_plan. I think that creation of a generic plan affects the time
of executing queries which are ordered after creating generic plan.

If my assumption is right, we can expect some additional process is occurred at
executing queries ordered after creating a generic plan, which results in auto 
is
slower than force_custom_plan because of additional process. But looking at
above results, on master with 4096 partitions, auto is faster than 
force_custom_plan.
So now I am confused.

Do you have any ideas what does affect the performance?

--
Yoshikazu Imai


Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Michael Paquier
On Tue, Jan 08, 2019 at 01:03:25PM +0100, Michael Banck wrote:
> I changed that to the switches -c/--verify (-c for check as -v is taken,
> should it be --check as well? I personally like verify better), 
> -d/--disable and -e/--enable.

Indeed we could use --check, pg_checksums --check looks repetitive
still that makes the short option more consistent with the rest.

+   printf(_("  -A, --action   action to take on the cluster, can be set 
as\n"));
+   printf(_(" \"verify\", \"enable\" and \"disable\"\n"));
Not reflected yet in the --help portion.

>> Also, the full page is rewritten... would it make sense to only overwrite 
>> the checksum part itself?
> 
> So just writing the page header? I find that a bit scary and don't
> expect much speedup as the OS would write the whole block anyway I
> guess? I haven't touched that yet.

The OS would write blocks of 4kB out of the 8kB as that's the usual
page size, no?  So this could save a lot of I/O.

> I have mostly taken the pg_rewind code here; if there was a function
> that allowed for safe offline changes of the control file, I'd be happy
> to use it but I don't think it should be this patch to invent that.
> 
> In any case, I have removed the unlink() now (not sure where that came
> from), and changed it to open(O_WRONLY) same as in Michael's code and
> pg_rewind.

My own stuff in pg_checksums.c does not have an unlink(), anyway...  I
think that there is room for improvement for both pg_rewind and
pg_checksums here.  What about refactoring updateControlFile() and
move it to controldata_utils.c()?  This centralizes the CRC check,
static assertions, file open and writes into a single place.  The
backend has a similar flavor with UpdateControlFile.  By combining
both we need some extra "ifdef FRONTEND" for BasicOpenFile and the
wait events which generates some noise, still both share a lot.  The
backend also includes a fsync() for the control file which happens
when the file is written, but for pg_checksums and pg_rewind we just
do it in one go at the end, so we would need an extra flag to decide
if fsync should happen or not.  pg_rewind has partially the right
interface by passing ControlFileData contents as an argument.

> V2 attached.

+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
This may look strange, but these are needed because pg_checksums
calls some of the sync-related routines which are defined in fd.c.
Amen.

+   if (fsync(fd) != 0)
+   {
+   fprintf(stderr, _("%s: fsync error: %s\n"), progname, strerror(errno));
+   exit(1);
+   }
No need for that as fsync_pgdata() gets called at the end.
--
Michael


signature.asc
Description: PGP signature


Re: A few new options for vacuumdb

2019-01-08 Thread Masahiko Sawada
On Wed, Jan 9, 2019 at 10:06 AM Michael Paquier  wrote:
>
> On Tue, Jan 08, 2019 at 06:46:11PM +, Bossart, Nathan wrote:
> > This was done in order to maintain the current behavior that
> > appendQualifiedRelation() gives us.  I found that skipping the
> > search_path handling here forced us to specify the schema in the
> > argument for --table in most cases.  At the very least, I could add a
> > comment here to highlight the importance of fully qualifying
> > everything in the catalog query.  What do you think?
>
> A comment sounds like a good thing.  And we really shouldn't hijack
> search_path even for one query...
>
> > Looks good to me, except for one small thing in the documentation:
> >
> > +   
> > +Disable all page-skipping behavior during processing based on
> > +the visibility map, similarly to the option
> > +DISABLE_PAGE_SKIPPING for 
> > VACUUM.
> > +   
> >
> > I think the "similarly to the option" part is slightly misleading.
> > It's not just similar, it _is_ using that option in the generated
> > commands.  Perhaps we could point to the VACUUM documentation for more
> > information about this one.
>
> Sure.  If you have any suggestions, please feel free.  Adding a link
> to VACUUM documentation sounds good to me, as long as we avoid
> duplicating the description related to DISABLE_PAGE_SKIPPING on the
> VACUUM page.
>
> > Good point.  I think allowing multiple different relation size options
> > here would be confusing, too (e.g. --min-relation-size versus
> > --min-total-relation-size).  IMO pg_total_relation_size() is the way
> > to go here, as we'll most likely need to process the indexes and TOAST
> > tables, too.  If/when there is a DISABLE_INDEX_CLEANUP option for
> > VACUUM, we'd then want to use pg_table_size() when --min-relation-size
> > and --disable-index-cleanup are used together in vacuumdb.
> > Thoughts?
>
> Yes, using pg_total_relation_size() looks like the best option to me
> here as well, still this does not make me 100% comfortable from the
> user perspective.

Agreed.

Since pg_(total)_realtion_size() returns 0 for parent table the
specifying the parent table to vacuumdb with --min-relation-size
always does nothing. Maybe we will need to deal with this case when a
function returning whole partitoned table size is introduced.
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



RE: [Proposal] Add accumulated statistics

2019-01-08 Thread Tsunakawa, Takayuki
From: Adrien NAYRAT [mailto:adrien.nay...@anayrat.info]
> FIY, wait events have been added in PoWA by using pg_wait_sampling
> extension :
> https://rjuju.github.io/postgresql/2018/07/09/wait-events-support-for-
> powa.html
> 
> pg_wait_sampling sample the wait events in shared memory and PoWA store
> them.

Great.  Also FYI, Amazon RDS/Aurora Performance Insights already provides a 
load profiling feature based on the wait events collected from 
pg_stat_activity, which samples once a second.  An intuitive 5 minute video for 
introduction is here:

Using Amazon RDS Performance Insights
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html

Although I cannot see the graphics as I'm almost blind, the explanation  sounds 
like it mimics Oracle Enterprise Manager.  Cool.

Aren't you thinking of incorporating your work into PostgreSQL as a contrib 
like pg_stat_statements?


Regards
Takayuki Tsunakawa






Re: Using logical replication with older version subscribers

2019-01-08 Thread Masahiko Sawada
On Tue, Jan 8, 2019 at 1:12 AM Magnus Hagander  wrote:
>
> On Mon, Jan 7, 2019 at 3:37 PM Masahiko Sawada  wrote:
>>
>> On Mon, Jan 7, 2019 at 6:54 PM Magnus Hagander  wrote:
>> >
>> > On Mon, Jan 7, 2019 at 9:01 AM Masahiko Sawada  
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> Logical replication enables us to replicate data changes to different
>> >> major version PostgreSQL as the doc says[1]. However the current
>> >> logical replication can work fine only if replicating to a newer major
>> >> version PostgreSQL such as from 10 to 11. Regarding using logical
>> >> replication with older major version, say sending from 11 to 10, it
>> >> will stop when a subscriber receives a truncate change because it's
>> >> not supported at PostgreSQL 10.  I think there are use cases where
>> >> using logical replication with a subscriber of an older version
>> >> PostgreSQL but I'm not sure we should support it.
>> >>
>> >> Of course in such case we can set the publication with publish =
>> >> 'insert, update, delete' to not send truncate changes but it requres
>> >> users to recognize the feature differences between major vesions and
>> >> in the future it will get more complex. So I think it would be better
>> >> to be configured autometically by PostgreSQL.
>> >>
>> >> To fix it we can make subscribers send its supporting message types to
>> >> the publisher at a startup time so that the publisher doesn't send
>> >> unsupported message types on the subscriber. Or as an another idea, we
>> >> can make subscribers ignore unsupported logical replication message
>> >> types instead of raising an error. Feedback is very welcome.
>> >>
>> >> [1] https://www.postgresql.org/docs/devel/logical-replication.html
>> >
>> >
>> > How would that work in practice?
>> >
>> > If an 11 server is sent a message saying "client does not support 
>> > truncate", and immediately generates an error, then you can no longer 
>> > replicate even if you turn off truncate. And if it delays it until the 
>> > actual replication of the item, then you just get the error on the primary 
>> > ìnstead of the standby?
>> >
>> > I assume you are not suggesting a publication with truncation enabled 
>> > should just ignore replicating truncation if the downstream server doesn't 
>> > support it? Because if that's the suggestion, then a strong -1 from me on 
>> > that.
>> >
>>
>> I'm thinking that the we can make the pgoutput plugin recognize that
>> the downstream server doesn't support it and not send it. For example,
>> even if we create a publication with publish = 'truncate' we send
>> nothing due to checking supported message types by pgoutput plugin if
>> the downstream server is PostgreSQL server and its version is older
>> than 10.
>
>
> That's the idea I definitely say a strong -1 to.
>
> Ignoring the truncate message isn't going to make it work. It's just going to 
> mean that the downstream data is incorrect vs what the publisher thought. The 
> correct solution here is to not publish the truncate, which we already have. 
> I can see the point in changing it so the error message becomes more obvious 
> (already when the subscriber connects, and not a random time later when the 
> first truncate replicates), but *silently* ignoring it seems like a terrible 
> choice.

I understood that that makes more sense. And the raising the error
when connection seems good to me. Thank you!
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread Michael Paquier
On Tue, Jan 08, 2019 at 10:14:10AM +0100, Fabien COELHO wrote:
>> Also, how is the level at which I should review it
>> determined?
> 
> Patches as complex as the one you submitted?

The usual expectation is to review one patch of equal difficulty for
each patch submitted.  The way to measure a patch difficulty is not
based on actual facts but mostly on how a patch feels complicated.
When it comes to reviews, the more you can look at the better of
course, still doing a correct review takes time, and that can be
surprising often even for so-said simple patches.

> Based on your area of expertise?

Taking new challenges on a regular basis is not bad either :)
--
Michael


signature.asc
Description: PGP signature


Re: A few new options for vacuumdb

2019-01-08 Thread Michael Paquier
On Tue, Jan 08, 2019 at 06:46:11PM +, Bossart, Nathan wrote:
> This was done in order to maintain the current behavior that
> appendQualifiedRelation() gives us.  I found that skipping the
> search_path handling here forced us to specify the schema in the
> argument for --table in most cases.  At the very least, I could add a
> comment here to highlight the importance of fully qualifying
> everything in the catalog query.  What do you think?

A comment sounds like a good thing.  And we really shouldn't hijack
search_path even for one query...

> Looks good to me, except for one small thing in the documentation:
> 
> +   
> +Disable all page-skipping behavior during processing based on
> +the visibility map, similarly to the option
> +DISABLE_PAGE_SKIPPING for 
> VACUUM.
> +   
> 
> I think the "similarly to the option" part is slightly misleading.
> It's not just similar, it _is_ using that option in the generated
> commands.  Perhaps we could point to the VACUUM documentation for more
> information about this one.

Sure.  If you have any suggestions, please feel free.  Adding a link
to VACUUM documentation sounds good to me, as long as we avoid
duplicating the description related to DISABLE_PAGE_SKIPPING on the
VACUUM page.

> Good point.  I think allowing multiple different relation size options
> here would be confusing, too (e.g. --min-relation-size versus
> --min-total-relation-size).  IMO pg_total_relation_size() is the way
> to go here, as we'll most likely need to process the indexes and TOAST
> tables, too.  If/when there is a DISABLE_INDEX_CLEANUP option for
> VACUUM, we'd then want to use pg_table_size() when --min-relation-size
> and --disable-index-cleanup are used together in vacuumdb.
> Thoughts?

Yes, using pg_total_relation_size() looks like the best option to me
here as well, still this does not make me 100% comfortable from the
user perspective.

> I've realized that I forgot to add links to the XID/MXID wraparound
> documentation like you suggested a while back.  I'll make sure that
> gets included in the next patch set, too.

Nice, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: New vacuum option to do only freezing

2019-01-08 Thread Masahiko Sawada
On Fri, Nov 9, 2018 at 2:56 AM Robert Haas  wrote:
>
> On Thu, Nov 8, 2018 at 4:36 AM Masahiko Sawada  wrote:
> > We already have disable_page_skipping option, not (page_skipping
> > false). So ISMT disable_index_cleanup would be more natural.
>
> Sure.
>
> > Also,
> > since what to do with this option is not only skipping vacuum indexes
> > but also skipping removeing dead tuples on heap, I think that the
> > option should have a more understandable name for users indicating
> > that both it removes dead tuples less than the normal vacuum and it's
> > aimed to freeze tuple more faster. Of course we document them, though.
>
> Well, I actually don't think that you should control two behaviors
> with the same option.  If you want to vacuum and skip index cleanup,
> you should say VACUUM (disable_index_cleanup).  If you want to vacuum,
> disable index cleanup, and skip aggressively, you should say VACUUM
> (freeze, disable_index_cleanup).  Both behaviors seem useful.
>

Attached the updated version patch incorporated all comments I got.
The patch includes the following changes.

* Changed the option name to DISABLE_INDEX_CLENAUP. I agreed with
Robert and Nathan, having an option separated from FREEZE.

* Instead of freezing xmax I've changed the behaviour of the new
option (DISABLE_INDEX_CLEANUP) so that it sets dead tuples as dead
instead of as unused and skips both index vacuum and index cleanup.
That is, we remove the storage of dead tuple but leave dead itemids
for the index lookups. These are removed by the next vacuum execution
enabling index cleanup. Currently HOT-pruning doesn't set the root of
the chain as unused even if the whole chain is dead. Since setting
tuples as dead removes storage the freezing xmax is no longer
required.

The second change might conflict with the retail index deletion
patch[1], which makes HOT-pruning *mark* tuples as dead (i.g., using
ItemIdMarkDead() instead). I think that this patch would not block
that patch but I've not considered deeply yet the combination with
these two pathes.

[1] 
https://www.postgresql.org/message-id/flat/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


v3-0001-Add-DISABLE_INDEX_CLEANUP-option-to-VACUUM-comman.patch
Description: Binary data


Re: BUG #15446: Crash on ALTER TABLE

2019-01-08 Thread Andrew Dunstan


On 1/8/19 4:48 PM, Dean Rasheed wrote:
> On Tue, 8 Jan 2019 at 19:34, Andrew Dunstan
>  wrote:
>> Here's a patch that I think cures the problem.
>>
> Hmm, that doesn't quite work because the table might not actually be
> rewritten as a result of the type change. For example:
>
> DROP TABLE IF EXISTS foo;
> CREATE TABLE foo (a int);
> INSERT INTO foo VALUES (1);
> ALTER TABLE foo ADD COLUMN b varchar(10) DEFAULT 'xxx';
> ALTER TABLE foo ALTER COLUMN b SET DEFAULT 'yyy';
> INSERT INTO foo VALUES (2);
> SELECT * FROM foo;
>  a |  b
> ---+-
>  1 | xxx
>  2 | yyy
> (2 rows)
>
> ALTER TABLE foo ALTER COLUMN b TYPE varchar(20) USING b::varchar(20);
> SELECT * FROM foo;
>  a |  b
> ---+-
>  1 |
>  2 | yyy
> (2 rows)
>


Ouch, OK, looks like we need something more complex.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0

2019-01-08 Thread Amit Langote
Fujita-san,

(sorry about the repeated email, but my previous attempt failed due to
trying to send to the -hackers and -performance lists at the same time, so
trying again after removing -performance)

On 2019/01/08 20:07, Etsuro Fujita wrote:
> (2018/12/07 20:14), Ashutosh Bapat wrote:
>> On Fri, Dec 7, 2018 at 11:13 AM Ashutosh Bapat
>> mailto:ashutosh.bapat@gmail.com>> wrote:
> 
>>     Robert, Ashutosh, any comments on this?  I'm unfamiliar with the
>>     partitionwise join code.
> 
>>     As the comment says it has to do with the equivalence classes being
>>     used during merge append. EC's are used to create pathkeys used for
>>     sorting. Creating a sort node which has column on the nullable side
>>     of an OUTER join will fail if it doesn't find corresponding
>>     equivalence class. You may not notice this if both the partitions
>>     being joined are pruned for some reason. Amit's idea to make
>>     partition-wise join code do this may work, but will add a similar
>>     overhead esp. in N-way partition-wise join once those equivalence
>>     classes are added.
> 
>> I looked at the patch. The problem there is that for a given relation,
>> we will add child ec member multiple times, as many times as the number
>> of joins it participates in. We need to avoid that to keep ec_member
>> list length in check.
> 
> Amit-san, are you still working on this, perhaps as part of the
> speeding-up-planning-with-partitions patch [1]?

I had tried to continue working on it after PGConf.ASIA last month, but
got distracted by something else.

So, while the patch at [1] can take care of this issue as I also mentioned
upthread, I was trying to come up with a solution that can be back-patched
to PG 11.  The patch I posted above is one such solution and as Ashutosh
points out it's perhaps not the best, because it can result in potentially
creating many copies of the same child EC member if we do it in joinrel.c,
as the patch proposes.  I will try to respond to the concerns he raised in
the next week if possible.

Thanks,
Amit







Re: speeding up planning with partitions

2019-01-08 Thread David Rowley
On Tue, 8 Jan 2019 at 19:30, Amit Langote  wrote:
> Rebased due to changed copyright year in prepunion.c.  Also realized that
> the new files added by patch 0004 still had 2018 in them.

I've made a pass over 0001. There's probably enough for you to look at
while I look at 0002 and the others.

0001

1. In your doc changes, just below a paragraph that you removed,
there's a paragraph starting "Both of these behaviors are likely to be
changed in a future release". This needs to be fixed since you've
removed the first of the two reasons.

2. This part should be left alone.

-technique similar to partition pruning.  While it is primarily used
-for partitioning implemented using the legacy inheritance method, it can be
-used for other purposes, including with declarative partitioning.
+technique similar to partition pruning.  It is primarily used
+for partitioning implemented using the legacy inheritance method.

Looking at set_inherit_target_rel_sizes(), constraint exclusion still
is applied to partitions, it's just applied after pruning, according
to:

if (did_pruning && !bms_is_member(appinfo->child_relid, live_children))
{
/* This partition was pruned; skip it. */
set_dummy_rel_pathlist(childrel);
continue;
}

if (relation_excluded_by_constraints(root, childrel, childRTE))

3. add_child_rel_equivalences().  You're replacing parent EMs with
their child equivalent, but only when the eclass has no volatile
functions. Is this really what you want? I think this would misbehave
if we ever allowed: UPDATE ... SET .. ORDER BY, of which there's a
legitimate use case of wanting to reduce the chances of deadlocks
caused by non-deterministic UPDATE order.  Or if you think skipping
the volatile eclasses is fine today, then I think the comment you've
added to add_child_rel_equivalences should mention that.

4. Do you think it's okay that add_child_rel_equivalences() does not
update the ec_relids when removing the member?

UPDATE: I see you're likely leaving this alone since you're only doing
a shallow copy of the eclasses in
adjust_inherited_target_child_root(). It seems like a pretty bad idea
to do a shallow copy there.

5. What's CE?

+ /* CE failed, so finish copying/modifying join quals. */

6. Typo:

+ * ass dummy.  We must do this in this phase so that the rel's

ass -> as

7. There's no accumulation going on here:

+ /*
+ * Accumulate size information from each live child.
+ */
+ Assert(childrel->rows > 0);

8. Any point in this? We're about to loop again anyway...

+ /*
+ * If child is dummy, ignore it.
+ */
+ if (IS_DUMMY_REL(childrel))
+ continue;
+ }

9. It's a bit confusing to mention SELECT in this comment. The Assert
ensures it's an UPDATE or DELETE.

+ /*
+ * For UPDATE/DELETE queries, the top parent can only ever be a table.
+ * As a contrast, it could be a UNION ALL subquery in the case of SELECT.
+ */
+ Assert(root->parse->commandType == CMD_UPDATE ||
+root->parse->commandType == CMD_DELETE);

10. I'd say the subroot assignment can go after the IS_DUMMY_REL
check. Keeping that loop as tight as possible for pruned rels seems
like a good idea.

+ subroot = root->inh_target_child_roots[appinfo->child_relid];
+ Assert(subroot->parse->resultRelation > 0);
+ childrel = find_base_rel(root, appinfo->child_relid);
+
+ /* Ignore excluded/pruned children. */
+ if (IS_DUMMY_REL(childrel))
+ continue;

11. I don't think you should reuse the childrel variable here:

+ childrel->reloptkind = RELOPT_BASEREL;
+
+ Assert(subroot->join_rel_list == NIL);
+ Assert(subroot->join_rel_hash == NULL);
+
+ /* Perform join planning and save the resulting RelOptInfo. */
+ childrel = make_rel_from_joinlist(subroot, translated_joinlist);
+
+ /*
+ * Remember this child target rel.  inheritance_planner will perform
+ * the remaining steps of planning for each child relation separately.
+ * Specifically, it will call grouping_planner on every
+ * RelOptInfo contained in the inh_target_child_rels list, each of
+ * which represents the source of tuples to be modified for a given
+ * target child rel.
+ */
+ root->inh_target_child_joinrels =
+ lappend(root->inh_target_child_joinrels, childrel);

12. The following comment makes less sense now that you've modified
the previous paragraph:

+ * Fortunately, the UPDATE/DELETE target can never be the nullable side of an
+ * outer join, so it's OK to generate the plan this way.

This text used to refer to:

but target inheritance has to be expanded at
 * the top.  The reason is that for UPDATE, each target relation needs a
 * different targetlist matching its own column set.  Fortunately,
 * the UPDATE/DELETE target can never be the nullable side of an outer join,
 * so it's OK to generate the plan this way.

you no longer describe plan as being expanded from the top rather than
at the bottom, which IMO is what "this way" refers to.

13. "tree is" -> "tree are" (references is plural)

+ * references in the join tree to the original target relation that's the
+ * root pare

Re: proposal: variadic argument support for least, greatest function

2019-01-08 Thread Tom Lane
Pavel Stehule  writes:
> We cannot to write
> SELECT least(VARIADIC ARRAY[1,2,3]);
> Attached patch add this possibility to least, greatest functions.

TBH, I don't find that natural at all.  If I were looking for the
functionality "smallest element of an array", I think I'd expect to find
that exposed as "array_smallest(anyarray) returns anyelement", not as
some weird syntax option for LEAST.

It also seems rather inconsistent that this behaves so differently
from, eg,

=# select least(array[1,2], array[3,4]);
 least 
---
 {1,2}
(1 row)

Normally, if you have a variadic function, it doesn't also take arrays,
so that there's less possibility for confusion.

The implementation seems mighty ugly too, in that it has to treat this
as entirely disjoint from MinMaxExpr's normal argument interpretation.
But that seems like a symptom of the fact that the definition is
disjointed itself.

In short, I'd rather see this done with a couple of array functions,
independently of MinMaxExpr.

regards, tom lane



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Joerg Sonnenberger
On Tue, Jan 08, 2019 at 05:53:25PM -0500, Tom Lane wrote:
> John Naylor  writes:
> > -As for the graph algorithm, I'd have to play with it to understand
> > how it works.
> 
> I improved the comment about how come the hash table entry assignment
> works.  One thing I'm not clear about myself is
> 
>   # A cycle-free graph is either empty or has some vertex of degree 1.
> 
> That sounds like a standard graph theory result, but I'm not familiar
> with a proof for it.

Let's assume all vertexes have a degree > 1, the graph is acyclic and
non-empty. Pick any vertex. Let's construct a path now starting from
this vertex. It is connected to at least one other vertex. Let's follow
that path. Again, there must be connected to one more vertex and it can't
go back to the starting point (since that would be a cycle). The next
vertex must still have another connections and it can't go back to any
already visited vertexes. Continue until you run out of vertex...

Joerg



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane
John Naylor  writes:
> -As for the graph algorithm, I'd have to play with it to understand
> how it works.

I improved the comment about how come the hash table entry assignment
works.  One thing I'm not clear about myself is

# A cycle-free graph is either empty or has some vertex of degree 1.

That sounds like a standard graph theory result, but I'm not familiar
with a proof for it.

regards, tom lane

#--
#
# PerfectHash.pm
#Perl module that constructs minimal perfect hash functions
#
# This code constructs a minimal perfect hash function for the given
# set of keys, using an algorithm described in
# "An optimal algorithm for generating minimal perfect hash functions"
# by Czech, Havas and Majewski in Information Processing Letters,
# 43(5):256-264, October 1992.
# This implementation is loosely based on NetBSD's "nbperf",
# which was written by Joerg Sonnenberger.
#
# The resulting hash function is perfect in the sense that if the presented
# key is one of the original set, it will return the key's index in the set
# (in range 0..N-1).  However, the caller must still verify the match,
# as false positives are possible.  Also, the hash function may return
# values that are out of range (negative, or >= N).  This indicates that
# the presented key is definitely not in the set.
#
#
# Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
# Portions Copyright (c) 1994, Regents of the University of California
#
# src/tools/PerfectHash.pm
#
#--

package PerfectHash;

use strict;
use warnings;


# At runtime, we'll compute two simple hash functions of the input key,
# and use them to index into a mapping table.  The hash functions are just
# multiply-and-add in uint32 arithmetic, with different multipliers but
# the same initial seed.  All the complexity in this module is concerned
# with selecting hash parameters that will work and building the mapping
# table.

# We support making case-insensitive hash functions, though this only
# works for a strict-ASCII interpretation of case insensitivity,
# ie, A-Z maps onto a-z and nothing else.
my $case_insensitive = 0;


#
# Construct a C function implementing a perfect hash for the given keys.
# The C function definition is returned as a string.
#
# The keys can be any set of Perl strings; it is caller's responsibility
# that there not be any duplicates.  (Note that the "strings" can be
# binary data, but endianness is the caller's problem.)
#
# The name to use for the function is caller-specified, but its signature
# is always "int f(const void *key, size_t keylen)".  The caller may
# prepend "static " to the result string if it wants a static function.
#
# If $ci is true, the function is case-insensitive, for the limited idea
# of case-insensitivity explained above.
#
sub generate_hash_function
{
my ($keys_ref, $funcname, $ci) = @_;

# It's not worth passing this around as a parameter; just use a global.
$case_insensitive = $ci;

# Try different hash function parameters until we find a set that works
# for these keys.  In principle we might need to change multipliers,
# but these two multipliers are chosen to be primes that are cheap to
# calculate via shift-and-add, so don't change them without care.
my $hash_mult1 = 31;
my $hash_mult2 = 2053;

# We just try successive hash seed values until we find one that works.
# (Commonly, random seeds are tried, but we want reproducible results
# from this program so we don't do that.)
my $hash_seed;
my @subresult;
for ($hash_seed = 0; $hash_seed < 1000; $hash_seed++)
{
@subresult =
  _construct_hash_table($keys_ref, $hash_mult1, $hash_mult2,
$hash_seed);
last if @subresult;
}

# Choke if we didn't succeed in a reasonable number of tries.
die "failed to generate perfect hash" if !@subresult;

# Extract info from the function result array.
my $elemtype = $subresult[0];
my @hashtab  = @{ $subresult[1] };
my $nhash= scalar(@hashtab);

# OK, construct the hash function definition including the hash table.
my $f = '';
$f .= sprintf "int\n";
$f .= sprintf "%s(const void *key, size_t keylen)\n{\n", $funcname;
$f .= sprintf "\tstatic const %s h[%d] = {\n", $elemtype, $nhash;
for (my $i = 0; $i < $nhash; $i++)
{
$f .= sprintf "%s%6d,%s",
  ($i % 8 == 0 ? "\t\t" : " "),
  $hashtab[$i],
  ($i % 8 == 7 ? "\n" : "");
}
$f .= sprintf "\n" if ($nhash % 8 != 0);
$f .= sprintf "\t};\n\n";
$f .= sprintf "\tconst unsigned char *k = key;\n";
$f .=

Re: Displaying and dumping of table access methods

2019-01-08 Thread Michael Paquier
On Tue, Jan 08, 2019 at 09:29:49AM -0800, Andres Freund wrote:
> On 2019-01-08 13:02:00 +0900, Michael Paquier wrote:
>> The specific-heap tests could be included as an extra module in
>> src/test/modules easily, so removing from the main tests what is not
>> completely transparent may make sense.
> 
> Why does it need to be an extra module, rather than just extra regression
> files / sections of files that just explicitly specify the AM?  Seems a
> lot easier and faster.

The point would be to keep individual Makefiles simpler to maintain,
and separating things can make it simpler.  I cannot say for sure
without seeing how things would change though based on what you are
suggesting, and that may finish by being a matter of taste.

> In the core tests there's a fair number of things that can be cured by
> adding an ORDER BY to the tests, which seems sensible. We've added a lot
> of those over the years anyway.

When working on Postgres-XC I cursed about the need to add many ORDER
BY queries to ensure the ordering of tuples fetched from different
nodes, and we actually had an option to enforce the default
distribution used by tables, so that would be really nice to close the
gap.

> There's additionally a number of plans
> that change, which currently is handled by alternatives output files,
> but I think we should move to reduce those differences, that's probably
> not too hard.

Okay, that's not surprising.  I guess it depends on how many alternate
files are needed and if it is possible to split things so as we avoid
unnecessary output in alternate files.  A lot of things you are
proposing on this thread make sense in my experience.
--
Michael


signature.asc
Description: PGP signature


Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane
John Naylor  writes:
> Just a couple comments about the module:

> -If you qualify the function's module name as you did
> (PerfectHash::generate_hash_function), you don't have to export the
> function into the callers namespace, so you can skip the @EXPORT_OK
> setting. Most of our modules don't export.

OK by me.  I was more concerned about hiding the stuff that isn't
supposed to be exported.

> -There is a bit of a cognitive clash between $case_sensitive in
> gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
> make sense in their own file, but might it be worth using one or the
> other?

Yeah, dunno.  It seems to make sense for the command-line-level default of
gen_keywordlist.pl to be "case insensitive", since most users want that.
But that surely shouldn't be the default in PerfectHash.pm, and I'm not
very sure how to reconcile the discrepancy.

> In the committed keyword patch, I noticed that in common/keywords.c,
> the array length is defined with
> ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]
> but other keyword arrays just have ...[]. Is there a reason for the 
> difference?

The length macro was readily available there so I used it.  AFAIR
that wasn't true elsewhere, though I might've missed something.
It's pretty much just belt-and-suspenders coding anyway, since all
those arrays are machine generated ...

regards, tom lane



Re: Statement-level Triggers For Uniqueness Checks

2019-01-08 Thread Corey Huinker
On Fri, Jan 4, 2019 at 7:49 AM Peter Eisentraut
 wrote:
>
> On 25/12/2018 00:56, Corey Huinker wrote:
> > The regression diff (attached) seems to imply that the triggers simply
> > are not firing, though.
>
> The reason for this was explained by Dean.  If you take out the check
> that he mentioned, then your trigger fires but crashes.  In your changed
> unique_key_recheck(), "slot" is not initialized before use (or ever).

Thanks. I'll be revisiting this shortly. Dean's information made me
think the potential for a gain is smaller than initially imagined.



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread John Naylor
On Tue, Jan 8, 2019 at 3:04 PM Tom Lane  wrote:
> > I'll take a crack at separating into a module.  I'll wait a bit in
> > case there are any stylistic suggestions on the patch as it stands.
>
> I had a go at that myself.  I'm sure there's plenty to criticize in
> the result, but at least it passes make check-world ;-)

Just a couple comments about the module:

-If you qualify the function's module name as you did
(PerfectHash::generate_hash_function), you don't have to export the
function into the callers namespace, so you can skip the @EXPORT_OK
setting. Most of our modules don't export.

-There is a bit of a cognitive clash between $case_sensitive in
gen_keywordlist.pl and $case_insensitive in PerfectHash.pm. They each
make sense in their own file, but might it be worth using one or the
other?

-As for the graph algorithm, I'd have to play with it to understand
how it works.


In the committed keyword patch, I noticed that in common/keywords.c,
the array length is defined with

ScanKeywordCategories[SCANKEYWORDS_NUM_KEYWORDS]

but other keyword arrays just have ...[]. Is there a reason for the difference?



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane
Andres Freund  writes:
> On 2019-01-08 13:41:16 -0500, John Naylor wrote:
>> Do you mean the fmgr table?

> Not the entire fmgr table, but just the builtin oid index, generated by
> the following section:
> ...
> The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
> hashtable might e.g. more efficient from a cache perspective.

I experimented with this, but TBH I think it's a dead loss.  We currently
have 2768 built-in functions, so the perfect hash table requires 5537
int16 entries, which is not *that* much less than the 1 entries
that are in fmgr_builtin_oid_index presently.  When you consider the
extra cycles needed to do the hashing, and the fact that you have to
touch (usually) two cache lines not one in the lookup table, it's hard
to see how this could net out as a win performance-wise.

Also, I fail to understand why fmgr_builtin_oid_index has 1 entries
anyway.  We could easily have fmgrtab.c expose the last actually assigned
builtin function OID (presently 6121) and make the index array only
that big, which just about eliminates the space advantage completely.

BTW, I found out while trying this that Joerg's fear of the hash
multipliers being too simplistic is valid: the perfect hash generator
failed until I changed them.  I picked a larger value that should be
just as easy to use for shift-and-add purposes.

regards, tom lane

diff --git a/src/backend/utils/Gen_fmgrtab.pl b/src/backend/utils/Gen_fmgrtab.pl
index cafe408..9fceb60 100644
*** a/src/backend/utils/Gen_fmgrtab.pl
--- b/src/backend/utils/Gen_fmgrtab.pl
*** use Catalog;
*** 18,23 
--- 18,24 
  
  use strict;
  use warnings;
+ use PerfectHash;
  
  # Collect arguments
  my @input_files;
*** foreach my $s (sort { $a->{oid} <=> $b->
*** 219,237 
  	print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }
  
! # Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oid_index;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
  	print $tfh
  	  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";
  
! 	$fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;
  
  	if ($fmgr_count <= $#fmgr)
  	{
--- 220,244 
  	print $pfh "extern Datum $s->{prosrc}(PG_FUNCTION_ARGS);\n";
  }
  
! # Create the fmgr_builtins table, collect data for hash function
  print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
  my %bmap;
  $bmap{'t'} = 'true';
  $bmap{'f'} = 'false';
! my @fmgr_builtin_oids;
! my $prev_oid = 0;
  my $fmgr_count = 0;
  foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
  {
  	print $tfh
  	  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, $bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";
  
! 	die "duplicate OIDs" if $s->{oid} <= $prev_oid;
! 	$prev_oid = $s->{oid};
! 
! 	push @fmgr_builtin_oids, pack("n", $s->{oid});
! 
! 	$fmgr_count++;
  
  	if ($fmgr_count <= $#fmgr)
  	{
*** print $tfh "};\n";
*** 246,283 
  
  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
- |;
  
- 
- # Create fmgr_builtins_oid_index table.
- #
- # Note that the array has to be filled up to FirstGenbkiObjectId,
- # as we can't rely on zero initialization as 0 is a valid mapping.
- print $tfh qq|
- const uint16 fmgr_builtin_oid_index[FirstGenbkiObjectId] = {
  |;
  
- for (my $i = 0; $i < $FirstGenbkiObjectId; $i++)
- {
- 	my $oid = $fmgr_builtin_oid_index[$i];
  
! 	# fmgr_builtin_oid_index is sparse, map nonexistant functions to
! 	# InvalidOidBuiltinMapping
! 	if (not defined $oid)
! 	{
! 		$oid = 'InvalidOidBuiltinMapping';
! 	}
  
! 	if ($i + 1 == $FirstGenbkiObjectId)
! 	{
! 		print $tfh "  $oid\n";
! 	}
! 	else
! 	{
! 		print $tfh "  $oid,\n";
! 	}
! }
! print $tfh "};\n";
  
  
  # And add the file footers.
--- 253,267 
  
  print $tfh qq|
  const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
  
  |;
  
  
! # Create perfect hash function for searching fmgr_builtin by OID.
  
! print $tfh PerfectHash::generate_hash_function(\@fmgr_builtin_oids,
! 	   "fmgr_builtin_oid_hash",
! 	   0);
  
  
  # And add the file footers.
diff --git a/src/backend/utils/fmgr/fmgr.c b/src/backend/utils/fmgr/fmgr.c
index b41649f..ad93032 100644
*** a/src/backend/utils/fmgr/fmgr.c
--- b/src/backend/utils/fmgr/fmgr.c
*** extern Datum fmgr_security_definer(PG_FU
*** 72,92 
  static const FmgrBuiltin *
  fmgr_isbuiltin(Oid id)
  {
! 	uint16		index;
  
  	/* fast lookup only possible if original oid still assigned */
  	if (id >= FirstGenbkiObjectId)
  		return NULL;
  
  	/*
! 	 * Lookup function data. If there's a miss in that range it's likely a
! 	 * nonexistant function, returning NULL here will trigger an ERROR later.

Question about autovacuum function autovac_balance_cost()

2019-01-08 Thread CNG L
I am new to the autovacuum. After reading its code, I am still confusing
what is the autovac_balance_cost() and how the cost logic works to make the
autovacuum workers consume the I/O equially. Can anyone share some light on
it?

Thanks


Re: WIP: Avoid creation of the free space map for small tables

2019-01-08 Thread John Naylor
On Sun, Dec 30, 2018 at 10:50 PM Amit Kapila  wrote:
>
> On Sun, Dec 30, 2018 at 3:49 AM John Naylor  wrote:
> > > I also tried to insert more
> > > records till 8 pages and same regression is observed! So I guess even
> > > HEAP_FSM_CREATION_THRESHOLD = 4 is not perfect!
> >
> > That's curious, because once the table exceeds the threshold, it would
> > be allowed to update the FSM, and in the process write 3 pages that it
> > didn't have to in the 4 page test. The master branch has the FSM
> > already, so I would expect the 8 page case to regress more.
> >
>
> It is not clear to me why you think there should be regression at 8
> pages when HEAP_FSM_CREATION_THRESHOLD is 4.  Basically, once FSM
> starts getting updated, we should be same as HEAD as it won't take any
> special path?

In this particular test, the FSM is already created ahead of time for
the master branch, so we can compare accessing FSM versus checking
every page. My reasoning is that passing the threshold would take some
time to create 3 FSM pages with the patch, leading to a larger
regression. It seems we don't observe this, however.

On Sun, Dec 30, 2018 at 10:59 PM Mithun Cy  wrote:
> I have some minor comments for pg_upgrade patch
> 1. Now we call stat main fork file in transfer_relfile()
> +sret = stat(old_file, &statbuf);
>
> +/* Save the size of the first segment of the main fork. */
> +if (type_suffix[0] == '\0' && segno == 0)
> +first_seg_size = statbuf.st_size;
>
> But we do not handle the case if stat has returned any error!

How about this:

/* Did file open fail? */
if (stat(old_file, &statbuf) != 0)
{
/* Extent, fsm, or vm does not exist?  That's OK, just return */
if (errno == ENOENT &&
(type_suffix[0] != '\0' || segno != 0))
return first_seg_size;
else
pg_fatal("error while checking for file existence \"%s.%s\"
(\"%s\" to \"%s\"): %s\n",
 map->nspname, map->relname, old_file, new_file,
 strerror(errno));
}

/* Save the size of the first segment of the main fork. */
else if (type_suffix[0] == '\0' && segno == 0)
first_seg_size = statbuf.st_size;

/* If extent, fsm, or vm is empty, just return */
else if (statbuf.st_size == 0)
return first_seg_size;

> 2. src/bin/pg_upgrade/pg_upgrade.h
>
>  char   *relname;
> +
> +charrelkind;/* relation relkind -- see pg_class.h */
>
> I think we can remove the added empty line.

In the full context:

-/* the rest are used only for logging and error reporting */
+
+/* These are used only for logging and error reporting. */
 char   *nspname;/* namespaces */
 char   *relname;
+
+charrelkind;/* relation relkind -- see pg_class.h */

Relkind is not used for logging or error reporting, so the space sets
it apart from the previous members. I could instead put relkind before
those other two...

-John Naylor



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Tom Lane
John Naylor  writes:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
>  wrote:
>> If he doesn't I will.

> I'll take a crack at separating into a module.  I'll wait a bit in
> case there are any stylistic suggestions on the patch as it stands.

I had a go at that myself.  I'm sure there's plenty to criticize in
the result, but at least it passes make check-world ;-)

I resolved the worry I had last night about the range of table values
by putting in logic to check the range and choose a suitable table
element type.  There are a couple of existing calls where we manage
to fit the hashtable elements into int8 that way; of course, by
definition that doesn't save a whole lot of space since such tables
couldn't have many elements, but it seems cleaner anyway.

regards, tom lane

diff --git a/src/common/Makefile b/src/common/Makefile
index 317b071..d0c2b97 100644
*** a/src/common/Makefile
--- b/src/common/Makefile
*** OBJS_FRONTEND = $(OBJS_COMMON) fe_memuti
*** 63,68 
--- 63,73 
  OBJS_SHLIB = $(OBJS_FRONTEND:%.o=%_shlib.o)
  OBJS_SRV = $(OBJS_COMMON:%.o=%_srv.o)
  
+ # where to find gen_keywordlist.pl and subsidiary files
+ TOOLSDIR = $(top_srcdir)/src/tools
+ GEN_KEYWORDLIST = $(PERL) -I $(TOOLSDIR) $(TOOLSDIR)/gen_keywordlist.pl
+ GEN_KEYWORDLIST_DEPS = $(TOOLSDIR)/gen_keywordlist.pl $(TOOLSDIR)/PerfectHash.pm
+ 
  all: libpgcommon.a libpgcommon_shlib.a libpgcommon_srv.a
  
  distprep: kwlist_d.h
*** libpgcommon_srv.a: $(OBJS_SRV)
*** 118,125 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(top_srcdir)/src/tools/gen_keywordlist.pl
! 	$(PERL) $(top_srcdir)/src/tools/gen_keywordlist.pl --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
--- 123,130 
  	$(CC) $(CFLAGS) $(subst -DFRONTEND,, $(CPPFLAGS)) -c $< -o $@
  
  # generate SQL keyword lookup table to be included into keywords*.o.
! kwlist_d.h: $(top_srcdir)/src/include/parser/kwlist.h $(GEN_KEYWORDLIST_DEPS)
! 	$(GEN_KEYWORDLIST) --extern $<
  
  # Dependencies of keywords*.o need to be managed explicitly to make sure
  # that you don't get broken parsing code, even in a non-enable-depend build.
diff --git a/src/common/kwlookup.c b/src/common/kwlookup.c
index d72842e..9dc1fee 100644
*** a/src/common/kwlookup.c
--- b/src/common/kwlookup.c
***
*** 35,94 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *text,
    const ScanKeywordList *keywords)
  {
! 	int			len,
! i;
! 	char		word[NAMEDATALEN];
! 	const char *kw_string;
! 	const uint16 *kw_offsets;
! 	const uint16 *low;
! 	const uint16 *high;
! 
! 	len = strlen(text);
  
  	if (len > keywords->max_kw_len)
! 		return -1;/* too long to be any keyword */
! 
! 	/* We assume all keywords are shorter than NAMEDATALEN. */
! 	Assert(len < NAMEDATALEN);
  
  	/*
! 	 * Apply an ASCII-only downcasing.  We must not use tolower() since it may
! 	 * produce the wrong translation in some locales (eg, Turkish).
  	 */
! 	for (i = 0; i < len; i++)
! 	{
! 		char		ch = text[i];
  
! 		if (ch >= 'A' && ch <= 'Z')
! 			ch += 'a' - 'A';
! 		word[i] = ch;
! 	}
! 	word[len] = '\0';
  
  	/*
! 	 * Now do a binary search using plain strcmp() comparison.
  	 */
! 	kw_string = keywords->kw_string;
! 	kw_offsets = keywords->kw_offsets;
! 	low = kw_offsets;
! 	high = kw_offsets + (keywords->num_keywords - 1);
! 	while (low <= high)
  	{
! 		const uint16 *middle;
! 		int			difference;
  
! 		middle = low + (high - low) / 2;
! 		difference = strcmp(kw_string + *middle, word);
! 		if (difference == 0)
! 			return middle - kw_offsets;
! 		else if (difference < 0)
! 			low = middle + 1;
! 		else
! 			high = middle - 1;
  	}
  
! 	return -1;
  }
--- 35,89 
   * receive a different case-normalization mapping.
   */
  int
! ScanKeywordLookup(const char *str,
    const ScanKeywordList *keywords)
  {
! 	size_t		len;
! 	int			h;
! 	const char *kw;
  
+ 	/*
+ 	 * Reject immediately if too long to be any keyword.  This saves useless
+ 	 * hashing and downcasing work on long strings.
+ 	 */
+ 	len = strlen(str);
  	if (len > keywords->max_kw_len)
! 		return -1;
  
  	/*
! 	 * Compute the hash function.  We assume it was generated to produce
! 	 * case-insensitive results.  Since it's a perfect hash, we need only
! 	 * match to the specific keyword it identifies.
  	 */
! 	h = keywords->hash(str, len);
  
! 	/*
! 	 * An out-of-range result implies no match.  (This can happen for
! 	 * non-keyword inputs because the hash function will sum two unrelated
! 	 * hashtable entries.)
! 	 */
! 	if (h < 0 || h >= keywords->num_keywords)
! 		return -1;
  
  	/*
! 	 * Compare character-by-character to see if we have a match, applying an
! 	 * ASCII-only 

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Andres Freund
Hi,

On 2019-01-08 13:41:16 -0500, John Naylor wrote:
> On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
>  wrote:
> > On 1/7/19 7:52 PM, Andres Freund wrote:
> > > Builtin functions for one, which we'd swatted down last time round due
> > > to gperfs defficiencies.
> 
> Do you mean the fmgr table?

Not the entire fmgr table, but just the builtin oid index, generated by
the following section:

# Create the fmgr_builtins table, collect data for fmgr_builtin_oid_index
print $tfh "\nconst FmgrBuiltin fmgr_builtins[] = {\n";
my %bmap;
$bmap{'t'} = 'true';
$bmap{'f'} = 'false';
my @fmgr_builtin_oid_index;
my $fmgr_count = 0;
foreach my $s (sort { $a->{oid} <=> $b->{oid} } @fmgr)
{
print $tfh
  "  { $s->{oid}, $s->{nargs}, $bmap{$s->{strict}}, 
$bmap{$s->{retset}}, \"$s->{prosrc}\", $s->{prosrc} }";

$fmgr_builtin_oid_index[ $s->{oid} ] = $fmgr_count++;

if ($fmgr_count <= $#fmgr)
{
print $tfh ",\n";
}
else
{
print $tfh "\n";
}
}
print $tfh "};\n";

print $tfh qq|
const int fmgr_nbuiltins = (sizeof(fmgr_builtins) / sizeof(FmgrBuiltin));
|;

The generated fmgr_builtin_oid_index is pretty sparse, and a more dense
hashtable might e.g. more efficient from a cache perspective.

Greetings,

Andres Freund



Re: A few new options for vacuumdb

2019-01-08 Thread Bossart, Nathan
On 1/7/19, 1:12 AM, "Michael Paquier"  wrote:
> I have been looking at the patch set, and 0001 can actually happen
> only once 0005 is applied because this modifies the query doing on
> HEAD a full scan of pg_class which would include at least catalog
> tables so it can never be empty.  For this reason it seems to me that
> 0001 and 0004 should be merged together as common refactoring, making
> sure on the way that all relations exist before processing anything.
> As 0005 and 0006 change similar portions of the code, we could also
> merge these together in a second patch introducing the new options.
> Keeping 0001, 0004, 0005 and 0006 separated eases the review, but I
> would likely merge things when they make sense together to reduce
> diff chunks.

Thanks for reviewing.  Sure, I can merge these together so that it's
just 2 patches.

> 0002 removes some carefully-written query generation, so as it is
> never possible to generate something like ANALYZE FULL.  HEAD splits
> ANALYZE and VACUUM clearly, but 0002 merges them together so as
> careless coding in vacuumdb.c makes it easier to generate command
> strings which would fail in the backend relying on the option checks
> to make sure that for example combining --full and --analyze-only
> never happens.  Introducing 0002 is mainly here for 0003, so let's
> merge them together.

Makes sense.  I was trying to simplify this code a bit, but I agree
with your point about relying on the option checks.

> From patch 0004:
> +   executeCommand(conn, "RESET search_path;", progname, echo);
> +   res = executeQuery(conn, catalog_query.data, progname, echo);
> +   termPQExpBuffer(&catalog_query);
> +   PQclear(executeQuery(conn, ALWAYS_SECURE_SEARCH_PATH_SQL,
> +   progname, echo));
> We should really avoid that.  There are other things like casts which
> tend to be forgotten, and if the catalog lookup query gets more
> complicated, I feel that this would again be forgotten, reintroducing
> the issues that ALWAYS_SECURE_SEARCH_PATH_SQL is here to fix.

This was done in order to maintain the current behavior that
appendQualifiedRelation() gives us.  I found that skipping the
search_path handling here forced us to specify the schema in the
argument for --table in most cases.  At the very least, I could add a
comment here to highlight the importance of fully qualifying
everything in the catalog query.  What do you think?

> I have put my hands into what you sent, and worked on putting
> 0002/0003 first into shape, finishing with the attached.  I have
> simplified the query construction a bit, making it IMO easier to read
> and to add new options, with comments documenting what's supported
> across versions.  I have also added a regression test for
> --analyze-only --skip-locked.  Then I have done some edit of the docs.
> What do you think for this portion?

Looks good to me, except for one small thing in the documentation:

+   
+Disable all page-skipping behavior during processing based on
+the visibility map, similarly to the option
+DISABLE_PAGE_SKIPPING for VACUUM.
+   

I think the "similarly to the option" part is slightly misleading.
It's not just similar, it _is_ using that option in the generated
commands.  Perhaps we could point to the VACUUM documentation for more
information about this one.

On 1/7/19, 8:03 PM, "Michael Paquier"  wrote:
> On Mon, Jan 07, 2019 at 06:10:21PM +0900, Masahiko Sawada wrote:
>> 0002 and 0003 are merged and posted by Michael-san and it looks good
>> to me, so I've looked at the 0001, 0004, 0005 and 0006 patches. Here
>> is a few review comments.
>
> I have done another round on 0002/0003 (PQfinish was lacking after
> error-ing on incompatible options) and committed the thing.

Thanks!

>> Even if all tables are filtered by --min-relation-size, min-mxid-age
>> or min-xid-age the following message appeared.
>> 
>> $ vacuumdb --verbose --min-relation-size 100 postgres
>> vacuumdb: vacuuming database "postgres"
>> 
>> Since no tables are processed in this case isn't is better not to
>> output this message by moving the message after checking if ntup ==
>> 0?
>
> Hmm.  My take on this one is that we attempt to vacuum relations on
> this database, but this may finish by doing nothing.  Printing this
> message is still important to let the user know that an attempt was
> done so I would keep the message.

+1 for keeping the message.

>> You use pg_total_relation_size() to check the relation size but it
>> might be better to use pg_relation_size() instead. The
>> pg_total_relation_size() includes the all index size but I think it's
>> common based on my experience that we check only the table size
>> (including toast table) to do vacuum it or not.
>
> Yes, I am also not completely sure if the way things are defined in
> the patch are completely what we are looking for.  Still, I have seen
> as well people complain about the total amount of space a table is
> p

Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread John Naylor
On Tue, Jan 8, 2019 at 12:06 PM Andrew Dunstan
 wrote:
> On 1/7/19 7:52 PM, Andres Freund wrote:
> > Builtin functions for one, which we'd swatted down last time round due
> > to gperfs defficiencies.

Do you mean the fmgr table?

> >> But in any case, that sounds like a task for someone with
> >> more sense of Perl style than I have.
> > John, any chance you could help out with that... :)
>
> If he doesn't I will.

I'll take a crack at separating into a module.  I'll wait a bit in
case there are any stylistic suggestions on the patch as it stands.



Re: Changing SQL Inlining Behaviour (or...?)

2019-01-08 Thread Adam Brightwell
>> > * Solution #2 - Quick and dirty and invisible. Tom suggested a hack that 
>> > achieves the aims of #1 but without adding syntax to CREATE FUNCTION: have 
>> > the inlining logic look at the cost of the wrapper and the cost of 
>> > parameters, and if the cost of the wrapper "greatly exceeded" the cost of 
>> > the parameters, then inline. So the PostGIS project would just set the 
>> > cost of our wrappers very high, and we'd get the behaviour we want, while 
>> > other users who want to use wrappers to force caching of calculations 
>> > would have zero coded wrapper functions.  Pros: Solves the problem and 
>> > easy to implement, I'm happy to contribute. Cons: it's so clearly a hack 
>> > involving hidden (from users) magic.
>> > ...
>> > So my question to hackers is: which is less worse, #1 or #2, to implement 
>> > and submit to commitfest, in case #3 does not materialize in time for 
>> > PgSQL 12?
>>
>> I've been working with Paul to create and test a patch (attached) that
>> addresses Solution #2. This patch essentially modifies the inlining
>> logic to compare the cost of the function with the total cost of the
>> parameters. The goal as stated above, is that for these kinds of
>> functions, they would be assigned relatively high cost to trigger the
>> inlining case.
>
>
> I've tested this patch for both the internal types case and for the PostGIS 
> functions case, and in both cases it provides a way to get around the 
> undesirable inlining behaviour for our case without impacting people for whom 
> the behaviour has been desirable in the past.
>
> Is this already in the commitfest?

Yes.

https://commitfest.postgresql.org/21/1965/

-Adam



Re: partitioned tables referenced by FKs

2019-01-08 Thread Jesper Pedersen

Hi,

On 1/7/19 1:07 PM, Jesper Pedersen wrote:
While I'm still looking at 0004 - should we have a test that updates one 
of the constraints of fk to another partition in pk ?




In addition:

* Document pre_drop_class_check()
* I think index_get_partition() would be more visible in partition.c
* Remove code in #if 0 block

I have tested this with a hash based setup.

Thanks again !

Best regards,
 Jesper



Re: GSoC 2019

2019-01-08 Thread Alexander Korotkov
On Tue, Jan 8, 2019 at 1:06 AM Stephen Frost  wrote:
> All the entries are marked with '2018' to indicate they were pulled from
> last year.  If the project from last year is still relevant, please
> update it to be '2019' and make sure to update all of the information
> (in particular, make sure to list yourself as a mentor and remove the
> other mentors, as appropriate).

I can confirm that I'm ready to mentor projects, where I'm listed as
potential mentor.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Displaying and dumping of table access methods

2019-01-08 Thread Andres Freund
Hi,

On 2019-01-08 11:30:56 +0100, Peter Eisentraut wrote:
> On 08/01/2019 00:56, Andres Freund wrote:
> > A patch at [2] adds display of a table's access method to \d+ - but that
> > means that running the tests with a different default table access
> > method (e.g. using PGOPTIONS='-c default_table_access_method=...)
> > there'll be a significant number of test failures, even though the test
> > results did not meaningfully differ.
> 
> For psql, a variable that hides the access method if it's the default.

Yea, I think that seems the least contentious solution.  Don't like it
too much, but it seems better than the alternative. I wonder if we want
one for multiple regression related issues, or whether one specifically
about table AMs is more appropriate. I lean towards the latter.


> > Similarly, if pg_dump starts to dump table access methods either
> > unconditionally, or for all non-heap AMS, the pg_dump tests fail due to
> > unimportant differences.
> 
> For pg_dump, track and set the default_table_access_method setting
> throughout the dump (similar to how default_with_oids was handled, I
> believe).

Yea, that's similar to that, and I think that makes sense.

Greetings,

Andres Freund



Re: Displaying and dumping of table access methods

2019-01-08 Thread Andres Freund
Hi,

On 2019-01-08 13:02:00 +0900, Michael Paquier wrote:
> On Mon, Jan 07, 2019 at 06:31:52PM -0800, Andres Freund wrote:
> > Huh? It's absolutely *trivial* from a buildsystem POV to run the tests
> > again with a different default AM. That's precisely why I'm talking
> > about this. Just setting PGOPTIONS='-c
> > default_table_access_method=zheap' in the new makefile target (the ms
> > run scripts are similar) is sufficient.  And we don't need to force
> > everyone to constantly run tests with e.g. both heap and zheap, it's
> > sufficient to do so on a few buildfarm machines, and whenever changing
> > AM level code.  Rerunning all the tests with a different AM is just
> > setting the same environment variable, but running check-world as the
> > target.
> 
> Another point is that having default_table_access_method facilitates
> the restore of tables across AMs similarly to tablespaces, so CREATE
> TABLE dumps should not include the AM part.

That's what I suggested in the first message in this thread, or did I
miss a difference?


> > And even if you were to successfully argue that it's sufficient during
> > normal development to only have a few zheap specific additional tests,
> > we'd certainly want to make it possible to occasionally explicitly run
> > the rest of the tests under zheap to see whether additional stuff has
> > been broken - and that's much harder to sift through if there's a lot of
> > spurious test failures due to \d[+] outputting additional/differing
> > data.
> 
> The specific-heap tests could be included as an extra module in
> src/test/modules easily, so removing from the main tests what is not
> completely transparent may make sense.

Why does it need to be an extra module, rather than just exta regression
files / sections of files that just explicitly specify the AM?  Seems a
lot easier and faster.


> > We are working seriously hard on making AMs pluggable. Zheap is not yet,
> > and won't be that soon, part of core. The concerns for an in-core zheap
> > (which needs to maintain the test infrastructure during the remainder of
> > its out-of-core development!) and out-of-core AMs are pretty aligned.  I
> > don't get your confusion.
> 
> I would imagine that a full-fledged AM should be able to maintain
> compatibility with the full set of queries that heap is able to
> support, so if you can make the tests transparent enough so as they
> can be run for any AMs without alternate input in the core tree, then
> that's a goal worth it.  Don't you have plan inconsistencies as well
> with zheap?

In the core tests there's a fair number of things that can be cured by
adding an ORDER BY to the tests, which seems sensible. We've added a lot
of those over the years anyway.  There's additionally a number of plans
that change, which currently is handled by alternatives output files,
but I think we should move to reduce those differences, that's probably
not too hard.

Greetings,

Andres Freund



Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)

2019-01-08 Thread Andrew Dunstan


On 1/7/19 7:52 PM, Andres Freund wrote:
> Hi,
>
> On 2019-01-07 19:37:51 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> Hm, shouldn't we extract the perfect hash generation into a perl module
>>> or such? It seems that there's plenty other possible uses for it.
>> Such as?
> Builtin functions for one, which we'd swatted down last time round due
> to gperfs defficiencies. But I think there's plenty more potential,
> e.g. it'd make sense from a performance POV to use a perfect hash
> function for locks on builtin objects (the hashtable for lookups therein
> shows up prominently in a fair number of profiles, and they are a large
> percentage of the acquistions). I'm certain there's plenty more, I've
> not though too much about it.
>


Yeah, this is pretty neat,



>> But in any case, that sounds like a task for someone with
>> more sense of Perl style than I have.
> John, any chance you could help out with that... :)
>

If he doesn't I will.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Fabien COELHO




But then you have to make sure the control flag gets cleared in any
case pg_verify_checksums crashes somehow or gets SIGKILL'ed ...


The usual approach is a restart with some --force option?

Setting the checksum flag is done after having finished all blocks, so 
there is no problem.


There is also a problem if the db is started while the checksum is 
being enabled.


What i mean is that interrupting pg_verify_checksums won't leave 
pg_control in a state where starting the cluster won't work without any 
further interaction.


Yep, I understood that, and agree that a way out is needed, hence the 
--force option suggestion.


--
Fabien.



Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Bernd Helmle
Am Dienstag, den 08.01.2019, 16:17 +0100 schrieb Fabien COELHO:
> > > Adding a new state to ControlFileData which would prevent it from
> > > starting?
> > 
> > But then you have to make sure the control flag gets cleared in any
> > case pg_verify_checksums crashes somehow or gets SIGKILL'ed ...
> 
> The usual approach is a restart with some --force option?
> 
> > Setting the checksum flag is done after having finished all blocks,
> > so
> > there is no problem.
> 
> There is also a problem if the db is started while the checksum is
> being 
> enabled.

What i mean is that interrupting pg_verify_checksums won't leave
pg_control in a state where starting the cluster won't work without any
further interaction.

Bernd.





Re: Fast path for empty relids in check_outerjoin_delay()

2019-01-08 Thread Tom Lane
Richard Guo  writes:
> On Fri, Jan 4, 2019 at 10:32 PM Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> wrote:
>> I think code readability and maintainability would be improved by having
>> fewer special cases and fast paths.  In this particular case, I'd even
>> remove the existing fast path and just let the subsequent loop run zero
>> times.  I doubt there is a performance benefit to the existing coding.
>> And if there is no performance benefit, then it's not really a fast
>> path, just another path.

> This code was added in commit d7a6a0, by Tom.
> Hi Tom, what is your opinion about the fast path in
> check_outerjoin_delay()?

I quite reject Peter's argument that the existing fast path is not
worthwhile.  It's a cheap test and it saves cycles whenever the query has
no outer joins, which is common for simple queries.  (The general rule
I've followed for planner fast-path cases is to make simple queries as
cheap as possible; you don't want the planner expending a whole lot of
overhead on trivial cases.)  Moreover, while it's true that the loop
will fall through quickly if root->join_info_list is nil, there's still
a bms_copy, bms_int_members, and bms_free that will be expended uselessly
if we remove the fast-path check.

I'm not, however, convinced that the case of *relids_p being empty is
common enough to justify adding a test for that.  In fact, it looks
to me like it's guaranteed to be nonempty for the main call sites in
distribute_qual_to_rels.

Possibly what would make more sense is to add the consideration to
check_equivalence_delay, which is the only call site that can pass
an empty set; that is

+   /* A variable-free expression cannot be outerjoin-delayed */
+   if (restrictinfo->left_relids)
+   {
/* must copy restrictinfo's relids to avoid changing it */
relids = bms_copy(restrictinfo->left_relids);
/* check left side does not need delay */
if (check_outerjoin_delay(root, &relids, &nullable_relids, true))
return false;
+   }

and likewise for the right side.

The bigger picture here, of course, is that check_outerjoin_delay's
API is not at all well matched to what check_equivalence_delay needs:
it has to make the extra bms_copy shown above, and it has no use
for either of the relid sets that check_outerjoin_delay wants to
return.  I believe it's like that because check_outerjoin_delay is
much older than the EquivalenceClass logic, and I didn't want to
redesign it when I put in ECs, and I also didn't want to just
duplicate all that logic.  But maybe it's time to put some more thought
into that, and try to find a refactoring of check_outerjoin_delay that
suits both callers better.

Anyway, I concur with Peter that the patch you've presented should
probably be rejected: I doubt it's a win performance-wise and I don't
see that it adds anything to readability either.  But if you want to
think about a more effective refactoring of this logic, maybe there
is something good to be had out of that.

regards, tom lane



Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Fabien COELHO



Setting the checksum flag is done after having finished all blocks, so
there is no problem. But we need to set this new flag before and reset
it afterwards, so in between strange things can happen (as the various
calls to exit() within error handling illustrates).


It seems writing a note like "pg_checksums is running" into the
postmaster.pid would work, and would give a hopefully useful hint to
somebody trying to start Postgres while pg_checksums is running:

postgres@kohn:~$ echo  "pg_checksums running with pid 1231, cluster disabled" > 
data/postmaster.pid 
postgres@kohn:~$ pg_ctl -D data -l logfile start
pg_ctl: invalid data in PID file "data/postmaster.pid"
postgres@kohn:~$ echo $?
1
postgres@kohn:~$ 


Looks ok, but I'm unsure how portable it is though. What if started with 
"postmater" directly?



If the DBA then just simply deletes postmaster.pid and starts over, well
then I call pilot error; though we could in theory change pg_ctl (or
whatever checks postmaster.pid) to emit an even more useful error
message if it encounters a "cluster is locked" keyword in it.

Not sure whether everybody likes that (or is future-proof for that
matter), but I like it better than adding a new field to the control
file, for the reasons Bernd outlined above.


ISTM that the point of the control file is exactly to tell what is current 
the status of the cluster, so it is where this information really belongs?


AFAICS all commands take care of the status in some way to avoid 
accidents.


--
Fabien.

Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Fabien COELHO




Adding a new state to ControlFileData which would prevent it from
starting?


But then you have to make sure the control flag gets cleared in any
case pg_verify_checksums crashes somehow or gets SIGKILL'ed ...


The usual approach is a restart with some --force option?


Setting the checksum flag is done after having finished all blocks, so
there is no problem.


There is also a problem if the db is started while the checksum is being 
enabled.


But we need to set this new flag before and reset it afterwards, so in 
between strange things can happen (as the various calls to exit() within 
error handling illustrates).


Sure, there is some need for a backup plan if it fails and the control 
file is let in a wrong state.


--
Fabien.



Re: Mentoring for GCI-19

2019-01-08 Thread Stephen Frost
Greetings,

* Padam Chopra (padamchopra1...@gmail.com) wrote:
> I want to join GCI as a mentor for the year 2019, please guide me about the
> procedure,
> thanks in anticipation.

Thanks for your interest but we aren't likely to even start thinking
about that until we're close to when GCI 2019 starts.

If you're interesting in mentoring, then I'd suggest you participate on
our existing mailing lists and become familiar with the community and
help out with all of the questions that are constantly being asked on
lists like -general, -admin, etc.  Doing so would make it much more
likely that we'd consider you for a mentor for GCI 2019.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Michael Banck
Am Dienstag, den 08.01.2019, 15:39 +0100 schrieb Bernd Helmle:
> Am Dienstag, den 08.01.2019, 15:09 +0100 schrieb Fabien COELHO:
> > > The question is how to reliably do this in an acceptable way? Just
> > > faking a postmaster.pid sounds pretty hackish to me, do you have
> > > any
> > > suggestions here?
> > 
> > Adding a new state to ControlFileData which would prevent it from 
> > starting?
> 
> But then you have to make sure the control flag gets cleared in any
> case pg_verify_checksums crashes somehow or gets SIGKILL'ed ...
> 
> Setting the checksum flag is done after having finished all blocks, so
> there is no problem. But we need to set this new flag before and reset
> it afterwards, so in between strange things can happen (as the various
> calls to exit() within error handling illustrates). 

It seems writing a note like "pg_checksums is running" into the
postmaster.pid would work, and would give a hopefully useful hint to
somebody trying to start Postgres while pg_checksums is running:

postgres@kohn:~$ echo  "pg_checksums running with pid 1231, cluster disabled" > 
data/postmaster.pid 
postgres@kohn:~$ pg_ctl -D data -l logfile start
pg_ctl: invalid data in PID file "data/postmaster.pid"
postgres@kohn:~$ echo $?
1
postgres@kohn:~$ 

If the DBA then just simply deletes postmaster.pid and starts over, well
then I call pilot error; though we could in theory change pg_ctl (or
whatever checks postmaster.pid) to emit an even more useful error
message if it encounters a "cluster is locked" keyword in it.

Not sure whether everybody likes that (or is future-proof for that
matter), but I like it better than adding a new field to the control
file, for the reasons Bernd outlined above.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Bernd Helmle
Am Dienstag, den 08.01.2019, 15:09 +0100 schrieb Fabien COELHO:
> > The question is how to reliably do this in an acceptable way? Just
> > faking a postmaster.pid sounds pretty hackish to me, do you have
> > any
> > suggestions here?
> 
> Adding a new state to ControlFileData which would prevent it from 
> starting?

But then you have to make sure the control flag gets cleared in any
case pg_verify_checksums crashes somehow or gets SIGKILL'ed ...

Setting the checksum flag is done after having finished all blocks, so
there is no problem. But we need to set this new flag before and reset
it afterwards, so in between strange things can happen (as the various
calls to exit() within error handling illustrates). 

Bernd.





Re: OpenBSD versus semaphores

2019-01-08 Thread Tom Lane
=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?=  writes:
> On 2019-01-08 07:14, Tom Lane wrote:
>> Raising the annoyance factor considerably, AFAICT the only way to
>> increase these settings is to build your own custom kernel.

> You don't need to build your custom kernel to change those settings.
> Just add:
> kern.seminfo.semmni=20
> to /etc/sysctl.conf and reboot

Hm, I wonder when that came in?  Our documentation doesn't know about it.

regards, tom lane



Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2019-01-08 Thread Dean Rasheed
On Mon, 7 Jan 2019 at 00:45, Tomas Vondra  wrote:
>
> FWIW the main unsolved issue (at least on the MCV part) is how it
> decides which items to keep in the list.
>
> As explained in [1], in the multivariate case we can't simply look at
> the group frequency and compare it to the average frequency (of the
> non-MCV items), which is what analyze_mcv_list() does in the
> single-column case. In the multivariate case we also case about observed
> vs. base frequency, i.e. we want the MCV list to include groups that are
> present singificantly more/less than product of per-column stats.
>
> I've repeatedly tried to come up with a criteria that would address
> that, but it seems rather difficult because we can't abandon the other
> criteria either. So the MCV list should include groups that match both
>
> (a) items that are statistically more common than the non-MCV part (i.e.
> the rule from per-column analyze_mcv_list)
>
> (b) items that are statistically more/less common than estimated from
> per-column stats (i.e. the new rule)

Thinking about this some more, I think that it probably isn't
appropriate to use analyze_mcv_list() directly because that's making
specific assumptions about how items not in the list will be estimated
that aren't actually true for groups of values in multivariate stats.
If a group of values isn't in the MCV list, it gets estimated based on
the product of the selectivities from the per-column stats (modulo the
additional logic preventing the selectivity not exceeding the total
non-MCV selectivity).

So actually, the estimate for a group of values will be either the MCV
item's frequency (if the MCV item is kept), or (roughly) the MCV
item's base_frequency (if the MCV item is not kept). That suggests
that we should simply keep items that are significantly more or less
common than the item's base frequency -- i.e., keep rule (b) and ditch
rule (a).

> Enforcing rule (a) seems reasonable because it ensures the MCV list
> includes all items more frequent than the last one. Without it, it's
> difficult to decide know whether the absent item is very common (but
> close to base frequency) or very uncommon (so less frequent than the
> last MCV item).

I'm not sure there's much we can do about that. Keeping the item will
result in keeping a frequency that we know is close to the base
frequency, and not keeping the item will result in per-column stats
being used that we expect to also give an estimate close to the base
frequency. So either way, the result is about the same, and it's
probably better to discard it, leaving more room for other items about
which we may have more information.

That said, there is a separate benefit to keeping items in the list
even if their frequency is close to the base frequency -- the more
items kept, the larger their total selectivity will be, giving a
better cap on the non-MCV selectivities. So if, after keeping all
items satisfying rule (b), there are free slots available, perhaps
they should be used for the most common remaining values satisfying
rule (a).

Regards,
Dean



Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Fabien COELHO






One difference between pg_rewind and pg_checksums is that the latter
potentially runs for a longer time (or rather a non-trivial amount of
time, compared to pg_rewind), so the margin of error of another DBA
saying "oh, that DB is down, let me start it again" might be much
higher.

The question is how to reliably do this in an acceptable way? Just
faking a postmaster.pid sounds pretty hackish to me, do you have any
suggestions here?


Adding a new state to ControlFileData which would prevent it from 
starting?


--
Fabien.



Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread David Rowley
On Wed, 9 Jan 2019 at 01:20, Andrey Borodin  wrote:
> By the way, is it ok to negotiate review exchange?

I think it happens fairly often. There's no need for the list to know
anything about it when it does.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: emacs configuration for new perltidy settings

2019-01-08 Thread Andrew Dunstan


On 1/3/19 12:53 AM, Noah Misch wrote:
> On Thu, Jul 12, 2012 at 12:35:26AM +0300, Peter Eisentraut wrote:
>> This might be useful for some people.  Here is an emacs configuration
>> for perl-mode that is compatible with the new perltidy settings.  Note
>> that the default perl-mode settings produce indentation that will be
>> completely shredded by the new perltidy settings.
>>
>> (defun pgsql-perl-style ()
>>   "Perl style adjusted for PostgreSQL project"
>>   (interactive)
>>   (setq tab-width 4)
>>   (setq perl-indent-level 4)
>>   (setq perl-continued-statement-offset 4)
>>   (setq perl-continued-brace-offset 4)
> (Later, commit 56fb890 changed perl-continued-statement-offset to 2.)  This
> indents braces (perltidy aligns the brace with "if", but perl-mode adds
> perl-continued-statement-offset + perl-continued-brace-offset = 6 columns):
>
>   if (-s "src/backend/snowball/stopwords/$lang.stop")
> {
> $stop = ", StopWords=$lang";
> }
>
> If I run perltidy on 60d9979, then run perl-mode indent, the diff between the
> perltidy run and perl-mode indent run is:
>  129 files changed, 8468 insertions(+), 8468 deletions(-)
> If I add (perl-continued-brace-offset . -2):
>  119 files changed, 3515 insertions(+), 3515 deletions(-)
> If I add (perl-indent-continued-arguments . 4) as well:
>  86 files changed, 2626 insertions(+), 2626 deletions(-)
> If I add (perl-indent-parens-as-block . t) as well:
>  65 files changed, 2373 insertions(+), 2373 deletions(-)
>
> That's with GNU Emacs 24.5.1.  Versions 24.3.1 and 21.4.1 show similar trends,
> though 21.4.1 predates perl-indent-continued-arguments and
> perl-indent-parens-as-block.
>
> I'm attaching the patch to make it so, along with a patch that illustrates my
> testing method.  "sh reindent-perl.sh" will test emacs.samples using your
> Emacs installation.  (I don't plan to push the testing patch.)


Sounds good. What do the remaining diffs look like?


cheers


andrew



-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread Andrey Borodin
Hi everyone!

> 8 янв. 2019 г., в 14:14, Fabien COELHO  написал(а):
> 
> The process is that *you* choose the patches to review and register as such 
> for the patch on the CF app.

By the way, is it ok to negotiate review exchange?

Best regards, Andrey Borodin.


Re: Using POPCNT and other advanced bit manipulation instructions

2019-01-08 Thread Dmitry Dolgov
> On Fri, Jan 4, 2019 at 1:38 PM David Rowley  
> wrote:
>
> On Thu, 20 Dec 2018 at 23:56, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> > I've checked for Clang 6, it turns out that indeed it generates popcnt 
> > without
> > any macro, but only in one place for bloom_prop_bits_set. After looking at 
> > this
> > function it seems that it would be benefitial to actually use popcnt there 
> > too.
>
> Yeah, that's the pattern that's mentioned in
> https://lemire.me/blog/2016/05/23/the-surprising-cleverness-of-modern-compilers/
> It would need to be changed to call the popcount function.  This
> existing makes me a bit more worried that some extension could be
> using a similar pattern and end up being compiled with -mpopcnt due to
> pg_config having that CFLAG. That's all fine until the binary makes
> it's way over to a machine without that instruction.

It surprises me, that it's not that obvious how to disable this feature for
clang. I guess one should be able to turn it off by invoking opt manually:

clang -S -mpopcnt -emit-llvm *.c
opt -S -mattr=+popcnt  *.ll
llc -mattr=+popcnt *.optimized.ll
clang -mpopcnt *optimized.s

But for some reason this doesn't work for me (popcnt is not appearing in
the first place).

> > > I am able to measure performance gains from the patch.  In a 3.4GB
> > > table containing a single column with just 10 statistics targets, I
> > > got the following times after running ANALYZE on the table.
> >
> > I've tested it too a bit, and got similar results when the patched version 
> > is
> > slightly faster. But then I wonder if popcnt is the best solution here, 
> > since
> > after some short research I found a paper [1], where authors claim that:
> >
> > Maybe surprisingly, we show that a vectorized approach using SIMD
> > instructions can be twice as fast as using the dedicated instructions on
> > recent Intel processors.
> >
> >
> > [1]: https://arxiv.org/pdf/1611.07612.pdf
>
> I can't imagine that using the number_of_ones[] array processing
> 8-bits at a time would be slower than POPCNT though.

Yeah, probably you're right. If I understand correctly even with the lookup
table in the cache the access would be a bit slower than a POPCNT instruction.



Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Michael Banck
Hi,

Am Mittwoch, den 26.12.2018, 19:43 +0100 schrieb Fabien COELHO:
> > It adds an (now mandatory) --action parameter that takes either verify,
> > enable or disable as argument.
> 
> I'd rather have explicit switches for verify, enable & disable, and verify 
> would be the default if none is provided.

I changed that to the switches -c/--verify (-c for check as -v is taken,
should it be --check as well? I personally like verify better), 
-d/--disable and -e/--enable.

> About the patch: applies, compiles, "make check" ok.
> 
> There is no documentation.

Yeah, I'll write that once the CLI is settled.

> In "scan_file", I would open RW only for enable, but keep RO for verify.

OK, I've changed that.

> Also, the full page is rewritten... would it make sense to only overwrite 
> the checksum part itself?

So just writing the page header? I find that a bit scary and don't
expect much speedup as the OS would write the whole block anyway I
guess? I haven't touched that yet.

> It seems that the control file is unlinked and then rewritten. If the 
> rewritting fails, or the command is interrupted, the user has a problem.
> 
> Could the control file be simply opened RW? Else, I would suggest to 
> rename (eg add .tmp), write the new one, then unlink the old one, so that 
> recovering the old state in case of problem is possible.

I have mostly taken the pg_rewind code here; if there was a function
that allowed for safe offline changes of the control file, I'd be happy
to use it but I don't think it should be this patch to invent that.

In any case, I have removed the unlink() now (not sure where that came
from), and changed it to open(O_WRONLY) same as in Michael's code and
pg_rewind.  

V2 attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutzdiff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index cc6ebb9df0..3ecce3da32 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -1,7 +1,7 @@
 /*
  * pg_verify_checksums
  *
- * Verifies page level checksums in an offline cluster
+ * Verifies/enables/disables page level checksums in an offline cluster
  *
  *	Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -13,15 +13,16 @@
 #include 
 #include 
 
+#include "access/xlog_internal.h"
 #include "catalog/pg_control.h"
 #include "common/controldata_utils.h"
+#include "common/file_perm.h"
+#include "common/file_utils.h"
 #include "getopt_long.h"
 #include "pg_getopt.h"
 #include "storage/bufpage.h"
 #include "storage/checksum.h"
 #include "storage/checksum_impl.h"
-#include "storage/fd.h"
-
 
 static int64 files = 0;
 static int64 blocks = 0;
@@ -31,16 +32,31 @@ static ControlFileData *ControlFile;
 static char *only_relfilenode = NULL;
 static bool verbose = false;
 
+typedef enum
+{
+	PG_ACTION_DISABLE,
+	PG_ACTION_ENABLE,
+	PG_ACTION_VERIFY
+} ChecksumAction;
+
+/* Filename components */
+#define PG_TEMP_FILES_DIR "pgsql_tmp"
+#define PG_TEMP_FILE_PREFIX "pgsql_tmp"
+
+static ChecksumAction action = PG_ACTION_VERIFY;
+
 static const char *progname;
 
 static void
 usage(void)
 {
-	printf(_("%s verifies data checksums in a PostgreSQL database cluster.\n\n"), progname);
+	printf(_("%s enables/disables/verifies data checksums in a PostgreSQL database cluster.\n\n"), progname);
 	printf(_("Usage:\n"));
 	printf(_("  %s [OPTION]... [DATADIR]\n"), progname);
 	printf(_("\nOptions:\n"));
 	printf(_(" [-D, --pgdata=]DATADIR  data directory\n"));
+	printf(_("  -A, --action   action to take on the cluster, can be set as\n"));
+	printf(_(" \"verify\", \"enable\" and \"disable\"\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -r RELFILENODE check only relation with specified relfilenode\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
@@ -80,6 +96,77 @@ skipfile(const char *fn)
 }
 
 static void
+updateControlFile(char *DataDir, ControlFileData *ControlFile)
+{
+	int			fd;
+	char		buffer[PG_CONTROL_FILE_SIZE];
+	char		ControlFilePath[MAXPGPATH];
+
+	Assert(action == PG_ACTION_ENABLE ||
+		   action == PG_ACTION_DISABLE);
+
+	/*
+	 * For good luck, apply the same static assertions as in backend's
+	 * WriteControlFile().
+	 */
+#if PG_VERSION_NUM >= 10
+	StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_MAX_SAFE_SIZE,
+	 "pg_control is too large for atomic disk writes");
+#endif
+	StaticAssertStmt(sizeof(ControlFileData) <= PG_CONTROL_FILE_SIZE,
+	 "sizeof(ControlFileData) exceeds PG_CONTRO

GCI-2019 Mentoring

2019-01-08 Thread Padam Chopra
Respected Concern,

I want to join GCI as a mentor for the year 2019, please guide me about the
procedure,
thanks in anticipation.

--
Regards
Padam Chopra
Google Grand Prize Winner
Microsoft Imagine Cup India winner
TedX Event Organizer

Contact:

Email:padamchopra1337(at)gmail(dot)com

More details about me and my work:

GitHub Profile: https://github.com/padamchopra
Website: http://padamchopra.me/


Re: Offline enabling/disabling of data checksums

2019-01-08 Thread Michael Banck
Hi,

Am Donnerstag, den 27.12.2018, 12:26 +0100 schrieb Fabien COELHO:
> > > For enable/disable, while the command is running, it should mark the 
> > > cluster as opened to prevent an unwanted database start. I do not see 
> > > where this is done.
> > > 
> > > You have pretty much the same class of problems if you attempt to
> > > start a cluster on which pg_rewind or the existing pg_verify_checksums
> > > is run after these have scanned the control file to make sure that
> > > they work on a cleanly-stopped instance. [...]
> > 
> > I think it comes down to what the outcome is. If we're going to end up with
> > a corrupt database (e.g. one where checksums aren't set everywhere but they
> > are marked as such in pg_control) then it's not acceptable. If the only
> > outcome is the tool gives an error that's not an error and if re-run it's
> > fine, then it's a different story.
> 
> ISTM that such an outcome is indeed a risk, as a starting postgres could 
> update already checksummed pages without putting a checksum. It could be 
> even worse, although with a (very) low probability, with updates 
> overwritten on a race condition between the processes. In any case, no 
> error would be reported before much later, with invalid checksums or 
> inconsistent data, or undetected forgotten committed data.

One difference between pg_rewind and pg_checksums is that the latter
potentially runs for a longer time (or rather a non-trivial amount of
time, compared to pg_rewind), so the margin of error of another DBA
saying "oh, that DB is down, let me start it again" might be much
higher. 

The question is how to reliably do this in an acceptable way? Just
faking a postmaster.pid sounds pretty hackish to me, do you have any
suggestions here?

The alternative would be to document that it needs to be made sure that
the database is not started up during enabling of checksums, yielding to
pilot error.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Displaying and dumping of table access methods

2019-01-08 Thread Peter Eisentraut
On 08/01/2019 00:56, Andres Freund wrote:
> A patch at [2] adds display of a table's access method to \d+ - but that
> means that running the tests with a different default table access
> method (e.g. using PGOPTIONS='-c default_table_access_method=...)
> there'll be a significant number of test failures, even though the test
> results did not meaningfully differ.

For psql, a variable that hides the access method if it's the default.

> Similarly, if pg_dump starts to dump table access methods either
> unconditionally, or for all non-heap AMS, the pg_dump tests fail due to
> unimportant differences.

For pg_dump, track and set the default_table_access_method setting
throughout the dump (similar to how default_with_oids was handled, I
believe).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fast path for empty relids in check_outerjoin_delay()

2019-01-08 Thread Richard Guo
On Fri, Jan 4, 2019 at 10:32 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/12/2018 08:32, Richard Guo wrote:
> > This small revise is not expected to bring performance improvements, but
> > can improve the readability of the code that for empty relids, the qual
> > is always considered as being not-outerjoin_delayed.
>
> I think code readability and maintainability would be improved by having
> fewer special cases and fast paths.  In this particular case, I'd even
> remove the existing fast path and just let the subsequent loop run zero
> times.  I doubt there is a performance benefit to the existing coding.
> And if there is no performance benefit, then it's not really a fast
> path, just another path.
>

I did not find any performance data in the commit history about the
existing fast path. Not sure about its influence.

This code was added in commit d7a6a0, by Tom.
Hi Tom, what is your opinion about the fast path in
check_outerjoin_delay()?

Thanks
Richard


>
> --
> Peter Eisentraut
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com_&d=DwICaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=5r3cnfZPUDOHrMiXq8Mq2g&m=wMQ24yJUAB2h1_82PmFbMHjH-GYasxH6-zaX_8U-9NY&s=JEn98XixO6rzXoXF7dDfzSMnw2HQo_HPa0XKZntbE14&e=
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: commitfest: When are you assigned patches to review?

2019-01-08 Thread Fabien COELHO



Hello Mitar,


I am new to this community. I have submitted few patches to this
commitfest and I have read that it is expected that I also review some
other patches. But I am not sure about the process here. Should I wait
for some other patches to be assigned to me to review? Or is there
some other process?


The process is that *you* choose the patches to review and register as 
such for the patch on the CF app.



Also, how is the level at which I should review it
determined?


Patches as complex as the one you submitted?

Based on your area of expertise?


I am not really too sure in my skills and understanding of
PostgreSQL codebase to feel confident that I can review well, but I am
willing to try. I have read [1] and [2].


There are doc patches, client-side code patches, compilation 
infrastructure patches...


--
Fabien.



RE: Cache relation sizes?

2019-01-08 Thread Jamison, Kirk
Hi Thomas,

On Friday, December 28, 2018 6:43 AM Thomas Munro 
 wrote:
> [...]if you have ideas about the validity of the assumptions, the reason it 
> breaks initdb, or any other aspect of this approach (or alternatives), please 
> don't let me stop you, and of course please feel free to submit this, an 
> improved version or an alternative proposal [...]

Sure. Thanks. I'd like to try to work on the idea. I also took a look at the 
code, and I hope you don't mind if I ask for clarifications 
(explanation/advice/opinions) on the following, since my postgres experience is 
not substantial enough yet.

(1) I noticed that you used a "relsize_change_counter" to store in 
MdSharedData. Is my understanding below correct?

The intention is to cache the rel_size per-backend (lock-free), with an array 
of relsize_change_counter to skip using lseek syscall when the counter does not 
change.
In _mdnblocks(), if the counter did not change, the cached rel size (nblocks) 
is used and skip the call to FileSize() (lseek to get and cache rel size). That 
means in the default Postgres master, lseek syscall (through FileSize()) is 
called whenever we want to get the rel size (nblocks).

On the other hand, the simplest method I thought that could also work is to 
only cache the file size (nblock) in shared memory, not in the backend process, 
since both nblock and relsize_change_counter are uint32 data type anyway. If 
relsize_change_counter can be changed without lock, then nblock can be changed 
without lock, is it right? In that case, nblock can be accessed directly in 
shared memory. In this case, is the relation size necessary to be cached in 
backend?

(2) Is the MdSharedData temporary or permanent in shared memory?
from the patch:
 typedef struct MdSharedData
 {
/* XXX could have an array of these, and use rel OID % nelements? */ 
pg_atomic_uint32relsize_change_counter;
 } MdSharedData;
 
 static MdSharedData *MdShared;

What I intend to have is a permanent hashtable that will keep the file size 
(eventually/future dev, including table addresses) in the shared memory for 
faster access by backend processes. The idea is to keep track of the 
unallocated blocks, based from how much the relation has been extended or 
truncated. Memory for this hashtable will be dynamically allocated.

Thanks, 
Kirk Jamison


commitfest: When are you assigned patches to review?

2019-01-08 Thread Mitar
Hi!

I am new to this community. I have submitted few patches to this
commitfest and I have read that it is expected that I also review some
other patches. But I am not sure about the process here. Should I wait
for some other patches to be assigned to me to review? Or is there
some other process? Also, how is the level at which I should review it
determined? I am not really too sure in my skills and understanding of
PostgreSQL codebase to feel confident that I can review well, but I am
willing to try. I have read [1] and [2].

[1] https://wiki.postgresql.org/wiki/CommitFest
[2] https://wiki.postgresql.org/wiki/Reviewing_a_Patch


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m



Re: Undo logs

2019-01-08 Thread Amit Kapila
On Sat, Jan 5, 2019 at 11:29 AM Dilip Kumar  wrote:
>
> On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila  wrote:
> >
> > Thanks, the new changes look mostly okay to me, but I have few comments:
> > 1.
> > + /*
> > + * WAL log, for log switch.  This is required to identify the log switch
> > + * during recovery.
> > + */
> > + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
> > + {
> > + XLogBeginInsert();
> > + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
> > + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
> > + }
> > +
> >
> > Don't we want to do this under critical section?
> I think we are not making any buffer changes here and just inserting a
> WAL, IMHO we don't need any critical section.  Am I missing
> something?.
>

No, you are correct.

Few more comments:

Few comments:

1.
+ * undorecord.c
+ *   encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group

Change the year in Copyright notice for all new files?

2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)

Is the above part of comment still correct? I don't see uur_info being set here.

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;

It would be better if we initialize these members in the order in
which they appear in the actual structure.  All other undo header
structures are initialized that way, so this looks out-of-place.

4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;

I don't understand the above comment where it is written: "We must
update it for the bytes we write.".  We always set  'my_bytes_written'
as 0 if we write.  Can you clarify?  I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it.  I think some wording change might be required based
on what we intend to say here.

Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.

5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around.  Xid is certainly
+ * logically later than ckptXid.
..

>From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how?  Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]?  Am, I correct, if so, I think we should postpone it for
now.

6.
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}

I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.

I am done with the first level of code-review for this patch.  I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me.  I have modified the proposed commit
message in the attached patch, see if that looks fine to you.

To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.

[1] - 
https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


0003-Provide-interfaces-to-store-and-fetch-undo-records-v14.patch
Description: Binary data


Mentoring for GCI-19

2019-01-08 Thread Padam Chopra
Respected Concern,

I want to join GCI as a mentor for the year 2019, please guide me about the
procedure,
thanks in anticipation.

--
Regards
Padam Chopra
Google Grand Prize Winner
Microsoft Imagine Cup India winner
TedX Event Organizer

Contact:

Email:padamchopra1...@gmail.com

More details about me and my work:

GitHub Profile: https://github.com/padamchopra

Website: http://padamchopra.me/ 


Re: Improve selectivity estimate for range queries

2019-01-08 Thread Kyotaro HORIGUCHI
Sigh.. In the frrst place, the loop should not stop until the maximum
exponent.
sorry, but I don't have a time now..

2019年1月8日(火) 16:43 Kyotaro HORIGUCHI :

> Mmm.
>
> At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20190108.162638.106314087.horiguchi.kyot...@lab.ntt.co.jp>
> > FWIW, I got the following result on my environment. It seems
> > different enough if this holds on all supported platforms, though
> > there still is a case where the result of a sequence of
> > arithmetics makes false match.
> >
> > x = 0.1
> >d = 1.00e+30 match
> >d = 1.00e+31: 0.00 match
>
> Of course the "match" in the last line above is a mistake of "NOT
> match".

-- 
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>
>


Re: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well

2019-01-08 Thread Christoph Berg
Re: Andres Freund 2019-01-08 <20190108011837.n4mx7dadvojv2...@alap3.anarazel.de>
> > Here's another revision that doesn't add an extra CXXOPT variable but
> > just extends CXXFLAGS whenever COPT or PROFILE are set, which seems
> > more usable.
> 
> Why does that seem more usable? How's that supposed to be used for flags
> that aren't valid for both languages?

The existing COPT and PROFILE are already catch-all type flags that
add to CFLAGS and LDFLAGS. Possibly we should leave those alone and
only add PG_CXXFLAGS and PG_LDFLAGS?

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz



Re: Displaying and dumping of table access methods

2019-01-08 Thread Haribabu Kommi
On Tue, Jan 8, 2019 at 3:02 PM Michael Paquier  wrote:

> On Mon, Jan 07, 2019 at 06:31:52PM -0800, Andres Freund wrote:
> > Huh? It's absolutely *trivial* from a buildsystem POV to run the tests
> > again with a different default AM. That's precisely why I'm talking
> > about this. Just setting PGOPTIONS='-c
> > default_table_access_method=zheap' in the new makefile target (the ms
> > run scripts are similar) is sufficient.  And we don't need to force
> > everyone to constantly run tests with e.g. both heap and zheap, it's
> > sufficient to do so on a few buildfarm machines, and whenever changing
> > AM level code.  Rerunning all the tests with a different AM is just
> > setting the same environment variable, but running check-world as the
> > target.
>

PGOPTIONS or any similar options are good for the AM development
to test their AM's with all the existing PostgreSQL features.


> Another point is that having default_table_access_method facilitates
> the restore of tables across AMs similarly to tablespaces, so CREATE
> TABLE dumps should not include the AM part.
>

+1 to the above approach to dump "set default_table_access_method".


> > And even if you were to successfully argue that it's sufficient during
> > normal development to only have a few zheap specific additional tests,
> > we'd certainly want to make it possible to occasionally explicitly run
> > the rest of the tests under zheap to see whether additional stuff has
> > been broken - and that's much harder to sift through if there's a lot of
> > spurious test failures due to \d[+] outputting additional/differing
> > data.
>
> The specific-heap tests could be included as an extra module in
> src/test/modules easily, so removing from the main tests what is not
> completely transparent may make sense.  Most users use make-check to
> test a patch quickly, so we could miss some bugs because of that
> during review.  Still, people seem to be better-educated lately in the
> fact that they need to do an actual check-world when checking a patch
> at full.  So personally I can live with a split where it makes sense.
> Being able to easily validate an AM implementation would be nice.
> Isolation tests may be another deal though for DMLs.
>
> > We are working seriously hard on making AMs pluggable. Zheap is not yet,
> > and won't be that soon, part of core. The concerns for an in-core zheap
> > (which needs to maintain the test infrastructure during the remainder of
> > its out-of-core development!) and out-of-core AMs are pretty aligned.  I
> > don't get your confusion.
>
> I would imagine that a full-fledged AM should be able to maintain
> compatibility with the full set of queries that heap is able to
> support, so if you can make the tests transparent enough so as they
> can be run for any AMs without alternate input in the core tree, then
> that's a goal worth it.  Don't you have plan inconsistencies as well
> with zheap?
>
> In short, improving portability incrementally is good for the
> long-term prospective.  From that point of view adding the AM to \d+
> output may be a bad idea, as there are modules out of core which
> rely on psql meta-commands, and it would be nice to be able to test
> those tests as well for those plugins with different types of AMs.
>

I also agree that adding AM details to \d+ will lead to many unnecessary
failures. Currently \d+ also doesn't show all the details of the relation
like
owner and etc.

Regards,
Haribabu Kommi
Fujitsu Australia