date:20180215

[bug fix] Produce a crash dump before main() on Windows

2018-02-15 Thread Tsunakawa, Takayuki

Hello,

postgres.exe on Windows doesn't output a crash dump when it crashes before 
main() is called.  The attached patch fixes this.  I'd like this to be 
back-patched.  I'll add this to the next CF.

The original problem happened on our customer's production system.  Their 
application sometimes failed to connect to the database.  That was because 
postgres.exe crashed due to access violation (exception code C005).  But 
there was no crash dump, so we had difficulty in finding the cause.  The 
frequency was low -- about ten times during half a year.

What caused the access violation was Symantec's antivirus software.  It seems 
that sysfer.dll of the software intercepts registry access, during C runtime 
library initialization,  before main() is called.  So, the direct cause of this 
problem is not PostgreSQL.

On the other hand, it's PostgreSQL's problem that we can't get the crash dump, 
which makes the investigation difficult.  The cause is that postmaster calls 
SetErrorMode() to disable the outputing of crash dumps by WER (Windows Error 
Reporting).  This error mode is inherited from postmaster to its children.  If 
a crash happens before the child sets up the exception handler, no crash dump 
is produced.


Regards
Takayuki Tsunakawa



crash_dump_before_main.patch
Description: crash_dump_before_main.patch

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-15 Thread Michael Paquier

On Wed, Feb 14, 2018 at 04:37:05AM +, Tsunakawa, Takayuki wrote:
> The PostgreSQL version is 9.5.  The cluster consists of a master, a
> cascading standby (SB1), and a cascaded standby (SB2).  The WAL flows
> like this: master -> SB1 -> SB2. 
> 
> The user shut down SB2 and tried to restart it, but failed with the
> following message: 
> 
> FATAL:  invalid memory alloc request size 3075129344
> 
> The master and SB1 continued to run.

This matches a couple of recent bug reports we have seen around like
this one:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
I recalled as well this one but in this case the user shot his own foot
with the failover scenario he designed:
https://www.postgresql.org/message-id/CAAc9rOyKAyzipiq7ee1%3DVbcRy2fRqV_hRujLbC0mrBkL07%3D7wQ%40mail.gmail.com

> CAUSE
> ==
> 
> total_len in the code below was about 3GB, so palloc() rejected the
> memory allocation request.

Yes, it is limited to 1GB.

> [xlogreader.c]
>   record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
>   total_len = record->xl_tot_len;
> ...
>   /*
>* Enlarge readRecordBuf as needed.
>*/
>   if (total_len > state->readRecordBufSize &&
>   !allocate_recordbuf(state, total_len))
>   {
> 
> Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader
> read the garbage portion in a WAL file, which is immediately after the
> end of valid WAL records. 
> 
> Why was there the garbage?  Because the cascading standby sends just
> the valid WAL records, not the whole WAL blocks, to the cascaded
> standby.  When the cascaded standby receives those WAL records and
> write them in a recycled WAL file, the last WAL block in the file
> contains the mix of valid WAL records and the garbage left over.

Wait a minute here, when recycled past WAL segments would be filled with
zeros before being used.

> Why did the XLOG reader try to allocate memory for a garbage?  Doesn't
> it stop reading the WAL?  As the following code shows, when the WAL
> record header crosses a WAL block boundary, the XLOG reader first
> allocates memory for the whole record, and then checks the validity of
> the record header as soon as it reads the entire header.
>
> [...]
> 
> FIX
> ==
> 
> One idea is to defer the memory allocation until the entire record
> header is read and ValidXLogRecordHeader() is called.  However,
> ValidXLogRecordHeader() might misjudge the validity if the garbage
> contains xl_prev seeming correct. 

It seems to me that the consolidation of the page read should happen
directly in xlogreader.c and not even in one of its callbacks so as no
garbage data is presented back to the caller using its own XLogReader.
I think that you need to initialize XLogReaderState->readBuf directly in
ReadPageInternal() before reading a page and you should be good.  With
your patch you get visibly only one portion of things addressed, what of
other code paths using xlogreader.c's APIs like pg_rewind, 2PC code and
such?

> FYI, the following unsolved problem may be solved, too.
> 
> https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com

Yeah, I noticed this one too before going through your message in
details ;)

Please note that your patch has some unnecessary garbage in two places:
- * Portions Copyright (c) 2010-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group
[...]
-* Only superusers and members of pg_read_all_stats can see details.
-* Other users only get the pid value
+* Only superusers can see details. Other users only get the pid valu

You may want to indent your patch as well before sending it.

+   if (zbuffer == NULL)
+   zbuffer = palloc0(XLOG_BLCKSZ);
You could just use a static buffer which is MemSet'd with 0 only if
necessary.
--
Michael


signature.asc
Description: PGP signature

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-02-15 Thread Michael Paquier

On Mon, Feb 12, 2018 at 04:33:51PM +0530, Pavan Deolasee wrote:
> I'm not sure if Michael has spotted a real problem or was that just a
> concern. He himself later rightly pointed out that when a WAL file is
> switched, the old file is filled with zeros. So I don't see a problem
> there. May be I am missing something and Michael can explain further.

Sorry for my late reply here.  I have been eyeballing the code for some
time to be sure that I was not missing something.  A new segment is
filled with zeros when it is freshly created (see XLogFileInit), or when
doing a forced segment switch, but the checkpointer just renames the old
files which could perfectly contain past data which is reused
afterwards.  So after doing one recycling cycle, the old data should be
overwritten when a second recycling cycle happens.  One thing that
worries me though is related to timeline switches.  The last, partial,
WAL segment may still have garbage data.  Since 9.5 the segment is being
renamed with a dedicated suffix, but the file could be renamed if an
operator is willing to finish replay up to the end of the timeline where
this file is the last one, and it could have junk data that your checks
rely on.  There may be other things I have not considered, but at least
that's one problem.
--
Michael


signature.asc
Description: PGP signature

Re: Removing useless #include's.

2018-02-15 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> At Thu, 15 Feb 2018 11:12:05 -0500, Tom Lane  wrote in 
> <6748.1518711...@sss.pgh.pa.us>
>> Kyotaro HORIGUCHI  writes:
>>> I found 641 includes that is just removable with no side effect
>>> with two exceptions.

>> I tend to be a bit suspicious of this sort of thing, especially for
>> old files that have been through previous rounds of "unnecessary
>> include" removal.

> As another point of view, placing an #include just because the
> source file uses the definition in the file is also
> reasonable. Header files are kept not to have a problem when
> included multiple times. I don't surprise if someone says that
> this is rather harmful. And I'm glas to read the clear reason.

Note that I'm *not* saying there's nothing to be done here.  Scanning
through your patch quickly, the #include "postgres.h" in knapsack.h
is clearly contrary to project policy, and there should surely be
no need for any backend .c file to #include elog.h or palloc.h
because postgres.h pulls in both of those.  But I'd like to see a bit
more analysis of most of the rest of these changes.  The bad experience
we had with the last round of #include-removal was really because some
poor decisions had been taken before that about which headers should
include which other headers.  I think it'd be a good idea to work backward
and see whether any of these proposed changes are pointing to header
inclusions that maybe weren't well thought out.

regards, tom lane

Re: reorganizing partitioning code

2018-02-15 Thread Kyotaro HORIGUCHI

Hello. I'd like to make a humble comment.

At Thu, 15 Feb 2018 19:31:47 +0900, Amit Langote 
 wrote in 
<8906c861-ea47-caee-c860-eff8d7e1d...@lab.ntt.co.jp>
> Added to CF here: https://commitfest.postgresql.org/17/1520/

The reorganization adds/removes header files to/from .[ch]
files. That can easily leave useless includes and they're hardly
noticed. Following files touched in this patch have such useless
includes after this patch.

On the other hand, naive decision of this kind of cleanup can
lead to curruption. [1] So, I don't insisit that the all of the
following *should* amended, especially for rel.h.

[1] https://www.postgresql.org/message-id/6748.1518711...@sss.pgh.pa.us


 nodeModifyTable.c:
> +#include "rewrite/rewriteManip.h"

rewriteManip.h is changed to include rel.h so rel.h is no longer
need to be included in the file. (It is also included in lmgr.h
so it was needless since before this patch, though.)

 hba.c:
> +#include "catalog/objectaddress.h"

objectaddress.h includes acl.h so acl.h is no longer useful.

 joinrels.c:
> +#include "utils/partcache.h"

partcache.h includes lsyscache.h.

 prepunion.c:
> +#include "utils/partcache.h"

partcache.h includes lsyscache.h and partcache.h is included in
rel.h. So partcache.h and lsyscache.h are not required.

 utility.c:
> +#include "utils/rel.h"

rel.h includes xlog.h (and xlog.h includes fd.h). The last two
are removable.

 partcache.c:
parsenodes.h is included from some other files.
rel.h is included from rewriteManip.h.
partcache.h is included from rel.h
As the result, parsenodes.h, rel.h, partcache.h are not required.

 relcache.c:
> +#include "utils/partcache.h"

lsyscache.h is included by partcache.h.

 rel.h:
> +#include "utils/partcache.h"

partcache.h includes fmgr.h and relcache.h so the last two are
no longer useful.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Let's remove DSM_INPL_NONE.

2018-02-15 Thread Kyotaro HORIGUCHI

Hello, thank you for the comment.

At Fri, 16 Feb 2018 13:07:08 +0900, Michael Paquier  wrote 
in <20180216040708.ga1...@paquier.xyz>
> On Thu, Feb 15, 2018 at 07:58:57PM +0900, Kyotaro HORIGUCHI wrote:
> > It is amost a just-to-delete work but I see two issues raised so
> > far.
> 
> The patch looks good as-is.  This simplifies a couple of code paths
> deciding if parallel queries can be used or not, so future features in
> need of doing the same decision-making won't fall in the same trap as
> the recent parellel btree builds.  So I am +1 for having the buildfarm
> express its opinion about it.

Agreed. I'd like to see how buildfarm machines respond to this.

> > 1. That change can raise regression test failure on some
> >buildfarm machines[3].
> > 
> > The reason is that initdb creates a database with
> > max_connection=10 from DSM failure caused by semaphore exhaustion
> > , even though regression test requires at least 20
> > connections. At that time it was "fixed" by setting
> > dynamic_shared_memory_type=none while detecting the number of
> > usable max_connections and shared buffers[4].  Regardless of
> > whether the probing was succeeded or not, initdb finally creats a
> > database on that any DSM implementation is activated, since
> > initdb doesn't choose "none". Finally the test items for parallel
> > query actually suceeded.
> > 
> > For the reason above, I believe just increasing the minimum
> > fallback value of max_connections to 20 will work[5]. Anyway it
> > is a problem to be fixed that initdb creates an inexecutable
> > database if it uses the fallback values of max_connections=10.
> > 
> > [3]
> > https://www.postgresql.org/message-id/ca+tgmoyhiigrcvssjhmbsebmof2zx_9_9rwd75cwvu99yrd...@mail.gmail.com
> 
> Do actually buildfarm machines select max_connections = 10 now?  We
> would have surely seen failures since max_wal_senders has its default
> value set to 10 as well.  Hence this is a separate problem.

Even not being pretty confident, that's because the current
initdb runs probing with dynamic_s_m_type=none. So the same BF
animal can fail if initdb probes with dynamic_s_m_type=sysv and
semaphore is exchausted at the time.

> > [4] Commit: d41ab71712a4457ed39d5471b23949872ac91def
> > 
> > [5] 
> > https://www.postgresql.org/message-id/20180209.170823.42008365.horiguchi.kyot...@lab.ntt.co.jp
> > 
> > 
> > > Feel free to.  Be just careful that the connection attempts in
> > > test_config_settings() should try to use the DSM implementation defined
> > > by choose_dsm_implementation().
> > 
> > Thank you for the advice. The probing fails with the default
> > setting if posix shared memory somehow fails on a platform like
> > Linux that is usually expected to have it.  It's enough to choose
> > the implementation before probing. Some messages from initdb are
> > transposed as the result.
> > 
> > |   creating directory /home/horiguti/data/data_ndsm ... ok
> > |   creating subdirectories ... ok
> > | + selecting dynamic shared memory implementation ... posix
> > |   selecting default max_connections ... 100
> > |   selecting default shared_buffers ... 128MB
> > | - selecting dynamic shared memory implementation ... posix
> > 
> > Even though probing can end with failure in the case of [3], it
> > will not be a problem with the patch [4].
> 
> That's fine by me, as this is actually the purpose of your patch.
> 
> +++ b/src/include/storage/dsm_impl.h
> @@ -14,7 +14,6 @@
>  #define DSM_IMPL_H
> 
>  /* Dynamic shared memory implementations. */
>  -#define DSM_IMPL_NONE  0
>   #define DSM_IMPL_POSIX 1
>   #define DSM_IMPL_SYSV  2
>   #define DSM_IMPL_WINDOWS   3
> You may as well assign numbers from 0 here and reorder the whole set.

The reason of that is I prefered to leave 0 as unused default
value. But it doesn't have significance after a clean build. I'm
fine with reordering (or renumbering) them.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: spin.c includes pg_sema.h even if unnecessary

2018-02-15 Thread Kyotaro HORIGUCHI

At Thu, 15 Feb 2018 10:01:57 -0800, Andres Freund  wrote in 
<20180215180157.7q55aytrhif7v...@alap3.anarazel.de>
> On 2018-02-15 20:11:07 +0900, Kyotaro HORIGUCHI wrote:
> > As in another mail just before, spin.c seems a bit strange
> > (without acutual harm).
> > 
> > spin.h doesn't include pg_sema.h when HAVE_SPINLOCKS is defined,
> > but spin.c always includes it even in the case. The file is
> > included only to use sizeof(PGSemaphore) to calcualte
> > SpinlockSemaSize as 0.
> > 
> > The codes that requires PGSempaphore is inactivated when the
> > symbol is defined in all other places so it seems to me that we
> > ought to refrain from using it there, too. The attched patch does
> 
> IDK, I don't quite see the point of the change here...

No actual gain, but I just feel it uneasy that utterly-unused
symbol is used to yield zero. From other point of view, it seems
to be inconsistency that a header file is disabled in a file, but
not disabled in another on one configuration.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: non-bulk inserts and tuple routing

2018-02-15 Thread Amit Langote

On 2018/02/16 12:41, Etsuro Fujita wrote:
> (2018/02/16 10:49), Amit Langote wrote:
>> I think you're right.  If node->returningLists is non-NULL at all,
>> ExecInitModifyTable() would've initialized the needed slot and expression
>> context.  I added Assert()s to that affect.
> 
> OK, but one thing I'd like to ask is:
> 
> +   /*
> +    * Use the slot that would have been set up in ExecInitModifyTable()
> +    * for the output of the RETURNING projection(s).  Just make sure to
> +    * assign its rowtype using the RETURNING list.
> +    */
> +   Assert(mtstate->ps.ps_ResultTupleSlot != NULL);
> +   tupDesc = ExecTypeFromTL(returningList, false);
> +   ExecAssignResultType(>ps, tupDesc);
> +   slot = mtstate->ps.ps_ResultTupleSlot;
> 
> Do we need that assignment here?

I guess mean the assignment of rowtype, that is, the
ExecAssignResultType() line.  On looking at this some more, it looks like
we don't need to ExecAssignResultType here, as you seem to be suspecting,
because we want the RETURNING projection output to use the rowtype of the
first of returningLists and that's what mtstate->ps.ps_ResultTupleSlot has
been set to use in the first place.  So, removed the ExecAssignResultType().

Attached v9.  Thanks a for the review!

Regards,
Amit
From e74ce73c574bc1892fa957b105adccf1eada1c5f Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 19 Dec 2017 16:20:09 +0900
Subject: [PATCH v9] During tuple-routing, initialize per-partition objects
 lazily

Those objects include ResultRelInfo, tuple conversion map,
WITH CHECK OPTION quals and RETURNING projections.

This means we don't allocate these objects for partitions that are
never inserted into.
---
 src/backend/commands/copy.c|  10 +-
 src/backend/executor/execPartition.c   | 305 -
 src/backend/executor/nodeModifyTable.c | 131 ++
 src/include/executor/execPartition.h   |   9 +-
 4 files changed, 248 insertions(+), 207 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b3933df9af..118452b602 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2470,7 +2470,7 @@ CopyFrom(CopyState cstate)
PartitionTupleRouting *proute;
 
proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2607,6 +2607,14 @@ CopyFrom(CopyState cstate)
 */
saved_resultRelInfo = resultRelInfo;
resultRelInfo = proute->partitions[leaf_part_index];
+   if (resultRelInfo == NULL)
+   {
+   resultRelInfo = ExecInitPartitionInfo(NULL,
+   
  saved_resultRelInfo,
+   
  proute, estate,
+   
  leaf_part_index);
+   Assert(resultRelInfo != NULL);
+   }
 
/* We do not yet have a way to insert into a foreign 
partition */
if (resultRelInfo->ri_FdwRoutine)
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 4048c3ebc6..5fa6e2de09 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -44,18 +44,23 @@ static char *ExecBuildSlotPartitionKeyDescription(Relation 
rel,
  *
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
+ *
+ * While we allocate the arrays of pointers of ResultRelInfo and
+ * TupleConversionMap for all partitions here, actual objects themselves are
+ * lazily allocated for a given partition if a tuple is actually routed to it;
+ * see ExecInitPartitionInfo.  However, if the function is invoked for update
+ * tuple routing, caller would already have initialized ResultRelInfo's for
+ * some of the partitions, which are reused and assigned to their respective
+ * slot in the aforementioned array.
  */
 PartitionTupleRouting *
-ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
-  Relation rel, Index 
resultRTindex,
-  EState *estate)
+ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
TupleDesc   tupDesc = RelationGetDescr(rel);
List   *leaf_parts;
ListCell   *cell;

Re: Removing useless #include's.

2018-02-15 Thread Kyotaro HORIGUCHI

Hello.

At Thu, 15 Feb 2018 11:12:05 -0500, Tom Lane  wrote in 
<6748.1518711...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > While looking some patch, just from curiosity, I checked for
> > redundant #include's in the source tree (except
> > contrib). "redundant" here means that a file is included in
> > another include file nearby.
> 
> > I found 641 includes that is just removable with no side effect
> > with two exceptions.
> 
> I tend to be a bit suspicious of this sort of thing, especially for
> old files that have been through previous rounds of "unnecessary
> include" removal.  It's usually a good idea to ask *why* is a header
> no longer needed?  The answer, usually, is that somebody added the
> same #include to some other header, and it's not uncommon for that
> to have been a bad idea.  It's usually best to minimize cross-header
> inclusions, IMV, and it's always necessary to exercise judgment
> when adding one.

As another point of view, placing an #include just because the
source file uses the definition in the file is also
reasonable. Header files are kept not to have a problem when
included multiple times. I don't surprise if someone says that
this is rather harmful. And I'm glas to read the clear reason.

> We've also had more than a few problems with automatic scripts deciding
> that an #include could be removed because they weren't testing with the
> combination of build options that made it necessary.
> 
> See for instance commits 6416a82a6 through 1609797c2 for some history
> of poorly managed #include removal.

I'm surprised to find no circular/mutual dependency even nor
multilevel inclusion among header files. I think I understand the
reason.

In this patch, I didn't touch the header files since I felt that
somewhat dangerous. But anyway I understand doing all of this at
a time can be dangerous.

Thank you for the suggestion, Tom.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Let's remove DSM_INPL_NONE.

2018-02-15 Thread Michael Paquier

On Thu, Feb 15, 2018 at 10:00:39AM -0800, Andres Freund wrote:
> Hm, I'm not quite convinced by this. Seems to make testing a bunch of
> codepaths harder.  I think it's fine to say that pg doesn't work
> correctly with them disabled though.

Well, for what it's worth that's one thing less to be forgotten when
implementing features related to parallel query.  That's the lesson
88ef48c is telling us here, much unlikely anybody would disable it
though after an initdb.
--
Michael


signature.asc
Description: PGP signature

Re: Let's remove DSM_INPL_NONE.

2018-02-15 Thread Michael Paquier

On Thu, Feb 15, 2018 at 07:58:57PM +0900, Kyotaro HORIGUCHI wrote:
> It is amost a just-to-delete work but I see two issues raised so
> far.

The patch looks good as-is.  This simplifies a couple of code paths
deciding if parallel queries can be used or not, so future features in
need of doing the same decision-making won't fall in the same trap as
the recent parellel btree builds.  So I am +1 for having the buildfarm
express its opinion about it.

> 1. That change can raise regression test failure on some
>buildfarm machines[3].
> 
> The reason is that initdb creates a database with
> max_connection=10 from DSM failure caused by semaphore exhaustion
> , even though regression test requires at least 20
> connections. At that time it was "fixed" by setting
> dynamic_shared_memory_type=none while detecting the number of
> usable max_connections and shared buffers[4].  Regardless of
> whether the probing was succeeded or not, initdb finally creats a
> database on that any DSM implementation is activated, since
> initdb doesn't choose "none". Finally the test items for parallel
> query actually suceeded.
> 
> For the reason above, I believe just increasing the minimum
> fallback value of max_connections to 20 will work[5]. Anyway it
> is a problem to be fixed that initdb creates an inexecutable
> database if it uses the fallback values of max_connections=10.
> 
> [3]
> https://www.postgresql.org/message-id/ca+tgmoyhiigrcvssjhmbsebmof2zx_9_9rwd75cwvu99yrd...@mail.gmail.com

Do actually buildfarm machines select max_connections = 10 now?  We
would have surely seen failures since max_wal_senders has its default
value set to 10 as well.  Hence this is a separate problem.

> [4] Commit: d41ab71712a4457ed39d5471b23949872ac91def
> 
> [5] 
> https://www.postgresql.org/message-id/20180209.170823.42008365.horiguchi.kyot...@lab.ntt.co.jp
> 
> 
> > Feel free to.  Be just careful that the connection attempts in
> > test_config_settings() should try to use the DSM implementation defined
> > by choose_dsm_implementation().
> 
> Thank you for the advice. The probing fails with the default
> setting if posix shared memory somehow fails on a platform like
> Linux that is usually expected to have it.  It's enough to choose
> the implementation before probing. Some messages from initdb are
> transposed as the result.
> 
> |   creating directory /home/horiguti/data/data_ndsm ... ok
> |   creating subdirectories ... ok
> | + selecting dynamic shared memory implementation ... posix
> |   selecting default max_connections ... 100
> |   selecting default shared_buffers ... 128MB
> | - selecting dynamic shared memory implementation ... posix
> 
> Even though probing can end with failure in the case of [3], it
> will not be a problem with the patch [4].

That's fine by me, as this is actually the purpose of your patch.

+++ b/src/include/storage/dsm_impl.h
@@ -14,7 +14,6 @@
 #define DSM_IMPL_H

 /* Dynamic shared memory implementations. */
 -#define DSM_IMPL_NONE  0
  #define DSM_IMPL_POSIX 1
  #define DSM_IMPL_SYSV  2
  #define DSM_IMPL_WINDOWS   3
You may as well assign numbers from 0 here and reorder the whole set.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-15 Thread Ashutosh Bapat

On Thu, Feb 15, 2018 at 7:47 PM, Amit Kapila  wrote:
> On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
>  wrote:
>> I happened to look at the patch for something else. But here are some
>> comments. If any of those have been already discussed, please feel
>> free to ignore. I have gone through the thread cursorily, so I might
>> have missed something.
>>
>> In grouping_planner() we call query_planner() first which builds the
>> join tree and creates paths, then calculate the target for scan/join
>> rel which is applied on the topmost scan join rel. I am wondering
>> whether we can reverse this order to calculate the target list of
>> scan/join relation and pass it to standard_join_search() (or the hook)
>> through query_planner().
>>
>
> I think the reason for doing in that order is that we can't compute
> target's width till after query_planner().  See the below note in
> code:
>
> /*
> * Convert the query's result tlist into PathTarget format.
> *
> * Note: it's desirable to not do this till after query_planner(),
> * because the target width estimates can use per-Var width numbers
> * that were obtained within query_planner().
> */
> final_target = create_pathtarget(root, tlist);
>
> Now, I think we can try to juggle the code in a way that the width can
> be computed later, but the rest can be done earlier.

/* Convenience macro to get a PathTarget with valid cost/width fields */
#define create_pathtarget(root, tlist) \
set_pathtarget_cost_width(root, make_pathtarget_from_tlist(tlist))
create_pathtarget already works that way. We will need to split it.

Create the Pathtarget without widths. Apply width estimates once we
know the width of Vars somewhere here in query_planner()
/*
 * We should now have size estimates for every actual table involved in
 * the query, and we also know which if any have been deleted from the
 * query by join removal; so we can compute total_table_pages.
 *
 * Note that appendrels are not double-counted here, even though we don't
 * bother to distinguish RelOptInfos for appendrel parents, because the
 * parents will still have size zero.
 *
The next step is building the join tree. Set the pathtarget before that.

> However, I think
> that will be somewhat major change

I agree.

>  still won't address all kind of
> cases (like for ordered paths) unless we can try to get all kind of
> targets pushed down in the call stack.

I didn't understand that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: non-bulk inserts and tuple routing

2018-02-15 Thread Etsuro Fujita


(2018/02/16 10:49), Amit Langote wrote:

On 2018/02/15 21:10, Etsuro Fujita wrote:

here are some minor comments:

o On changes to ExecCleanupTupleRouting:

-   ExecCloseIndices(resultRelInfo);
-   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+   if (resultRelInfo)
+   {
+   ExecCloseIndices(resultRelInfo);
+   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+   }

You might check at the top of the loop whether resultRelInfo is NULL and
if so do continue.  I think that would save cycles a bit.


Good point, done.


Thanks.


o On ExecInitPartitionInfo:

+   int firstVarno;
+   RelationfirstResultRel;

My old compiler got "variable may be used uninitialized" warnings.


Fixed.  Actually, I moved those declarations from out here into the blocks
where they're actually needed.


OK, my compiler gets no warnings now.


+   /*
+* Build the RETURNING projection if any for the partition.  Note that
+* we didn't build the returningList for each partition within the
+* planner, but simple translation of the varattnos will suffice.
+* This only occurs for the INSERT case; in the UPDATE/DELETE cases,
+* ExecInitModifyTable() would've initialized this.
+*/

I think the last comment should be the same as for WCO lists: "This only
occurs for the INSERT case or in the case of UPDATE for which we didn't
find a result rel above to reuse."


Fixed.  The "above" is no longer needed, because there is no code left in
ExecInitPartitionInfo() to find UPDATE result rels to reuse.  That code is
now in ExecSetupPartitionTupleRouting().


OK, that makes sense.


+   /*
+* Initialize result tuple slot and assign its rowtype using the
first
+* RETURNING list.  We assume the rest will look the same.
+*/
+   tupDesc = ExecTypeFromTL(returningList, false);
+
+   /* Set up a slot for the output of the RETURNING projection(s) */
+   ExecInitResultTupleSlot(estate,>ps);
+   ExecAssignResultType(>ps, tupDesc);
+   slot = mtstate->ps.ps_ResultTupleSlot;
+
+   /* Need an econtext too */
+   if (mtstate->ps.ps_ExprContext == NULL)
+   ExecAssignExprContext(estate,>ps);
+   econtext = mtstate->ps.ps_ExprContext;

Do we need this initialization?  I think we would already have the slot
and econtext initialized when we get here.


I think you're right.  If node->returningLists is non-NULL at all,
ExecInitModifyTable() would've initialized the needed slot and expression
context.  I added Assert()s to that affect.


OK, but one thing I'd like to ask is:

+   /*
+* Use the slot that would have been set up in ExecInitModifyTable()
+* for the output of the RETURNING projection(s).  Just make sure to
+* assign its rowtype using the RETURNING list.
+*/
+   Assert(mtstate->ps.ps_ResultTupleSlot != NULL);
+   tupDesc = ExecTypeFromTL(returningList, false);
+   ExecAssignResultType(>ps, tupDesc);
+   slot = mtstate->ps.ps_ResultTupleSlot;

Do we need that assignment here?


Attached updated patch.


Thanks for updating the patch!

Best regards,
Etsuro Fujita

Re: FOR EACH ROW triggers on partitioned tables

2018-02-15 Thread Amit Langote

On 2018/02/16 6:55, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/02/15 6:26, Alvaro Herrera wrote:
>>> Another option is to rethink this feature from the ground up: instead of
>>> cloning catalog rows for each children, maybe we should have the trigger
>>> lookup code, when running DML on the child relation (the partition),
>>> obtain trigger entries not only for the child relation itself but also
>>> for its parents recursively -- so triggers defined in the parent are
>>> fired for the partitions, too.  I'm not sure what implications this has
>>> for constraint triggers.
>>>
>>> The behavior should be the same, except that you cannot modify the
>>> trigger (firing conditions, etc) on the partition individually -- it
>>> works at the level of the whole partitioned table instead.
>>
>> Do you mean to fire these triggers only if the parent table (not a child
>> table/partition) is addressed in the DML, right?  If the table directly
>> addressed in the DML is a partition whose parent has a row-level trigger,
>> then that trigger should not get fired I suppose.
> 
> No, I think that would be strange and cause data inconsistencies.
> Inserting directly into the partition is seen as a performance
> optimization (compared to inserted into the partitioned table), so we
> don't get to skip firing the triggers defined on the parent because the
> behavior would become different.  In other words, the performance
> optimization breaks the database.

OK, that makes sense.

Thanks,
Amit

Re: non-bulk inserts and tuple routing

2018-02-15 Thread Amit Langote

Fujita-san,

Thanks for the review.

On 2018/02/15 21:10, Etsuro Fujita wrote:
> (2018/02/13 10:12), Amit Langote wrote:
>> Updated patch is attached.
> 
> Thanks, here are some minor comments:
> 
> o On changes to ExecCleanupTupleRouting:
> 
> -   ExecCloseIndices(resultRelInfo);
> -   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
> +   if (resultRelInfo)
> +   {
> +   ExecCloseIndices(resultRelInfo);
> +   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
> +   }
> 
> You might check at the top of the loop whether resultRelInfo is NULL and
> if so do continue.  I think that would save cycles a bit.

Good point, done.

> o On ExecInitPartitionInfo:
> 
> +   int firstVarno;
> +   Relation    firstResultRel;
> 
> My old compiler got "variable may be used uninitialized" warnings.

Fixed.  Actually, I moved those declarations from out here into the blocks
where they're actually needed.

> +   /*
> +    * Build the RETURNING projection if any for the partition.  Note that
> +    * we didn't build the returningList for each partition within the
> +    * planner, but simple translation of the varattnos will suffice.
> +    * This only occurs for the INSERT case; in the UPDATE/DELETE cases,
> +    * ExecInitModifyTable() would've initialized this.
> +    */
> 
> I think the last comment should be the same as for WCO lists: "This only
> occurs for the INSERT case or in the case of UPDATE for which we didn't
> find a result rel above to reuse."

Fixed.  The "above" is no longer needed, because there is no code left in
ExecInitPartitionInfo() to find UPDATE result rels to reuse.  That code is
now in ExecSetupPartitionTupleRouting().

> +   /*
> +    * Initialize result tuple slot and assign its rowtype using the
> first
> +    * RETURNING list.  We assume the rest will look the same.
> +    */
> +   tupDesc = ExecTypeFromTL(returningList, false);
> +
> +   /* Set up a slot for the output of the RETURNING projection(s) */
> +   ExecInitResultTupleSlot(estate, >ps);
> +   ExecAssignResultType(>ps, tupDesc);
> +   slot = mtstate->ps.ps_ResultTupleSlot;
> +
> +   /* Need an econtext too */
> +   if (mtstate->ps.ps_ExprContext == NULL)
> +   ExecAssignExprContext(estate, >ps);
> +   econtext = mtstate->ps.ps_ExprContext;
> 
> Do we need this initialization?  I think we would already have the slot
> and econtext initialized when we get here.

I think you're right.  If node->returningLists is non-NULL at all,
ExecInitModifyTable() would've initialized the needed slot and expression
context.  I added Assert()s to that affect.

> Other than that, the patch looks good to me.
> 
> Sorry for the delay.

No problem! Thanks again.

Attached updated patch.

Thanks,
Amit
From d4266a3072640b64f758e3de1f896fffbd9332ae Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 19 Dec 2017 16:20:09 +0900
Subject: [PATCH v8] During tuple-routing, initialize per-partition objects
 lazily

Those objects include ResultRelInfo, tuple conversion map,
WITH CHECK OPTION quals and RETURNING projections.

This means we don't allocate these objects for partitions that are
never inserted into.
---
 src/backend/commands/copy.c|  10 +-
 src/backend/executor/execPartition.c   | 294 +++--
 src/backend/executor/nodeModifyTable.c | 131 ++-
 src/include/executor/execPartition.h   |   9 +-
 4 files changed, 237 insertions(+), 207 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b3933df9af..118452b602 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2470,7 +2470,7 @@ CopyFrom(CopyState cstate)
PartitionTupleRouting *proute;
 
proute = cstate->partition_tuple_routing =
-   ExecSetupPartitionTupleRouting(NULL, cstate->rel, 1, 
estate);
+   ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
/*
 * If we are capturing transition tuples, they may need to be
@@ -2607,6 +2607,14 @@ CopyFrom(CopyState cstate)
 */
saved_resultRelInfo = resultRelInfo;
resultRelInfo = proute->partitions[leaf_part_index];
+   if (resultRelInfo == NULL)
+   {
+   resultRelInfo = ExecInitPartitionInfo(NULL,
+   
  saved_resultRelInfo,
+   
  proute, estate,
+   
  leaf_part_index);
+   Assert(resultRelInfo != NULL);
+   }
 
/* We do not

Re: JIT compiling with LLVM v9.1

2018-02-15 Thread Andreas Karlsson


On 02/15/2018 06:23 PM, Andres Freund wrote:

2) When I build with --with-cassert I expect the assertions to be there,
both in the binaries and the bitcode. Is that just a bug or is there any
thought behind this?


Not sure what you mean by that. NDEBUG and cassert are independent
mechanisms, no?


Yeah, I think I just managed to confuse myself there.

The actual issue is that --with-llvm changes if NDEBUG is set or not, 
which is quite surprising. I would not expect assertions to be disabled 
in the fronted code just because I compiled PostgreSQL with llvm.


Andreas

Re: FOR EACH ROW triggers on partitioned tables

2018-02-15 Thread Alvaro Herrera

Amit Langote wrote:
> On 2018/02/15 6:26, Alvaro Herrera wrote:
> > Another option is to rethink this feature from the ground up: instead of
> > cloning catalog rows for each children, maybe we should have the trigger
> > lookup code, when running DML on the child relation (the partition),
> > obtain trigger entries not only for the child relation itself but also
> > for its parents recursively -- so triggers defined in the parent are
> > fired for the partitions, too.  I'm not sure what implications this has
> > for constraint triggers.
> >
> > The behavior should be the same, except that you cannot modify the
> > trigger (firing conditions, etc) on the partition individually -- it
> > works at the level of the whole partitioned table instead.
> 
> Do you mean to fire these triggers only if the parent table (not a child
> table/partition) is addressed in the DML, right?  If the table directly
> addressed in the DML is a partition whose parent has a row-level trigger,
> then that trigger should not get fired I suppose.

No, I think that would be strange and cause data inconsistencies.
Inserting directly into the partition is seen as a performance
optimization (compared to inserted into the partitioned table), so we
don't get to skip firing the triggers defined on the parent because the
behavior would become different.  In other words, the performance
optimization breaks the database.

Example: suppose the trigger is used to maintain an audit record trail.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-15 Thread Claudio Freire

On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire  wrote:
> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada  
> wrote:

>>> The final FSM vacuum pass isn't partial, to finally correct all those
>>> small inconsistencies.
>>
>> Yes, but the purpose of this patch is to prevent table bloating from
>> concurrent modifications?
>
> Yes, by *marking* unmarked space. If the slot is above the threshold,
> it means there's already marked free space on that branch, and search
> will go into it already, so no pressing need to refine the mark. The
> space already marked can serve while vacuum makes further progress.
> Ie: it can wait.

Although... I think I see what you mean.

If you have this:

255
.   0
.   .   0
.   .   255
.   0
.   .   64
.   .   64
.   0
.   .   0
.   .   64

A partial vacuum won't even recurse beyond the root node, so it won't
expose the free space 2 levels down.

This could be arrived at by having an almost fully packed table that
contains some free space near the end, and it gets deletions near the
beginning. Vacuum will mark the leaf levels at the beginning of the
table, and we'd like for that free space to be discoverable to avoid
packing more rows near the end where they prevent truncation.

The patch as it is now will only vacuum that part of the FSM when the
root value drops, which usually requires all the free space on that
region of the heap to be exhausted.

With the current recursive FSM vacuum method, however, that's
unavoidable. We can't detect that there's some free space below the
current level to be exposed without recursing into the tree, and
recursing is exactly the kind of work we want to prevent with tree
pruning.

In all my trials, however, I did not run into that case. It must not
be easy to get into that situation, because the root node already has
~4000 slots in it. Each of those 4000 slots should be in that
situation to keep the space at the end of the table as the sole
discoverable free space, which seems like a rather contorted worst
case.

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-15 Thread Claudio Freire

On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada  wrote:
> On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire  
> wrote:
>> On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada  
>> wrote:
>>> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire  
>>> wrote:
 On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada  
 wrote:
> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire  
> wrote:
>> I can look into doing 3, that *might* get rid of the need to do that
>> initial FSM vacuum, but all other intermediate vacuums are still
>> needed.
>
> Understood. So how about that this patch focuses only make FSM vacuum
> more frequently and leaves the initial FSM vacuum and the handling
> cancellation cases? The rest can be a separate patch.

 Ok.

 Attached split patches. All but the initial FSM pass is in 1, the
 initial FSM pass as in the original patch is in 2.

 I will post an alternative patch for 2 when/if I have one. In the
 meanwhile, we can talk about 1.
>>>
>>> Thank you for updating the patch!
>>>
>>> +/* Tree pruning for partial vacuums */
>>> +if (threshold)
>>> +{
>>> +child_avail = fsm_get_avail(page, slot);
>>> +if (child_avail >= threshold)
>>> +continue;
>>> +}
>>>
>>> I think slots in a non-leaf page might not have a correct value
>>> because fsm_set_avail doesn't propagate up beyond pages. So do we need
>>> to check the root of child page rather than a slot in the non-lead
>>> page? I might be missing something though.
>>
>> I'm not sure I understand what you're asking.
>>
>> Yes, a partial FSM vacuum won't correct those inconsistencies. It
>> doesn't matter though, because as long as the root node (or whichever
>> inner node we're looking at the time) reports free space, the lookup
>> for free space will recurse and find the lower nodes, even if they
>> report more space than the inner node reported. The goal of making
>> free space discoverable will have been achieved nonetheless.
>
> For example, if a slot of internal node of fsm tree has a value
> greater than the threshold while the root of leaf node of fsm tree has
> a value less than threshold, we want to update the internal node of
> fsm tree but we will skip it with your patch. I wonder if this can
> happen.

If I understand your point correctly, that would mean that space was
used since the last vacuum. Partial FSM vacuums don't concern
themselves with that case, the normal free space search algorithm will
handle that, retrying with the next slot until it succeeds (or until
it gives up).

IIUC, each time a failure of such kind is found while searching, the
FSM gets updated to avoid following that link a second time.

See, in fsm_search:

/*
 * At lower level, failure can happen if the value in the upper-
 * level node didn't reflect the value on the lower page. Update
 * the upper node, to avoid falling into the same trap again, and
 * start over.

>
>> The final FSM vacuum pass isn't partial, to finally correct all those
>> small inconsistencies.
>
> Yes, but the purpose of this patch is to prevent table bloating from
> concurrent modifications?

Yes, by *marking* unmarked space. If the slot is above the threshold,
it means there's already marked free space on that branch, and search
will go into it already, so no pressing need to refine the mark. The
space already marked can serve while vacuum makes further progress.
Ie: it can wait.

> Here is other review comments.
>
> +/* Tree pruning for partial vacuums */
> +if (threshold)
> +{
> +child_avail = fsm_get_avail(page, slot);
> +if (child_avail >= threshold)
> +continue;
> +}
>
> 'child_avail' is a category value while 'threshold' is an actual size
> of freespace.

Good catch. It was meant to be a category, but I see the public
interface of the FSM doesn't usually expose categories, so fixing
that.

> The logic of finding the max_freespace seems not work fine if table
> with indices because max_freespace is not updated based on
> post-compaction freespace. I think we need to get the max freespace
> size during vacuuming heap (i.g. at lazy_vacuum_heap) and use it as
> the threshold.

Good point.

>
> +   vacuum_fsm_every_pages = nblocks / 8;
> +   vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, 1048576);
>
> I think a comment for 1048576 is needed. And perhaps we can define it
> as a macro.

1M pages = 8GB with an 8kb page size.

I can clarify.

Attached are new versions of the patches with these corrections.
From 0587d8fb9855314bbde73297856fe2116d3310a5 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Thu, 8 Feb 2018 12:39:15 -0300
Subject: [PATCH 2/2]

Re: JIT compiling with LLVM v10.1

2018-02-15 Thread Andres Freund

Hi,

On 2018-02-15 11:59:46 +0300, Konstantin Knizhnik wrote:
> It is well known fact that Postgres spends most of the time in sequence scan
> queries for warm data in deforming tuples (17% in case of TPC-H Q1).

I think that the majority of the time therein is not actually
bottlenecked by CPU, but by cache misses.  It might be worthwhile to
repeat your analysis with the last patch of my series applied, and the
#define FASTORDER
uncommented.

> Postgres  tries to optimize access to the tuple by caching fixed size
> offsets to the fields whenever possible and loading attributes on demand.
> It is also well know recommendation to put fixed size, non-null, frequently
> used attributes at the beginning of table's attribute list to make this
> optimization work more efficiently.

FWIW, I think this optimization causes vastly more trouble than it's
worth.

> You can see in the code of heap_deform_tuple shows that first NULL value
> will switch it to "slow" mode:

Note that in most workloads the relevant codepath isn't
heap_deform_tuple but slot_deform_tuple.

> 1. Modern platforms are mostly limited by memory access time, number of
> performed instructions is less critical.

I don't think this is quite the correct result. Especially because a lot
of time is spent accessing memory, having code that the CPU can execute
out-of-order (by speculatively executing forward) is hugely
beneficial.  Some of the benefit of JITing comes from being able to
start deforming the next field while memory fetches for the previous one
are still ongoing (iff dealing with fixed width cols).

> 2. For large number of attributes JIT-ing of deform tuple can improve speed
> up to two time. Which is quite good result from my point of view.

+1

Note the last version has a small deficiency in decoding varlena datums
that I need to fix (varsize_any isn't inlined anymore).

Greetings,

Andres Freund

Re: Add void cast to StaticAssertExpr?

2018-02-15 Thread Andres Freund

Hi,

On 2018-02-15 12:19:46 -0500, Tom Lane wrote:
> While poking around in buildfarm results, I noticed that some members are
> giving warnings like
> 
> analyze.c:386:181: warning: expression result unused [-Wunused-value]
> 
> which is apparently caused by the use of StaticAssertExpr in
> 
> #define AllocSetContextCreate(parent, name, allocparams) \
> (StaticAssertExpr(__builtin_constant_p(name), \
>   "Use AllocSetContextCreateExtended with 
> MEMCONTEXT_COPY_NAME for non-constant context names"), \
>  AllocSetContextCreateExtended(parent, name, 0, allocparams))
> 
> Now, I could silence those warnings via
> 
> -(StaticAssertExpr(__builtin_constant_p(name), \
> +((void) StaticAssertExpr(__builtin_constant_p(name), \
> 
> as I see has already been done in two similar uses of StaticAssertExpr in
> c.h.  However, this seems a bit silly.  Wouldn't it be better to put
> the void cast right into StaticAssertExpr?

No argument against it here.

Greetings,

Andres Freund

Re: spin.c includes pg_sema.h even if unnecessary

2018-02-15 Thread Andres Freund

On 2018-02-15 20:11:07 +0900, Kyotaro HORIGUCHI wrote:
> As in another mail just before, spin.c seems a bit strange
> (without acutual harm).
> 
> spin.h doesn't include pg_sema.h when HAVE_SPINLOCKS is defined,
> but spin.c always includes it even in the case. The file is
> included only to use sizeof(PGSemaphore) to calcualte
> SpinlockSemaSize as 0.
> 
> The codes that requires PGSempaphore is inactivated when the
> symbol is defined in all other places so it seems to me that we
> ought to refrain from using it there, too. The attched patch does

IDK, I don't quite see the point of the change here...

Greetings,

Andres Freund

Re: Let's remove DSM_INPL_NONE.

2018-02-15 Thread Andres Freund

Hi,

On 2018-02-15 19:58:57 +0900, Kyotaro HORIGUCHI wrote:
> As in the other threads[1][2], we have had several good reasons
> to remove DSM_IMPL_NONE in PG11. The attached patch doesn that.
> 
> [1] 
> https://www.postgresql.org/message-id/CA+Tgmoa0e23YC3SCwB85Yf5oUTRyirV=sq_rxyxaxabldpp...@mail.gmail.com
> 
> [2] 
> https://www.postgresql.org/message-id/20180214.103858.02713585.horiguchi.kyot...@lab.ntt.co.jp

Hm, I'm not quite convinced by this. Seems to make testing a bunch of
codepaths harder.  I think it's fine to say that pg doesn't work
correctly with them disabled though.

- Andres

Re: JIT compiling with LLVM v9.1

2018-02-15 Thread Andres Freund

Hi,

On 2018-02-15 12:54:34 +0100, Andreas Karlsson wrote:
> 1) Why does config/llvm.m4 modify CPPFLAGS? That affects the building of the
> binaries too which may be done with gcc like in my case. Shouldn't it use a
> LLVM_CPPFLAGS or something?

Well, most of the time cppflags just are things like additional include
directories. And the established precedent is to just add those to the
global cppflags (c.f. libxml stuff in configure in).  I've no problem
changing this, I just followed established practice.

> 2) When I build with --with-cassert I expect the assertions to be there,
> both in the binaries and the bitcode. Is that just a bug or is there any
> thought behind this?

Not sure what you mean by that. NDEBUG and cassert are independent
mechanisms, no?

Greetings,

Andres Freund

Add void cast to StaticAssertExpr?

2018-02-15 Thread Tom Lane

While poking around in buildfarm results, I noticed that some members are
giving warnings like

analyze.c:386:181: warning: expression result unused [-Wunused-value]

which is apparently caused by the use of StaticAssertExpr in

#define AllocSetContextCreate(parent, name, allocparams) \
(StaticAssertExpr(__builtin_constant_p(name), \
  "Use AllocSetContextCreateExtended with 
MEMCONTEXT_COPY_NAME for non-constant context names"), \
 AllocSetContextCreateExtended(parent, name, 0, allocparams))

Now, I could silence those warnings via

-(StaticAssertExpr(__builtin_constant_p(name), \
+((void) StaticAssertExpr(__builtin_constant_p(name), \

as I see has already been done in two similar uses of StaticAssertExpr in
c.h.  However, this seems a bit silly.  Wouldn't it be better to put
the void cast right into StaticAssertExpr?

regards, tom lane

Re: [PATCH][PROPOSAL] Add enum releation option type

2018-02-15 Thread Nikolay Shaplov

В письме от 15 февраля 2018 12:53:27 пользователь Alvaro Herrera написал:

> > > Maybe we need some in-core user to verify the string case still works.
> > > A new module in src/test/modules perhaps?
> > 
> > I've looked attentively in src/test/modules... To properly test all
> > reloptions hooks for modules wee need to create an extension with some
> > dummy index in it. This way we can properly test everything. Creating
> > dummy index would be fun, but it a quite big deal to create it, so it
> > will require a separate patch to deal with. So I suppose string
> > reloptions is better to leave untested for now, with a notion, that it
> > should be done sooner or later
> 
> Do we really need a dummy index?  I was thinking in something that just
> calls a few C functions to create and parse a string reloption should be
> more than enough.

Technically we can add_reloption_kind(); then add some string options to it 
using add_string_reloption, and then call fillRelOptions with some dummy data 
and validate argument on and see what will happen.

But I do not think it will test a lot. I think it is better to test it 
altogether, not just some part of it.

Moreover when my whole patch is commited these all should be rewritten, as I 
changed some options internals for some good reasons. 

So I would prefer to keep it untested while we are done with reloptions, and 
then test it in a good way, with creating dummy index and so on. I think it 
will be needed for more tests and educational purposes...

But if you will insist on it as a reviewer, I will do as you say.

> > > I don't much like the way you've represented the list of possible values
> > > for each enum.  I think it'd be better to have a struct that represents
> > > everything about each value (string name and C symbol.  Maybe the
> > > numerical value too if that's needed, but is it?  I suppose all code
> > > should use the C symbol only, so why do we care about the numerical
> > > value?).
> > 
> > One more reason to keep numeric value, that came to my mind, is that it
> > seems to be logical to keep enum value in postgres internals represented
> > as C-enum. And C-enum is actually an int value (can be easily casted both
> > ways). It is not strictly necessary, but it seems to be a bit logical...
> 
> Oh, I didn't mean to steer you away from a C enum.  I just meant that we
> don't need to define the numerical values ourselves -- it should be fine
> to use whatever the C compiler chooses for each C symbol (enum member
> name).  In the code we don't refer to the values by numerical value,
> only by the C symbol.
Ah that is what you are talking about :-)

I needed this numbers for debug purposes, nothing more. If it is not good to 
keep them, they can be removed now...
(I would prefer to keep them for further debugging, but if it is not right, I 
can easily remove them, I do not need them right now)

-- 
Do code for fun.

signature.asc
Description: This is a digitally signed message part.

Re: Removing useless #include's.

2018-02-15 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> While looking some patch, just from curiosity, I checked for
> redundant #include's in the source tree (except
> contrib). "redundant" here means that a file is included in
> another include file nearby.

> I found 641 includes that is just removable with no side effect
> with two exceptions.

I tend to be a bit suspicious of this sort of thing, especially for
old files that have been through previous rounds of "unnecessary
include" removal.  It's usually a good idea to ask *why* is a header
no longer needed?  The answer, usually, is that somebody added the
same #include to some other header, and it's not uncommon for that
to have been a bad idea.  It's usually best to minimize cross-header
inclusions, IMV, and it's always necessary to exercise judgment
when adding one.

We've also had more than a few problems with automatic scripts deciding
that an #include could be removed because they weren't testing with the
combination of build options that made it necessary.

See for instance commits 6416a82a6 through 1609797c2 for some history
of poorly managed #include removal.

regards, tom lane

Re: [PATCH][PROPOSAL] Add enum releation option type

2018-02-15 Thread Alvaro Herrera

Nikolay Shaplov wrote:
> В письме от 9 февраля 2018 18:45:29 пользователь Alvaro Herrera написал:
> 
> > Maybe we need some in-core user to verify the string case still works.
> > A new module in src/test/modules perhaps? 
> I've looked attentively in src/test/modules... To properly test all 
> reloptions 
> hooks for modules wee need to create an extension with some dummy index in 
> it. 
> This way we can properly test everything. Creating dummy index would be fun, 
> but it a quite big deal to create it, so it will require a separate patch to 
> deal with. So I suppose string reloptions is better to leave untested for 
> now, 
> with a notion, that it should be done sooner or later

Do we really need a dummy index?  I was thinking in something that just
calls a few C functions to create and parse a string reloption should be
more than enough.

> > I don't much like the way you've represented the list of possible values
> > for each enum.  I think it'd be better to have a struct that represents
> > everything about each value (string name and C symbol.  Maybe the
> > numerical value too if that's needed, but is it?  I suppose all code
> > should use the C symbol only, so why do we care about the numerical
> > value?).
> One more reason to keep numeric value, that came to my mind, is that it seems 
> to be logical to keep enum value in postgres internals represented as C-enum. 
> And C-enum is actually an int value (can be easily casted both ways). It is 
> not strictly necessary, but it seems to be a bit logical... 

Oh, I didn't mean to steer you away from a C enum.  I just meant that we
don't need to define the numerical values ourselves -- it should be fine
to use whatever the C compiler chooses for each C symbol (enum member
name).  In the code we don't refer to the values by numerical value,
only by the C symbol.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH][PROPOSAL] Add enum releation option type

2018-02-15 Thread Nikolay Shaplov

В письме от 9 февраля 2018 18:45:29 пользователь Alvaro Herrera написал:

> Maybe we need some in-core user to verify the string case still works.
> A new module in src/test/modules perhaps? 
I've looked attentively in src/test/modules... To properly test all reloptions 
hooks for modules wee need to create an extension with some dummy index in it. 
This way we can properly test everything. Creating dummy index would be fun, 
but it a quite big deal to create it, so it will require a separate patch to 
deal with. So I suppose string reloptions is better to leave untested for now, 
with a notion, that it should be done sooner or later

> I don't much like the way you've represented the list of possible values
> for each enum.  I think it'd be better to have a struct that represents
> everything about each value (string name and C symbol.  Maybe the
> numerical value too if that's needed, but is it?  I suppose all code
> should use the C symbol only, so why do we care about the numerical
> value?).
One more reason to keep numeric value, that came to my mind, is that it seems 
to be logical to keep enum value in postgres internals represented as C-enum. 
And C-enum is actually an int value (can be easily casted both ways). It is 
not strictly necessary, but it seems to be a bit logical... 


-- 
Do code for fun.

signature.asc
Description: This is a digitally signed message part.

Re: Cached/global query plans, autopreparation

2018-02-15 Thread Jorge Solórzano

On Thu, Feb 15, 2018 at 8:00 AM, Shay Rojansky  wrote:

> I am an author of one of the proposal (autoprepare which is in commit fest
>> now), but I think that sooner or later Postgres has to come to solution
>> with shared DB caches/prepared plans.
>> Please correct me if I am wrong, but it seems to me that most of all
>> other top DBMSes having something like this.
>> Such decision can provide a lot of different advantages:
>> 1. Better memory utilization: no need to store the same data N times
>> where N is number of backends and spend time for warming cache.
>> 2. Optimizer can spend more time choosing better plan which then can be
>> used by all clients. Even now time of compilation of some queries several
>> times exceeds time of their execution.
>> 3. It is simpler to add facilities for query plan tuning and maintaining
>> (storing, comparing,...)
>> 4. It make is possible to control size of memory used by caches. Right
>> now catalog cache for DB with hundred thousands and tables and indexes
>> multiplied by hundreds of backends can consume terabytes of memory.
>> 5. Shared caches can simplify invalidation mechanism.
>> 6. Almost all enterprise systems working with Postgres has to use some
>> kind of connection pooling (pgbouncer, pgpool,...). It almost exclude
>> possibility to use prepared statements. Which can slow down performance up
>> to two times.
>>
>
> Just wanted to say I didn't see this email before my previous response,
> but I agree with all of the above. The last point is particularly
> important, especially for short-lived connection scenarios, the most
> typical of which is web.
>
>
>> There is just one (but very important) problem which needs to be solved:
>> access to shared cache should be synchronized.
>> But there are a lot of other shared resources in Postgres (procarray,
>> shared buffers,...). So  I do not think that it is unsolvable problem and
>> that it can cause degrade of performance.
>>
>> So it seems to be obvious that shared caches/plans can provide a lot of
>> advantages. But it is still not clear to me the value of this advantages
>> for real customers.
>> Using -M prepared  protocol in pgbench workload can improve speed up to
>> two times. But I have asked real Postgres users in Avito, Yandex, MyOffice
>> and them told me
>> that on their workloads advantage of prepared statements is about 10%.
>> 10% performance improvement is definitely not a good compensation for
>> rewriting substantial part of Postgres core...
>>
>
> Just wanted to say that I've seen more than 10% improvement in some
> real-world application when preparation was done properly. Also, I'm
> assuming that implementing this wouldn't involve "rewriting substantial
> part of Postgres core", and that even 10% is quite a big gain, especially
> if it's a transparent/free one as far as the user is concerned (no
> application changes).
>


10% of improvement in real-world can be pretty significant, I ignore how
complicated can be to implement this in Postgres core, how about add this
to the GSoC 2018 ideas[1]?

[1] https://wiki.postgresql.org/wiki/GSoC_2018

Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-15 Thread Amit Kapila

On Thu, Feb 15, 2018 at 4:18 PM, Ashutosh Bapat
 wrote:
> I happened to look at the patch for something else. But here are some
> comments. If any of those have been already discussed, please feel
> free to ignore. I have gone through the thread cursorily, so I might
> have missed something.
>
> In grouping_planner() we call query_planner() first which builds the
> join tree and creates paths, then calculate the target for scan/join
> rel which is applied on the topmost scan join rel. I am wondering
> whether we can reverse this order to calculate the target list of
> scan/join relation and pass it to standard_join_search() (or the hook)
> through query_planner().
>

I think the reason for doing in that order is that we can't compute
target's width till after query_planner().  See the below note in
code:

/*
* Convert the query's result tlist into PathTarget format.
*
* Note: it's desirable to not do this till after query_planner(),
* because the target width estimates can use per-Var width numbers
* that were obtained within query_planner().
*/
final_target = create_pathtarget(root, tlist);

Now, I think we can try to juggle the code in a way that the width can
be computed later, but the rest can be done earlier.  However, I think
that will be somewhat major change and still won't address all kind of
cases (like for ordered paths) unless we can try to get all kind of
targets pushed down in the call stack.  Also, I am not sure if that is
the only reason or there are some other assumptions about this calling
order as well.

>
> Here are some comments on the patch.
>

Thanks for looking into the patch.  As of now, we are evaluating the
right approach for this patch, so let's wait for Robert's reply.
After we agree on the approach, I will address your comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Cached/global query plans, autopreparation

2018-02-15 Thread Shay Rojansky

>
> I am an author of one of the proposal (autoprepare which is in commit fest
> now), but I think that sooner or later Postgres has to come to solution
> with shared DB caches/prepared plans.
> Please correct me if I am wrong, but it seems to me that most of all other
> top DBMSes having something like this.
> Such decision can provide a lot of different advantages:
> 1. Better memory utilization: no need to store the same data N times where
> N is number of backends and spend time for warming cache.
> 2. Optimizer can spend more time choosing better plan which then can be
> used by all clients. Even now time of compilation of some queries several
> times exceeds time of their execution.
> 3. It is simpler to add facilities for query plan tuning and maintaining
> (storing, comparing,...)
> 4. It make is possible to control size of memory used by caches. Right now
> catalog cache for DB with hundred thousands and tables and indexes
> multiplied by hundreds of backends can consume terabytes of memory.
> 5. Shared caches can simplify invalidation mechanism.
> 6. Almost all enterprise systems working with Postgres has to use some
> kind of connection pooling (pgbouncer, pgpool,...). It almost exclude
> possibility to use prepared statements. Which can slow down performance up
> to two times.
>

Just wanted to say I didn't see this email before my previous response, but
I agree with all of the above. The last point is particularly important,
especially for short-lived connection scenarios, the most typical of which
is web.


> There is just one (but very important) problem which needs to be solved:
> access to shared cache should be synchronized.
> But there are a lot of other shared resources in Postgres (procarray,
> shared buffers,...). So  I do not think that it is unsolvable problem and
> that it can cause degrade of performance.
>
> So it seems to be obvious that shared caches/plans can provide a lot of
> advantages. But it is still not clear to me the value of this advantages
> for real customers.
> Using -M prepared  protocol in pgbench workload can improve speed up to
> two times. But I have asked real Postgres users in Avito, Yandex, MyOffice
> and them told me
> that on their workloads advantage of prepared statements is about 10%. 10%
> performance improvement is definitely not a good compensation for rewriting
> substantial part of Postgres core...
>

Just wanted to say that I've seen more than 10% improvement in some
real-world application when preparation was done properly. Also, I'm
assuming that implementing this wouldn't involve "rewriting substantial
part of Postgres core", and that even 10% is quite a big gain, especially
if it's a transparent/free one as far as the user is concerned (no
application changes).

Re: Cached/global query plans, autopreparation

2018-02-15 Thread Shay Rojansky

>
> > Well, the issue is that implementing this is a major piece of work. This
> > post doesn't offer either resources nor a simpler way to do so. There's
> > no huge debate about the benefit of having a global plan cache, so I'm
> > not that surprised there's not a huge debate about a post arguing that
> > we should have one.
>
> Actually, I'm pretty darn skeptical about the value of such a cache for
> most use-cases.  But as long as it can be turned off and doesn't leave
> residual overhead nor massive added code cruft, I won't stand in the way
> of someone else investing their time in it.
>
> In any case, as you say, it's moot until somebody steps up to do it.
>

Well, looking at previous conversations and also at the comment above it
doesn't seem like there's a consensus on whether this feature would even be
beneficial... The point of my email above was to have that conversation
before looking into implementation. Tom, I'm especially interested in
understanding why you think this cache wouldn't help most use-cases: I see
many applications which don't prepare (i.e. because they use data access
layers/O/RMs which don't do it or expose it), and implementing this in the
driver seems like the wrong way (although Npgsql and JDBC do it, at least
some other languages don't).

In addition, there are also various options/possibilities here and there
seems no consensus about that either:

* How should statement plan caching be done for unprepared statements, what
strategy should be used? Should a threshold number of unprepared executions
be used before PostgreSQL decides to prepare? Should there be a maximum
number of autoprepared statements, ejecting the least-recently used one to
make room for a new one? Or something else?
* Should the cached plans be shared across connections ("global" cached
statements)? Are the savings from global caching greater than the cost of
the contention? The savings include both (a) not having to re-prepare the
same statement N times on different connections (typically just a one-time
application warm-up cost), and (b) not having the memory duplication of N
identical statements across statements (a constant cost, not warm-up - but
not sure how significant this is). Note that the global/shared discussion
is a bit orthogonal to the general autopreparation conversation - the
latter has value with or without the former.

Essentially I think it's a good idea to have a conversation about all this
before anyone jumps into implementation.

Changing baserel to foreignrel in postgres_fdw functions

2018-02-15 Thread Ashutosh Bapat

Hi,
I noticed that functions is_foreign_expr(), classifyConditions() and
appendOrderByClause() had variables/arguments named baserel when the
relations passed to those could be join or upper relation as well.
Here's patch renaming those as foreignrel.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index e111b09..bcf9bea 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -198,7 +198,7 @@ static void get_relation_column_alias_ids(Var *node, RelOptInfo *foreignrel,
  */
 void
 classifyConditions(PlannerInfo *root,
-   RelOptInfo *baserel,
+   RelOptInfo *foreignrel,
    List *input_conds,
    List **remote_conds,
    List **local_conds)
@@ -212,7 +212,7 @@ classifyConditions(PlannerInfo *root,
 	{
 		RestrictInfo *ri = lfirst_node(RestrictInfo, lc);
 
-		if (is_foreign_expr(root, baserel, ri->clause))
+		if (is_foreign_expr(root, foreignrel, ri->clause))
 			*remote_conds = lappend(*remote_conds, ri);
 		else
 			*local_conds = lappend(*local_conds, ri);
@@ -224,29 +224,29 @@ classifyConditions(PlannerInfo *root,
  */
 bool
 is_foreign_expr(PlannerInfo *root,
-RelOptInfo *baserel,
+RelOptInfo *foreignrel,
 Expr *expr)
 {
 	foreign_glob_cxt glob_cxt;
 	foreign_loc_cxt loc_cxt;
-	PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) (baserel->fdw_private);
+	PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) (foreignrel->fdw_private);
 
 	/*
 	 * Check that the expression consists of nodes that are safe to execute
 	 * remotely.
 	 */
 	glob_cxt.root = root;
-	glob_cxt.foreignrel = baserel;
+	glob_cxt.foreignrel = foreignrel;
 
 	/*
 	 * For an upper relation, use relids from its underneath scan relation,
 	 * because the upperrel's own relids currently aren't set to anything
 	 * meaningful by the core code.  For other relation, use their own relids.
 	 */
-	if (IS_UPPER_REL(baserel))
+	if (IS_UPPER_REL(foreignrel))
 		glob_cxt.relids = fpinfo->outerrel->relids;
 	else
-		glob_cxt.relids = baserel->relids;
+		glob_cxt.relids = foreignrel->relids;
 	loc_cxt.collation = InvalidOid;
 	loc_cxt.state = FDW_COLLATE_NONE;
 	if (!foreign_expr_walker((Node *) expr, _cxt, _cxt))
@@ -301,7 +301,7 @@ foreign_expr_walker(Node *node,
 	if (node == NULL)
 		return true;
 
-	/* May need server info from baserel's fdw_private struct */
+	/* May need server info from foreignrel's fdw_private struct */
 	fpinfo = (PgFdwRelationInfo *) (glob_cxt->foreignrel->fdw_private);
 
 	/* Set up inner_cxt for possible recursion to child nodes */
@@ -2965,7 +2965,7 @@ appendGroupByClause(List *tlist, deparse_expr_cxt *context)
 }
 
 /*
- * Deparse ORDER BY clause according to the given pathkeys for given base
+ * Deparse ORDER BY clause according to the given pathkeys for given foreign
  * relation. From given pathkeys expressions belonging entirely to the given
  * base relation are obtained and deparsed.
  */
@@ -2975,7 +2975,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt *context)
 	ListCell   *lcell;
 	int			nestlevel;
 	char	   *delim = " ";
-	RelOptInfo *baserel = context->scanrel;
+	RelOptInfo *foreignrel = context->scanrel;
 	StringInfo	buf = context->buf;
 
 	/* Make sure any constants in the exprs are printed portably */
@@ -2987,7 +2987,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt *context)
 		PathKey*pathkey = lfirst(lcell);
 		Expr	   *em_expr;
 
-		em_expr = find_em_expr_for_rel(pathkey->pk_eclass, baserel);
+		em_expr = find_em_expr_for_rel(pathkey->pk_eclass, foreignrel);
 		Assert(em_expr != NULL);
 
 		appendStringInfoString(buf, delim);

Re: autovacuum: change priority of the vacuumed tables

2018-02-15 Thread Grigory Smolkin


On 02/15/2018 09:28 AM, Masahiko Sawada wrote:


Hi,

On Thu, Feb 8, 2018 at 11:01 PM, Ildus Kurbangaliev
 wrote:

Hi,

Attached patch adds 'autovacuum_table_priority' to the current list of
automatic vacuuming settings. It's used in sorting of vacuumed tables in
autovacuum worker before actual vacuum.

The idea is to give possibility to the users to prioritize their tables
in autovacuum process.


Hmm, I couldn't understand the benefit of this patch. Would you
elaborate it a little more?

Multiple autovacuum worker can work on one database. So even if a
table that you want to vacuum first is the back of the list and there
other worker would pick up it. If the vacuuming the table gets delayed
due to some big tables are in front of that table I think you can deal
with it by increasing the number of autovacuum workers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Database can contain thousands of tables and often updates/deletes 
concentrate mostly in only a handful of tables.

Going through thousands of less bloated tables can take ages.
Currently autovacuum know nothing about prioritizing it`s work with 
respect to user`s understanding of his data and application.
Also It`s would be great to sort tables according to dead/live tuple 
ratio and relfrozenxid.


--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

spin.c includes pg_sema.h even if unnecessary

2018-02-15 Thread Kyotaro HORIGUCHI

Ouch! Sorry for the bogus subject of just-submitted mail.
I'll make another thread with the fixed subject.


Hello.

As in another mail just before, spin.c seems a bit strange
(without acutual harm).

spin.h doesn't include pg_sema.h when HAVE_SPINLOCKS is defined,
but spin.c always includes it even in the case. The file is
included only to use sizeof(PGSemaphore) to calcualte
SpinlockSemaSize as 0.

The codes that requires PGSempaphore is inactivated when the
symbol is defined in all other places so it seems to me that we
ought to refrain from using it there, too. The attched patch does
that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 6d59a7f..353a3a9 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -22,13 +22,15 @@
  */
 #include "postgres.h"
 
-#include "storage/pg_sema.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 
 #ifndef HAVE_SPINLOCKS
+#define RequredShmemSize (SpinlockSemas() * sizeof(PGSemaphore))
 PGSemaphore *SpinlockSemaArray;
+#else
+#define RequiredShmemSize 0
 #endif
 
 /*
@@ -38,7 +40,7 @@ PGSemaphore *SpinlockSemaArray;
 Size
 SpinlockSemaSize(void)
 {
-	return SpinlockSemas() * sizeof(PGSemaphore);
+	return RequiredShmemSize;
 }
 
 #ifdef HAVE_SPINLOCKS

Re: non-bulk inserts and tuple routing

2018-02-15 Thread Etsuro Fujita


(2018/02/13 10:12), Amit Langote wrote:

On 2018/02/09 21:20, Etsuro Fujita wrote:

* Please add a brief decsription about partition_oids to the comments for
this struct.

@@ -91,6 +91,7 @@ typedef struct PartitionTupleRouting
   {
  PartitionDispatch *partition_dispatch_info;
  int num_dispatch;
+   Oid*partition_oids;


Done.


Thanks, but one thing I'm wondering is: do we really need this array?  I
think we could store into PartitionTupleRouting the list of oids returned
by RelationGetPartitionDispatchInfo in ExecSetupPartitionTupleRouting,
instead.  Sorry, I should have commented this in a previous email, but
what do you think about that?


ExecInitPartitionInfo() will have to iterate the list to get the OID of
the partition to be initialized.  Isn't it much cheaper with the array?


Good point!  So I agree with adding that array.


Here are other comments:

o On changes to ExecSetupPartitionTupleRouting:

* This is nitpicking, but it would be better to define partrel and
part_tupdesc within the if (update_rri_index<  num_update_rri&&
RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) ==
leaf_oid) block.

-   ResultRelInfo *leaf_part_rri;
+   ResultRelInfo *leaf_part_rri = NULL;
 Relationpartrel = NULL;
 TupleDesc   part_tupdesc;
 Oid leaf_oid = lfirst_oid(cell);


Sure, done.


* Do we need this?  For a leaf partition that is already present in the
subplan resultrels, the partition's indices (if any) would have already
been opened.

+   /*
+* Open partition indices.  We wouldn't
need speculative
+* insertions though.
+*/
+   if
(leaf_part_rri->ri_RelationDesc->rd_rel->relhasindex&&
+ leaf_part_rri->ri_IndexRelationDescs == NULL)
+   ExecOpenIndices(leaf_part_rri,
false);


You're right.  Removed the call.


Thanks for the above changes!


Updated patch is attached.


Thanks, here are some minor comments:

o On changes to ExecCleanupTupleRouting:

-   ExecCloseIndices(resultRelInfo);
-   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+   if (resultRelInfo)
+   {
+   ExecCloseIndices(resultRelInfo);
+   heap_close(resultRelInfo->ri_RelationDesc, NoLock);
+   }

You might check at the top of the loop whether resultRelInfo is NULL and 
if so do continue.  I think that would save cycles a bit.


o On ExecInitPartitionInfo:

+   int firstVarno;
+   RelationfirstResultRel;

My old compiler got "variable may be used uninitialized" warnings.

+   /*
+* Build the RETURNING projection if any for the partition.  Note that
+* we didn't build the returningList for each partition within the
+* planner, but simple translation of the varattnos will suffice.
+* This only occurs for the INSERT case; in the UPDATE/DELETE cases,
+* ExecInitModifyTable() would've initialized this.
+*/

I think the last comment should be the same as for WCO lists: "This only 
occurs for the INSERT case or in the case of UPDATE for which we didn't 
find a result rel above to reuse."


+   /*
+* Initialize result tuple slot and assign its rowtype using the 
first

+* RETURNING list.  We assume the rest will look the same.
+*/
+   tupDesc = ExecTypeFromTL(returningList, false);
+
+   /* Set up a slot for the output of the RETURNING projection(s) */
+   ExecInitResultTupleSlot(estate, >ps);
+   ExecAssignResultType(>ps, tupDesc);
+   slot = mtstate->ps.ps_ResultTupleSlot;
+
+   /* Need an econtext too */
+   if (mtstate->ps.ps_ExprContext == NULL)
+   ExecAssignExprContext(estate, >ps);
+   econtext = mtstate->ps.ps_ExprContext;

Do we need this initialization?  I think we would already have the slot 
and econtext initialized when we get here.


Other than that, the patch looks good to me.

Sorry for the delay.

Best regards,
Etsuro Fujita

Re: JIT compiling with LLVM v9.1

2018-02-15 Thread Andreas Karlsson

On 02/05/2018 10:44 PM, Pierre Ducroquet wrote:

psqlscanslash.l: In function ‘psql_scan_slash_option’:
psqlscanslash.l:550:8: warning: variable ‘lexresult’ set but not used
[-Wunused-but-set-variable]
int   final_state;
  ^

I'm not sure Andres's patches have anything to do with psql, it's surprising.

I managed to track down the bug and apparently when building with 
--with-llvm the -DNDEBUG option is added to CPPFLAGS, but I am not 
entirely sure what the code in config/llvm.m4 is trying to do in the 
first place.

The two issues I see with what the code does are:

1) Why does config/llvm.m4 modify CPPFLAGS? That affects the building of 
the binaries too which may be done with gcc like in my case. Shouldn't 
it use a LLVM_CPPFLAGS or something?

2) When I build with --with-cassert I expect the assertions to be there, 
both in the binaries and the bitcode. Is that just a bug or is there any 
thought behind this?

Below is the diff in src/Makefile.global between when I run configure 
with --with-llvm or not.

diff src/Makefile.global-nollvm  src/Makefile.global-llvm
78c78
< configure_args =  '--prefix=/home/andreas/dev/postgresql-inst' 
'--enable-tap-tests' '--enable-cassert' '--enable-debug'

---
> configure_args =  '--prefix=/home/andreas/dev/postgresql-inst' 
'--enable-tap-tests' '--enable-cassert' '--enable-debug' '--with-llvm'

190c190
< with_llvm  = no
---
> with_llvm  = yes
227,229c227,229
< LLVM_CONFIG =
< LLVM_BINPATH =
< CLANG =
---
> LLVM_CONFIG = /usr/bin/llvm-config
> LLVM_BINPATH = /usr/lib/llvm-4.0/bin
> CLANG = /usr/bin/clang
238c238
< CPPFLAGS =  -D_GNU_SOURCE
---
> CPPFLAGS = -D__STDC_LIMIT_MACROS -D__STDC_FORMAT_MACROS 
-D__STDC_CONSTANT_MACROS -D_GNU_SOURCE -DNDEBUG 
-I/usr/lib/llvm-4.0/include  -D_GNU_SOURCE

261c261
< LLVM_CXXFLAGS =
---
> LLVM_CXXFLAGS =  -std=c++0x -std=c++11 -fno-exceptions
283c283
< LLVM_LIBS=
---
> LLVM_LIBS= -lLLVM-4.0
297c297
< LDFLAGS +=   -Wl,--as-needed
---
> LDFLAGS +=  -L/usr/lib/llvm-4.0/lib  -Wl,--as-needed

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-15 Thread Pavan Deolasee

Hi Andreas,

On Sun, Feb 4, 2018 at 5:45 PM, Andreas Seltenreich 
wrote:

>
>
> Attached are testcases that trigger the assertions when run against an
> empty database instead of the one left behind by make installcheck.  The
> "unrecognized node type" one appears to be a known issue according to
> the wiki, so I didn't bother doing the work for this one as well.
>

Thanks for doing those tests. I've just sent v16a version of the patch and
I think it fixes the issues reported so far. Can you please recheck? Please
let me know if there are other issues detected by sqlsmith or otherwise.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Removing useless #include's.

2018-02-15 Thread Kyotaro HORIGUCHI

Hello.

While looking some patch, just from curiosity, I checked for
redundant #include's in the source tree (except
contrib). "redundant" here means that a file is included in
another include file nearby.

I found 641 includes that is just removable with no side effect
with two exceptions.

- src/common/saslprep.c

  A comment that suggests linking wchar.c was placed just above
  '#include "mb/pg_wchar.h" but it is now just above "#include
  "common/unicode_norm.h" but the comment seems to be used as is.

 
- backend/storage/lmgr/spin.c

  spin.c and spin.h don't aggree on the necessity of pg_sema.h
  when HAVE_SPINLOCKS is defined. I post a mail about this issue
  separately.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


remove_useless_includes.patch.gz
Description: Binary data

Re: HAVE_SPINLOCKS still requests exctra shared memory

2018-02-15 Thread Kyotaro HORIGUCHI

Please ignore this thread and see the following one instead.

https://www.postgresql.org/message-id/20180215.201107.78574525.horiguchi.kyot...@lab.ntt.co.jp

At Thu, 15 Feb 2018 20:07:03 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180215.200703.242189033.horiguchi.kyot...@lab.ntt.co.jp>
> As in another mail just before, spin.c seems a bit strange
> (without acutual harm).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

HAVE_SPINLOCKS still requests exctra shared memory

2018-02-15 Thread Kyotaro HORIGUCHI

Hello.

As in another mail just before, spin.c seems a bit strange
(without acutual harm).

spin.h doesn't include pg_sema.h when HAVE_SPINLOCKS is defined,
but spin.c always includes it even in the case. The file is
included only to use sizeof(PGSemaphore) to calcualte
SpinlockSemaSize as 0.

The codes that requires PGSempaphore is inactivated when the
symbol is defined in all other places so it seems to me that we
ought to refrain from using it there, too. The attched patch does
that.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/storage/lmgr/spin.c b/src/backend/storage/lmgr/spin.c
index 6d59a7f..353a3a9 100644
--- a/src/backend/storage/lmgr/spin.c
+++ b/src/backend/storage/lmgr/spin.c
@@ -22,13 +22,15 @@
  */
 #include "postgres.h"
 
-#include "storage/pg_sema.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
 
 
 #ifndef HAVE_SPINLOCKS
+#define RequredShmemSize (SpinlockSemas() * sizeof(PGSemaphore))
 PGSemaphore *SpinlockSemaArray;
+#else
+#define RequiredShmemSize 0
 #endif
 
 /*
@@ -38,7 +40,7 @@ PGSemaphore *SpinlockSemaArray;
 Size
 SpinlockSemaSize(void)
 {
-	return SpinlockSemas() * sizeof(PGSemaphore);
+	return RequiredShmemSize;
 }
 
 #ifdef HAVE_SPINLOCKS

Let's remove DSM_INPL_NONE.

2018-02-15 Thread Kyotaro HORIGUCHI

Hello.

As in the other threads[1][2], we have had several good reasons
to remove DSM_IMPL_NONE in PG11. The attached patch doesn that.

[1] 
https://www.postgresql.org/message-id/CA+Tgmoa0e23YC3SCwB85Yf5oUTRyirV=sq_rxyxaxabldpp...@mail.gmail.com

[2] 
https://www.postgresql.org/message-id/20180214.103858.02713585.horiguchi.kyot...@lab.ntt.co.jp


It is amost a just-to-delete work but I see two issues raised so
far.

1. That change can raise regression test failure on some
   buildfarm machines[3].

The reason is that initdb creates a database with
max_connection=10 from DSM failure caused by semaphore exhaustion
, even though regression test requires at least 20
connections. At that time it was "fixed" by setting
dynamic_shared_memory_type=none while detecting the number of
usable max_connections and shared buffers[4].  Regardless of
whether the probing was succeeded or not, initdb finally creats a
database on that any DSM implementation is activated, since
initdb doesn't choose "none". Finally the test items for parallel
query actually suceeded.

For the reason above, I believe just increasing the minimum
fallback value of max_connections to 20 will work[5]. Anyway it
is a problem to be fixed that initdb creates an inexecutable
database if it uses the fallback values of max_connections=10.

[3] 
https://www.postgresql.org/message-id/ca+tgmoyhiigrcvssjhmbsebmof2zx_9_9rwd75cwvu99yrd...@mail.gmail.com

[4] Commit: d41ab71712a4457ed39d5471b23949872ac91def

[5] 
https://www.postgresql.org/message-id/20180209.170823.42008365.horiguchi.kyot...@lab.ntt.co.jp


2. Server may chooose unusable DSM implementation while initdb probing. 

https://www.postgresql.org/message-id/20180214033248.ga1...@paquier.xyz

> Feel free to.  Be just careful that the connection attempts in
> test_config_settings() should try to use the DSM implementation defined
> by choose_dsm_implementation().

Thank you for the advice. The probing fails with the default
setting if posix shared memory somehow fails on a platform like
Linux that is usually expected to have it.  It's enough to choose
the implementation before probing. Some messages from initdb are
transposed as the result.

|   creating directory /home/horiguti/data/data_ndsm ... ok
|   creating subdirectories ... ok
| + selecting dynamic shared memory implementation ... posix
|   selecting default max_connections ... 100
|   selecting default shared_buffers ... 128MB
| - selecting dynamic shared memory implementation ... posix

Even though probing can end with failure in the case of [3], it
will not be a problem with the patch [4].

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 44411ee17e881d1ee42fb91106d62aa97a7e45c9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 15 Feb 2018 11:22:50 +0900
Subject: [PATCH] Remove dynamic_shared_memroy_type=none

Nowadays PostgreSQL on all supported platform offers any kind of
dynamic shared memory feature. Assuming none hinder us from using it
for core features.  Just removing the option leads to intermittent
failure of regression test on some platforms due to insufficient
max_connection on initdb. This patch requires the patch that increases
the minimum fallback value of max_connection of initdb.
---
 doc/src/sgml/config.sgml  |  6 +++---
 src/backend/access/transam/parallel.c |  7 ---
 src/backend/optimizer/plan/planner.c  |  4 +---
 src/backend/storage/ipc/dsm.c | 15 ---
 src/backend/storage/ipc/dsm_impl.c|  3 ---
 src/backend/utils/misc/postgresql.conf.sample |  1 -
 src/bin/initdb/initdb.c   | 21 ++---
 src/include/storage/dsm_impl.h|  1 -
 8 files changed, 18 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c45979d..c2d86b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1601,9 +1601,9 @@ include_dir 'conf.d'
 should use.  Possible values are posix (for POSIX shared
 memory allocated using shm_open), sysv
 (for System V shared memory allocated via shmget),
-windows (for Windows shared memory), mmap
-(to simulate shared memory using memory-mapped files stored in the
-data directory), and none (to disable this feature).
+windows (for Windows shared memory),
+and mmap (to simulate shared memory using
+memory-mapped files stored in the data directory).
 Not all values are supported on all platforms; the first supported
 option is the default for that platform.  The use of the
 mmap option, which is not the default on any platform,
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 9d4efc0..4d27241 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -162,13 +162,6 @@

Re: [HACKERS] why not parallel seq scan for slow functions

2018-02-15 Thread Ashutosh Bapat

I happened to look at the patch for something else. But here are some
comments. If any of those have been already discussed, please feel
free to ignore. I have gone through the thread cursorily, so I might
have missed something.

In grouping_planner() we call query_planner() first which builds the
join tree and creates paths, then calculate the target for scan/join
rel which is applied on the topmost scan join rel. I am wondering
whether we can reverse this order to calculate the target list of
scan/join relation and pass it to standard_join_search() (or the hook)
through query_planner(). standard_join_search() would then set this as
reltarget of the topmost relation and every path created for it will
have that target, applying projection if needed. This way we avoid
calling generate_gather_path() at two places. Right now
generate_gather_path() seems to be the only thing benefitting from
this but what about FDWs and custom paths whose costs may change when
targetlist changes. For custom paths I am considering GPU optimization
paths. Also this might address Tom's worry, "But having to do that in
a normal code path
suggests that something's not right about the design ... "

Here are some comments on the patch.

+/*
+ * Except for the topmost scan/join rel, consider gathering
+ * partial paths.  We'll do the same for the topmost scan/join
This function only works on join relations. Mentioning scan rel is confusing.

index 6e842f9..5206da7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -481,14 +481,21 @@ set_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 }

+ *
+ * Also, if this is the topmost scan/join rel (that is, the only baserel),
+ * we postpone this until the final scan/join targelist is available (see

Mentioning join rel here is confusing since we deal with base relations here.

+bms_membership(root->all_baserels) != BMS_SINGLETON)

set_tablesample_rel_pathlist() is also using this method to decide whether
there are any joins in the query. May be macro-ize this and use that macro at
these two places?

- * for the specified relation.  (Otherwise, add_partial_path might delete a
+ * for the specified relation. (Otherwise, add_partial_path might delete a

Unrelated change?

+/* Add projection step if needed */
+if (target && simple_gather_path->pathtarget != target)

If the target was copied someplace, this test will fail. Probably we want to
check containts of the PathTarget structure? Right now copy_pathtarget() is not
called from many places and all those places modify the copied target. So this
isn't a problem. But we can't guarantee that in future. Similar comment for
gather_merge path creation.

+simple_gather_path = apply_projection_to_path(root,
+  rel,
+  simple_gather_path,
+  target);
+

Why don't we incorporate those changes in create_gather_path() by passing it
the projection target instead of rel->reltarget? Similar comment for
gather_merge path creation.

+/*
+ * Except for the topmost scan/join rel, consider gathering
+ * partial paths.  We'll do the same for the topmost scan/join rel

Mentioning scan rel is confusing here.


 deallocate tenk1_count;
+explain (costs off) select ten, costly_func(ten) from tenk1;

verbose output will show that the parallel seq scan's targetlist has
costly_func() in it. Isn't that what we want to test here?


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: reorganizing partitioning code

2018-02-15 Thread Amit Langote

Added to CF here: https://commitfest.postgresql.org/17/1520/

Thanks,
Amit

Re: Transform for pl/perl

2018-02-15 Thread Anthony Bykov

On Wed, 31 Jan 2018 13:36:22 +0300
Arthur Zakirov  wrote:

> I've noticed a possible bug:
> 
> > +   /* json key in v */
> > +   key =
> > pstrdup(v.val.string.val);
> > +   keyLength =
> > v.val.string.len;
> > +   JsonbIteratorNext(, ,
> > true);  
> 
> I think it is worth to use pnstrdup() here, because v.val.string.val
> is not necessarily null-terminated as the comment says:
> 
> > struct JsonbValue
> > ...
> > struct
> > {
> > int len;
> > char   *val;/* Not
> > necessarily null-terminated */ }
> > string; /* String primitive type */  
> 
> Consider an example:
> 
> =# CREATE FUNCTION testSVToJsonb3(val jsonb) RETURNS jsonb
> LANGUAGE plperl
> TRANSFORM FOR TYPE jsonb
> AS $$
> return $_->{"1"};
> $$;
> 
> =# SELECT testSVToJsonb3('{"1":{"2":[3,4,5]},"2":3}');
>  testsvtojsonb3 
> 
>  (null)
> 
> But my perl isn't good, so the example maybe isn't good too.
> 

Hello.
Glad you've noticed this. Thank you.

I've fixed this possible bug in the new patch, but your example
can't check that.

The problem is that $_ - is a pointer to an array of incoming
parameters. So, if you return $_[0]->{"1"} instead of $_->{"1"}, the
test will return exactly the expected output: {"2":[3,4,5]} 

I've tried to test "chop" and even "=~ s/\0$//", but that didn't check
the problem.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/contrib/Makefile b/contrib/Makefile
index 8046ca4..53d44fe 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += hstore_plperl
+SUBDIRS += hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += hstore_plperl
+ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile
new file mode 100644
index 000..cd86553
--- /dev/null
+++ b/contrib/jsonb_plperl/Makefile
@@ -0,0 +1,40 @@
+# contrib/jsonb_plperl/Makefile
+
+MODULE_big = jsonb_plperl
+OBJS = jsonb_plperl.o $(WIN32RES)
+PGFILEDESC = "jsonb_plperl - jsonb transform for plperl"
+
+PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl
+
+EXTENSION = jsonb_plperlu jsonb_plperl
+DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql
+
+REGRESS = jsonb_plperl jsonb_plperl_relocatability jsonb_plperlu jsonb_plperlu_relocatability
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/jsonb_plperl
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We must link libperl explicitly
+ifeq ($(PORTNAME), win32)
+# these settings are the same as for plperl
+override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
+# ... see silliness in plperl Makefile ...
+SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+else
+rpathdir = $(perl_archlibexp)/CORE
+SHLIB_LINK += $(perl_embed_ldflags)
+endif
+
+# As with plperl we need to make sure that the CORE directory is included
+# last, probably because it sometimes contains some header files with names
+# that clash with some of ours, or with some that we include, notably on
+# Windows.
+override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
new file mode 100644
index 000..152e62d
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -0,0 +1,243 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+-- test hash -> jsonb
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+-- test array -> jsonb
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+-- test scalar -> jsonb
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testSVToJsonb();
+ testsvtojsonb 
+---
+ 1
+(1 row)
+
+-- test blessed scalar -> jsonb
+CREATE FUNCTION testBlessedToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $class = shift;
+my $tmp = { a=>"a", 1=>"1" };

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-02-15 Thread Amit Langote

Hi Ashutosh.

On 2018/02/09 14:27, Ashutosh Bapat wrote:
> Here's updated patch set with those comments added.

I looked at patches 0002 and 0003.

In 0002:

+ * In case of hash partition we store modulus and remainder in datums array

In case of hash partitioned table?

+ * which has the same data type irrespective of the number of partition keys
+ * and their data types. Hence we can compare the hash bound collection
without
+ * any partition key specific information.

"has the same data type" sounds like it means a Postgres data type,
whereas I think you mean that they are simple int32 values, so we don't
need any PartitionKey information to compare them.

In 0003:

A portion of code in both partition_range_bounds_merge(),
partition_list_bounds_merge(), and merge_null_partitions() has an extra
semi-colon at the end of a line starting with else if:

   if (default_index < 0)
default_index = merged_index;
else if(default_index != merged_index);
{

which emits warnings like this:

partition.c: In function ‘partition_range_bounds_merge’:
partition.c:4192:11: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
  else if(default_index != merged_index);

   ^~
partition.c: In function ‘partition_list_bounds_merge’:
partition.c:4261:11: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
  else if(default_index != merged_index);
   ^~
Also, get this warning.

partition.c:3955:1: warning: ‘is_next_range_continuous’ defined but not
used [-Wunused-function]

I'm trying to understand the optimizer side of this patch.  In your commit
message, I read:

This commit allows partition-wise join to be applied under
following conditions

1. the partition bounds of joining relations are such that rows from
given partition on one side join can join with rows from maximum one
partition on the other side i.e. bounds of a given partition on one
side match/overlap with those of maximum one partition on the other
side. If the mapping happens to be m:n where m > 1 or n > 1, we have
to gang multiple partition relations together into a single relation.
This means that we have to add simple relations during join
processing, something which is not supported right now.  ALso, in such
a case, different pairs of joining relations can produce different
partition bounds for the same join relation, which again is not
supported right now.


So, is the currently allowed case (partition bounds on both sides match
exactly) a special case of this new feature which tries to match
partitions in a more generalized manner?  I see that this patch removes
the partition_bound_equal(outer_rel->boundinfo, inner_rel->boundinfo)
check in build_joinrel_partition_info() in favor of reconciling any
differences in the representation of the partition bounds by calling
partition_bounds_merge() from try_partition_wise_join().

2. For every partition on outer side that can contribute to the result
of an OUTER side, there exists at least one (taken along with item 1,
it means exactly one)  matching partition on the inner side. To
support partition-wise join when the inner matching partition doesn't
exist, we have to add a dummy base relation corresponding to the
non-existent inner partition. We don't have support add base relations
during join processing.

Sorry but I'm a bit confused by the last sentence; does it mean we're not
able to allow partition-wise join to happen in this case?  But this is in
the list of the new cases that the patch makes partition-wise join to
happen for.


Looking at the code changes under src/backend/optimizer:

+else
+{
+Assert(partition_bounds_equal(part_scheme->partnatts,
+  part_scheme->parttyplen,
+  part_scheme->parttypbyval,
+  join_boundinfo, joinrel->boundinfo));

IIUC, this else block would run when try_partition_wise_join() is called
again for the same pair of relations.

+/*
+ * Every pair of joining relations should result in the same number
+ * of child-joins.
+ */

Sorry if I'm misreading this, but does it mean: a given pair of joining
relations should always result in the same number of (set of, too?)
child-joins?

In the new comment in build_joinrel_partition_info():

+ * Because of restrictions in partition_bounds_merge(), not every pair of
+ * joining relation

joining relations


I will try to hop into partition_bounds_merge() from now...

Thanks,
Amit

Re: JIT compiling with LLVM v10.1

2018-02-15 Thread Konstantin Knizhnik




On 14.02.2018 21:17, Andres Freund wrote:

Hi,

On 2018-02-07 06:54:05 -0800, Andres Freund wrote:

I've pushed v10.0. The big (and pretty painful to make) change is that
now all the LLVM specific code lives in src/backend/jit/llvm, which is
built as a shared library which is loaded on demand.

The layout is now as follows:

src/backend/jit/jit.c:
 Part of JITing always linked into the server. Supports loading the
 LLVM using JIT library.

src/backend/jit/llvm/
Infrastructure:
  llvmjit.c:
 General code generation and optimization infrastructure
  llvmjit_error.cpp, llvmjit_wrap.cpp:
 Error / backward compat wrappers
  llvmjit_inline.cpp:
 Cross module inlining support
Code-Gen:
   llvmjit_expr.c
 Expression compilation
   llvmjit_deform.c
 Deform compilation

I've pushed a revised version that hopefully should address Jeff's
wish/need of being able to experiment with this out of core. There's now
a "jit_provider" PGC_POSTMASTER GUC that's by default set to
"llvmjit". llvmjit.so is the .so implementing JIT using LLVM. It fills a
set of callbacks via
extern void _PG_jit_provider_init(JitProviderCallbacks *cb);
which can also be implemented by any other potential provider.

The other two biggest changes are that I've added a README
https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=blob;f=src/backend/jit/README;hb=jit
and that I've revised the configure support so it does more error
checks, and moved it into config/llvm.m4.

There's a larger smattering of small changes too.

I'm pretty happy with how the separation of core / shlib looks now. I'm
planning to work on cleaning and then pushing some of the preliminary
patches (fixed tupledesc, grouping) over the next few days.

Greetings,

Andres Freund



I have made  some more experiments with efficiency of JIT-ing of deform 
tuple and I want to share this results (I hope that them will be 
interesting).
It is well known fact that Postgres spends most of the time in sequence 
scan queries for warm data in deforming tuples (17% in case of TPC-H Q1).
Postgres  tries to optimize access to the tuple by caching fixed size 
offsets to the fields whenever possible and loading attributes on demand.
It is also well know recommendation to put fixed size, non-null, 
frequently used attributes at the beginning of table's attribute list to 
make this optimization work more efficiently.
You can see in the code of heap_deform_tuple shows that first NULL value 
will switch it to "slow" mode:


for (attnum = 0; attnum < natts; attnum++)
    {
        Form_pg_attribute thisatt = TupleDescAttr(tupleDesc, attnum);

        if (hasnulls && att_isnull(attnum, bp))
        {
            values[attnum] = (Datum) 0;
            isnull[attnum] = true;
            slow = true;        /* can't use attcacheoff anymore */
            continue;
        }


I tried to investigate importance of this optimization and what is 
actual penalty of "slow" mode.
At the same time I want to understand how JIT help to speed-up tuple 
deforming.


I have populated with data three tables:

create table t1(id integer primary key,c1 integer,c2 integer,c3 
integer,c4 integer,c5 integer,c6 integer,c7 integer,c8 integer,c9 integer);
create table t2(id integer primary key,c1 integer,c2 integer,c3 
integer,c4 integer,c5 integer,c6 integer,c7 integer,c8 integer,c9 integer);
create table t3(id integer primary key,c1 integer not null,c2 integer 
not null,c3 integer not null,c4 integer not null,c5 integer not null,c6 
integer not null,c7 integer not null,c8 integer not null,c9 integer not 
null);
insert into t1 (id,c1,c2,c3,c4,c5,c6,c7,c8) values 
(generate_series(1,1000),0,0,0,0,0,0,0,0);
insert into t2 (id,c2,c3,c4,c5,c6,c7,c8,c9) values 
(generate_series(1,1000),0,0,0,0,0,0,0,0);
insert into t3 (id,c1,c2,c3,c4,c5,c6,c7,c8,c9) values 
(generate_series(1,1000),0,0,0,0,0,0,0,0,0);

vacuum analyze t1;
vacuum analyze t2;
vacuum analyze t3;

t1 contains null in last c9 column, t2 - in first c1 columns and t3 has 
all attributes declared as not-null (and JIT can use this knowledge to 
generate more efficient deforming code).
All data set is hold in memory (shared buffer size is greater than 
database size) and I intentionally switch off parallel execution to make 
results more deterministic.

I run two queries calculating aggregates on one/all not-null fields:

select sum(c8) from t*;
select sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8) 
from t*;


As expected 35% time was spent in heap_deform_tuple.
But results (msec) were slightly confusing and unexected:

select sum(c8) from t*;


w/o JIT
with JIT
t1  763
563
t2  772
570
t3
776
592


select sum(c2), sum(c3), sum(c4), sum(c5), sum(c6), sum(c7), sum(c8) 
from t*;



w/o JIT
with JIT
t1  1239742
t2  1233747
t3
1255803


I repeat each query 10 times and take the minimal time ( I think that it

Re: Cached/global query plans, autopreparation

2018-02-15 Thread Konstantin Knizhnik

On 13.02.2018 20:13, Shay Rojansky wrote:

Hi all,

Was wondering if anyone has a reaction to my email below about
statement preparation, was it too long? :)

(and sorry for top-posting)

On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky > wrote:

Hi all.

Various versions of having PostgreSQL caching and/or autopreparing
statement plans have been discussed

(https://www.postgresql.org/message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.com

https://www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1-e9cf193a74f5%40postgrespro.ru

),
without clear conclusions or even an agreement on what might be
worthwhile to implement. I wanted to bring this up again from a
PostgreSQL driver maintainer's perspective (I'm the owner of
Npgsql, the open source .NET driver), apologies in advance if I'm
repeating things or I've missed crucial information. Below I'll
describe three relevant issues and what I've done to deal with them.

When the same statement is rerun, preparing it has a very
significant performance boost. However, in short-lived connection
scenarios it's frequently not possible to benefit from this -
think of a typical webapp which allocates a connection from a
pool, run a query and then return the connection. To make sure
prepared statements are used, Npgsql's connection pool doesn't
send DISCARD ALL when a connection is returned (to avoid wiping
out the connections), and maintains an internal table mapping SQL
(and parameter types) to a PostgreSQL statement name. The next
time the application attempts to prepare the same SQL, the
prepared statement is found in the table and no preparation needs
to occur. This means that prepared statements persist across
pooled connection open/close, and are never discarded unless the
user uses a specific API. While this works, the disadvantages are
that:
1. This kind of mechanism needs to be implemented again and again,
in each driver:
2. It relies on Npgsql's internal pooling, which can track
persistent prepared statements on physical connections. If an
external pool is used (e.g. pgpool), this isn't really possible.
1. It complicates resetting the session state (instead of DISCARD
ALL, a combination of all other reset commands except DEALLOCATE
ALL needs be sent). This is minor.

The second issue is that many applications don't work directly
against the database API (ADO.NET in .NET, JDBC
in Java). If any sort of O/RM or additional layer is used, there's
a good chance that that layer doesn't prepare in any way, and
indeed hide your access to the database API's preparation method.
Two examples from the .NET world is dapper (a very popular
micro-O/RM) and Entity Framework. In order to provide the best
possible performance in these scenarios, Npgsql has an opt-in
feature whereby it tracks how many times a given statement was
executed, and once it passes a certain threshold automatically
prepares it. An LRU cache is then used to determine which prepared
statements to discard, to avoid explosion. In effect, statement
auto-preparation is implemented in the driver. I know that the
JDBC driver also implements such a mechanism (it was actually the
inspiration for the Npgsql feature). The issues with this are:

1. As above, this has to be implemented by every driver (and is
quite complex to do well)
2. There's a possible missed opportunity in having a single plan
on the server, as each connection has its own (the "global plan"
option). Many apps out there send the same statements across many
connections so this seems relevant - but I don't know if the gains
outweigh the contention impact in PostgreSQL.

Finally, since quite a few (most?) other databases include
autopreparation (SQL Server, Oracle...), users porting their
applications - which don't explicitly prepare - experience a big
performance drop. It can rightly be said that porting an
application across databases isn't a trivial task and that
adjustments need to be made, but from experience I can say that
PostgreSQL is losing quite a few users to this.

The above issues could be helped by having PostgreSQL cache on its
side (especially the second issue, which is the most important).
Ideally, any adopted solution would be transparent and not require
any modification to applications. It would also not impact
explicitly-prepared statements in any way.

Note that I'm not arguing for any specific implementation on the
PostgreSQL side (e.g. global or not), but just describing a need
and hoping to restart a conversation

49 matches

Mail list logo