date:20201009

Re: SEARCH and CYCLE clauses

2020-10-09 Thread Pavel Stehule

Hi

pá 9. 10. 2020 v 12:17 odesílatel Pavel Stehule 
napsal:

>
>
> pá 9. 10. 2020 v 11:40 odesílatel Peter Eisentraut <
> peter.eisentr...@2ndquadrant.com> napsal:
>
>> On 2020-09-22 20:29, Pavel Stehule wrote:
>> > The result is correct. When I tried to use UNION instead UNION ALL, the
>> > pg crash
>>
>> I fixed the crash, but UNION [DISTINCT] won't actually work here because
>> row/record types are not hashable.  I'm leaving the partial support in,
>> but I'm documenting it as currently not supported.
>>
>
>  I think so UNION is a common solution against the cycles. So missing
> support for this specific case is not a nice thing. How much work is needed
> for hashing rows. It should not be too much code.
>
>
>> > looks so clause USING in cycle detection is unsupported for DB2 and
>> > Oracle - the examples from these databases doesn't work on PG without
>> > modifications
>>
>> Yeah, the path clause is actually not necessary from a user's
>> perspective, but it's required for internal bookkeeping.  We could
>> perhaps come up with a mechanism to make it invisible coming out of the
>> CTE (maybe give the CTE a target list internally), but that seems like a
>> separate project.
>>
>> The attached patch fixes the issues you have reported (also the view
>> issue from the other email).  I have also moved the whole rewrite
>> support to a new file to not blow up rewriteHandler.c so much.
>>
>> --
>> Peter Eisentraut  http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>
This patch is based on transformation CYCLE and SEARCH clauses to specific
expressions - it is in agreement with ANSI SQL

There is not a problem with compilation
Nobody had objections in discussion
There are enough regress tests and documentation
check-world passed
doc build passed

I'll mark this patch as ready for committer

Possible enhancing for this feature (can be done in next steps)

1. support UNION DISTINCT
2. better compatibility with Oracle and DB2 (USING clause can be optional)

Regards

Pavel

Re: [PATCH] Add `truncate` option to subscription commands

2020-10-09 Thread Amit Kapila

On Sat, Oct 10, 2020 at 12:24 AM David Christensen  wrote:
>
> -hackers,
>
> Enclosed find a patch to add a “truncate” option to subscription commands.
>
> When adding new tables to a subscription (either via `CREATE SUBSCRIPTION` or 
> `REFRESH PUBLICATION`), tables on the target which are being newly subscribed 
> will be truncated before the data copy step.  This saves explicit 
> coordination of a manual `TRUNCATE` on the target tables and allows the 
> results of the initial data sync to be the same as on the publisher at the 
> time of sync.
>

So IIUC, this will either truncate all the tables for a particular
subscription or none?  Is it possible that the user wants some of
those tables to be truncated which made me think what exactly made you
propose this feature? Basically, is it from user complaint, or is it
some optimization that you think will be helpful to users?

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 7:32 PM Greg Nancarrow  wrote:
>
> On Fri, Oct 9, 2020 at 11:57 PM Amit Kapila  wrote:
> >
> > >
> > > Sure but in that case 'top_plan->parallel_safe' should be false and it
> > > should stick Gather node atop Insert node.
> > >
> >
> > /should/should not.
> >
>
> OK, for the minimal patch, just allowing INSERT with parallel SELECT,
> you're right, neither of those additional "commandType == CMD_SELECT"
> checks are needed, so I'll remove them.
>

Okay, that makes sense.

> (In the main patch, I think
> the first check can be removed, once the XID handling is fixed; the
> second check is definitely needed though).
>

Okay, then move that check but please do add some comments to state the reason.

-- 
With Regards,
Amit Kapila.

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread David G. Johnston

On Fri, Oct 9, 2020 at 6:41 PM Ranier Vilela  wrote:

> The problem is not only in nodeIncrementalSort.c, but in several others
> too, where people are using TupIsNull with ExecCopySlot.
> I would call this a design flaw.
> If (TupIsNull)
>  ExecCopySlot
>
> The callers, think they are using TupIsNotNullAndEmpty.
> If (TupIsNotNullAndEmpty)
>  ExecCopySlot
>

IMO both names are problematic, too data value centric, not semantic.
TupIsValid for the name and negating the existing tests would help to at
least clear that part up.  Then, things operating on invalid tuples would
be expected to know about both representations.  In the case of
ExecCopySlot there is nothing it can do with a null representation of an
invalid tuple so it would have to fail if presented one.  An assertion
seems sufficient.

David J.

Use list_delete_xxxcell O(1) instead of list_delete_ptr O(N) in some places

2020-10-09 Thread Hou, Zhijie

Hi

I found some code places call list_delete_ptr can be replaced by 
list_delete_xxxcell which can be faster.

diff --git a/src/backend/optimizer/path/joinpath.c 
b/src/backend/optimizer/path/joinpath.c
index db54a6b..61ef7c8 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -1005,8 +1005,8 @@ sort_inner_and_outer(PlannerInfo *root,
/* Make a pathkey list with this guy first */
if (l != list_head(all_pathkeys))
outerkeys = lcons(front_pathkey,
- 
list_delete_ptr(list_copy(all_pathkeys),
-   
  front_pathkey));
+ 
list_delete_nth_cell(list_copy(all_pathkeys),
+   
   foreach_current_index(l)));
else
outerkeys = all_pathkeys;   /* no work at first 
one... */
 
diff --git a/src/backend/rewrite/rewriteHandler.c 
b/src/backend/rewrite/rewriteHandler.c
index fe777c3..d0f15b8 100644
--- a/src/backend/rewrite/rewriteHandler.c
+++ b/src/backend/rewrite/rewriteHandler.c
@@ -650,7 +650,7 @@ adjustJoinTreeList(Query *parsetree, bool removert, int 
rt_index)
if (IsA(rtr, RangeTblRef) &&
rtr->rtindex == rt_index)
{
-   newjointree = list_delete_ptr(newjointree, rtr);
+   newjointree = list_delete_cell(newjointree, l);


Best regards,
houzj




0001-Use-list_delete_xxxcell-instead-of-list_delete_ptr.patch
Description: 0001-Use-list_delete_xxxcell-instead-of-list_delete_ptr.patch

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 19:45, Tom Lane  escreveu:

> Tomas Vondra  writes:
> > My (admittedly very subjective) opinion is that it looks much worse. The
> > TupIsNull is pretty self-descriptive, unlike this proposed code.
>
> +1
>
> > That could be fixed by defining a new macro, perhaps something like
> > SlotIsEmpty() or so. But as was already explained, Incremental Sort
> > can't actually have a NULL slot here, so it makes no difference there.
> > And in the other places we can't just mechanically replace the macros
> > because it'd quite likely silently hide pre-existing bugs.
>
> IME, there are basically two use-cases for TupIsNull in the executor:
>
> 1. Checking whether a lower-level plan node has returned an actual
> tuple or an EOF indicator.  In current usage, both parts of the
> TupIsNull test are needed here, because some node types like to
> return NULL pointers while others do something like
> "return ExecClearTuple(myslot)".
>
> 2. Manipulating a locally-managed slot.  In just about every case
> of this sort, the slot is created during the node Init function,
> so that the NULL test in TupIsNull is unnecessary and what we are
> really interested in is the empty-or-not state of the slot.
>
> Thus, Ranier's concern would be valid if a node ever did anything
> with a returned-from-lower-level slot after failing the TupIsNull
> check on it.  But there's really no reason to do so, and furthermore
> doing so would be a logic bug in itself.  (Something like ExecCopySlot
> into the slot, for example, is flat out wrong, because an upper level
> node is *never* entitled to scribble on the output slot of a lower-level
> node.)  So I seriously, seriously doubt that there are any live bugs
> of this ilk.
>
> In principle we could invent SlotIsEmpty() and apply it in use
> cases of type 2, but I don't really think that'd be a productive
> activity.  In return for saving a few cycles we'd have a nontrivial
> risk of new bugs from using the wrong macro for the case at hand.
>
> I do wonder whether we should try to simplify the inter-node
> API by allowing only one of the two cases for EOF indicators.
> Not convinced it's worth troubling over, though.
>
The problem is not only in nodeIncrementalSort.c, but in several others
too, where people are using TupIsNull with ExecCopySlot.
I would call this a design flaw.
If (TupIsNull)
 ExecCopySlot

The callers, think they are using TupIsNotNullAndEmpty.
If (TupIsNotNullAndEmpty)
 ExecCopySlot

regards,
Ranier Vilela

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-10-09 Thread Bruce Momjian

On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> In my local branch, I had revised this comment to say:
> 
> + * Note, v8.4 has no tablespace_suffix, which is fine so long as the version 
> we
> + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade 
> from
> + * a version to the same version if tablespaces are in use.

OK, updated patch attached.  Also, from your original patch, I didn't
need to call canonicalize_path() since we are not comparing paths, and I
didn't need to include common/relpath.h.  I also renamed a variable.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 168058a873..df4274d89a 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -27,6 +27,7 @@ static void check_for_tables_with_oids(ClusterInfo *cluster);
 static void check_for_reg_data_type_usage(ClusterInfo *cluster);
 static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
 static void check_for_pg_role_prefix(ClusterInfo *cluster);
+static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
 static char *get_canonical_locale_name(int category, const char *locale);
 
 
@@ -187,6 +188,8 @@ check_new_cluster(void)
 	check_is_install_user(_cluster);
 
 	check_for_prepared_transactions(_cluster);
+
+	check_for_new_tablespace_dir(_cluster);
 }
 
 
@@ -527,6 +530,43 @@ create_script_for_cluster_analyze(char **analyze_script_file_name)
 }
 
 
+/*
+ * A previous run of pg_upgrade might have failed and the new cluster
+ * directory recreated, but they might have forgotten to remove
+ * the new cluser's tablespace directories.  Therefore, check that
+ * new cluster tablespace directories do not already exist.  If
+ * they do, it would cause an error while restoring global objects.
+ * This allows the failure to be detected at check time, rather than
+ * during schema restore.
+ *
+ * Note, v8.4 has no tablespace_suffix, which is fine so long as the
+ * version we being upgraded *to* has a suffix, since it's not allowed
+ * to pg_upgrade from a version to the same version if tablespaces are
+ * in use.
+ */
+static void
+check_for_new_tablespace_dir(ClusterInfo *new_cluster)
+{
+	char	new_tablespace_dir[MAXPGPATH];
+
+ 	prep_status("Checking for new cluster tablespace directories");
+
+	for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+	{
+		struct stat statbuf;
+
+		snprintf(new_tablespace_dir, MAXPGPATH, "%s%s",
+os_info.old_tablespaces[tblnum],
+new_cluster->tablespace_suffix);
+
+		if (stat(new_tablespace_dir, ) == 0 || errno != ENOENT)
+			pg_fatal("new cluster tablespace directory already exists: \"%s\"\n",
+	 new_tablespace_dir);
+	}
+
+	check_ok();
+}
+
 /*
  * create_script_for_old_cluster_deletion()
  *

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tom Lane

Tomas Vondra  writes:
> My (admittedly very subjective) opinion is that it looks much worse. The
> TupIsNull is pretty self-descriptive, unlike this proposed code.

+1

> That could be fixed by defining a new macro, perhaps something like
> SlotIsEmpty() or so. But as was already explained, Incremental Sort
> can't actually have a NULL slot here, so it makes no difference there.
> And in the other places we can't just mechanically replace the macros
> because it'd quite likely silently hide pre-existing bugs.

IME, there are basically two use-cases for TupIsNull in the executor:

1. Checking whether a lower-level plan node has returned an actual
tuple or an EOF indicator.  In current usage, both parts of the
TupIsNull test are needed here, because some node types like to
return NULL pointers while others do something like
"return ExecClearTuple(myslot)".

2. Manipulating a locally-managed slot.  In just about every case
of this sort, the slot is created during the node Init function,
so that the NULL test in TupIsNull is unnecessary and what we are
really interested in is the empty-or-not state of the slot.

Thus, Ranier's concern would be valid if a node ever did anything
with a returned-from-lower-level slot after failing the TupIsNull
check on it.  But there's really no reason to do so, and furthermore
doing so would be a logic bug in itself.  (Something like ExecCopySlot
into the slot, for example, is flat out wrong, because an upper level
node is *never* entitled to scribble on the output slot of a lower-level
node.)  So I seriously, seriously doubt that there are any live bugs
of this ilk.

In principle we could invent SlotIsEmpty() and apply it in use
cases of type 2, but I don't really think that'd be a productive
activity.  In return for saving a few cycles we'd have a nontrivial
risk of new bugs from using the wrong macro for the case at hand.

I do wonder whether we should try to simplify the inter-node
API by allowing only one of the two cases for EOF indicators.
Not convinced it's worth troubling over, though.

regards, tom lane

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tomas Vondra

On Fri, Oct 09, 2020 at 05:50:09PM -0300, Ranier Vilela wrote:

Em sex., 9 de out. de 2020 às 17:47, Tom Lane  escreveu:

Ranier Vilela  writes:
> The trap is not on the second part of expression. Is in the first.
> If the pointer is NULL, ExecCopySlot will be called.

Your initial comment indicated that you were worried about
IncrementalSortState's group_pivot slot, but that is never going
to be null in any execution function of nodeIncrementalSort.c,
because ExecInitIncrementalSort always creates it.

(The test whether it's NULL in ExecReScanIncrementalSort therefore
seems rather useless and misleading, but it's not actually a bug.)

The places that use TupIsNull are just doing so because that's
the standard way to check whether a slot is empty.  The null
test inside the macro is pointless in this context (and in a lot
of its other use-cases, too) but we don't worry about that.

So I said that TupIsNull was not the most appropriate.

Doesn't it look better?
(node->group_pivot != NULL && TTS_EMPTY(node->group_pivot))

My (admittedly very subjective) opinion is that it looks much worse. The
TupIsNull is pretty self-descriptive, unlike this proposed code.

That could be fixed by defining a new macro, perhaps something like
SlotIsEmpty() or so. But as was already explained, Incremental Sort
can't actually have a NULL slot here, so it makes no difference there.
And in the other places we can't just mechanically replace the macros
because it'd quite likely silently hide pre-existing bugs.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tomas Vondra

On Fri, Oct 09, 2020 at 05:25:02PM -0300, Ranier Vilela wrote:

Em sex., 9 de out. de 2020 às 14:05, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

On Fri, Oct 09, 2020 at 12:24:16PM -0300, Ranier Vilela wrote:
>I think that TupIsNull macro is no longer appropriate, to protect
>ExecCopySlot.
>
>See at tuptable.h:
>#define TupIsNull(slot) \
>((slot) == NULL || TTS_EMPTY(slot))
>
>If var node->group_pivot is NULL, ExecCopySlot will
>dereference a null pointer (first arg).
>

No. The C standard says there's a "sequence point" [1] between the left
and right arguments of the || operator, and that the expressions are
evaluated from left to right. So the program will do the first check,
and if the pointer really is NULL it won't do the second one (because
that is not necessary for determining the result). Similarly for the &&
operator, of course.

Really.
The trap is not on the second part of expression. Is in the first.
If the pointer is NULL, ExecCopySlot will be called.

Ah, OK. Now I see what you meant. Well, yeah - calling ExecCopySlot with
NULL would be bad, but as others pointed out most of the call sites
don't really have the issue for other reasons.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Peter Geoghegan

On Fri, Oct 9, 2020 at 2:04 PM Ranier Vilela  wrote:
>> > The author of deduplication claimed that he thinks the problem may be in 
>> > IncrementalSort,
>> > he did not specify which part.
>>
>> No I didn't.
>
> https://www.postgresql.org/message-id/CAH2-Wz=ae84z0pxtbc+ssgi9ec8ngkn9d16op-dgh47jcrv...@mail.gmail.com

That thread is obviously totally unrelated to what you're talking
about. I cannot imagine how you made the connection. The only
commonality is the term "incremental sort".

Moreover, the point that I make in the thread that you link to is that
the bug in question could not possibly be related to the incremental
sort commit. That was an initial quick guess that I made that turned
out to be wrong.

-- 
Peter Geoghegan

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Stephen Frost

Greetings,

* Ranier Vilela (ranier...@gmail.com) wrote:
> Em sex., 9 de out. de 2020 às 18:02, Stephen Frost 
> escreveu:
> > As a side-note, this kind of further analysis and looking for other,
> > similar, cases that might be problematic is really helpful and important
> > to do whenever you come across a case like this, and will also lend a
> > bit more validation that this is really an issue and something we need
> > to look at and not a one-off mistake (which, as much as we'd like to
> > think we never make mistakes, isn't typically the case...).
> >
> Several places.
> TupIsNull it looks like a minefield...

Is it though?  Tom already pointed out that the specific case you were
concerned about isn't an issue- I'd encourage you to go review the other
cases that I found and see if you can find any cases where it's actually
going to result in a crash.

If there is such a case, then perhaps we should consider changing
things, but if not, then perhaps there isn't any need to make a change.
I do wonder if maybe some of those cases don't acutally need the
TupIsNull() check at all as we could prove that it won't be by the time
we reach that point, but it depends- and requires more review.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 18:02, Stephen Frost 
escreveu:

> Greetings,
>
> * Ranier Vilela (ranier...@gmail.com) wrote:
> > Em sex., 9 de out. de 2020 às 14:05, Tomas Vondra <
> > tomas.von...@2ndquadrant.com> escreveu:
> >
> > > On Fri, Oct 09, 2020 at 12:24:16PM -0300, Ranier Vilela wrote:
> > > >I think that TupIsNull macro is no longer appropriate, to protect
> > > >ExecCopySlot.
> > > >
> > > >See at tuptable.h:
> > > >#define TupIsNull(slot) \
> > > >((slot) == NULL || TTS_EMPTY(slot))
> > > >
> > > >If var node->group_pivot is NULL, ExecCopySlot will
> > > >dereference a null pointer (first arg).
>
> [...]
>
> > The trap is not on the second part of expression. Is in the first.
> > If the pointer is NULL, ExecCopySlot will be called.
>
> Yeah, that's interesting, and this isn't the only place there's a risk
> of that happening, in doing a bit of review of TupIsNull() callers:
>
> src/backend/executor/nodeGroup.c:
>
> if (TupIsNull(firsttupleslot))
> {
> outerslot = ExecProcNode(outerPlanState(node));
> if (TupIsNull(outerslot))
> {
> /* empty input, so return nothing */
> node->grp_done = true;
> return NULL;
> }
> /* Copy tuple into firsttupleslot */
> ExecCopySlot(firsttupleslot, outerslot);
>
> Seems that 'firsttupleslot' could possibly be a NULL pointer at this
> point?
>
> src/backend/executor/nodeWindowAgg.c:
>
> /* Fetch next row if we didn't already */
> if (TupIsNull(agg_row_slot))
> {
> if (!window_gettupleslot(agg_winobj, winstate->aggregatedupto,
>  agg_row_slot))
> break;  /* must be end of partition */
> }
>
> If agg_row_slot ends up being an actual NULL pointer, this looks likely
> to end up resulting in a crash too.
>
> /*
>  * If this is the very first partition, we need to fetch the first
> input
>  * row to store in first_part_slot.
>  */
> if (TupIsNull(winstate->first_part_slot))
> {
> TupleTableSlot *outerslot = ExecProcNode(outerPlan);
>
> if (!TupIsNull(outerslot))
> ExecCopySlot(winstate->first_part_slot, outerslot);
> else
> {
> /* outer plan is empty, so we have nothing to do */
> winstate->partition_spooled = true;
> winstate->more_partitions = false;
> return;
> }
> }
>
> This seems like another risky case, since we don't check that
> winstate->first_part_slot is a non-NULL pointer.
>
> if (winstate->frameheadpos == 0 &&
> TupIsNull(winstate->framehead_slot))
> {
> /* fetch first row into framehead_slot, if we didn't
> already */
> if (!tuplestore_gettupleslot(winstate->buffer, true, true,
>  winstate->framehead_slot))
> elog(ERROR, "unexpected end of tuplestore");
> }
>
> There's a few of these 'framehead_slot' cases, and then some with
> 'frametail_slot', all a similar pattern to above.
>
> > For convenience, I will reproduce it:
> > static inline TupleTableSlot *
> > ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
> > {
> > Assert(!TTS_EMPTY(srcslot));
> > AssertArg(srcslot != dstslot);
> >
> > dstslot->tts_ops->copyslot(dstslot, srcslot);
> >
> > return dstslot;
> > }
> >
> > The second arg is not empty? Yes.
> > The second arg is different from the first arg (NULL)? Yes.
> >
> > dstslot->tts_ops->copyslot(dstslot, srcslot); // dereference dstslot
> (which
> > is NULL)
>
> Right, just to try and clarify further, the issue here is with this code:
>
> if (TupIsNull(node->group_pivot))
> ExecCopySlot(node->group_pivot, node->transfer_tuple);
>
> With TupIsNull defined as:
>
> ((slot) == NULL || TTS_EMPTY(slot))
>
> That means we get:
>
> if ((node->group_pivot) == NULL || TTS_EMPTY(node->group_pivot))
> ExecCopySlot(node->group_pivot, node->transfer_tuple);
>
> Which means that if we reach this point with node->group_pivot as NULL,
> then we'll pass that to ExecCopySlot() and eventually end up
> dereferencing it and crashing.
>
> I haven't tried to run back farther up to see if it's possible that
> there's other checks which prevent node->group_pivot (and the other
> cases) from actually being a NULL pointer by the time we reach this
> code, but I agree that it seems to be a bit concerning to have the code
> written this way- TupIsNull() allows the pointer *itself* to be NULL and
> callers of it need to realize that (if nothing else by at least
> commenting that there's other checks in place to make sure that it can't
> end up actually being a NULL pointer when we're passing it to some other
> function that'll dereference it).
>
> As a side-note, this kind of further analysis and looking for other,
> similar, cases that might be problematic is really helpful and important
>

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tom Lane

Ranier Vilela  writes:
> So I said that TupIsNull was not the most appropriate.

[ shrug... ]  You're entitled to your opinion, but I see essentially
no value in running around and trying to figure out which TupIsNull
calls actually can see a null pointer and which never will.  It'd
likely introduce bugs, it would certainly not remove any, and there's
no reason to believe that any meaningful performance improvement
could be gained.

(It's possible that the compiler can remove some of the useless
tests, so I'm satisfied to leave such micro-optimization to it.)

regards, tom lane

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 17:58, Peter Geoghegan  escreveu:

> On Fri, Oct 9, 2020 at 1:28 PM Ranier Vilela  wrote:
> > Sorry, can't find the thread.
> > The author of deduplication claimed that he thinks the problem may be in
> IncrementalSort,
> > he did not specify which part.
>
> No I didn't.
>
https://www.postgresql.org/message-id/CAH2-Wz=ae84z0pxtbc+ssgi9ec8ngkn9d16op-dgh47jcrv...@mail.gmail.com
" On Tue, Jul 28, 2020 at 8:47 AM Peter Geoghegan  wrote:
> This is very likely to be related to incremental sort because it's a"

Ranier Vilela

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Stephen Frost

Greetings,

* Ranier Vilela (ranier...@gmail.com) wrote:
> Em sex., 9 de out. de 2020 às 14:05, Tomas Vondra <
> tomas.von...@2ndquadrant.com> escreveu:
> 
> > On Fri, Oct 09, 2020 at 12:24:16PM -0300, Ranier Vilela wrote:
> > >I think that TupIsNull macro is no longer appropriate, to protect
> > >ExecCopySlot.
> > >
> > >See at tuptable.h:
> > >#define TupIsNull(slot) \
> > >((slot) == NULL || TTS_EMPTY(slot))
> > >
> > >If var node->group_pivot is NULL, ExecCopySlot will
> > >dereference a null pointer (first arg).

[...]

> The trap is not on the second part of expression. Is in the first.
> If the pointer is NULL, ExecCopySlot will be called.

Yeah, that's interesting, and this isn't the only place there's a risk
of that happening, in doing a bit of review of TupIsNull() callers:

src/backend/executor/nodeGroup.c:

if (TupIsNull(firsttupleslot))
{
outerslot = ExecProcNode(outerPlanState(node));
if (TupIsNull(outerslot))
{
/* empty input, so return nothing */
node->grp_done = true;
return NULL;
}
/* Copy tuple into firsttupleslot */
ExecCopySlot(firsttupleslot, outerslot);

Seems that 'firsttupleslot' could possibly be a NULL pointer at this
point?

src/backend/executor/nodeWindowAgg.c:

/* Fetch next row if we didn't already */
if (TupIsNull(agg_row_slot))
{
if (!window_gettupleslot(agg_winobj, winstate->aggregatedupto,
 agg_row_slot))
break;  /* must be end of partition */
}

If agg_row_slot ends up being an actual NULL pointer, this looks likely
to end up resulting in a crash too.

/*
 * If this is the very first partition, we need to fetch the first input
 * row to store in first_part_slot.
 */
if (TupIsNull(winstate->first_part_slot))
{
TupleTableSlot *outerslot = ExecProcNode(outerPlan);

if (!TupIsNull(outerslot))
ExecCopySlot(winstate->first_part_slot, outerslot);
else
{
/* outer plan is empty, so we have nothing to do */
winstate->partition_spooled = true;
winstate->more_partitions = false;
return;
}
}

This seems like another risky case, since we don't check that
winstate->first_part_slot is a non-NULL pointer.

if (winstate->frameheadpos == 0 &&
TupIsNull(winstate->framehead_slot))
{
/* fetch first row into framehead_slot, if we didn't already */
if (!tuplestore_gettupleslot(winstate->buffer, true, true,
 winstate->framehead_slot))
elog(ERROR, "unexpected end of tuplestore");
}

There's a few of these 'framehead_slot' cases, and then some with
'frametail_slot', all a similar pattern to above.

> For convenience, I will reproduce it:
> static inline TupleTableSlot *
> ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
> {
> Assert(!TTS_EMPTY(srcslot));
> AssertArg(srcslot != dstslot);
> 
> dstslot->tts_ops->copyslot(dstslot, srcslot);
> 
> return dstslot;
> }
> 
> The second arg is not empty? Yes.
> The second arg is different from the first arg (NULL)? Yes.
> 
> dstslot->tts_ops->copyslot(dstslot, srcslot); // dereference dstslot (which
> is NULL)

Right, just to try and clarify further, the issue here is with this code:

if (TupIsNull(node->group_pivot))
ExecCopySlot(node->group_pivot, node->transfer_tuple);

With TupIsNull defined as:

((slot) == NULL || TTS_EMPTY(slot))

That means we get:

if ((node->group_pivot) == NULL || TTS_EMPTY(node->group_pivot))
ExecCopySlot(node->group_pivot, node->transfer_tuple);

Which means that if we reach this point with node->group_pivot as NULL,
then we'll pass that to ExecCopySlot() and eventually end up
dereferencing it and crashing.

I haven't tried to run back farther up to see if it's possible that
there's other checks which prevent node->group_pivot (and the other
cases) from actually being a NULL pointer by the time we reach this
code, but I agree that it seems to be a bit concerning to have the code
written this way- TupIsNull() allows the pointer *itself* to be NULL and
callers of it need to realize that (if nothing else by at least
commenting that there's other checks in place to make sure that it can't
end up actually being a NULL pointer when we're passing it to some other
function that'll dereference it).

As a side-note, this kind of further analysis and looking for other,
similar, cases that might be problematic is really helpful and important
to do whenever you come across a case like this, and will also lend a
bit more validation that this is really an issue and something we need
to look at and not a one-off mistake (which, as much as we'd like to
think we never make mistakes, isn't typically the case...).

Thanks,

Stephen

signature.asc

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Peter Geoghegan

On Fri, Oct 9, 2020 at 1:28 PM Ranier Vilela  wrote:
> Sorry, can't find the thread.
> The author of deduplication claimed that he thinks the problem may be in 
> IncrementalSort,
> he did not specify which part.

No I didn't.

-- 
Peter Geoghegan

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 17:47, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > The trap is not on the second part of expression. Is in the first.
> > If the pointer is NULL, ExecCopySlot will be called.
>
> Your initial comment indicated that you were worried about
> IncrementalSortState's group_pivot slot, but that is never going
> to be null in any execution function of nodeIncrementalSort.c,
> because ExecInitIncrementalSort always creates it.
>
> (The test whether it's NULL in ExecReScanIncrementalSort therefore
> seems rather useless and misleading, but it's not actually a bug.)
>
> The places that use TupIsNull are just doing so because that's
> the standard way to check whether a slot is empty.  The null
> test inside the macro is pointless in this context (and in a lot
> of its other use-cases, too) but we don't worry about that.
>
So I said that TupIsNull was not the most appropriate.

Doesn't it look better?
(node->group_pivot != NULL && TTS_EMPTY(node->group_pivot))

regards,
Ranier Vilela

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tom Lane

Ranier Vilela  writes:
> The trap is not on the second part of expression. Is in the first.
> If the pointer is NULL, ExecCopySlot will be called.

Your initial comment indicated that you were worried about
IncrementalSortState's group_pivot slot, but that is never going
to be null in any execution function of nodeIncrementalSort.c,
because ExecInitIncrementalSort always creates it.

(The test whether it's NULL in ExecReScanIncrementalSort therefore
seems rather useless and misleading, but it's not actually a bug.)

The places that use TupIsNull are just doing so because that's
the standard way to check whether a slot is empty.  The null
test inside the macro is pointless in this context (and in a lot
of its other use-cases, too) but we don't worry about that.

regards, tom lane

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Thomas Munro

On Fri, Oct 9, 2020 at 11:58 PM Greg Nancarrow  wrote:
> On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila  wrote:
> > That will be true for the number of rows/pages we need to scan not for
> > the number of tuples we need to return as a result. The formula here
> > considers the number of rows the parallel scan will return and the
> > more the number of rows each parallel node needs to pass via shared
> > memory to gather node the more costly it will be.
> >
> > We do consider the total pages we need to scan in
> > compute_parallel_worker() where we use a logarithmic formula to
> > determine the number of workers.
>
> Despite all the best intentions, the current costings seem to be
> geared towards selection of a non-parallel plan over a parallel plan,
> the more rows there are in the table. Yet the performance of a
> parallel plan appears to be better than non-parallel-plan the more
> rows there are in the table.

Right, but as Amit said, we still have to account for the cost of
schlepping tuples between processes.  Hmm... could the problem be that
we're incorrectly estimating that Insert (without RETURNING) will send
a bazillion tuples, even though that isn't true?  I didn't look at the
code but that's what the plan seems to imply when it says stuff like
"Gather  (cost=15428.00..16101.14 rows=100 width=4)".  I suppose
the row estimates for ModifyTable paths are based on what they write,
not what they emit, and in the past that distinction didn't matter
much because it wasn't something that was used for comparing
alternative plans.  Now it is.

Re: Improving connection scalability: GetSnapshotData()

2020-10-09 Thread Andrew Dunstan



On 10/5/20 10:33 PM, Andres Freund wrote:
> Hi,
>
> On 2020-10-01 19:21:14 -0400, Andrew Dunstan wrote:
>> On 10/1/20 4:22 PM, Andres Freund wrote:
>>> # Note: on Windows, IPC::Run seems to convert \r\n to \n in program 
>>> output
>>> # if we're using native Perl, but not if we're using MSys Perl.  So do 
>>> it
>>> # by hand in the latter case, here and elsewhere.
>>> that IPC::Run converts things, but that native windows perl uses
>>> https://perldoc.perl.org/perlrun#PERLIO
>>> a PERLIO that includes :crlf, whereas msys probably doesn't?
>>>
>>> Any chance you could run something like
>>> perl -mPerlIO -e 'print(PerlIO::get_layers(STDIN), "\n");'
>>> on both native and msys perl?
>>>
>>>
 possibly also for stderr, just to make it more futureproof, and at the
 top of the file:

 use Config;


 Do you want me to test that first?
>>> That'd be awesome.
>> The change I suggested makes jacana happy.
> Thanks, pushed. Hopefully that fixes the mingw animals.
>

I don't think we're out of the woods yet. This test is also have bad
effects on bowerbird, which is an MSVC animal. It's hanging completely :-(


Digging some more.


cheers


andrew


-- 
Andrew Dunstan
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 14:05, Tomas Vondra <
tomas.von...@2ndquadrant.com> escreveu:

> On Fri, Oct 09, 2020 at 12:24:16PM -0300, Ranier Vilela wrote:
> >I think that TupIsNull macro is no longer appropriate, to protect
> >ExecCopySlot.
> >
> >See at tuptable.h:
> >#define TupIsNull(slot) \
> >((slot) == NULL || TTS_EMPTY(slot))
> >
> >If var node->group_pivot is NULL, ExecCopySlot will
> >dereference a null pointer (first arg).
> >
>
> No. The C standard says there's a "sequence point" [1] between the left
> and right arguments of the || operator, and that the expressions are
> evaluated from left to right. So the program will do the first check,
> and if the pointer really is NULL it won't do the second one (because
> that is not necessary for determining the result). Similarly for the &&
> operator, of course.
>
Really.
The trap is not on the second part of expression. Is in the first.
If the pointer is NULL, ExecCopySlot will be called.

For convenience, I will reproduce it:
static inline TupleTableSlot *
ExecCopySlot(TupleTableSlot *dstslot, TupleTableSlot *srcslot)
{
Assert(!TTS_EMPTY(srcslot));
AssertArg(srcslot != dstslot);

dstslot->tts_ops->copyslot(dstslot, srcslot);

return dstslot;
}

The second arg is not empty? Yes.
The second arg is different from the first arg (NULL)? Yes.

dstslot->tts_ops->copyslot(dstslot, srcslot); // dereference dstslot (which
is NULL)


>
> Had this been wrong, surely some of the the other places TupIsNull would
> be wrong too (and there are hundreds of them).
>
> >Maybe, this can be related to a bug reported in the btree deduplication.
> >
>
> Not sure which bug you mean, but this piece of code is pretty unrelated
> to btree in general, so I don't see any link.
>
Sorry, can't find the thread.
The author of deduplication claimed that he thinks the problem may be in
IncrementalSort,
he did not specify which part.

regards,
Ranier Vilela

Re: BUG #15858: could not stat file - over 4GB

2020-10-09 Thread Tom Lane

Emil Iggland  writes:
> I tested the patch at hand, and it performs as expected. Files larger than 
> 4GB can be imported.

Thanks for testing!

I'd been expecting one of our Windows-savvy committers to pick this up,
but since nothing has been happening, I took it on myself to push it.
I'll probably regret that :-(

I made a few cosmetic changes, mostly reorganizing comments in a way
that made more sense to me.

regards, tom lane

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread John Naylor

On Fri, Oct 9, 2020 at 12:50 PM Tom Lane  wrote:

> Consulting the wiki page's edit history shows that Bruce has been
> doing some maintenance work, but almost nobody else does.  Evidently
> that's not enough.
>
> I'd be +1 for the "aggressive culling" suggested upthread, but
> I'm not particularly volunteering to do it, either.
>

I'll start a separate thread for this. Working on one or a small number of
sections at a time should be manageable.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-10-09 Thread Justin Pryzby

On Fri, Oct 09, 2020 at 02:53:25PM -0400, Bruce Momjian wrote:
> On Thu, Sep 24, 2020 at 07:55:31PM -0500, Justin Pryzby wrote:
> > This should be caught during --check, or earlier in the upgrade, rather than
> > only after dumping the schema.
> > 
> > Typically, the tablespace dir would be left behind by a previous failed 
> > upgrade
> > attempt, causing a cascade of failured upgrades.  I guess I may not be the 
> > only
> > one with a 3 year old wrapper which has a hack to check for this.
> 
> This is an interesting failure case I never considered --- running
> pg_upgrade, having it fail, deleting and recreating the _new_ cluster
> directory, but not removing the new cluster's tablepace directories.  I
> was able to recreate the failure, and verify that your patch properly
> throws an error during 'check' for this case.
> 
> Modified patch attached.  I plan to apply this to all supported Postgres
> versions.

Thanks for looking.  It hits me a bunch of times every year.

> +  * Check that tablespace directories do not already exist for new cluster.
> +  * If they do, it would cause an error while restoring global objects.
> +  * This allows the failure to be detected at check time, rather than
> +  * during schema restore.
> +  *
> +  * Note, v8.4 has no tablespace_suffix, which is fine as long as the new
> +  * cluster version has a suffix.

In my local branch, I had revised this comment to say:

+ * Note, v8.4 has no tablespace_suffix, which is fine so long as the version we
+ * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade from
+ * a version to the same version if tablespaces are in use.

Cheers,
-- 
Justin

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Dave Cramer

On Fri, 9 Oct 2020 at 14:59, Andres Freund  wrote:

> Hi,
>
> On 2020-10-09 14:49:11 -0400, Dave Cramer wrote:
> > On Fri, 9 Oct 2020 at 14:46, Andres Freund  wrote:
> > > > (I suspect what would be more useful in practice is to designate
> > > > output formats per data type.)
> > >
> > > Yea, that'd be *really* useful. It sucks that we basically require
> > > multiple round trips to make realistic use of the binary data for the
> > > few types where it's a huge win (e.g. bytea).
> > >
> >
> > Yes!!! Ideally in the startup message.
>
> I don't think startup is a good choice. For one, it's size limited. But
> more importantly, before having successfully established a connection,
> there's really no way the driver can know which types it should list as
> to be sent in binary (consider e.g. some postgis types, which'd greatly
> benefit from being sent in binary, but also just version dependent
> stuff).
>
> For the most part we know exactly which types we want in binary for 99% of
queries.


> The hard part around this really is whether and how to deal with changes
> in type definitions. From types just being created - comparatively
> simple - to extensions being dropped and recreated, with oids
> potentially being reused.
>

Fair point but this is going to be much more complex than just sending most
of the results in binary which would speed up the overwhelming majority of
queries

Dave Cramer

>
>

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Andres Freund

Hi,

On 2020-10-09 14:49:11 -0400, Dave Cramer wrote:
> On Fri, 9 Oct 2020 at 14:46, Andres Freund  wrote:
> > > (I suspect what would be more useful in practice is to designate
> > > output formats per data type.)
> >
> > Yea, that'd be *really* useful. It sucks that we basically require
> > multiple round trips to make realistic use of the binary data for the
> > few types where it's a huge win (e.g. bytea).
> >
> 
> Yes!!! Ideally in the startup message.

I don't think startup is a good choice. For one, it's size limited. But
more importantly, before having successfully established a connection,
there's really no way the driver can know which types it should list as
to be sent in binary (consider e.g. some postgis types, which'd greatly
benefit from being sent in binary, but also just version dependent
stuff).

The hard part around this really is whether and how to deal with changes
in type definitions. From types just being created - comparatively
simple - to extensions being dropped and recreated, with oids
potentially being reused.

Greetings,

Andres Freund

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread Bruce Momjian

On Fri, Oct  9, 2020 at 12:50:36PM -0400, Tom Lane wrote:
> Magnus Hagander  writes:
> > That said, having this in a different format would in no way fix the fact
> > that the information is mislabeled and obsolete.  It would likely be
> > equally mislabeled and obsolete in an issue tracker. Just look at almost
> > any of the other projects out there that *do* use an issue tracker and have
> > been around for 20+ years.
> 
> The long and the short of it is that a list like this is only of value
> if it gets diligently maintained.  Putting the data in a different tool
> changes nothing about that.  (A tool might reduce the effort involved,
> *if* it's a good match to your workflow.  But it won't reduce the effort
> to zero.)
> 
> Consulting the wiki page's edit history shows that Bruce has been
> doing some maintenance work, but almost nobody else does.  Evidently
> that's not enough.
> 
> I'd be +1 for the "aggressive culling" suggested upthread, but
> I'm not particularly volunteering to do it, either.

+1

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

[PATCH] Add `truncate` option to subscription commands

2020-10-09 Thread David Christensen

-hackers,

Enclosed find a patch to add a “truncate” option to subscription commands.

When adding new tables to a subscription (either via `CREATE SUBSCRIPTION` or 
`REFRESH PUBLICATION`), tables on the target which are being newly subscribed 
will be truncated before the data copy step.  This saves explicit coordination 
of a manual `TRUNCATE` on the target tables and allows the results of the 
initial data sync to be the same as on the publisher at the time of sync.

To preserve compatibility with existing behavior, the default value for this 
parameter is `false`.

Best,

David



0001-Add-truncate-option-to-subscription-commands.patch
Description: Binary data


--
David Christensen
Senior Software and Database Engineer
End Point Corporation
da...@endpoint.com
785-727-1171




signature.asc
Description: Message signed with OpenPGP

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-10-09 Thread Bruce Momjian

On Thu, Sep 24, 2020 at 07:55:31PM -0500, Justin Pryzby wrote:
> This should be caught during --check, or earlier in the upgrade, rather than
> only after dumping the schema.
> 
> Typically, the tablespace dir would be left behind by a previous failed 
> upgrade
> attempt, causing a cascade of failured upgrades.  I guess I may not be the 
> only
> one with a 3 year old wrapper which has a hack to check for this.

This is an interesting failure case I never considered --- running
pg_upgrade, having it fail, deleting and recreating the _new_ cluster
directory, but not removing the new cluster's tablepace directories.  I
was able to recreate the failure, and verify that your patch properly
throws an error during 'check' for this case.

Modified patch attached.  I plan to apply this to all supported Postgres
versions.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
new file mode 100644
index 168058a..0c845b7
*** a/src/bin/pg_upgrade/check.c
--- b/src/bin/pg_upgrade/check.c
*** static void check_for_tables_with_oids(C
*** 27,32 
--- 27,33 
  static void check_for_reg_data_type_usage(ClusterInfo *cluster);
  static void check_for_jsonb_9_4_usage(ClusterInfo *cluster);
  static void check_for_pg_role_prefix(ClusterInfo *cluster);
+ static void check_for_new_tablespace_dir(ClusterInfo *new_cluster);
  static char *get_canonical_locale_name(int category, const char *locale);
  
  
*** check_new_cluster(void)
*** 187,192 
--- 188,195 
  	check_is_install_user(_cluster);
  
  	check_for_prepared_transactions(_cluster);
+ 
+ 	check_for_new_tablespace_dir(_cluster);
  }
  
  
*** create_script_for_cluster_analyze(char *
*** 528,533 
--- 531,568 
  
  
  /*
+  * Check that tablespace directories do not already exist for new cluster.
+  * If they do, it would cause an error while restoring global objects.
+  * This allows the failure to be detected at check time, rather than
+  * during schema restore.
+  *
+  * Note, v8.4 has no tablespace_suffix, which is fine as long as the new
+  * cluster version has a suffix.
+  */
+ static void
+ check_for_new_tablespace_dir(ClusterInfo *new_cluster)
+ {
+ 	char	new_tablespace_dir[MAXPGPATH];
+ 
+  	prep_status("Checking for new cluster tablespace directories");
+ 
+ 	for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+ 	{
+ 		struct stat statbuf;
+ 
+ 		snprintf(new_tablespace_dir, MAXPGPATH, "%s%s",
+ os_info.old_tablespaces[tblnum],
+ new_cluster->tablespace_suffix);
+ 
+ 		if (stat(new_tablespace_dir, ) == 0 || errno != ENOENT)
+ 			pg_fatal("new cluster tablespace directory already exists: \"%s\"\n",
+ 	 new_tablespace_dir);
+ 	}
+ 
+ 	check_ok();
+ }
+ 
+ /*
   * create_script_for_old_cluster_deletion()
   *
   *	This is particularly useful for tablespace deletion.

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Dave Cramer

Hi,


On Fri, 9 Oct 2020 at 14:46, Andres Freund  wrote:

> Hi,
>
> On 2020-10-08 09:46:38 +0200, Peter Eisentraut wrote:
> > New would be that the server would now also respond with a new message,
> say,
> >
> > S: DynamicResultInfo
>
> > Now, if the client has seen DynamicResultInfo earlier, it should now go
> into
> > a new subsequence to get the remaining result sets, like this (naming
> > obviously to be refined):
>
> Hm. Isn't this going to be a lot more latency sensitive than we'd like?
> This would basically require at least one additional roundtrip for
> everything that *potentially* could return multiple result sets, even if
> no additional results are returned, right? And it'd add at least one
> additional roundtrip for every result set that's actually sent.
>

Agreed as mentioned.

>
> Is there really a good reason for forcing the client to issue
> NextResult, Describe, Execute for each of the dynamic result sets? It's
> not like there's really a case for allowing the clients to skip them,
> right?  Why aren't we sending something more like
>
> S: CommandPartiallyComplete
> S: RowDescription
> S: DataRow...
> S: CommandPartiallyComplete
> S: RowDescription
> S: DataRow...
> ...
> S: CommandComplete
> C: Sync
>
> gated by a _pq_ parameter, of course.
>
>
> > I think this would all have to use the unnamed portal, but perhaps there
> > could be other uses with named portals.  Some details to be worked out.
>
> Which'd avoid this too, but:
>
> > One thing that's missing in this sequence is a way to specify the desired
> > output format (text/binary) for each result set.
>
> Is a good point. I personally think avoiding the back and forth is more
> important though. But if we could address both at the same time...
>
>
> > (I suspect what would be more useful in practice is to designate
> > output formats per data type.)
>
> Yea, that'd be *really* useful. It sucks that we basically require
> multiple round trips to make realistic use of the binary data for the
> few types where it's a huge win (e.g. bytea).
>

Yes!!! Ideally in the startup message.

Dave

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Andres Freund

Hi,

On 2020-10-08 09:46:38 +0200, Peter Eisentraut wrote:
> New would be that the server would now also respond with a new message, say,
> 
> S: DynamicResultInfo

> Now, if the client has seen DynamicResultInfo earlier, it should now go into
> a new subsequence to get the remaining result sets, like this (naming
> obviously to be refined):

Hm. Isn't this going to be a lot more latency sensitive than we'd like?
This would basically require at least one additional roundtrip for
everything that *potentially* could return multiple result sets, even if
no additional results are returned, right? And it'd add at least one
additional roundtrip for every result set that's actually sent.

Is there really a good reason for forcing the client to issue
NextResult, Describe, Execute for each of the dynamic result sets? It's
not like there's really a case for allowing the clients to skip them,
right?  Why aren't we sending something more like

S: CommandPartiallyComplete
S: RowDescription
S: DataRow...
S: CommandPartiallyComplete
S: RowDescription
S: DataRow...
...
S: CommandComplete
C: Sync

gated by a _pq_ parameter, of course.

> I think this would all have to use the unnamed portal, but perhaps there
> could be other uses with named portals.  Some details to be worked out.

Which'd avoid this too, but:

> One thing that's missing in this sequence is a way to specify the desired
> output format (text/binary) for each result set.

Is a good point. I personally think avoiding the back and forth is more
important though. But if we could address both at the same time...

> (I suspect what would be more useful in practice is to designate
> output formats per data type.)

Yea, that'd be *really* useful. It sucks that we basically require
multiple round trips to make realistic use of the binary data for the
few types where it's a huge win (e.g. bytea).

Greetings,

Andres Freund

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Dave Cramer

On Fri, 9 Oct 2020 at 13:33, Andrew Dunstan  wrote:

>
> On 10/8/20 3:46 AM, Peter Eisentraut wrote:
> > I want to progress work on stored procedures returning multiple result
> > sets.  Examples of how this could work on the SQL side have previously
> > been shown [0].  We also have ongoing work to make psql show multiple
> > result sets [1].  This appears to work fine in the simple query
> > protocol.  But the extended query protocol doesn't support multiple
> > result sets at the moment [2].  This would be desirable to be able to
> > use parameter binding, and also since one of the higher-level goals
> > would be to support the use case of stored procedures returning
> > multiple result sets via JDBC.
> >
> > [0]:
> >
> https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com
> > [1]: https://commitfest.postgresql.org/29/2096/
> > [2]:
> > https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us
> >
> > (Terminology: I'm calling this project "dynamic result sets", which
> > includes several concepts: 1) multiple result sets, 2) those result
> > sets can have different structures, 3) the structure of the result
> > sets is decided at run time, not declared in the schema/procedure
> > definition/etc.)
> >
> > One possibility I rejected was to invent a third query protocol beside
> > the simple and extended one.  This wouldn't really match with the
> > requirements of JDBC and similar APIs because the APIs for sending
> > queries don't indicate whether dynamic result sets are expected or
> > required, you only indicate that later by how you process the result
> > sets.  So we really need to use the existing ways of sending off the
> > queries.  Also, avoiding a third query protocol is probably desirable
> > in general to avoid extra code and APIs.
> >
> > So here is my sketch on how this functionality could be woven into the
> > extended query protocol.  I'll go through how the existing protocol
> > exchange works and then point out the additions that I have in mind.
> >
> > These additions could be enabled by a _pq_ startup parameter sent by
> > the client.  Alternatively, it might also work without that because
> > the client would just reject protocol messages it doesn't understand,
> > but that's probably less desirable behavior.
> >
> > So here is how it goes:
> >
> > C: Parse
> > S: ParseComplete
> >
> > At this point, the server would know whether the statement it has
> > parsed can produce dynamic result sets.  For a stored procedure, this
> > would be declared with the procedure definition, so when the CALL
> > statement is parsed, this can be noticed.  I don't actually plan any
> > other cases, but for the sake of discussion, perhaps some variant of
> > EXPLAIN could also return multiple result sets, and that could also be
> > detected from parsing the EXPLAIN invocation.
> >
> > At this point a client would usually do
> >
> > C: Describe (statement)
> > S: ParameterDescription
> > S: RowDescription
> >
> > New would be that the server would now also respond with a new
> > message, say,
> >
> > S: DynamicResultInfo
> >
> > that indicates that dynamic result sets will follow later.  The
> > message would otherwise be empty.  (We could perhaps include the
> > number of result sets, but this might not actually be useful, and
> > perhaps it's better not to spent effort on counting things that don't
> > need to be counted.)
> >
> > (If we don't guard this by a _pq_ startup parameter from the client,
> > an old client would now error out because of an unexpected protocol
> > message.)
> >
> > Now the normal bind and execute sequence follows:
> >
> > C: Bind
> > S: BindComplete
> > (C: Describe (portal))
> > (S: RowDescription)
> > C: Execute
> > S: ... (DataRows)
> > S: CommandComplete
> >
> > In the case of a CALL with output parameters, this "primary" result
> > set contains one row with the output parameters (existing behavior).
> >
> > Now, if the client has seen DynamicResultInfo earlier, it should now
> > go into a new subsequence to get the remaining result sets, like this
> > (naming obviously to be refined):
> >
> > C: NextResult
> > S: NextResultReady
> > C: Describe (portal)
> > S: RowDescription
> > C: Execute
> > 
> > S: CommandComplete
> > C: NextResult
> > ...
> > C: NextResult
> > S: NoNextResult
> > C: Sync
> > S: ReadyForQuery
> >
> > I think this would all have to use the unnamed portal, but perhaps
> > there could be other uses with named portals.  Some details to be
> > worked out.
> >
> > One could perhaps also do without the DynamicResultInfo message and
> > just put extra information into the CommandComplete message indicating
> > "there are more result sets after this one".
> >
> > (Following the model from the simple query protocol, CommandComplete
> > really means one result set complete, not the whole top-level command.
> > ReadyForQuery means the whole command is complete.  This is perhaps
> > debatable, and

Re: Batching page logging during B-tree build

2020-10-09 Thread Andrey Borodin

> 23 сент. 2020 г., в 23:19, Peter Geoghegan  написал(а):
> 
> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin  
> wrote:
>> Here is PoC with porting that same routine to B-tree. It allows to build 
>> B-trees ~10% faster on my machine.
> 
> It doesn't seem to make any difference on my machine, which has an
> NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> sync time isn't exceptional. My test case is "reindex index
> pgbench_accounts_pkey", with pgbench scale 500. I thought that this
> would be a sympathetic case, since it's bottlenecked on writing the
> index, with relatively little time spent scanning and sorting in
> parallel workers.
> Can you provide a test case that is sympathetic towards the patch?
Thanks for looking into this!

I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds.
With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 1.951975.
On scale 500 with patch
postgres=# reindex index pgbench_accounts_pkey;
REINDEX
Time: 21577,640 ms (00:21,578)
without patch
postgres=# reindex index pgbench_accounts_pkey;
REINDEX
Time: 26139,175 ms (00:26,139)

I think it's hardware dependent, I will try on servers.
> 
> BTW, I noticed that the index build is absurdly bottlenecked on
> compressing WAL with wal_compression=on. It's almost 3x slower with
> compression turned on!

> 24 сент. 2020 г., в 00:33, Andres Freund  написал(а):
> 
>> I know that we've tested different compression methods in the past,
>> but perhaps index build performance was overlooked.
> 
> I am pretty sure we have known that pglz for this was much much slower
> than alternatives. I seem to recall somebody posting convincing numbers,
> but can't find them just now.

There was a thread about different compressions[0]. It was demonstrated there 
that lz4 is 10 times faster on compression.
We have a patch to speedup pglz compression x1.43 [1], but I was hoping that we 
will go lz4\zstd way. It seems to me now, I actually should finish that speedup 
patch, it's very focused local refactoring.

Thanks!

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/ea57b49a-ecf0-481a-a77b-631833354f7d%40postgrespro.ru#dcac101f8a73dfce98924066f6a12a13
[1] 
https://www.postgresql.org/message-id/flat/169163A8-C96F-4DBE-A062-7D1CECBE9E5D%40yandex-team.ru#996a194c12bacd2d093be2cb7ac54ca6

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Andrew Dunstan



On 10/8/20 3:46 AM, Peter Eisentraut wrote:
> I want to progress work on stored procedures returning multiple result
> sets.  Examples of how this could work on the SQL side have previously
> been shown [0].  We also have ongoing work to make psql show multiple
> result sets [1].  This appears to work fine in the simple query
> protocol.  But the extended query protocol doesn't support multiple
> result sets at the moment [2].  This would be desirable to be able to
> use parameter binding, and also since one of the higher-level goals
> would be to support the use case of stored procedures returning
> multiple result sets via JDBC.
>
> [0]:
> https://www.postgresql.org/message-id/flat/4580ff7b-d610-eaeb-e06f-4d686896b93b%402ndquadrant.com
> [1]: https://commitfest.postgresql.org/29/2096/
> [2]:
> https://www.postgresql.org/message-id/9507.1534370765%40sss.pgh.pa.us
>
> (Terminology: I'm calling this project "dynamic result sets", which
> includes several concepts: 1) multiple result sets, 2) those result
> sets can have different structures, 3) the structure of the result
> sets is decided at run time, not declared in the schema/procedure
> definition/etc.)
>
> One possibility I rejected was to invent a third query protocol beside
> the simple and extended one.  This wouldn't really match with the
> requirements of JDBC and similar APIs because the APIs for sending
> queries don't indicate whether dynamic result sets are expected or
> required, you only indicate that later by how you process the result
> sets.  So we really need to use the existing ways of sending off the
> queries.  Also, avoiding a third query protocol is probably desirable
> in general to avoid extra code and APIs.
>
> So here is my sketch on how this functionality could be woven into the
> extended query protocol.  I'll go through how the existing protocol
> exchange works and then point out the additions that I have in mind.
>
> These additions could be enabled by a _pq_ startup parameter sent by
> the client.  Alternatively, it might also work without that because
> the client would just reject protocol messages it doesn't understand,
> but that's probably less desirable behavior.
>
> So here is how it goes:
>
> C: Parse
> S: ParseComplete
>
> At this point, the server would know whether the statement it has
> parsed can produce dynamic result sets.  For a stored procedure, this
> would be declared with the procedure definition, so when the CALL
> statement is parsed, this can be noticed.  I don't actually plan any
> other cases, but for the sake of discussion, perhaps some variant of
> EXPLAIN could also return multiple result sets, and that could also be
> detected from parsing the EXPLAIN invocation.
>
> At this point a client would usually do
>
> C: Describe (statement)
> S: ParameterDescription
> S: RowDescription
>
> New would be that the server would now also respond with a new
> message, say,
>
> S: DynamicResultInfo
>
> that indicates that dynamic result sets will follow later.  The
> message would otherwise be empty.  (We could perhaps include the
> number of result sets, but this might not actually be useful, and
> perhaps it's better not to spent effort on counting things that don't
> need to be counted.)
>
> (If we don't guard this by a _pq_ startup parameter from the client,
> an old client would now error out because of an unexpected protocol
> message.)
>
> Now the normal bind and execute sequence follows:
>
> C: Bind
> S: BindComplete
> (C: Describe (portal))
> (S: RowDescription)
> C: Execute
> S: ... (DataRows)
> S: CommandComplete
>
> In the case of a CALL with output parameters, this "primary" result
> set contains one row with the output parameters (existing behavior).
>
> Now, if the client has seen DynamicResultInfo earlier, it should now
> go into a new subsequence to get the remaining result sets, like this
> (naming obviously to be refined):
>
> C: NextResult
> S: NextResultReady
> C: Describe (portal)
> S: RowDescription
> C: Execute
> 
> S: CommandComplete
> C: NextResult
> ...
> C: NextResult
> S: NoNextResult
> C: Sync
> S: ReadyForQuery
>
> I think this would all have to use the unnamed portal, but perhaps
> there could be other uses with named portals.  Some details to be
> worked out.
>
> One could perhaps also do without the DynamicResultInfo message and
> just put extra information into the CommandComplete message indicating
> "there are more result sets after this one".
>
> (Following the model from the simple query protocol, CommandComplete
> really means one result set complete, not the whole top-level command.
> ReadyForQuery means the whole command is complete.  This is perhaps
> debatable, and interesting questions could also arise when considering
> what should happen in the simple query protocol when a query string
> consists of multiple commands each returning multiple result sets. 
> But it doesn't really seem sensible to cater to that.)
>
> One thing that's missing in this sequence

Re: Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Tomas Vondra


On Fri, Oct 09, 2020 at 12:24:16PM -0300, Ranier Vilela wrote:

I think that TupIsNull macro is no longer appropriate, to protect
ExecCopySlot.

See at tuptable.h:
#define TupIsNull(slot) \
((slot) == NULL || TTS_EMPTY(slot))

If var node->group_pivot is NULL, ExecCopySlot will
dereference a null pointer (first arg).



No. The C standard says there's a "sequence point" [1] between the left
and right arguments of the || operator, and that the expressions are
evaluated from left to right. So the program will do the first check,
and if the pointer really is NULL it won't do the second one (because
that is not necessary for determining the result). Similarly for the &&
operator, of course.

Had this been wrong, surely some of the the other places TupIsNull would
be wrong too (and there are hundreds of them).


Maybe, this can be related to a bug reported in the btree deduplication.



Not sure which bug you mean, but this piece of code is pretty unrelated
to btree in general, so I don't see any link.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread Tom Lane

Magnus Hagander  writes:
> That said, having this in a different format would in no way fix the fact
> that the information is mislabeled and obsolete.  It would likely be
> equally mislabeled and obsolete in an issue tracker. Just look at almost
> any of the other projects out there that *do* use an issue tracker and have
> been around for 20+ years.

The long and the short of it is that a list like this is only of value
if it gets diligently maintained.  Putting the data in a different tool
changes nothing about that.  (A tool might reduce the effort involved,
*if* it's a good match to your workflow.  But it won't reduce the effort
to zero.)

Consulting the wiki page's edit history shows that Bruce has been
doing some maintenance work, but almost nobody else does.  Evidently
that's not enough.

I'd be +1 for the "aggressive culling" suggested upthread, but
I'm not particularly volunteering to do it, either.

regards, tom lane

Re: Expansion of our checks for connection-loss errors

2020-10-09 Thread Tom Lane

I wrote:
> Hmm, excellent point.  While our code response to all these errors
> should be the same, you are right that that doesn't extend to emitting
> identical error texts.  For EHOSTUNREACH/ENETDOWN/ENETUNREACH, we
> should say something like "connection to server lost", without claiming
> that the server crashed.  It is less clear what to do with ECONNABORTED,
> but I'm inclined to put it in the network-problem bucket not the
> server-crash bucket, despite lorikeet's behavior.  Thoughts?

> This also destroys the patch's idea that switch statements should be
> able to handle these all alike.  If we group things as "ECONNRESET means
> server crash and the others are all network failures", then I'd be
> inclined to leave the ECONNRESET cases alone and just introduce
> new infrastructure to recognize all the network-failure errnos.

Actually, treating it that way seems like a good thing because it nets
out as (nearly) no change to our error message behavior.  The connection
failure errnos fall through to the default case, which produces a
perfectly reasonable report that includes strerror().  The only big thing
we're changing is the set of errnos that errcode_for_socket_access will
map to ERRCODE_CONNECTION_FAILURE, so this is spiritually closer to your
original patch.

Some other changes in the attached v2:

* I incorporated Kyotaro-san's suggested improvements.

* I went ahead and included ENETRESET and EHOSTDOWN, figuring that
if they exist we definitely want to class them as network failures.
We can worry about ifdef'ing them when and if we find a platform
that hasn't got them.  (I don't see any non-ugly way to make the
ALL_NETWORK_FAILURE_ERRNOS macro vary for missing symbols, so I'd
rather not deal with that unless it's proven necessary.)

* I noticed that we were not terribly consistent about whether
EPIPE is regarded as indicating a server failure like ECONNRESET
does.  So this patch also makes sure that EPIPE is treated like
ECONNRESET everywhere.  (Hence, pqsecure_raw_read's error reporting
does change, since it'll now report EPIPE as server failure.)

I lack a way to test this on Windows, but otherwise it feels
like it's about ready.

regards, tom lane

diff --git a/src/backend/port/win32/socket.c b/src/backend/port/win32/socket.c
index 6fbd1ed6fb..b4ef9fbd8f 100644
--- a/src/backend/port/win32/socket.c
+++ b/src/backend/port/win32/socket.c
@@ -121,12 +121,20 @@ TranslateSocketError(void)
 			errno = EADDRNOTAVAIL;
 			break;
 		case WSAEHOSTUNREACH:
-		case WSAEHOSTDOWN:
 		case WSAHOST_NOT_FOUND:
+			errno = EHOSTUNREACH;
+			break;
+		case WSAEHOSTDOWN:
+			errno = EHOSTDOWN;
+			break;
 		case WSAENETDOWN:
+			errno = ENETDOWN;
+			break;
 		case WSAENETUNREACH:
+			errno = ENETUNREACH;
+			break;
 		case WSAENETRESET:
-			errno = EHOSTUNREACH;
+			errno = ENETRESET;
 			break;
 		case WSAENOTCONN:
 		case WSAESHUTDOWN:
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index d0b368530e..637768d776 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -712,9 +712,8 @@ errcode_for_socket_access(void)
 	{
 			/* Loss of connection */
 		case EPIPE:
-#ifdef ECONNRESET
 		case ECONNRESET:
-#endif
+		case ALL_NETWORK_FAILURE_ERRNOS:
 			edata->sqlerrcode = ERRCODE_CONNECTION_FAILURE;
 			break;
 
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index f0587f41e4..fa798431ea 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -1825,10 +1825,15 @@ piperead(int s, char *buf, int len)
 {
 	int			ret = recv(s, buf, len, 0);
 
-	if (ret < 0 && WSAGetLastError() == WSAECONNRESET)
+	if (ret < 0)
 	{
-		/* EOF on the pipe! */
-		ret = 0;
+		int			werrno = TranslateSocketError(WSAGetLastError());
+
+		if (errno_is_connection_loss(werrno))
+		{
+			/* EOF on the pipe! */
+			ret = 0;
+		}
 	}
 	return ret;
 }
diff --git a/src/include/port.h b/src/include/port.h
index 84bf2c363f..766bb3dd6d 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -99,6 +99,36 @@ extern void pgfnames_cleanup(char **filenames);
 )
 #endif
 
+/*
+ * This macro provides a centralized list of all errnos that identify
+ * network-level failure of a previously-established TCP connection.  This
+ * excludes ECONNRESET, which we treat as reporting a probable server crash.
+ * (EPIPE is also excluded, since it's likewise a server-crash indication,
+ * and not TCP either.)  The macro is intended to be used in a switch
+ * statement, in the form "case ALL_NETWORK_FAILURE_ERRNOS:".
+ */
+#define ALL_NETWORK_FAILURE_ERRNOS \
+	ECONNABORTED: \
+	case EHOSTDOWN: \
+	case EHOSTUNREACH: \
+	case ENETDOWN: \
+	case ENETRESET: \
+	case ENETUNREACH
+
+/* Test for all connection-loss errnos, whether server or network failure */
+static inline bool
+errno_is_connection_loss(int e)
+{
+	switch (e)
+	{
+		case EPIPE:
+		case ECONNRESET:
+		case ALL_NETWORK_FAILURE_ERRNOS:
+			return true;
+	}
+	return false;
+}
+
 /*

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread Magnus Hagander

On Fri, Oct 9, 2020 at 1:08 PM John Naylor 
wrote:

>
> On Fri, Oct 9, 2020 at 4:47 AM Daniel Gustafsson  wrote:
>
>> Now, I don't actually suggest we *remove* it, as there is valuable curated
>> content, but that we rename it to something which won't encourage
>> newcomers to
>> pick something from the list simply because it exists.  The name "TODO"
>> implies
>> that the list is something which it isn't.
>
>
> +1
>

+1 as well. The way things are today, that is a terrible name, and has been
for a double-digit number of years.


The name is very much at odds with the content. To pick one baffling
> example, there's an entry under Free Space Map that dates to before the
> on-disk FSM was invented. A more accurate, if not charitable, description
> is "archive of stalled discussions". This is just a side effect of
> non-controversial todos getting finished over time. That could be helped
> with a round of aggressive culling.
>


That is one very good example indeed.


Also, it's organized by functionality, which makes sense in a way, but I
> don't find very useful in this context. Better categories might be
>
> help-wanted,
> good-first-issue (I know this is a tough one),
> feature-request,
> won't-fix (The memory leaks under discussion go here, it seems),
> code-cleanup,
> research-project,
> documentation,
> performance,
> user-experience,
> sql-standard
> etc.
>
> I suspect we will eventually want something like a full-blown issue
> tracker, although I gather that's been rejected previously. But a wiki
> page is not really suited for this.
>

Talk about "possibly controversial proposal" eh?

That said, having this in a different format would in no way fix the fact
that the information is mislabeled and obsolete.  It would likely be
equally mislabeled and obsolete in an issue tracker. Just look at almost
any of the other projects out there that *do* use an issue tracker and have
been around for 20+ years.

That said, I am a believer in that we should have one, yes. But I am
equally certain that it would not help with *this* particular problem.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Possible NULL dereferencing null pointer (src/backend/executor/nodeIncrementalSort.c)

2020-10-09 Thread Ranier Vilela

I think that TupIsNull macro is no longer appropriate, to protect
ExecCopySlot.

See at tuptable.h:
#define TupIsNull(slot) \
((slot) == NULL || TTS_EMPTY(slot))

If var node->group_pivot is NULL, ExecCopySlot will
dereference a null pointer (first arg).

Maybe, this can be related to a bug reported in the btree deduplication.

regards,
Ranier Vilela


avoid_dereferencing_null_pointer.patch
Description: Binary data

Re: Wrong example in the bloom documentation

2020-10-09 Thread Bruce Momjian

On Fri, Oct  9, 2020 at 05:44:57AM +, Daniel Westermann (DWE) wrote:
> Hi Bruce, Tom,
> 
> On Thu, Oct  8, 2020 at 03:43:32PM -0400, Tom Lane wrote:
> >> "Daniel Westermann (DWE)"  writes:
> >> >> I was hoping someone more experienced with this would comment, but
> >> >> seeing none, I will apply it in a day or two to all supported versions?
> >> >> Have you tested this output back to 9.5?
> >> 
> >> > I hoped that as well. No, I tested down to 9.6 because the change 
> >> > happened in 10.
> >> 
> >> The patch assumes that parallel query is enabled, which is not true by
> >> default before v10, so it should certainly not be applied before v10
> >> (at least not without significant revisions).
> 
> Yes, the behavior change was in 10. Before 10 the example is fine, I would 
> not apply that to any prior version, otherwise the whole example needs to be 
> rewritten.

Agreed.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: libpq debug log

2020-10-09 Thread Alvaro Herrera

On 2020-Oct-09, Kyotaro Horiguchi wrote:

> I saw that version and have some comments.
> 
> +pqGetProtocolMsgType(unsigned char c, PGCommSource commsource)
> +{
> + const char *message_type;
> 
> Compiler complains that this is unused.
> 
> +static const char *
> +pqGetProtocolMsgType(unsigned char c, PGCommSource commsource)
> +{
> ...
> + else
> + return "UnknownCommand";
> +}
> 
> Compiler complains as "control reached end of non-void function"

Yeah, those two warnings are caused by the same problem, namely that I
was editing this function to make it simpler and apparently the patch
version I sent does not include all such changes.  The fix is to remove
the message_type variable and have the two assignments be "return".

> +pqLogMsgString(PGconn *conn, const char *v, int length, PGCommSource 
> commsource)
> +{
> + if (length < 0)
> + length = strlen(v) + 1;
> +

> pqLogMsgString(conn, str, -1, FROM_*) means actual length may be
> different from the caller thinks, but the pqLogLineBreak() subtracts
> that value from the message length rememberd in in logging_message.
> Anyway AFAICS the patch doesn't use the code path so we should remove
> the first two lines.

True, +1 for removing it.

> By the way, appendBinaryPQExpBuffer() enlarges its buffer by the size
> of the exact length of the given data, but appends '\0' at the end of
> the copied data. Couldn't that leads to an memory overrun?

Doesn't enlargePQExpBuffer() include room for the trailing zero?  I
think it does.

Re: Uninitialized var utilized (src/backend/tsearch/spell.c)

2020-10-09 Thread Ranier Vilela

Em sex., 9 de out. de 2020 às 11:37, Tom Lane  escreveu:

> Ranier Vilela  writes:
> > Em sex., 9 de out. de 2020 às 11:08, Daniel Gustafsson 
> > escreveu:
> >> To help reviewers, your report should contain an explanation of when
> that
> >> can happen.
>
> > When option "flag" is not handled.
> > if (STRNCMP(pstr, "flag") == 0)
>
> I think what he means is that if the file contains no "flag" command
> before an affix entry then then we would arrive at NIAddAffix with an
> undefined flag buffer.  That's illegal syntax according to a quick scan
> of the ispell(5) man page, which explains the lack of complaints; but
> it might be worth guarding against.
>
> Aside from failing to initialize some variables that need it, it looks to
> me like NIImportAffixes is uselessly initializing some variables that
> don't need it.  I'd also be inclined to figure out which values are
> actually meant to be carried across lines, and declare the ones that
> aren't inside the loop, just for clarity.
>
Thanks Tom, for the great explanation.

regards,
Ranier Vilela

Re: Uninitialized var utilized (src/backend/tsearch/spell.c)

2020-10-09 Thread Tom Lane

Ranier Vilela  writes:
> Em sex., 9 de out. de 2020 às 11:08, Daniel Gustafsson 
> escreveu:
>> To help reviewers, your report should contain an explanation of when that
>> can happen.

> When option "flag" is not handled.
> if (STRNCMP(pstr, "flag") == 0)

I think what he means is that if the file contains no "flag" command
before an affix entry then then we would arrive at NIAddAffix with an
undefined flag buffer.  That's illegal syntax according to a quick scan
of the ispell(5) man page, which explains the lack of complaints; but
it might be worth guarding against.

Aside from failing to initialize some variables that need it, it looks to
me like NIImportAffixes is uselessly initializing some variables that
don't need it.  I'd also be inclined to figure out which values are
actually meant to be carried across lines, and declare the ones that
aren't inside the loop, just for clarity.

regards, tom lane

Re: Expansion of our checks for connection-loss errors

2020-10-09 Thread Tom Lane

Fujii Masao  writes:
> On 2020/10/09 4:15, Tom Lane wrote:
>> -#ifdef ECONNRESET
>> -case ECONNRESET:
>> +case ALL_CONNECTION_LOSS_ERRNOS:
>>  printfPQExpBuffer(>errorMessage,
>>
>> libpq_gettext("server closed the connection unexpectedly\n"
>>  
>> "\tThis probably means the server terminated abnormally\n"
>>  
>> "\tbefore or while processing the request.\n"));

> This change causes the same error message to be reported for those five errno.
> That is, we cannot identify which errno is actually reported, from the error
> message. But I just wonder if it's more helpful for the troubleshooting if we,
> for example, append strerror() into the message so that we can easily
> identify errno. Thought?

Hmm, excellent point.  While our code response to all these errors
should be the same, you are right that that doesn't extend to emitting
identical error texts.  For EHOSTUNREACH/ENETDOWN/ENETUNREACH, we
should say something like "connection to server lost", without claiming
that the server crashed.  It is less clear what to do with ECONNABORTED,
but I'm inclined to put it in the network-problem bucket not the
server-crash bucket, despite lorikeet's behavior.  Thoughts?

This also destroys the patch's idea that switch statements should be
able to handle these all alike.  If we group things as "ECONNRESET means
server crash and the others are all network failures", then I'd be
inclined to leave the ECONNRESET cases alone and just introduce
new infrastructure to recognize all the network-failure errnos.

regards, tom lane

Re: Uninitialized var utilized (src/backend/tsearch/spell.c)

2020-10-09 Thread Daniel Gustafsson

> On 9 Oct 2020, at 14:36, Ranier Vilela  wrote:

> At function NIImportAffixes (src/backend/tsearch/spell.c).
> 
> If option "flag" is not handled, variable char flag[BUFSIZE] will remain 
> uninitialized.

To help reviewers, your report should contain an explanation of when that can
happen.

cheers ./daniel

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 11:57 PM Amit Kapila  wrote:
>
> >
> > Sure but in that case 'top_plan->parallel_safe' should be false and it
> > should stick Gather node atop Insert node.
> >
>
> /should/should not.
>

OK, for the minimal patch, just allowing INSERT with parallel SELECT,
you're right, neither of those additional "commandType == CMD_SELECT"
checks are needed, so I'll remove them. (In the main patch, I think
the first check can be removed, once the XID handling is fixed; the
second check is definitely needed though).

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Two fsync related performance issues?

2020-10-09 Thread Michael Banck

Hi,

Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro:
> On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro  wrote:
> > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro  wrote:
> > > For the record, Andres Freund mentioned a few problems with this
> > > off-list and suggested we consider calling Linux syncfs() for each top
> > > level directory that could potentially be on a different filesystem.
> > > That seems like a nice idea to look into.
> > 
> > Here's an experimental patch to try that.  One problem is that before
> > Linux 5.8, syncfs() doesn't report failures[1].  I'm not sure what to
> > think about that; in the current coding we just log them and carry on
> > anyway, but any theoretical problems that causes for BLK_DONE should
> > be moot anyway because of FPIs which result in more writes and syncs.
> > Another is that it may affect other files that aren't under pgdata as
> > collateral damage, but that seems acceptable.  It also makes me a bit
> > sad that this wouldn't help other OSes.
> 
> ... and for comparison/discussion, here is an alternative patch that
> figures out precisely which files need to be fsync'd using information
> in the WAL.  On a system with full_page_writes=on, this effectively
> means that we don't have to do anything at all for relation files, as
> described in more detail in the commit message.  You still need to
> fsync the WAL files to make sure you can't replay records from the log
> that were written but not yet fdatasync'd, addressed in the patch.
> I'm not yet sure which other kinds of special files might need special
> treatment.
> 
> Some thoughts:
> 1.  Both patches avoid the need to open many files.  With 1 million
> tables this can take over a minute even on a fast system with warm
> caches and/or fast local storage, before replay begins, and for a cold
> system with high latency storage it can be a serious problem.

You mention "serious problem" and "over a minute", but I don't recall
you mentioning how long it takes with those two patches (or maybe I
missed it), so here goes:

I created an instance with 250 databases on 250 tablespaces on a SAN
storage. This is on 12.4, the patches can be backpatched with minimal
changes.

After pg_ctl stop -m immediate, a pg_ctl start -w (or rather the time
between the two log messages "database system was interrupted; last
known up at %s" and "database system was not properly shut down;
automatic recovery in progress" took

1. 12-13 seconds on vanilla
2. usually < 10 ms, sometimes 70-80 ms with the syncfs patch
3. 4 ms with the optimized sync patch

That's a dramatic improvement, but maybe also a best case scenario as no
traffic happened since the last checkpoint. I did some light pgbench
before killing the server again, but couldn't get the optimzid sync
patch to take any longer, might try harder at some point but won't have
much more time to test today.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Expansion of our checks for connection-loss errors

2020-10-09 Thread Fujii Masao





On 2020/10/09 4:15, Tom Lane wrote:

Over in the thread at [1], we've tentatively determined that the
reason buildfarm member lorikeet is currently failing is that its
network stack returns ECONNABORTED for (some?) connection failures,
whereas our code is only expecting ECONNRESET.  Fujii Masao therefore
proposes that we treat ECONNABORTED the same as ECONNRESET.  I think
this is a good idea, but after a bit of research I feel it does not
go far enough.  I find these POSIX-standard errnos that also seem
likely candidates to be returned for a hard loss of connection:

ECONNABORTED
EHOSTUNREACH
ENETDOWN
ENETUNREACH

All of these have been in POSIX since SUSv2, so it seems unlikely
that we need to #ifdef any of them.  (It is in any case pretty silly
that we have #ifdefs around a very small minority of our references
to ECONNRESET :-(.)

There are some other related errnos, such as ECONNREFUSED, that
don't seem like they'd be returned for a failure of a pre-existing
connection, so we don't need to include them in such tests.

Accordingly, I propose the attached patch (an expansion of
Fujii-san's) that causes us to test for all five errnos anyplace
we had been checking for ECONNRESET.


+1

Thanks for expanding the patch!

-#ifdef ECONNRESET
-   case ECONNRESET:
+   case ALL_CONNECTION_LOSS_ERRNOS:
printfPQExpBuffer(>errorMessage,
  
libpq_gettext("server closed the connection unexpectedly\n"
  
  "\tThis probably means the server terminated abnormally\n"
  
  "\tbefore or while processing the request.\n"));

This change causes the same error message to be reported for those five errno.
That is, we cannot identify which errno is actually reported, from the error
message. But I just wonder if it's more helpful for the troubleshooting if we,
for example, append strerror() into the message so that we can easily
identify errno. Thought?



I felt that this was getting to
the point where we'd better centralize the knowledge of what to check,
so the patch does that, via an inline function and an admittedly hacky
macro.  I also upgraded some places such as strerror.c to have full
support for these symbols.

All of the machines I have (even as far back as HPUX 10.20) also
define ENETRESET and EHOSTDOWN.  However, those symbols do not appear
in SUSv2.  ENETRESET was added at some later point, but EHOSTDOWN is
still not in POSIX.  For the moment I've left these second-tier
symbols out of the patch, but there's a case for adding them.  I'm
not sure whether there'd be any point in trying to #ifdef them.

BTW, I took out the conditional defines of some of these errnos in
libpq's win32.h; AFAICS that's been dead code ever since we added
#define's for them to win32_port.h.  Am I missing something?

This seems like a bug fix to me, so I'm inclined to back-patch.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Assertion failure with LEFT JOINs among >500 relations

2020-10-09 Thread Tom Lane

David Rowley  writes:
> On Fri, 9 Oct 2020 at 15:06, Tom Lane  wrote:
>> I notice there are some other ad-hoc isnan() checks scattered
>> about costsize.c, too.  Maybe we should indeed consider fixing
>> clamp_row_estimate to get rid of inf (and nan too, I suppose)
>> so that we'd not need those.  I don't recall the exact cases
>> that made us introduce those checks, but they were for cases
>> a lot more easily reachable than this one, I believe.

> Is there actually a case where nrows could be NaN?  If not, then it
> seems like a wasted check.  Wouldn't it take one of the input
> relations or the input rels to have an Inf row estimate (which won't
> happen after changing clamp_row_estimate()), or the selectivity
> estimate being NaN.

I'm fairly certain that every one of the existing NaN checks was put
there on the basis of hard experience.  Possibly digging in the git
history would offer more info about exactly where the NaNs came from.

regards, tom lane

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 6:26 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 3:51 PM Greg Nancarrow  wrote:
> >
> > On Fri, Oct 9, 2020 at 8:09 PM Amit Kapila  wrote:
> >
> > OK, I will update the comments for this.
> > Basically, up to now, the "force_parallel_mode" has only ever operated
> > on a SELECT.
> > But since we are now allowing CMD_INSERT to be assessed for parallel
> > mode too, we need to prevent the force_parallel_mode logic from
> > sticking a Gather node over the top of arbitrary INSERTs and causing
> > them to be run in parallel. Not all INSERTs are suitable for parallel
> > operation, and also there are further considerations for
> > parallel-safety for INSERTs compared to SELECT. INSERTs can also
> > trigger UPDATEs.
> >
>
> Sure but in that case 'top_plan->parallel_safe' should be false and it
> should stick Gather node atop Insert node.
>

/should/should not.

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 3:51 PM Greg Nancarrow  wrote:
>
> On Fri, Oct 9, 2020 at 8:09 PM Amit Kapila  wrote:
> >
> > 0001-InsertParallelSelect
> > 1.
> > ParallelContext *pcxt;
> >
> > + /*
> > + * We need to avoid an attempt on INSERT to assign a
> > + * FullTransactionId whilst in parallel mode (which is in
> > + * effect due to the underlying parallel query) - so the
> > + * FullTransactionId is assigned here. Parallel mode must
> > + * be temporarily escaped in order for this to be possible.
> > + * The FullTransactionId will be included in the transaction
> > + * state that is serialized in the parallel DSM.
> > + */
> > + if (estate->es_plannedstmt->commandType == CMD_INSERT)
> > + {
> > + Assert(IsInParallelMode());
> > + ExitParallelMode();
> > + GetCurrentFullTransactionId();
> > + EnterParallelMode();
> > + }
> > +
> >
> > This looks like a hack to me. I think you are doing this to avoid the
> > parallel mode checks in GetNewTransactionId(), right?
>
> Yes, agreed, is a hack to avoid that (mind you, it's not exactly great
> that ExecutePlan() sets parallel-mode for the entire plan execution).
> Also, did not expect that to necessarily remain in a final patch.
>
> >If so, I have
> > already mentioned above [1] that we can change it so that we disallow
> > assigning xids for parallel workers only. The same is true for the
> > check in ExecGatherMerge. Do you see any problem with that suggestion?
> >
>
> No, should be OK I guess, but will update and test to be sure.
>
> > 2.
> > @@ -337,7 +337,7 @@ standard_planner(Query *parse, const char
> > *query_string, int cursorOptions,
> >   */
> >   if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> >   IsUnderPostmaster &&
> > - parse->commandType == CMD_SELECT &&
> > + (parse->commandType == CMD_SELECT || parse->commandType == CMD_INSERT) &&
> >   !parse->hasModifyingCTE &&
> >   max_parallel_workers_per_gather > 0 &&
> >   !IsParallelWorker())
> >
> > I think the comments above this need to be updated especially the part
> > where we says:"Note that we do allow CREATE TABLE AS, SELECT INTO, and
> > CREATE MATERIALIZED VIEW to use parallel plans, but as of now, only
> > the leader backend writes into a completely new table.". Don't we need
> > to include Insert also?
>
> Yes, Insert needs to be mentioned somewhere there.
>
> >
> > 3.
> > @@ -371,6 +371,7 @@ standard_planner(Query *parse, const char
> > *query_string, int cursorOptions,
> >   * parallel-unsafe, or else the query planner itself has a bug.
> >   */
> >   glob->parallelModeNeeded = glob->parallelModeOK &&
> > + (parse->commandType == CMD_SELECT) &&
> >   (force_parallel_mode != FORCE_PARALLEL_OFF);
> >
> > Why do you need this change? The comments above this code should be
> > updated to reflect this change. I think for the same reason the below
> > code seems to be modified but I don't understand the reason for the
> > below change as well, also it is better to update the comments for
> > this as well.
> >
>
> OK, I will update the comments for this.
> Basically, up to now, the "force_parallel_mode" has only ever operated
> on a SELECT.
> But since we are now allowing CMD_INSERT to be assessed for parallel
> mode too, we need to prevent the force_parallel_mode logic from
> sticking a Gather node over the top of arbitrary INSERTs and causing
> them to be run in parallel. Not all INSERTs are suitable for parallel
> operation, and also there are further considerations for
> parallel-safety for INSERTs compared to SELECT. INSERTs can also
> trigger UPDATEs.
>

Sure but in that case 'top_plan->parallel_safe' should be false and it
should stick Gather node atop Insert node. For the purpose of this
patch, the scan beneath Insert should be considered as parallel_safe.

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 5:54 PM Greg Nancarrow  wrote:
>
> On Fri, Oct 9, 2020 at 8:09 PM Amit Kapila  wrote:
> >
> > + /*
> > + * We need to avoid an attempt on INSERT to assign a
> > + * FullTransactionId whilst in parallel mode (which is in
> > + * effect due to the underlying parallel query) - so the
> > + * FullTransactionId is assigned here. Parallel mode must
> > + * be temporarily escaped in order for this to be possible.
> > + * The FullTransactionId will be included in the transaction
> > + * state that is serialized in the parallel DSM.
> > + */
> > + if (estate->es_plannedstmt->commandType == CMD_INSERT)
> > + {
> > + Assert(IsInParallelMode());
> > + ExitParallelMode();
> > + GetCurrentFullTransactionId();
> > + EnterParallelMode();
> > + }
> > +
> >
> > This looks like a hack to me. I think you are doing this to avoid the
> > parallel mode checks in GetNewTransactionId(), right? If so, I have
> > already mentioned above [1] that we can change it so that we disallow
> > assigning xids for parallel workers only. The same is true for the
> > check in ExecGatherMerge. Do you see any problem with that suggestion?
> >
>
> Actually, there is a problem.
> If I remove that "hack", and change the code in GetNewTransactionId()
> to disallow xid assignment for parallel workers only, then there is
> also similar code in AssignTransactionId() which gets called.
>

I don't think workers need to call AssignTransactionId(), before that
the transactionid passed from leader should be set in
CurrentTransactionState. Why
GetCurrentTransactionId()/GetCurrentFullTransactionId(void) needs to
call AssignTransactionId() when called from worker?

> GetCurrentFullTransactionId() must be called in the leader, somewhere
> (and will be included in the transaction state that is serialized in
> the parallel DSM).
>

Yes, it should have done in the leader and then it should have been
set in the workers via StartParallelWorkerTransaction before we do any
actual operation. If that happens then GetCurrentTransactionId() won't
need to call AssignTransactionId().

-- 
With Regards,
Amit Kapila.

Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-09 Thread Masahiko Sawada

On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi  wrote:
>
> At Fri, 9 Oct 2020 02:33:37 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Masahiko Sawada 
> > > What about temporary network failures? I think there are users who
> > > don't want to give up resolving foreign transactions failed due to a
> > > temporary network failure. Or even they might want to wait for
> > > transaction completion until they send a cancel request. If we want to
> > > call the commit routine only once and therefore want FDW to retry
> > > connecting the foreign server within the call, it means we require all
> > > FDW implementors to write a retry loop code that is interruptible and
> > > ensures not to raise an error, which increases difficulty.
> > >
> > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > an unreliable network environment, the user might end up requiring
> > > every transaction to check the status of foreign transactions of the
> > > previous distributed transaction before starts. If we allow to do
> > > retry, I guess we ease that somewhat.
> >
> > OK.  As I said, I'm not against trying to cope with temporary network 
> > failure.  I just don't think it's mandatory.  If the network failure is 
> > really temporary and thus recovers soon, then the resolver will be able to 
> > commit the transaction soon, too.
>
> I should missing something, though...
>
> I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> much. I think remote-commits should be performed before local commit
> passes the point-of-no-return and the v26-0002 actually places
> AtEOXact_FdwXact() before the critical section.
>

So you're thinking the following sequence?

1. Prepare all foreign transactions.
2. Commit the all prepared foreign transactions.
3. Commit the local transaction.

Suppose we have the backend process call the commit routine, what if
one of FDW raises an ERROR during committing the foreign transaction
after committing other foreign transactions? The transaction will end
up with an abort but some foreign transactions are already committed.
Also, what if the backend process failed to commit the local
transaction? Since it already committed all foreign transactions it
cannot ensure the global atomicity in this case too. Therefore, I
think we should commit the distributed transactions in the following
sequence:

1. Prepare all foreign transactions.
2. Commit the local transaction.
3. Commit the all prepared foreign transactions.

But this is still not a perfect solution. If we have the backend
process call the commit routine and an error happens during executing
the commit routine of an FDW (i.g., at step 3) it's too late to report
an error to the client because we already committed the local
transaction. So the current solution is to have a background process
commit the foreign transactions so that the backend can just wait
without the possibility of errors.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Uninitialized var utilized (src/backend/tsearch/spell.c)

2020-10-09 Thread Ranier Vilela

At function NIImportAffixes (src/backend/tsearch/spell.c).

If option "flag" is not handled, variable char flag[BUFSIZE] will remain
uninitialized.

regards,
Ranier Vilela


fix_uninitialized_var_flag_spell.patch
Description: Binary data

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 8:09 PM Amit Kapila  wrote:
>
> + /*
> + * We need to avoid an attempt on INSERT to assign a
> + * FullTransactionId whilst in parallel mode (which is in
> + * effect due to the underlying parallel query) - so the
> + * FullTransactionId is assigned here. Parallel mode must
> + * be temporarily escaped in order for this to be possible.
> + * The FullTransactionId will be included in the transaction
> + * state that is serialized in the parallel DSM.
> + */
> + if (estate->es_plannedstmt->commandType == CMD_INSERT)
> + {
> + Assert(IsInParallelMode());
> + ExitParallelMode();
> + GetCurrentFullTransactionId();
> + EnterParallelMode();
> + }
> +
>
> This looks like a hack to me. I think you are doing this to avoid the
> parallel mode checks in GetNewTransactionId(), right? If so, I have
> already mentioned above [1] that we can change it so that we disallow
> assigning xids for parallel workers only. The same is true for the
> check in ExecGatherMerge. Do you see any problem with that suggestion?
>

Actually, there is a problem.
If I remove that "hack", and change the code in GetNewTransactionId()
to disallow xid assignment for parallel workers only, then there is
also similar code in AssignTransactionId() which gets called. If I
change that code too, in the same way, then on a parallel INSERT, that
code gets called by a parallel worker (from GetCurrentTransactionId())
and the ERROR "cannot assign XIDs in a parallel worker" results.
GetCurrentFullTransactionId() must be called in the leader, somewhere
(and will be included in the transaction state that is serialized in
the parallel DSM).
If not done here, then where?

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 4:28 PM Greg Nancarrow  wrote:
>
> On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila  wrote:
> >
> > On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow  wrote:
> > >
> > > Speaking of costing, I'm not sure I really agree with the current
> > > costing of a Gather node. Just considering a simple Parallel SeqScan
> > > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > > Gather cost always completely drowns out any other path costs when a
> > > large number of rows are involved (at least with default
> > > parallel-related GUC values), such that Parallel SeqScan would never
> > > be the cheapest path. This linear relationship in the costing based on
> > > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > > after a certain amount of rows, the overhead of launching workers will
> > > be out-weighed by the benefit of their parallel work, such that the
> > > more rows, the more likely a Parallel SeqScan will benefit.
> > >
> >
> > That will be true for the number of rows/pages we need to scan not for
> > the number of tuples we need to return as a result. The formula here
> > considers the number of rows the parallel scan will return and the
> > more the number of rows each parallel node needs to pass via shared
> > memory to gather node the more costly it will be.
> >
> > We do consider the total pages we need to scan in
> > compute_parallel_worker() where we use a logarithmic formula to
> > determine the number of workers.
> >
>
> Despite all the best intentions, the current costings seem to be
> geared towards selection of a non-parallel plan over a parallel plan,
> the more rows there are in the table. Yet the performance of a
> parallel plan appears to be better than non-parallel-plan the more
> rows there are in the table.
> This doesn't seem right to me. Is there a rationale behind this costing model?
>

Yes, AFAIK, there is no proof that we can get any (much) gain by
dividing the I/O among workers. It is primarily the CPU effort which
gives the benefit. So, the parallel plans show greater benefit when we
have to scan a large table and then project much lesser rows.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread John Naylor

On Fri, Oct 9, 2020 at 4:47 AM Daniel Gustafsson  wrote:

> Now, I don't actually suggest we *remove* it, as there is valuable curated
> content, but that we rename it to something which won't encourage
> newcomers to
> pick something from the list simply because it exists.  The name "TODO"
> implies
> that the list is something which it isn't.

+1

The name is very much at odds with the content. To pick one baffling
example, there's an entry under Free Space Map that dates to before the
on-disk FSM was invented. A more accurate, if not charitable, description
is "archive of stalled discussions". This is just a side effect of
non-controversial todos getting finished over time. That could be helped
with a round of aggressive culling.

Also, it's organized by functionality, which makes sense in a way, but I
don't find very useful in this context. Better categories might be

help-wanted,
good-first-issue (I know this is a tough one),
feature-request,
won't-fix (The memory leaks under discussion go here, it seems),
code-cleanup,
research-project,
documentation,
performance,
user-experience,
sql-standard
etc.

I suspect we will eventually want something like a full-blown issue
tracker, although I gather that's been rejected previously. But a wiki
page is not really suited for this.

I've found that a better source of actual "todo"s from developers can be
found in some commit messages, by grepping for "left for future work" or
"for another day". I've gotten patch ideas (and more than one committed)
from these. If I were asked, that's where I'd point aspiring developers.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-09 Thread Masahiko Sawada

On Fri, 9 Oct 2020 at 11:33, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > What about temporary network failures? I think there are users who
> > don't want to give up resolving foreign transactions failed due to a
> > temporary network failure. Or even they might want to wait for
> > transaction completion until they send a cancel request. If we want to
> > call the commit routine only once and therefore want FDW to retry
> > connecting the foreign server within the call, it means we require all
> > FDW implementors to write a retry loop code that is interruptible and
> > ensures not to raise an error, which increases difficulty.
> >
> > Yes, but if we don’t retry to resolve foreign transactions at all on
> > an unreliable network environment, the user might end up requiring
> > every transaction to check the status of foreign transactions of the
> > previous distributed transaction before starts. If we allow to do
> > retry, I guess we ease that somewhat.
>
> OK.  As I said, I'm not against trying to cope with temporary network 
> failure.  I just don't think it's mandatory.  If the network failure is 
> really temporary and thus recovers soon, then the resolver will be able to 
> commit the transaction soon, too.

Well, I agree that it's not mandatory. I think it's better if the user
can choose.

I also doubt how useful the per-foreign-server timeout setting you
mentioned before. For example, suppose the transaction involves with
three foreign servers that have different timeout setting, what if the
backend failed to commit on the first one of the server due to
timeout? Does it attempt to commit on the other two servers? Or does
it give up and return the control to the client? In the former case,
what if the backend failed again on one of the other two servers due
to timeout? The backend might end up waiting for all timeouts and in
practice the user is not aware of how many servers are involved with
the transaction, for example in a sharding. So It seems to be hard to
predict the total timeout. In the latter case, the backend might
succeed to commit on the other two nodes. Also, the timeout setting of
the first foreign server virtually is used as the whole foreign
transaction resolution timeout. However, the user cannot control the
order of resolution. So again it seems to be hard for the user to
predict the timeout. So If we have a timeout mechanism, I think it's
better if the user can control the timeout for each transaction.
Probably the same is true for the retry.

>
> Then, we can have a commit retry timeout or retry count like the following 
> WebLogic manual says.  (I couldn't quickly find the English manual, so below 
> is in Japanese.  I quoted some text that got through machine translation, 
> which appears a bit strange.)
>
> https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> --
> Abandon timeout
> Specifies the maximum time (in seconds) that the transaction manager attempts 
> to complete the second phase of a two-phase commit transaction.
>
> In the second phase of a two-phase commit transaction, the transaction 
> manager attempts to complete the transaction until all resource managers 
> indicate that the transaction is complete. After the abort transaction timer 
> expires, no attempt is made to resolve the transaction. If the transaction 
> enters a ready state before it is destroyed, the transaction manager rolls 
> back the transaction and releases the held lock on behalf of the destroyed 
> transaction.
> --

Yeah per-transaction timeout for 2nd phase of 2PC seems a good idea.

>
>
>
> > Also, what if the user sets the statement timeout to 60 sec and they
> > want to cancel the waits after 5 sec by pressing ctl-C? You mentioned
> > that client libraries of other DBMSs don't have asynchronous execution
> > functionality. If the SQL execution function is not interruptible, the
> > user will end up waiting for 60 sec, which seems not good.
>
> FDW functions can be uninterruptible in general, aren't they?  We experienced 
> that odbc_fdw didn't allow cancellation of SQL execution.

For example in postgres_fdw, it executes a SQL in asynchronous manner
using by PQsendQuery(), PQconsumeInput() and PQgetResult() and so on
(see do_sql_command() and pgfdw_get_result()). Therefore it the user
pressed ctl-C, the remote query would be canceled and raise an ERROR.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 8:41 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow  wrote:
> >
> > Speaking of costing, I'm not sure I really agree with the current
> > costing of a Gather node. Just considering a simple Parallel SeqScan
> > case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> > Gather cost always completely drowns out any other path costs when a
> > large number of rows are involved (at least with default
> > parallel-related GUC values), such that Parallel SeqScan would never
> > be the cheapest path. This linear relationship in the costing based on
> > the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> > after a certain amount of rows, the overhead of launching workers will
> > be out-weighed by the benefit of their parallel work, such that the
> > more rows, the more likely a Parallel SeqScan will benefit.
> >
>
> That will be true for the number of rows/pages we need to scan not for
> the number of tuples we need to return as a result. The formula here
> considers the number of rows the parallel scan will return and the
> more the number of rows each parallel node needs to pass via shared
> memory to gather node the more costly it will be.
>
> We do consider the total pages we need to scan in
> compute_parallel_worker() where we use a logarithmic formula to
> determine the number of workers.
>

Despite all the best intentions, the current costings seem to be
geared towards selection of a non-parallel plan over a parallel plan,
the more rows there are in the table. Yet the performance of a
parallel plan appears to be better than non-parallel-plan the more
rows there are in the table.
This doesn't seem right to me. Is there a rationale behind this costing model?
I have pointed out the part of the parallel_tuple_cost calculation
that seems to drown out all other costs (causing the cost value to be
huge), the more rows there are in the table.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 3:50 PM Bharath Rupireddy
 wrote:
>
> On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
> >
> > On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
> >  wrote:
> > >
> > > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  
> > > wrote:
> > > >
> > > > From the testing perspective,
> > > > 1. Test by having something force_parallel_mode = regress which means
> > > > that all existing Copy tests in the regression will be executed via
> > > > new worker code. You can have this as a test-only patch for now and
> > > > make sure all existing tests passed with this.
> > > >
> > >
> > > I don't think all the existing copy test cases(except the new test cases 
> > > added in the parallel copy patch set) would run inside the parallel 
> > > worker if force_parallel_mode is on. This is because, the parallelism 
> > > will be picked up for parallel copy only if parallel option is specified 
> > > unlike parallelism for select queries.
> > >
> >
> > Sure, you need to change the code such that when force_parallel_mode =
> > 'regress' is specified then it always uses one worker. This is
> > primarily for testing purposes and will help during the development of
> > this patch as it will make all exiting Copy tests to use quite a good
> > portion of the parallel infrastructure.
> >
>
> IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
> as default value in guc.c,
>

No need to set this as the default value. You can change it in
postgresql.conf before running tests.

> and then adjust the parallelism related
> code in copy.c such that it always picks 1 worker and spawns it. This
> way, all the existing copy test cases would be run in parallel worker.
> Please let me know if this is okay.
>

Yeah, this sounds fine.

> If yes, I will do this and update
> here.
>

Okay, thanks, but ensure the difference in test execution before and
after your change. After your change, all the 'copy' tests should
invoke the worker to perform a copy.

> >
> > > All the above tests are performed on the latest v6 patch set (attached 
> > > here in this thread) with custom postgresql.conf[1]. The results are of 
> > > the triplet form (exec time in sec, number of workers, gain)
> > >
> >
> > Okay, so I am assuming the performance is the same as we have seen
> > with the earlier versions of patches.
> >
>
> Yes. Most recent run on v5 patch set [1]
>

Okay, good to know that.

-- 
With Regards,
Amit Kapila.

Re: Parallel copy

2020-10-09 Thread Bharath Rupireddy

On Fri, Oct 9, 2020 at 3:26 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> > >
> > > From the testing perspective,
> > > 1. Test by having something force_parallel_mode = regress which means
> > > that all existing Copy tests in the regression will be executed via
> > > new worker code. You can have this as a test-only patch for now and
> > > make sure all existing tests passed with this.
> > >
> >
> > I don't think all the existing copy test cases(except the new test cases 
> > added in the parallel copy patch set) would run inside the parallel worker 
> > if force_parallel_mode is on. This is because, the parallelism will be 
> > picked up for parallel copy only if parallel option is specified unlike 
> > parallelism for select queries.
> >
>
> Sure, you need to change the code such that when force_parallel_mode =
> 'regress' is specified then it always uses one worker. This is
> primarily for testing purposes and will help during the development of
> this patch as it will make all exiting Copy tests to use quite a good
> portion of the parallel infrastructure.
>

IIUC, firstly, I will set force_parallel_mode = FORCE_PARALLEL_REGRESS
as default value in guc.c, and then adjust the parallelism related
code in copy.c such that it always picks 1 worker and spawns it. This
way, all the existing copy test cases would be run in parallel worker.
Please let me know if this is okay. If yes, I will do this and update
here.

>
> > All the above tests are performed on the latest v6 patch set (attached here 
> > in this thread) with custom postgresql.conf[1]. The results are of the 
> > triplet form (exec time in sec, number of workers, gain)
> >
>
> Okay, so I am assuming the performance is the same as we have seen
> with the earlier versions of patches.
>

Yes. Most recent run on v5 patch set [1]

>
> > Overall, we have below test cases to cover the code and for performance 
> > measurements. We plan to run these tests whenever a new set of patches is 
> > posted.
> >
> > 1. csv
> > 2. binary
>
> Don't we need the tests for plain text files as well?
>

Will add one.

>
> > 3. force parallel mode = regress
> > 4. toast data csv and binary
> > 5. foreign key check, before row, after row, before statement, after 
> > statement, instead of triggers
> > 6. partition case
> > 7. foreign partitions and partitions having trigger cases
> > 8. where clause having parallel unsafe and safe expression, default 
> > parallel unsafe and safe expression
> > 9. temp, global, local, unlogged, inherited tables cases, foreign tables
> >
>
> Sounds like good coverage. So, are you doing all this testing
> manually? How are you maintaining these tests?
>

Yes, running them manually. Few of the tests(1,2,4) require huge
datasets for performance measurements and other test cases are to
ensure we don't choose parallelism. We will try to add test cases that
are not meant for performance, to the patch test.

[1] - 
https://www.postgresql.org/message-id/CALj2ACW%3Djm5ri%2B7rXiQaFT_c5h2rVS%3DcJOQVFR5R%2Bbowt3QDkw%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 8:09 PM Amit Kapila  wrote:
>
> 0001-InsertParallelSelect
> 1.
> ParallelContext *pcxt;
>
> + /*
> + * We need to avoid an attempt on INSERT to assign a
> + * FullTransactionId whilst in parallel mode (which is in
> + * effect due to the underlying parallel query) - so the
> + * FullTransactionId is assigned here. Parallel mode must
> + * be temporarily escaped in order for this to be possible.
> + * The FullTransactionId will be included in the transaction
> + * state that is serialized in the parallel DSM.
> + */
> + if (estate->es_plannedstmt->commandType == CMD_INSERT)
> + {
> + Assert(IsInParallelMode());
> + ExitParallelMode();
> + GetCurrentFullTransactionId();
> + EnterParallelMode();
> + }
> +
>
> This looks like a hack to me. I think you are doing this to avoid the
> parallel mode checks in GetNewTransactionId(), right?

Yes, agreed, is a hack to avoid that (mind you, it's not exactly great
that ExecutePlan() sets parallel-mode for the entire plan execution).
Also, did not expect that to necessarily remain in a final patch.

>If so, I have
> already mentioned above [1] that we can change it so that we disallow
> assigning xids for parallel workers only. The same is true for the
> check in ExecGatherMerge. Do you see any problem with that suggestion?
>

No, should be OK I guess, but will update and test to be sure.

> 2.
> @@ -337,7 +337,7 @@ standard_planner(Query *parse, const char
> *query_string, int cursorOptions,
>   */
>   if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
>   IsUnderPostmaster &&
> - parse->commandType == CMD_SELECT &&
> + (parse->commandType == CMD_SELECT || parse->commandType == CMD_INSERT) &&
>   !parse->hasModifyingCTE &&
>   max_parallel_workers_per_gather > 0 &&
>   !IsParallelWorker())
>
> I think the comments above this need to be updated especially the part
> where we says:"Note that we do allow CREATE TABLE AS, SELECT INTO, and
> CREATE MATERIALIZED VIEW to use parallel plans, but as of now, only
> the leader backend writes into a completely new table.". Don't we need
> to include Insert also?

Yes, Insert needs to be mentioned somewhere there.

>
> 3.
> @@ -371,6 +371,7 @@ standard_planner(Query *parse, const char
> *query_string, int cursorOptions,
>   * parallel-unsafe, or else the query planner itself has a bug.
>   */
>   glob->parallelModeNeeded = glob->parallelModeOK &&
> + (parse->commandType == CMD_SELECT) &&
>   (force_parallel_mode != FORCE_PARALLEL_OFF);
>
> Why do you need this change? The comments above this code should be
> updated to reflect this change. I think for the same reason the below
> code seems to be modified but I don't understand the reason for the
> below change as well, also it is better to update the comments for
> this as well.
>

OK, I will update the comments for this.
Basically, up to now, the "force_parallel_mode" has only ever operated
on a SELECT.
But since we are now allowing CMD_INSERT to be assessed for parallel
mode too, we need to prevent the force_parallel_mode logic from
sticking a Gather node over the top of arbitrary INSERTs and causing
them to be run in parallel. Not all INSERTs are suitable for parallel
operation, and also there are further considerations for
parallel-safety for INSERTs compared to SELECT. INSERTs can also
trigger UPDATEs.
If we need to support force_parallel_mode for INSERT, more work will
need to be done.

> @@ -425,7 +426,7 @@ standard_planner(Query *parse, const char
> *query_string, int cursorOptions,
>   * Optionally add a Gather node for testing purposes, provided this is
>   * actually a safe thing to do.
>   */
> - if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe)
> + if (force_parallel_mode != FORCE_PARALLEL_OFF && parse->commandType
> == CMD_SELECT && top_plan->parallel_safe)
>   {
>   Gather*gather = makeNode(Gather);
>
> [1] - 
> https://www.postgresql.org/message-id/CAA4eK1%2BE-pM0U6qw7EOF0yO0giTxdErxoJV9xTqN%2BLo9zdotFQ%40mail.gmail.com
>

Regards,
Greg Nancarrow
Fujitsu Australia

Re: SEARCH and CYCLE clauses

2020-10-09 Thread Pavel Stehule

pá 9. 10. 2020 v 11:40 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2020-09-22 20:29, Pavel Stehule wrote:
> > The result is correct. When I tried to use UNION instead UNION ALL, the
> > pg crash
>
> I fixed the crash, but UNION [DISTINCT] won't actually work here because
> row/record types are not hashable.  I'm leaving the partial support in,
> but I'm documenting it as currently not supported.
>

 I think so UNION is a common solution against the cycles. So missing
support for this specific case is not a nice thing. How much work is needed
for hashing rows. It should not be too much code.


> > looks so clause USING in cycle detection is unsupported for DB2 and
> > Oracle - the examples from these databases doesn't work on PG without
> > modifications
>
> Yeah, the path clause is actually not necessary from a user's
> perspective, but it's required for internal bookkeeping.  We could
> perhaps come up with a mechanism to make it invisible coming out of the
> CTE (maybe give the CTE a target list internally), but that seems like a
> separate project.
>
> The attached patch fixes the issues you have reported (also the view
> issue from the other email).  I have also moved the whole rewrite
> support to a new file to not blow up rewriteHandler.c so much.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: powerpc pg_atomic_compare_exchange_u32_impl: error: comparison of integer expressions of different signedness (Re: pgsql: For all ppc compilers, implement compare_exchange and) fetch_add

2020-10-09 Thread Noah Misch

On Fri, Oct 09, 2020 at 11:28:25AM +0200, Christoph Berg wrote:
> pg-cron on powerpc/ppc64/ppc64el is raising this warning inside the
> ppc atomics:
> 
> gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
> -Werror=vla -Wendif-labels -Wmissing-format-attribute 
> -Wimplicit-fallthrough=3 -Wformat-security -fno-strict-aliasing -fwrapv 
> -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation 
> -g -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -g -O2 
> -fdebug-prefix-map=/<>=. -fstack-protector-strong -Wformat 
> -Werror=format-security -fPIC -std=c99 -Wall -Wextra -Werror 
> -Wno-unknown-warning-option -Wno-unused-parameter -Wno-maybe-uninitialized 
> -Wno-implicit-fallthrough -Iinclude -I/usr/include/postgresql -I. -I./ 
> -I/usr/include/postgresql/13/server -I/usr/include/postgresql/internal  
> -Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o 
> src/job_metadata.o src/job_metadata.c
> In file included from /usr/include/postgresql/13/server/port/atomics.h:74,
>  from /usr/include/postgresql/13/server/utils/dsa.h:17,
>  from /usr/include/postgresql/13/server/nodes/tidbitmap.h:26,
>  from /usr/include/postgresql/13/server/access/genam.h:19,
>  from src/job_metadata.c:21:
> /usr/include/postgresql/13/server/port/atomics/arch-ppc.h: In function 
> ‘pg_atomic_compare_exchange_u32_impl’:
> /usr/include/postgresql/13/server/port/atomics/arch-ppc.h:97:42: error: 
> comparison of integer expressions of different signedness: ‘uint32’ {aka 
> ‘unsigned int’} and ‘int’ [-Werror=sign-compare]
>97 |   *expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)
>   |  ^~
> src/job_metadata.c: At top level:
> cc1: note: unrecognized command-line option ‘-Wno-unknown-warning-option’ may 
> have been intended to silence earlier diagnostics
> cc1: all warnings being treated as errors
> 
> Looking at the pg_atomic_compare_exchange_u32_impl, this looks like a
> genuine problem:
> 
> static inline bool
> pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
> uint32 *expected, uint32 newval)
> ...
> if (__builtin_constant_p(*expected) &&
> *expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)
> 
> If *expected is an unsigned integer, comparing it to PG_INT16_MIN
> which is a negative number doesn't make sense.
> 
> src/include/c.h:#define PG_INT16_MIN(-0x7FFF-1)

Agreed.  I'll probably fix it like this:

--- a/src/include/port/atomics/arch-ppc.h
+++ b/src/include/port/atomics/arch-ppc.h
@@ -96,3 +96,4 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 
*ptr,
if (__builtin_constant_p(*expected) &&
-   *expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)
+   (int32) *expected <= PG_INT16_MAX &&
+   (int32) *expected >= PG_INT16_MIN)
__asm__ __volatile__(
@@ -185,3 +186,4 @@ pg_atomic_compare_exchange_u64_impl(volatile 
pg_atomic_uint64 *ptr,
if (__builtin_constant_p(*expected) &&
-   *expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)
+   (int64) *expected <= PG_INT16_MAX &&
+   (int64) *expected >= PG_INT16_MIN)
__asm__ __volatile__(

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 2:52 PM Bharath Rupireddy
 wrote:
>
> On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila  wrote:
> >
> > From the testing perspective,
> > 1. Test by having something force_parallel_mode = regress which means
> > that all existing Copy tests in the regression will be executed via
> > new worker code. You can have this as a test-only patch for now and
> > make sure all existing tests passed with this.
> >
>
> I don't think all the existing copy test cases(except the new test cases 
> added in the parallel copy patch set) would run inside the parallel worker if 
> force_parallel_mode is on. This is because, the parallelism will be picked up 
> for parallel copy only if parallel option is specified unlike parallelism for 
> select queries.
>

Sure, you need to change the code such that when force_parallel_mode =
'regress' is specified then it always uses one worker. This is
primarily for testing purposes and will help during the development of
this patch as it will make all exiting Copy tests to use quite a good
portion of the parallel infrastructure.

>
> All the above tests are performed on the latest v6 patch set (attached here 
> in this thread) with custom postgresql.conf[1]. The results are of the 
> triplet form (exec time in sec, number of workers, gain)
>

Okay, so I am assuming the performance is the same as we have seen
with the earlier versions of patches.

> Overall, we have below test cases to cover the code and for performance 
> measurements. We plan to run these tests whenever a new set of patches is 
> posted.
>
> 1. csv
> 2. binary

Don't we need the tests for plain text files as well?

> 3. force parallel mode = regress
> 4. toast data csv and binary
> 5. foreign key check, before row, after row, before statement, after 
> statement, instead of triggers
> 6. partition case
> 7. foreign partitions and partitions having trigger cases
> 8. where clause having parallel unsafe and safe expression, default parallel 
> unsafe and safe expression
> 9. temp, global, local, unlogged, inherited tables cases, foreign tables
>

Sounds like good coverage. So, are you doing all this testing
manually? How are you maintaining these tests?

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 2:37 PM Greg Nancarrow  wrote:
>
> Speaking of costing, I'm not sure I really agree with the current
> costing of a Gather node. Just considering a simple Parallel SeqScan
> case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
> Gather cost always completely drowns out any other path costs when a
> large number of rows are involved (at least with default
> parallel-related GUC values), such that Parallel SeqScan would never
> be the cheapest path. This linear relationship in the costing based on
> the rows and a parallel_tuple_cost doesn't make sense to me. Surely
> after a certain amount of rows, the overhead of launching workers will
> be out-weighed by the benefit of their parallel work, such that the
> more rows, the more likely a Parallel SeqScan will benefit.
>

That will be true for the number of rows/pages we need to scan not for
the number of tuples we need to return as a result. The formula here
considers the number of rows the parallel scan will return and the
more the number of rows each parallel node needs to pass via shared
memory to gather node the more costly it will be.

We do consider the total pages we need to scan in
compute_parallel_worker() where we use a logarithmic formula to
determine the number of workers.

-- 
With Regards,
Amit Kapila.

Re: SEARCH and CYCLE clauses

2020-10-09 Thread Peter Eisentraut


On 2020-09-22 20:29, Pavel Stehule wrote:
The result is correct. When I tried to use UNION instead UNION ALL, the 
pg crash


I fixed the crash, but UNION [DISTINCT] won't actually work here because 
row/record types are not hashable.  I'm leaving the partial support in, 
but I'm documenting it as currently not supported.


looks so clause USING in cycle detection is unsupported for DB2 and 
Oracle - the examples from these databases doesn't work on PG without 
modifications


Yeah, the path clause is actually not necessary from a user's 
perspective, but it's required for internal bookkeeping.  We could 
perhaps come up with a mechanism to make it invisible coming out of the 
CTE (maybe give the CTE a target list internally), but that seems like a 
separate project.


The attached patch fixes the issues you have reported (also the view 
issue from the other email).  I have also moved the whole rewrite 
support to a new file to not blow up rewriteHandler.c so much.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 4b36e81def50be44bd1f68247e33a22343520b5e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 9 Oct 2020 11:32:10 +0200
Subject: [PATCH v3] SEARCH and CYCLE clauses

Discussion: 
https://www.postgresql.org/message-id/flat/db80ceee-6f97-9b4a-8ee8-3ba0c58e5...@2ndquadrant.com
---
 doc/src/sgml/queries.sgml| 213 ++-
 doc/src/sgml/ref/select.sgml |  44 ++
 src/backend/nodes/copyfuncs.c|  39 ++
 src/backend/nodes/equalfuncs.c   |  35 ++
 src/backend/nodes/nodeFuncs.c|   6 +
 src/backend/nodes/outfuncs.c |  35 ++
 src/backend/nodes/readfuncs.c|  43 ++
 src/backend/parser/analyze.c |  47 +-
 src/backend/parser/gram.y|  58 +-
 src/backend/parser/parse_agg.c   |   7 +
 src/backend/parser/parse_cte.c   | 117 
 src/backend/parser/parse_expr.c  |   4 +
 src/backend/parser/parse_func.c  |   3 +
 src/backend/parser/parse_relation.c  |  54 +-
 src/backend/parser/parse_target.c|  21 +-
 src/backend/rewrite/Makefile |   1 +
 src/backend/rewrite/rewriteHandler.c |  18 +
 src/backend/rewrite/rewriteSearchCycle.c | 709 +++
 src/backend/utils/adt/ruleutils.c|  47 ++
 src/include/nodes/nodes.h|   2 +
 src/include/nodes/parsenodes.h   |  28 +-
 src/include/parser/analyze.h |   2 +
 src/include/parser/kwlist.h  |   2 +
 src/include/parser/parse_node.h  |   2 +
 src/include/rewrite/rewriteSearchCycle.h |  21 +
 src/test/regress/expected/with.out   | 433 ++
 src/test/regress/sql/with.sql| 210 +++
 27 files changed, 2153 insertions(+), 48 deletions(-)
 create mode 100644 src/backend/rewrite/rewriteSearchCycle.c
 create mode 100644 src/include/rewrite/rewriteSearchCycle.h

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index 77fb1991ae..34b3e48986 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -2011,6 +2011,10 @@ SELECT in 
WITH
but we'd have needed two levels of nested sub-SELECTs.  
It's a bit
easier to follow this way.
   
+ 
+
+ 
+  Recursive Queries
 
   

@@ -2114,6 +2118,153 @@ Recursive Query Evaluation
 
   
 
+  
+   Search Order
+
+   
+When computing a tree traversal using a recursive query, you might want to
+order the results in either depth-first or breadth-first order.  This can
+be done by computing an ordering column alongside the other data columns
+and using that to sort the results at the end.  Note that this does not
+actually control in which order the query evaluation visits the rows; that
+is as always in SQL implementation-dependent.  This approach merely
+provides a convenient way to order the results afterwards.
+   
+
+   
+To create a depth-first order, we compute for each result row an array of
+rows that we have visited so far.  For example, consider the following
+query that searches a table tree using a
+link field:
+
+
+WITH RECURSIVE search_tree(id, link, data) AS (
+SELECT t.id, t.link, t.data
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree;
+
+
+To add depth-first ordering information, you can write this:
+
+
+WITH RECURSIVE search_tree(id, link, data, path) AS (
+SELECT t.id, t.link, t.data, ARRAY[t.id]
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data, path || t.id
+FROM tree t, search_tree st
+WHERE t.id = st.link
+)
+SELECT * FROM search_tree ORDER BY path;
+
+   
+
+  
+   In the general case where more than one field needs to be used to identify
+   a row, use an array of rows.  For example, if we needed to track fields
+   f1 and f2:
+
+
+WITH RECURSIVE

Re: [HACKERS] Custom compression methods

2020-10-09 Thread Dilip Kumar

On Fri, Oct 9, 2020 at 3:24 AM Tomas Vondra
 wrote:
>
> On Thu, Oct 08, 2020 at 02:38:27PM +0530, Dilip Kumar wrote:
> >On Wed, Oct 7, 2020 at 5:00 PM Dilip Kumar  wrote:
> >>
> >> On Wed, Oct 7, 2020 at 10:26 AM Dilip Kumar  wrote:
> >> >
> >> > On Tue, Oct 6, 2020 at 10:21 PM Tomas Vondra
> >> >  wrote:
> >> > >
> >> > > On Tue, Oct 06, 2020 at 11:00:55AM +0530, Dilip Kumar wrote:
> >> > > >On Mon, Oct 5, 2020 at 9:34 PM Tomas Vondra
> >> > > > wrote:
> >> > > >>
> >> > > >> On Mon, Oct 05, 2020 at 07:57:41PM +0530, Dilip Kumar wrote:
> >> > > >> >On Mon, Oct 5, 2020 at 5:53 PM Tomas Vondra
> >> > > >> > wrote:
> >> > > >> >>
> >> > > >> >> On Mon, Oct 05, 2020 at 11:17:28AM +0530, Dilip Kumar wrote:
> >> > > >> >> >On Mon, Oct 5, 2020 at 3:37 AM Tomas Vondra
> >> > > >> >> > wrote:
> >> > > >> >> >
> >> > > >> >> >Thanks, Tomas for your feedback.
> >> > > >> >> >
> >> > > >> >> >> 9) attcompression ...
> >> > > >> >> >>
> >> > > >> >> >> The main issue I see is what the patch does with 
> >> > > >> >> >> attcompression. Instead
> >> > > >> >> >> of just using it to store a the compression method, it's also 
> >> > > >> >> >> used to
> >> > > >> >> >> store the preserved compression methods. And using NameData 
> >> > > >> >> >> to store
> >> > > >> >> >> this seems wrong too - if we really want to store this info, 
> >> > > >> >> >> the correct
> >> > > >> >> >> way is either using text[] or inventing charvector or similar.
> >> > > >> >> >
> >> > > >> >> >The reason for using the NameData is the get it in the fixed 
> >> > > >> >> >part of
> >> > > >> >> >the data structure.
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Why do we need that? It's possible to have varlena fields with 
> >> > > >> >> direct
> >> > > >> >> access (see pg_index.indkey for example).
> >> > > >> >
> >> > > >> >I see.  While making it NameData I was thinking whether we have an
> >> > > >> >option to direct access the varlena. Thanks for pointing me there. 
> >> > > >> >I
> >> > > >> >will change this.
> >> > > >> >
> >> > > >> > Adding NameData just to make
> >> > > >> >> it fixed-length means we're always adding 64B even if we just 
> >> > > >> >> need a
> >> > > >> >> single byte, which means ~30% overhead for the 
> >> > > >> >> FormData_pg_attribute.
> >> > > >> >> That seems a bit unnecessary, and might be an issue with many 
> >> > > >> >> attributes
> >> > > >> >> (e.g. with many temp tables, etc.).
> >> > > >> >
> >> > > >> >You are right.  Even I did not like to keep 64B for this, so I 
> >> > > >> >will change it.
> >> > > >> >
> >> > > >> >>
> >> > > >> >> >> But to me this seems very much like a misuse of 
> >> > > >> >> >> attcompression to track
> >> > > >> >> >> dependencies on compression methods, necessary because we 
> >> > > >> >> >> don't have a
> >> > > >> >> >> separate catalog listing compression methods. If we had that, 
> >> > > >> >> >> I think we
> >> > > >> >> >> could simply add dependencies between attributes and that 
> >> > > >> >> >> catalog.
> >> > > >> >> >
> >> > > >> >> >Basically, up to this patch, we are having only built-in 
> >> > > >> >> >compression
> >> > > >> >> >methods and those can not be dropped so we don't need any 
> >> > > >> >> >dependency
> >> > > >> >> >at all.  We just want to know what is the current compression 
> >> > > >> >> >method
> >> > > >> >> >and what is the preserve compression methods supported for this
> >> > > >> >> >attribute.  Maybe we can do it better instead of using the 
> >> > > >> >> >NameData
> >> > > >> >> >but I don't think it makes sense to add a separate catalog?
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >> Sure, I understand what the goal was - all I'm saying is that it 
> >> > > >> >> looks
> >> > > >> >> very much like a workaround needed because we don't have the 
> >> > > >> >> catalog.
> >> > > >> >>
> >> > > >> >> I don't quite understand how could we support custom compression 
> >> > > >> >> methods
> >> > > >> >> without listing them in some sort of catalog?
> >> > > >> >
> >> > > >> >Yeah for supporting custom compression we need some catalog.
> >> > > >> >
> >> > > >> >> >> Moreover, having the catalog would allow adding compression 
> >> > > >> >> >> methods
> >> > > >> >> >> (from extensions etc) instead of just having a list of 
> >> > > >> >> >> hard-coded
> >> > > >> >> >> compression methods. Which seems like a strange limitation, 
> >> > > >> >> >> considering
> >> > > >> >> >> this thread is called "custom compression methods".
> >> > > >> >> >
> >> > > >> >> >I think I forgot to mention while submitting the previous patch 
> >> > > >> >> >that
> >> > > >> >> >the next patch I am planning to submit is, Support creating the 
> >> > > >> >> >custom
> >> > > >> >> >compression methods wherein we can use pg_am catalog to insert 
> >> > > >> >> >the new
> >> > > >> >> >compression method.  And for dependency handling, we can create 
> >> > > >> >> >an
> >> > > >> >> >attribute dependency on the pg_am row.  Basically, we will 
>

powerpc pg_atomic_compare_exchange_u32_impl: error: comparison of integer expressions of different signedness (Re: pgsql: For all ppc compilers, implement compare_exchange and) fetch_add

2020-10-09 Thread Christoph Berg

Re: Noah Misch
> For all ppc compilers, implement compare_exchange and fetch_add with asm.
> 
> This is more like how we handle s_lock.h and arch-x86.h.
> 
> Reviewed by Tom Lane.
> 
> Discussion: https://postgr.es/m/20191005173400.ga3979...@rfd.leadboat.com

Hi,

pg-cron on powerpc/ppc64/ppc64el is raising this warning inside the
ppc atomics:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 
-Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard 
-Wno-format-truncation -Wno-stringop-truncation -g -g -O2 
-fstack-protector-strong -Wformat -Werror=format-security -g -O2 
-fdebug-prefix-map=/<>=. -fstack-protector-strong -Wformat 
-Werror=format-security -fPIC -std=c99 -Wall -Wextra -Werror 
-Wno-unknown-warning-option -Wno-unused-parameter -Wno-maybe-uninitialized 
-Wno-implicit-fallthrough -Iinclude -I/usr/include/postgresql -I. -I./ 
-I/usr/include/postgresql/13/server -I/usr/include/postgresql/internal  
-Wdate-time -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o 
src/job_metadata.o src/job_metadata.c
In file included from /usr/include/postgresql/13/server/port/atomics.h:74,
 from /usr/include/postgresql/13/server/utils/dsa.h:17,
 from /usr/include/postgresql/13/server/nodes/tidbitmap.h:26,
 from /usr/include/postgresql/13/server/access/genam.h:19,
 from src/job_metadata.c:21:
/usr/include/postgresql/13/server/port/atomics/arch-ppc.h: In function 
‘pg_atomic_compare_exchange_u32_impl’:
/usr/include/postgresql/13/server/port/atomics/arch-ppc.h:97:42: error: 
comparison of integer expressions of different signedness: ‘uint32’ {aka 
‘unsigned int’} and ‘int’ [-Werror=sign-compare]
   97 |   *expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)
  |  ^~
src/job_metadata.c: At top level:
cc1: note: unrecognized command-line option ‘-Wno-unknown-warning-option’ may 
have been intended to silence earlier diagnostics
cc1: all warnings being treated as errors

Looking at the pg_atomic_compare_exchange_u32_impl, this looks like a
genuine problem:

static inline bool
pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
uint32 *expected, uint32 newval)
...
if (__builtin_constant_p(*expected) &&
*expected <= PG_INT16_MAX && *expected >= PG_INT16_MIN)

If *expected is an unsigned integer, comparing it to PG_INT16_MIN
which is a negative number doesn't make sense.

src/include/c.h:#define PG_INT16_MIN(-0x7FFF-1)

Christoph

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Amit Kapila

On Tue, Oct 6, 2020 at 3:08 PM Greg Nancarrow  wrote:
>
> On Mon, Oct 5, 2020 at 10:36 PM Dilip Kumar  wrote:
>
> Also, I have attached a separate patch (requested by Andres Freund)
> that just allows the underlying SELECT part of "INSERT INTO ... SELECT
> ..." to be parallel.
>

It might be a good idea to first just get this patch committed, if
possible. So, I have reviewed the latest version of this patch:

0001-InsertParallelSelect
1.
ParallelContext *pcxt;

+ /*
+ * We need to avoid an attempt on INSERT to assign a
+ * FullTransactionId whilst in parallel mode (which is in
+ * effect due to the underlying parallel query) - so the
+ * FullTransactionId is assigned here. Parallel mode must
+ * be temporarily escaped in order for this to be possible.
+ * The FullTransactionId will be included in the transaction
+ * state that is serialized in the parallel DSM.
+ */
+ if (estate->es_plannedstmt->commandType == CMD_INSERT)
+ {
+ Assert(IsInParallelMode());
+ ExitParallelMode();
+ GetCurrentFullTransactionId();
+ EnterParallelMode();
+ }
+

This looks like a hack to me. I think you are doing this to avoid the
parallel mode checks in GetNewTransactionId(), right? If so, I have
already mentioned above [1] that we can change it so that we disallow
assigning xids for parallel workers only. The same is true for the
check in ExecGatherMerge. Do you see any problem with that suggestion?

2.
@@ -337,7 +337,7 @@ standard_planner(Query *parse, const char
*query_string, int cursorOptions,
  */
  if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
  IsUnderPostmaster &&
- parse->commandType == CMD_SELECT &&
+ (parse->commandType == CMD_SELECT || parse->commandType == CMD_INSERT) &&
  !parse->hasModifyingCTE &&
  max_parallel_workers_per_gather > 0 &&
  !IsParallelWorker())

I think the comments above this need to be updated especially the part
where we says:"Note that we do allow CREATE TABLE AS, SELECT INTO, and
CREATE MATERIALIZED VIEW to use parallel plans, but as of now, only
the leader backend writes into a completely new table.". Don't we need
to include Insert also?

3.
@@ -371,6 +371,7 @@ standard_planner(Query *parse, const char
*query_string, int cursorOptions,
  * parallel-unsafe, or else the query planner itself has a bug.
  */
  glob->parallelModeNeeded = glob->parallelModeOK &&
+ (parse->commandType == CMD_SELECT) &&
  (force_parallel_mode != FORCE_PARALLEL_OFF);

Why do you need this change? The comments above this code should be
updated to reflect this change. I think for the same reason the below
code seems to be modified but I don't understand the reason for the
below change as well, also it is better to update the comments for
this as well.

@@ -425,7 +426,7 @@ standard_planner(Query *parse, const char
*query_string, int cursorOptions,
  * Optionally add a Gather node for testing purposes, provided this is
  * actually a safe thing to do.
  */
- if (force_parallel_mode != FORCE_PARALLEL_OFF && top_plan->parallel_safe)
+ if (force_parallel_mode != FORCE_PARALLEL_OFF && parse->commandType
== CMD_SELECT && top_plan->parallel_safe)
  {
  Gather*gather = makeNode(Gather);

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BE-pM0U6qw7EOF0yO0giTxdErxoJV9xTqN%2BLo9zdotFQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 6:31 PM Thomas Munro  wrote:
>
> A couple more observations:
>
> +   pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;
>
> Hmm, I think this may be bogus window dressing only affecting EXPLAIN.
> If you change it to assign false always, it works just the same,
> except EXPLAIN says:
>
>  Gather  (cost=15428.00..16101.14 rows=100 width=4)
>Workers Planned: 2
>->  Insert on s  (cost=15428.00..16101.14 rows=208334 width=4)
>  ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667 width=4)
>
> ... instead of:
>
>  Gather  (cost=15428.00..16101.14 rows=100 width=4)
>Workers Planned: 2
>->  Parallel Insert on s  (cost=15428.00..16101.14 rows=208334 width=4)
>  ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667 width=4)
>
> AFAICS it's not parallel-aware, it just happens to be running in
> parallel with a partial input and partial output (and in this case,
> effect in terms of writes).  Parallel-aware is our term for nodes that
> actually know they are running in parallel and do some special
> coordination with their twins in other processes.
>

Ah, thanks, I see the distinction now. I'll fix that, to restore
parallel_aware=false for the ModifyTable node.

> The estimated row count also looks wrong; at a guess, the parallel
> divisor is applied twice.  Let me try that with
> parallel_leader_particiation=off (which disables some funky maths in
> the row estimation and makes it straight division by number of
> processes):
>
>  Gather  (cost=17629.00..18645.50 rows=100 width=4)
>Workers Planned: 2
>->  Insert on s  (cost=17629.00..18645.50 rows=25 width=4)
>  ->  Parallel Hash Join  (cost=17629.00..37291.00 rows=50 width=4)
> [more nodes omitted]
>
> Yeah, that was a join that spat out a million rows, and we correctly
> estimated 500k per process, and then Insert (still with my hack to
> turn off the bogus "Parallel" display in this case, but it doesn't
> affect the estimation) estimated 250k per process, which is wrong.

Thanks, I did suspect the current costing was wrong for ModifyTable
(workers>0 case), as I'd thrown it in (moving current costing code
into costsize.c) without a lot of checking or great thought, and was
on my TODO list of things to check. At least I created a placeholder
for it. Looks like I've applied a parallel-divisor again (not allowing
for that of the underlying query), as you said.
Speaking of costing, I'm not sure I really agree with the current
costing of a Gather node. Just considering a simple Parallel SeqScan
case, the "run_cost += parallel_tuple_cost * path->path.rows;" part of
Gather cost always completely drowns out any other path costs when a
large number of rows are involved (at least with default
parallel-related GUC values), such that Parallel SeqScan would never
be the cheapest path. This linear relationship in the costing based on
the rows and a parallel_tuple_cost doesn't make sense to me. Surely
after a certain amount of rows, the overhead of launching workers will
be out-weighed by the benefit of their parallel work, such that the
more rows, the more likely a Parallel SeqScan will benefit. That seems
to suggest something like a logarithmic formula (or similar) would
better match reality than what we have now. Am I wrong on this? Every
time I use default GUC values, the planner doesn't want to generate a
parallel plan. Lowering parallel-related GUCs like parallel_tuple_cost
(which I normally do for testing) influences it of course, but the
linear relationship still seems wrong.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-09 Thread Amit Kapila

On Fri, Oct 9, 2020 at 2:19 PM Simon Riggs  wrote:
>
> On Fri, 9 Oct 2020 at 04:10, Amit Kapila  wrote:
> >
> > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> > >
> > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> > >
> > > > > This script will wait 10 seconds after INSERT exits
> > > > > before executing TRUNCATE, please wait for it to run.
> > >
> > > Has this been tested with anything other than the one test case?
> > >
> > > It would be good to know how the patch handles a transaction that
> > > contains many aborted subtransactions that contain invals.
> > >
> >
> > Are you thinking from the angle of performance or functionality? I
> > don't see how this patch can impact either of those. Basically, it
> > will not execute any extra invalidations then it is executing without
> > the patch for aborted subtransactions. Can you please explain in a bit
> > more detail about your fear?
> >
> > Having said that, I think it would be a good idea to test the scenario
> > you mentioned to ensure that we have not broken anything unknowingly.
>
> The test appears to only cover the case of many subtransactions, all
> of which commit, and then top-level commit occurs.
>
> We should be testing cases where the top-level commit occurs, yet some
> proportion of the subtransactions abort. "Normal" would be 10-50%
> aborts.
>
> I presume we support this case already, but wish to ensure the
> performance tweak is not just for the one special case.
>

Okay, I think this makes sense. I think we should see the performance
benefit for this case as well but maybe to a bit lesser degree because
we will exclude some of the subtransactions from processing.

-- 
With Regards,
Amit Kapila.

Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-09 Thread Simon Riggs

On Fri, 9 Oct 2020 at 04:10, Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> >
> > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> >
> > > > This script will wait 10 seconds after INSERT exits
> > > > before executing TRUNCATE, please wait for it to run.
> >
> > Has this been tested with anything other than the one test case?
> >
> > It would be good to know how the patch handles a transaction that
> > contains many aborted subtransactions that contain invals.
> >
>
> Are you thinking from the angle of performance or functionality? I
> don't see how this patch can impact either of those. Basically, it
> will not execute any extra invalidations then it is executing without
> the patch for aborted subtransactions. Can you please explain in a bit
> more detail about your fear?
>
> Having said that, I think it would be a good idea to test the scenario
> you mentioned to ensure that we have not broken anything unknowingly.

The test appears to only cover the case of many subtransactions, all
of which commit, and then top-level commit occurs.

We should be testing cases where the top-level commit occurs, yet some
proportion of the subtransactions abort. "Normal" would be 10-50%
aborts.

I presume we support this case already, but wish to ensure the
performance tweak is not just for the one special case.

Thanks

-- 
Simon Riggshttp://www.enterprisedb.com/

Re: [PATCH] ecpg: fix progname memory leak

2020-10-09 Thread Daniel Gustafsson

> On 8 Oct 2020, at 19:08, Bruce Momjian  wrote:

> OK, TODO removed.

Potentially controversial proposal: Should we perhaps remove (for some value
of) the TODO list altogether?

The value of the list is questionable as it's not actually a TODO list for
established developers, and for aspiring new developers it's unlikely to give
good ideas for where to start (I know there's a big warning but not everyone
will read everything, the page is quite long).

Now, I don't actually suggest we *remove* it, as there is valuable curated
content, but that we rename it to something which won't encourage newcomers to
pick something from the list simply because it exists.  The name "TODO" implies
that the list is something which it isn't.

Thoughts?

cheers ./daniel

Re: Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-10-09 Thread Kyotaro Horiguchi

At Fri, 9 Oct 2020 13:41:25 +0800, "movead...@highgo.ca"  
wrote in 
> Hello hackers,
> 
> We know that pg_waldump can statistics size for every kind of records. When I 
> use
> the feature I find it misses some size for XLOG_SWITCH records. When a user 
> does
> a pg_wal_switch(), then postgres will discard the remaining size in the 
> current wal
> segment, and the pg_waldump tool misses the discard size.
> 
> I think it will be better if pg_waldump  can show the matter, so I make a 
> patch
> which regards the discard size as a part of XLOG_SWITCH record, it works if we
> want to display the detail of wal records or the statistics, and patch 
> attached.
> 
> What's your opinion?

I think that the length of the XLOG_SWITCH record is no other than 24
bytes. Just adding the padding? garbage bytes to that length doesn't
seem the right thing to me.

If we want pg_waldump to show that length somewhere, it could be shown
at the end of that record explicitly:

rmgr: XLOGlen (rec/tot): 24/16776848, tx:  0, lsn: 
0/02000148, prev 0/02000110, desc: SWITCH, trailing-bytes: 16776944

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Add a description to the documentation that toast_tuple_target affects "Main"

2020-10-09 Thread Shinya Okano


Hi,

Regarding the toast_tuple_target parameter of CREATE TABLE, the 
documentation says that it only affects External or Extended, but it 
actually affects the compression of Main as well.


The attached patch modifies the document to match the actual behavior.

Regards,

--
Shinya Okano
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 28f844071b..76d80fb567 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1359,10 +1359,11 @@ WITH ( MODULUS numeric_literal, REM
 
  
   The toast_tuple_target specifies the minimum tuple length required before
-  we try to move long column values into TOAST tables, and is also the
-  target length we try to reduce the length below once toasting begins.
-  This only affects columns marked as either External or Extended
-  and applies only to new tuples; there is no effect on existing rows.
+  we try to compress long column values or to move into TOAST tables, and
+  is also the target length we try to reduce the length below once toasting
+  begins. This affects columns marked as External (for move),
+  Main (for compression), or Extended (for both) and applies only to new
+  tuples. There is no effect on existing rows.
   By default this parameter is set to allow at least 4 tuples per block,
   which with the default block size will be 2040 bytes. Valid values are
   between 128 bytes and the (block size - header), by default 8160 bytes.

Re: libpq debug log

2020-10-09 Thread Kyotaro Horiguchi

At Wed, 16 Sep 2020 17:41:55 -0300, Alvaro Herrera  
wrote in 
> Hello Aya Iwata,
> 
> I like this patch, and I think we should have it.  I have updated it, as
> it had conflicts.
> 
> In 0002, I remove use of ftime(), because that function is obsolete.
> However, with that change we lose printing of milliseconds in the trace
> lines.  I think that's not great, but I also think we can live without
> them until somebody gets motivated to write the code for that.  It seems
> a little messy so I'd prefer to leave that for a subsequent commit.
> (Maybe we can just use pg_strftime.)
> 
> Looking at the message type tables, I decided to get rid of the "bf"
> table, which had only two bytes, and instead put CopyData and CopyDone
> in the other two tables.  That seems simpler.  Also, the COMMAND_x_MAX
> macros were a bit pointless; I just expanded at the only caller of each,
> using lengthof().  And since the message type is unsigned, there's no
> need to do "c >= 0" since it must always be true.
> 
> I added setlinebuf() to the debug file.  Works better than without,
> because "tail -f /tmp/libpqtrace.log" works as you'd expect.
> 
> One thing that it might be good to do is to avoid creating the message
> type tables as const but instead initialize them if and only if tracing
> is enabled, on the first call.  I think that would not only save space
> in the constant area of the library for the 99.99% of the cases where
> tracing isn't used, but also make the initialization code be more
> sensible (since presumably you wouldn't have to initialize all the
> zeroes.)
> 
> Opinions?  I experimented by patching psql as below (not intended for
> commit) and it looks good.  Only ErrorResponse prints the terminator as
> a control character, or something:
> 
> 2020-09-16 13:27:29.072 -03  < ErrorResponse 117 S "ERROR" V "ERROR" C 
> "42704" M "no existe el slot de replicación «foobar»" F "slot.c" L "408" R 
> "ReplicationSlotAcquireInternal" ^@ 
> 
> 
> diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
> index 8232a0143b..01728ba8e8 100644
> --- a/src/bin/psql/startup.c
> +++ b/src/bin/psql/startup.c
> @@ -301,6 +301,11 @@ main(int argc, char *argv[])
>  
>   psql_setup_cancel_handler();
>  
> + {
> + FILE *trace = fopen("/tmp/libpqtrace.log", "a+");
> + PQtrace(pset.db, trace);
> + }
> +
>   PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
>  
>   SyncVariables();

I saw that version and have some comments.

+pqGetProtocolMsgType(unsigned char c, PGCommSource commsource)
+{
+   const char *message_type;

Compiler complains that this is unused.

+static const char *
+pqGetProtocolMsgType(unsigned char c, PGCommSource commsource)
+{
...
+   else
+   return "UnknownCommand";
+}

Compiler complains as "control reached end of non-void function"


+pqLogMsgString(PGconn *conn, const char *v, int length, PGCommSource 
commsource)
+{
+   if (length < 0)
+   length = strlen(v) + 1;
+
+   if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
+   {
+   switch (conn->logging_message.state)
+   {
+   case LOG_CONTENTS:
+   fprintf(conn->Pfdebug, "\"%s\" ", v);
+   pqLogLineBreak(length, conn);

pqLogMsgString(conn, str, -1, FROM_*) means actual length may be
different from the caller thinks, but the pqLogLineBreak() subtracts
that value from the message length rememberd in in logging_message.
Anyway AFAICS the patch doesn't use the code path so we should remove
the first two lines.

+   case LOG_CONTENTS:
+   fprintf(conn->Pfdebug, "%s%d ", prefix, v);

That prints #65535 for v = -1 and length = 2. I think it should be
properly expanded as a signed integer.


@@ -139,8 +447,7 @@ pqGets_internal(PQExpBuffer buf, PGconn *conn, bool 
resetbuffer)
conn->inCursor = ++inCursor;
 
if (conn->Pfdebug)
-   fprintf(conn->Pfdebug, "From backend> \"%s\"\n",
-   buf->data);
+   pqLogMsgString(conn, buf->data, buf->len + 1, FROM_BACKEND);

By the way, appendBinaryPQExpBuffer() enlarges its buffer by the size
of the exact length of the given data, but appends '\0' at the end of
the copied data. Couldn't that leads to an memory overrun?


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Parallel copy

2020-10-09 Thread Greg Nancarrow

On Fri, Oct 9, 2020 at 5:40 PM Amit Kapila  wrote:
>
> > Looking a bit deeper into this, I'm wondering if in fact your
> > EstimateStringSize() and EstimateNodeSize() functions should be using
> > BUFFERALIGN() for EACH stored string/node (rather than just calling
> > shm_toc_estimate_chunk() once at the end, after the length of packed
> > strings and nodes has been estimated), to ensure alignment of start of
> > each string/node. Other Postgres code appears to be aligning each
> > stored chunk using shm_toc_estimate_chunk(). See the definition of
> > that macro and its current usages.
> >
>
> I am not sure if this required for the purpose of correctness. AFAIU,
> we do store/estimate multiple parameters in same way at other places,
> see EstimateParamListSpace and SerializeParamList. Do you have
> something else in mind?
>

The point I was trying to make is that potentially more efficient code
can be used if the individual strings/nodes are aligned, rather than
packed (as they are now), but as you point out, there are already
cases (e.g. SerializeParamList) where within the separately-aligned
chunks the data is not aligned, so maybe not a big deal. Oh well,
without alignment, that means use of memcpy() cannot really be avoided
here for serializing/de-serializing ints etc., let's hope the compiler
optimizes it as best it can.

Regards,
Greg Nancarrow
Fujitsu Australia

Wrong statistics for size of XLOG_SWITCH during pg_waldump.

2020-10-09 Thread movead...@highgo.ca

Hello hackers,

We know that pg_waldump can statistics size for every kind of records. When I 
use
the feature I find it misses some size for XLOG_SWITCH records. When a user does
a pg_wal_switch(), then postgres will discard the remaining size in the current 
wal
segment, and the pg_waldump tool misses the discard size.

I think it will be better if pg_waldump  can show the matter, so I make a patch
which regards the discard size as a part of XLOG_SWITCH record, it works if we
want to display the detail of wal records or the statistics, and patch attached.

What's your opinion?



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


fix_waldump_size_for_wal_switch.patch
Description: Binary data

Re: dynamic result sets support in extended query protocol

2020-10-09 Thread Peter Eisentraut


On 2020-10-08 10:23, Tatsuo Ishii wrote:

Are you proposing to bump up the protocol version (either major or
minor)?  I am asking because it seems you are going to introduce some
new message types.


It wouldn't be a new major version.  It could either be a new minor 
version, or it would be guarded by a _pq_ protocol message to enable 
this functionality from the client, as described.  Or both?  We haven't 
done this sort of thing a lot, so some discussion on the details might 
be necessary.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-09 Thread Thomas Munro

On Fri, Oct 9, 2020 at 3:48 PM Greg Nancarrow  wrote:
> It does give me the incentive to look beyond that issue and see
> whether parallel Update and parallel Delete are indeed possible. I'll
> be sure to give it a go!

Cool!

A couple more observations:

+   pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;

Hmm, I think this may be bogus window dressing only affecting EXPLAIN.
If you change it to assign false always, it works just the same,
except EXPLAIN says:

 Gather  (cost=15428.00..16101.14 rows=100 width=4)
   Workers Planned: 2
   ->  Insert on s  (cost=15428.00..16101.14 rows=208334 width=4)
 ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667 width=4)

... instead of:

 Gather  (cost=15428.00..16101.14 rows=100 width=4)
   Workers Planned: 2
   ->  Parallel Insert on s  (cost=15428.00..16101.14 rows=208334 width=4)
 ->  Parallel Hash Join  (cost=15428.00..32202.28 rows=416667 width=4)

AFAICS it's not parallel-aware, it just happens to be running in
parallel with a partial input and partial output (and in this case,
effect in terms of writes).  Parallel-aware is our term for nodes that
actually know they are running in parallel and do some special
coordination with their twins in other processes.

The estimated row count also looks wrong; at a guess, the parallel
divisor is applied twice.  Let me try that with
parallel_leader_particiation=off (which disables some funky maths in
the row estimation and makes it straight division by number of
processes):

 Gather  (cost=17629.00..18645.50 rows=100 width=4)
   Workers Planned: 2
   ->  Insert on s  (cost=17629.00..18645.50 rows=25 width=4)
 ->  Parallel Hash Join  (cost=17629.00..37291.00 rows=50 width=4)
[more nodes omitted]

Yeah, that was a join that spat out a million rows, and we correctly
estimated 500k per process, and then Insert (still with my hack to
turn off the bogus "Parallel" display in this case, but it doesn't
affect the estimation) estimated 250k per process, which is wrong.

abstract Unix-domain sockets

2020-10-09 Thread Peter Eisentraut

During the discussion on Unix-domain sockets on Windows, someone pointed 
out[0] abstract Unix-domain sockets.  This is a variant of the normal 
Unix-domain sockets that don't use the file system but a separate 
"abstract" namespace.  At the user interface, such sockets are 
represented by names starting with "@".  I took a look at this and it 
wasn't hard to get working, so here is a patch.  It's supposed to be 
supported on Linux and Windows right now, but I haven't tested on Windows.


I figure, there are so many different deployment options nowadays, this 
could be useful somewhere.  It relieves you from dealing with the file 
system, you don't have to set up /tmp or something under /var/run, you 
don't need to make sure file system permissions are right.  Also, there 
is no need for a lock file or file cleanup.  (Unlike file-system 
namespace sockets, abstract namespace sockets give an EADDRINUSE when 
trying to bind to a name already in use.)  Conversely, of course, you 
don't get to use file-system permissions to manage access to the socket, 
but that isn't essential functionality, so it's a trade-off users can 
make on their own.


And then some extra patches for surrounding cleanup.  During testing I 
noticed that the bind() failure hint "Is another postmaster already 
running ..." was shown in inappropriate situations, so I changed that to 
only show for EADDRINUSE errors.  (Maybe other error codes could be 
appropriate, but I couldn't find any more.)


And then looking for other uses of EADDRINUSE I found some dead 
Windows-related code that can be cleaned up.



[0]: 
https://www.postgresql.org/message-id/20191218142419.fvv4ikm4wq4gn...@isc.upenn.edu


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 5afc3342a306a1d9599bb5e6c77aeb2581b6799c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 9 Oct 2020 09:10:41 +0200
Subject: [PATCH 1/3] Add support for abstract Unix-domain sockets

This is a variant of the normal Unix-domain sockets that don't use the
file system but a separate "abstract" namespace.  At the user
interface, such sockets are represented by names starting with "@".
Supported on Linux and Windows right now.
---
 doc/src/sgml/config.sgml  | 24 +++-
 doc/src/sgml/libpq.sgml   |  5 -
 src/backend/libpq/pqcomm.c|  8 
 src/bin/psql/command.c| 15 +--
 src/bin/psql/prompt.c |  3 ++-
 src/common/ip.c   | 24 +++-
 src/include/libpq/pqcomm.h|  9 +
 src/interfaces/libpq/fe-connect.c |  4 ++--
 8 files changed, 76 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee914740cc..8fc76bb40f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -749,6 +749,21 @@ Connection Settings
 An empty value
 specifies not listening on any Unix-domain sockets, in which case
 only TCP/IP sockets can be used to connect to the server.
+   
+
+   
+A value that starts with @ specifies that a
+Unix-domain socket in the abstract namespace should be created
+(currently supported on Linux and Windows).  In that case, this value
+does not specify a directory but a prefix from which
+the actual socket name is computed in the same manner as for the
+file-system namespace.  While the abstract socket name prefix can be
+chosen freely, since it is not a file-system location, the convention
+is to nonetheless use file-system-like values such as
+@/tmp.
+   
+
+   
 The default value is normally
 /tmp, but that can be changed at build time.
 On Windows, the default is empty, which means no Unix-domain socket is
@@ -763,6 +778,7 @@ Connection Settings
 named .s.PGSQL..lock 
will be
 created in each of the unix_socket_directories 
directories.
 Neither file should ever be removed manually.
+For sockets in the abstract namespace, no lock file is created.

   
  
@@ -787,7 +803,8 @@ Connection Settings
 

 This parameter is not supported on Windows.  Any setting will be
-ignored.
+ignored.  Also, sockets in the abstract namespace have no file owner,
+so this setting is also ignored in that case.

   
  
@@ -834,6 +851,11 @@ Connection Settings
 similar effect by pointing unix_socket_directories 
to a
 directory having search permission limited to the desired audience.

+
+   
+Sockets in the abstract namespace have no file permissions, so this
+setting is also ignored in that case.
+   
   
  
 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 3315f1dd05..040750b2b2 100644
--- a/doc/src/sgml/libpq.sgml
+++

RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-09 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> much. I think remote-commits should be performed before local commit
> passes the point-of-no-return and the v26-0002 actually places
> AtEOXact_FdwXact() before the critical section.

I don't hate ERROR, but it would be simpler and understandable for the FDW 
commit routine to just return control to the caller (TM) and let TM do whatever 
is appropriate (asks the resolver to handle the failed commit, and continues to 
request next FDW to commit.)


> > https://docs.oracle.com/cd/E92951_01/wls/WLJTA/trxcon.htm
> > --
> > Abandon timeout
> > Specifies the maximum time (in seconds) that the transaction manager
> attempts to complete the second phase of a two-phase commit transaction.
> >
> > In the second phase of a two-phase commit transaction, the transaction
> manager attempts to complete the transaction until all resource managers
> indicate that the transaction is complete. After the abort transaction timer
> expires, no attempt is made to resolve the transaction. If the transaction 
> enters
> a ready state before it is destroyed, the transaction manager rolls back the
> transaction and releases the held lock on behalf of the destroyed transaction.
> > --
> 
> That's not a retry timeout but a timeout for total time of all
> 2nd-phase-commits.  But I think it would be sufficient.  Even if an
> fdw could retry 2pc-commit, it's a matter of that fdw and the core has
> nothing to do with.

Yeah, the WebLogic documentation doesn't say whether it performs retries during 
the timeout period.  I just cited as an example that has a timeout parameter 
for the second phase of 2PC.


> At least postgres_fdw is interruptible while waiting the remote.
> 
> create view lt as select 1 as slp from (select pg_sleep(10)) t;
> create foreign table ft(slp int) server sv1 options (table_name 'lt');
> select * from ft;
> ^CCancel request sent
> ERROR:  canceling statement due to user request

I'm afraid the cancellation doesn't work while postgres_fdw is trying to 
connect to a down server.  Also, Postgres manual doesn't say about 
cancellation, so we cannot expect FDWs to respond to user's cancel request.


 Regards
Takayuki Tsunakawa

Re: range_agg

2020-10-09 Thread Pavel Stehule

čt 24. 9. 2020 v 2:05 odesílatel Paul A Jungwirth <
p...@illuminatedcomputing.com> napsal:

> On Sun, Aug 16, 2020 at 12:55 PM Paul A Jungwirth
>  wrote:
> > This is rebased on the current master, including some changes to doc
> > tables and pg_upgrade handling of type oids.
>
> Here is a rebased version of this patch, including a bunch of cleanup
> from Alvaro. (Thanks Alvaro!)
>

I tested this patch and It looks well, I have not any objections

1. there are not new warnings
2. make check-world passed
3. build doc without problems
4. doc is enough, regress tests too
5. there was not objection against this feature in discussion, and I think
it is interesting and useful feature - good additional to arrays

Regards

Pavel

> Paul
>

Re: Parallel copy

2020-10-09 Thread Amit Kapila

On Thu, Oct 8, 2020 at 8:43 AM Greg Nancarrow  wrote:
>
> On Thu, Oct 8, 2020 at 5:44 AM vignesh C  wrote:
>
> > Attached v6 patch with the fixes.
> >
>
> Hi Vignesh,
>
> I noticed a couple of issues when scanning the code in the following patch:
>
> v6-0003-Allow-copy-from-command-to-process-data-from-file.patch
>
> In the following code, it will put a junk uint16 value into *destptr
> (and thus may well cause a crash) on a Big Endian architecture
> (Solaris Sparc, s390x, etc.):
> You're storing a (uint16) string length in a uint32 and then pulling
> out the lower two bytes of the uint32 and copying them into the
> location pointed to by destptr.
>
>
> static void
> +CopyStringToSharedMemory(CopyState cstate, char *srcPtr, char *destptr,
> + uint32 *copiedsize)
> +{
> + uint32 len = srcPtr ? strlen(srcPtr) + 1 : 0;
> +
> + memcpy(destptr, (uint16 *) , sizeof(uint16));
> + *copiedsize += sizeof(uint16);
> + if (len)
> + {
> + memcpy(destptr + sizeof(uint16), srcPtr, len);
> + *copiedsize += len;
> + }
> +}
>
> I suggest you change the code to:
>
> uint16 len = srcPtr ? (uint16)strlen(srcPtr) + 1 : 0;
> memcpy(destptr, , sizeof(uint16));
>
> [I assume string length here can't ever exceed (65535 - 1), right?]
>

Your suggestion makes sense to me if the assumption related to string
length is correct. If we can't ensure that then we need to probably
use four bytes uint32 to store the length.

> Looking a bit deeper into this, I'm wondering if in fact your
> EstimateStringSize() and EstimateNodeSize() functions should be using
> BUFFERALIGN() for EACH stored string/node (rather than just calling
> shm_toc_estimate_chunk() once at the end, after the length of packed
> strings and nodes has been estimated), to ensure alignment of start of
> each string/node. Other Postgres code appears to be aligning each
> stored chunk using shm_toc_estimate_chunk(). See the definition of
> that macro and its current usages.
>

I am not sure if this required for the purpose of correctness. AFAIU,
we do store/estimate multiple parameters in same way at other places,
see EstimateParamListSpace and SerializeParamList. Do you have
something else in mind?

While looking at the latest code, I observed below issue in patch
v6-0003-Allow-copy-from-command-to-process-data-from-file:

+ /* Estimate the size for shared information for PARALLEL_COPY_KEY_CSTATE */
+ est_cstateshared = MAXALIGN(sizeof(SerializedParallelCopyState));
+ shm_toc_estimate_chunk(>estimator, est_cstateshared);
+ shm_toc_estimate_keys(>estimator, 1);
+
+ strsize = EstimateCstateSize(pcxt, cstate, attnamelist, ,
+ , ,
+ , ,
+ );

Here, do we need to separately estimate the size of
SerializedParallelCopyState when it is also done in
EstimateCstateSize?

-- 
With Regards,
Amit Kapila.

RE: [PoC] Non-volatile WAL buffer

2020-10-09 Thread Deng, Gang

Hi Takashi,

There are some differences between our HW/SW configuration and test steps. I 
attached postgresql.conf I used for your reference. I would like to try 
postgresql.conf and steps you provided in the later days to see if I can find 
cause.

I also ran pgbench and postgres server on the same server but on different NUMA 
node, and ensure server process and PMEM on the same NUMA node. I used similar 
steps are yours from step 1 to 9. But some difference in later steps, major of 
them are:

In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text 
default 
'75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"

in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to 
warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f 
./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment)

For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled)
OS Distro: CentOS 8.2.2004 
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1

Best regards
Gang

-Original Message-
From: Takashi Menjo  
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang 
Cc: pgsql-hack...@postgresql.org; 'Takashi Menjo' 
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when 
inserting 328-byte records. So I think the condition of you and me would be 
different, such as steps to reproduce, postgresql.conf, installation setup, and 
so on.

My results and condition are as follows. May I have your condition in more 
detail? Note that I refer to your "Storage over App Direct" as my "Original 
(PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi


# Results
See the attached figure. In short, Non-volatile WAL buffer got better 
performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but 
separated two NUMA nodes. PMEM and PCI SSD for the server process are on the 
server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M 
dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo 
mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q 
-F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile 
WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl 
-l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 
(ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- 
pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r 
-M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = 
__ (including connections establishing)" of the three as throughput and the 
"latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by 
BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels 
per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 
sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)

87 matches

Mail list logo